I have log files like this:
   #timestamp (ms),     server,    user,    action,    domain , x,    y ,
z
   1262332800008, 7, 50817, 2, yahoo.com, 31, blahblah, foobar
   1262332800017, 2, 373168, 0, google.com, 67, blahblah, foobar
   1262332800025, 8, 172910, 1, facebook.com, 135, blahblah, foobar

I have the following pig script to count the number of domains from logs. (
For example, we have seen facebook.com 10 times ..etc.)

Here is the pig script:

--------------------------------
records = LOAD '/logs-in/*.log' using PigStorage(',') AS (ts:long,
server:int, user:int, action_id:int, domain:chararray, price:int);

-- DUMP records;
grouped_by_domain = GROUP records BY domain;
-- DUMP grouped_by_domain;
-- DESCRIBE grouped_by_domain;

freq = FOREACH grouped_by_domain GENERATE group as domain, COUNT(records) as
mycount;
-- DESCRIBE freq;
-- DUMP freq;

sorted = ORDER freq BY mycount DESC;
DUMP sorted;
--------------------------------

This script takes a hour to run.   I also wrote a simple Java MR job to
count the domains, it takes about 15 mins.  So the pig script is taking 4x
longer to complete.

any suggestions on what I am doing wrong in pig?

thanks
Sujee
http://sujee.net

Reply via email to