A couple of possibilities that I'm kicking around off the top of my head...

1) Does your MR job also sort afterwards? That's going to kick off another
MR job
2) Does your MR job compile all the results into one job?

My guess is the Order+Dump are making it take longer.

2011/6/17 Sujee Maniyam <su...@sujee.net>

> I have log files like this:
>   #timestamp (ms),     server,    user,    action,    domain , x,    y ,
> z
>   1262332800008, 7, 50817, 2, yahoo.com, 31, blahblah, foobar
>   1262332800017, 2, 373168, 0, google.com, 67, blahblah, foobar
>   1262332800025, 8, 172910, 1, facebook.com, 135, blahblah, foobar
>
> I have the following pig script to count the number of domains from logs. (
> For example, we have seen facebook.com 10 times ..etc.)
>
> Here is the pig script:
>
> --------------------------------
> records = LOAD '/logs-in/*.log' using PigStorage(',') AS (ts:long,
> server:int, user:int, action_id:int, domain:chararray, price:int);
>
> -- DUMP records;
> grouped_by_domain = GROUP records BY domain;
> -- DUMP grouped_by_domain;
> -- DESCRIBE grouped_by_domain;
>
> freq = FOREACH grouped_by_domain GENERATE group as domain, COUNT(records)
> as
> mycount;
> -- DESCRIBE freq;
> -- DUMP freq;
>
> sorted = ORDER freq BY mycount DESC;
> DUMP sorted;
> --------------------------------
>
> This script takes a hour to run.   I also wrote a simple Java MR job to
> count the domains, it takes about 15 mins.  So the pig script is taking 4x
> longer to complete.
>
> any suggestions on what I am doing wrong in pig?
>
> thanks
> Sujee
> http://sujee.net
>

Reply via email to