I have log files like this: #timestamp (ms), server, user, action, domain , x, y , z 1262332800008, 7, 50817, 2, yahoo.com, 31, blahblah, foobar 1262332800017, 2, 373168, 0, google.com, 67, blahblah, foobar 1262332800025, 8, 172910, 1, facebook.com, 135, blahblah, foobar
I have the following pig script to count the number of domains from logs. ( For example, we have seen facebook.com 10 times ..etc.) Here is the pig script: -------------------------------- records = LOAD '/logs-in/*.log' using PigStorage(',') AS (ts:long, server:int, user:int, action_id:int, domain:chararray, price:int); -- DUMP records; grouped_by_domain = GROUP records BY domain; -- DUMP grouped_by_domain; -- DESCRIBE grouped_by_domain; freq = FOREACH grouped_by_domain GENERATE group as domain, COUNT(records) as mycount; -- DESCRIBE freq; -- DUMP freq; sorted = ORDER freq BY mycount DESC; DUMP sorted; -------------------------------- This script takes a hour to run. I also wrote a simple Java MR job to count the domains, it takes about 15 mins. So the pig script is taking 4x longer to complete. any suggestions on what I am doing wrong in pig? thanks Sujee http://sujee.net