On 12/20/11 10:59 AM, "Alan Gates" <[email protected]> wrote:
>
>On Dec 14, 2011, at 12:41 PM, Dmitriy Ryaboy wrote:
>
>>
>> 2) I noticed that L17 can be greatly optimized. Currently it does this:
>>
>> register pigperf.jar;
>> %default PIGMIX_DIR /user/pig/tests/data/pigmix
>> A = load '$PIGMIX_DIR/page_views' using
>> org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user,
>>action,
>> timespent, query_term,
>> ip_addr, timestamp, estimated_revenue, page_info,
>>page_links);
>> B = foreach A generate user, timestamp;
>> C = group B by user;
>> D = foreach C {
>> morning = filter B by timestamp < 43200;
>> afternoon = filter B by timestamp >= 43200;
>> generate group, COUNT(morning), COUNT(afternoon);
>> }
>> store D into 'L7out';
>>
>> It can be improved to use combiners:
>>
>> register pigperf.jar;
>> %default PIGMIX_DIR /user/pig/tests/data/pigmix
>> A = load '$PIGMIX_DIR/page_views' using
>> org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user,
>>action,
>> timespent, query_term,
>> ip_addr, timestamp, estimated_revenue, page_info,
>>page_links);
>> B = foreach A generate user, timestamp,
>> (timestamp < 43200 ? 1 : 0) as morning, (timestamp >= 43200 ? 1 :
>>0)
>> as afternoon;
>> C = group B by user;
>> D = foreach C {
>> generate group, SUM(B.morning), SUM(B,afternoon);
>> }
>> store D into 'L7out';
>>
>> Is L17 supposed to test something that precludes the use of combiners,
>>or
>> is improving the query fair game?
>
>According to https://cwiki.apache.org/confluence/display/PIG/PigMix L7
>was design to test nested split. Your changes would fundamentally alter
>the test.
To me this looks like something that Pig could optimize on its own.
Filter(condition) then count --> Sum (condition ? 1 : 0). More generally
filter(condition) then algebraic_aggregate can be transformed to use
combiners.
>
>Alan.
>>
>> D
>