Re: How do we feel about improving pigmix queries?

Scott Carey Mon, 09 Jan 2012 21:08:20 -0800


On 12/20/11 10:59 AM, "Alan Gates" <[email protected]> wrote:


>
>On Dec 14, 2011, at 12:41 PM, Dmitriy Ryaboy wrote:
>
>> 
>> 2) I noticed that L17 can be greatly optimized. Currently it does this:
>> 
>> register pigperf.jar;
>> %default PIGMIX_DIR /user/pig/tests/data/pigmix
>> A = load '$PIGMIX_DIR/page_views' using
>> org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user,
>>action,
>> timespent, query_term,
>>            ip_addr, timestamp, estimated_revenue, page_info,
>>page_links);
>> B = foreach A generate user, timestamp;
>> C = group B by user;
>> D = foreach C {
>>    morning = filter B by timestamp < 43200;
>>    afternoon = filter B by timestamp >= 43200;
>>    generate group, COUNT(morning), COUNT(afternoon);
>> }
>> store D into 'L7out';
>> 
>> It can be improved to use combiners:
>> 
>> register pigperf.jar;
>> %default PIGMIX_DIR /user/pig/tests/data/pigmix
>> A = load '$PIGMIX_DIR/page_views' using
>> org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user,
>>action,
>> timespent, query_term,
>>            ip_addr, timestamp, estimated_revenue, page_info,
>>page_links);
>> B = foreach A generate user, timestamp,
>>      (timestamp < 43200 ? 1 : 0) as morning, (timestamp >= 43200 ? 1 :
>>0)
>> as afternoon;
>> C = group B by user;
>> D = foreach C {
>>    generate group, SUM(B.morning), SUM(B,afternoon);
>> }
>> store D into 'L7out';
>> 
>> Is L17 supposed to test something that precludes the use of combiners,
>>or
>> is improving the query fair game?
>
>According to https://cwiki.apache.org/confluence/display/PIG/PigMix L7
>was design to test nested split.  Your changes would fundamentally alter
>the test.

To me this looks like something that Pig could optimize on its own.
Filter(condition) then count --> Sum (condition ? 1 : 0).  More generally
filter(condition) then algebraic_aggregate can be transformed to use
combiners.  

>
>Alan.
>> 
>> D
>

Re: How do we feel about improving pigmix queries?

Reply via email to