.. which would also make the nested split test go away, just in a more
subtle way. Sounds like we should
a) implement the optimization
b) write a new pig mix query that can't be optimized that way to test the split.

As far as parallelization selection not being mature enough -- that's
perfect! As we improve parallelization selection, timing on the
queries will improve. Maybe we can have a couple of separate queries
that test this, instead of changing the existing ones.

D

On Mon, Jan 9, 2012 at 9:10 PM, Scott Carey <[email protected]> wrote:
>
>
> On 12/20/11 10:59 AM, "Alan Gates" <[email protected]> wrote:
>
>>
>>On Dec 14, 2011, at 12:41 PM, Dmitriy Ryaboy wrote:
>>
>>>
>>> 2) I noticed that L17 can be greatly optimized. Currently it does this:
>>>
>>> register pigperf.jar;
>>> %default PIGMIX_DIR /user/pig/tests/data/pigmix
>>> A = load '$PIGMIX_DIR/page_views' using
>>> org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user,
>>>action,
>>> timespent, query_term,
>>>            ip_addr, timestamp, estimated_revenue, page_info,
>>>page_links);
>>> B = foreach A generate user, timestamp;
>>> C = group B by user;
>>> D = foreach C {
>>>    morning = filter B by timestamp < 43200;
>>>    afternoon = filter B by timestamp >= 43200;
>>>    generate group, COUNT(morning), COUNT(afternoon);
>>> }
>>> store D into 'L7out';
>>>
>>> It can be improved to use combiners:
>>>
>>> register pigperf.jar;
>>> %default PIGMIX_DIR /user/pig/tests/data/pigmix
>>> A = load '$PIGMIX_DIR/page_views' using
>>> org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user,
>>>action,
>>> timespent, query_term,
>>>            ip_addr, timestamp, estimated_revenue, page_info,
>>>page_links);
>>> B = foreach A generate user, timestamp,
>>>      (timestamp < 43200 ? 1 : 0) as morning, (timestamp >= 43200 ? 1 :
>>>0)
>>> as afternoon;
>>> C = group B by user;
>>> D = foreach C {
>>>    generate group, SUM(B.morning), SUM(B,afternoon);
>>> }
>>> store D into 'L7out';
>>>
>>> Is L17 supposed to test something that precludes the use of combiners,
>>>or
>>> is improving the query fair game?
>>
>>According to https://cwiki.apache.org/confluence/display/PIG/PigMix L7
>>was design to test nested split.  Your changes would fundamentally alter
>>the test.
>
> To me this looks like something that Pig could optimize on its own.
> Filter(condition) then count --> Sum (condition ? 1 : 0).  More generally
> filter(condition) then algebraic_aggregate can be transformed to use
> combiners.
>
>>
>>Alan.
>>>
>>> D
>>
>

Reply via email to