On Dec 14, 2011, at 12:41 PM, Dmitriy Ryaboy wrote:

> Two questions relating to that:
> 
> 1) we currently hardcode parallel 40 in pigmix. Since Pig can now
> automatically select parallelism, would it be better to let it do so?
I agree the hard wiring is bad.  But my take is that the auto-parallel feature 
isn't mature enough to pick well yet.  Perhaps we should instead change the 
base PigMix to use parameter substitution so that users are required to set 
this value.

> 
> 2) I noticed that L17 can be greatly optimized. Currently it does this:
> 
> register pigperf.jar;
> %default PIGMIX_DIR /user/pig/tests/data/pigmix
> A = load '$PIGMIX_DIR/page_views' using
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action,
> timespent, query_term,
>            ip_addr, timestamp, estimated_revenue, page_info, page_links);
> B = foreach A generate user, timestamp;
> C = group B by user;
> D = foreach C {
>    morning = filter B by timestamp < 43200;
>    afternoon = filter B by timestamp >= 43200;
>    generate group, COUNT(morning), COUNT(afternoon);
> }
> store D into 'L7out';
> 
> It can be improved to use combiners:
> 
> register pigperf.jar;
> %default PIGMIX_DIR /user/pig/tests/data/pigmix
> A = load '$PIGMIX_DIR/page_views' using
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action,
> timespent, query_term,
>            ip_addr, timestamp, estimated_revenue, page_info, page_links);
> B = foreach A generate user, timestamp,
>      (timestamp < 43200 ? 1 : 0) as morning, (timestamp >= 43200 ? 1 : 0)
> as afternoon;
> C = group B by user;
> D = foreach C {
>    generate group, SUM(B.morning), SUM(B,afternoon);
> }
> store D into 'L7out';
> 
> Is L17 supposed to test something that precludes the use of combiners, or
> is improving the query fair game?

According to https://cwiki.apache.org/confluence/display/PIG/PigMix L7 was 
design to test nested split.  Your changes would fundamentally alter the test.

Alan.
> 
> D

Reply via email to