.. which would also make the nested split test go away, just in a more subtle way. Sounds like we should a) implement the optimization b) write a new pig mix query that can't be optimized that way to test the split.
As far as parallelization selection not being mature enough -- that's perfect! As we improve parallelization selection, timing on the queries will improve. Maybe we can have a couple of separate queries that test this, instead of changing the existing ones. D On Mon, Jan 9, 2012 at 9:10 PM, Scott Carey <[email protected]> wrote: > > > On 12/20/11 10:59 AM, "Alan Gates" <[email protected]> wrote: > >> >>On Dec 14, 2011, at 12:41 PM, Dmitriy Ryaboy wrote: >> >>> >>> 2) I noticed that L17 can be greatly optimized. Currently it does this: >>> >>> register pigperf.jar; >>> %default PIGMIX_DIR /user/pig/tests/data/pigmix >>> A = load '$PIGMIX_DIR/page_views' using >>> org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, >>>action, >>> timespent, query_term, >>> ip_addr, timestamp, estimated_revenue, page_info, >>>page_links); >>> B = foreach A generate user, timestamp; >>> C = group B by user; >>> D = foreach C { >>> morning = filter B by timestamp < 43200; >>> afternoon = filter B by timestamp >= 43200; >>> generate group, COUNT(morning), COUNT(afternoon); >>> } >>> store D into 'L7out'; >>> >>> It can be improved to use combiners: >>> >>> register pigperf.jar; >>> %default PIGMIX_DIR /user/pig/tests/data/pigmix >>> A = load '$PIGMIX_DIR/page_views' using >>> org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, >>>action, >>> timespent, query_term, >>> ip_addr, timestamp, estimated_revenue, page_info, >>>page_links); >>> B = foreach A generate user, timestamp, >>> (timestamp < 43200 ? 1 : 0) as morning, (timestamp >= 43200 ? 1 : >>>0) >>> as afternoon; >>> C = group B by user; >>> D = foreach C { >>> generate group, SUM(B.morning), SUM(B,afternoon); >>> } >>> store D into 'L7out'; >>> >>> Is L17 supposed to test something that precludes the use of combiners, >>>or >>> is improving the query fair game? >> >>According to https://cwiki.apache.org/confluence/display/PIG/PigMix L7 >>was design to test nested split. Your changes would fundamentally alter >>the test. > > To me this looks like something that Pig could optimize on its own. > Filter(condition) then count --> Sum (condition ? 1 : 0). More generally > filter(condition) then algebraic_aggregate can be transformed to use > combiners. > >> >>Alan. >>> >>> D >> >
