Two questions relating to that:
1) we currently hardcode parallel 40 in pigmix. Since Pig can now
automatically select parallelism, would it be better to let it do so?
2) I noticed that L17 can be greatly optimized. Currently it does this:
register pigperf.jar;
%default PIGMIX_DIR /user/pig/tests/data/pigmix
A = load '$PIGMIX_DIR/page_views' using
org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action,
timespent, query_term,
ip_addr, timestamp, estimated_revenue, page_info, page_links);
B = foreach A generate user, timestamp;
C = group B by user;
D = foreach C {
morning = filter B by timestamp < 43200;
afternoon = filter B by timestamp >= 43200;
generate group, COUNT(morning), COUNT(afternoon);
}
store D into 'L7out';
It can be improved to use combiners:
register pigperf.jar;
%default PIGMIX_DIR /user/pig/tests/data/pigmix
A = load '$PIGMIX_DIR/page_views' using
org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action,
timespent, query_term,
ip_addr, timestamp, estimated_revenue, page_info, page_links);
B = foreach A generate user, timestamp,
(timestamp < 43200 ? 1 : 0) as morning, (timestamp >= 43200 ? 1 : 0)
as afternoon;
C = group B by user;
D = foreach C {
generate group, SUM(B.morning), SUM(B,afternoon);
}
store D into 'L7out';
Is L17 supposed to test something that precludes the use of combiners, or
is improving the query fair game?
D