Distinct UDF progress reports
-----------------------------
Key: PIG-685
URL: https://issues.apache.org/jira/browse/PIG-685
Project: Pig
Issue Type: Bug
Affects Versions: types_branch
Environment: Hadoop 0.18.3 on redhat, PIG svn from feb-01
Reporter: Tamir Kamara
When using the DISTINCT function many of the map tasks are being killed because
of failure to report for 600 seconds. It seems that PIG-646 should have
addressed this but I'm still seeing many errors like this:
2009-02-21 11:41:53,916 INFO org.apache.hadoop.mapred.MapTask: Starting flush
of map output
2009-02-21 11:41:57,727 WARN org.apache.pig.builtin.Distinct$Intermediate: No
reporter object provided to UDF org.apache.pig.builtin.Distinct$Intermediate
2009-02-21 11:41:57,730 WARN org.apache.pig.builtin.Distinct$Intermediate: No
reporter object provided to UDF org.apache.pig.builtin.Distinct$Intermediate
My query:
r0 = load 'domain-org/*' as (domain:chararray, org:chararray);
r3 = GROUP r0 BY org parallel 18;
r4 = FOREACH r3 {
r5 = r0.domain;
r6 = distinct r5;
GENERATE group as org, COUNT(r6) as domains;
}
store r4 into 'org-domain-count';
the source files are 21GB in total with some 800M lines, 60M distinct domains
and 80K distinct orgs. Some orgs have 50M domains in them.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.