[ https://issues.apache.org/jira/browse/PIG-685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693532#action_12693532 ]
Tamir Kamara commented on PIG-685: ---------------------------------- I've replaced 1000 with 10 in the Distinct.java file (lines 129 & 148) and still mappers are failing because of failure to report for 600 seconds. There's also, a heap space error on some mappers (same as before). By the way, if I use the same script with no COUNT (i.e. GENERATE group as org, r6;) the mappers are all finishing just fine, but the reducers are failing due to GC overhead exceeded. I'm running my tasks with 1024MB. > Distinct UDF progress reports > ----------------------------- > > Key: PIG-685 > URL: https://issues.apache.org/jira/browse/PIG-685 > Project: Pig > Issue Type: Bug > Affects Versions: 0.2.0 > Environment: Hadoop 0.18.3 on redhat, PIG svn from feb-01 > Reporter: Tamir Kamara > > When using the DISTINCT function many of the map tasks are being killed > because of failure to report for 600 seconds. It seems that PIG-646 should > have addressed this but I'm still seeing many errors like this: > 2009-02-21 11:41:53,916 INFO org.apache.hadoop.mapred.MapTask: Starting flush > of map output > 2009-02-21 11:41:57,727 WARN org.apache.pig.builtin.Distinct$Intermediate: No > reporter object provided to UDF org.apache.pig.builtin.Distinct$Intermediate > 2009-02-21 11:41:57,730 WARN org.apache.pig.builtin.Distinct$Intermediate: No > reporter object provided to UDF org.apache.pig.builtin.Distinct$Intermediate > My query: > r0 = load 'domain-org/*' as (domain:chararray, org:chararray); > r3 = GROUP r0 BY org parallel 18; > r4 = FOREACH r3 { > r5 = r0.domain; > r6 = distinct r5; > GENERATE group as org, COUNT(r6) as domains; > } > store r4 into 'org-domain-count'; > the source files are 21GB in total with some 800M lines, 60M distinct domains > and 80K distinct orgs. Some orgs have 50M domains in them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.