[ https://issues.apache.org/jira/browse/BEAM-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951462#comment-15951462 ]
Kenneth Knowles commented on BEAM-1848: --------------------------------------- This is also now fixed in the Dataflow runner for the 0.6.0 release. Please try it out and see if your problem recurs. > GroupByKey stuck with more than one worker on Dataflow > ------------------------------------------------------ > > Key: BEAM-1848 > URL: https://issues.apache.org/jira/browse/BEAM-1848 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-java-core > Affects Versions: 0.6.0 > Reporter: peay > Assignee: Kenneth Knowles > Priority: Blocker > > I have a simple pipeline which has a sliding window, a {{GroupByKey}} and > then a simple stateful {{DoFn}}. I run in batch mode ({{--streaming=false}}) > on Dataflow. > On a very small dataset of a couple KBs, I can run the pipeline to > completion. Dataflow does show "successful". On a larger dataset (but still > very small, 100s MB read by source), the pipeline stays stuck, no matter how > long I wait. In addition, it never really gets stuck at the same point. I > expect about 340k output records, and never get more than 70k before it gets > stuck. > Dataflow always autoscales from 1 to 8 workers, which is my limit. > Run A: after 30mins+: no elements added out of {{GroupByKey}}, but logs have > repeating occurrences of > {code} > Proposing dynamic split of work unit xxxxxx;aaaaaa;bbbbbb at > {"position":{"shufflePosition":"AAAAAAD_AP8A_wEAAQ"}} > Refusing to split <unstarted in shuffle range > [ShufflePosition(base64:AAAAAAD_AP8A_wD_AAE), > ShufflePosition(base64:AAAAAQD_AP8A_wD_AAE))> at > ShufflePosition(base64:AAAAAAD_AP8A_wEAAQ): unstarted > Refused to split GroupingShuffleReader <unstarted in shuffle range > [ShufflePosition(base64:AAAAAAD_AP8A_wD_AAE), > ShufflePosition(base64:AAAAAQD_AP8A_wD_AAE))> at > ShufflePosition(base64:AAAAAAD_AP8A_wEAAQ) > {code} > Run B: after a couple minutes, elements get added to output of > {{GroupByKey}}, up to 56,128 and then stays stuck doing nothing, but logs > have repeating occurrences of > {code} > Proposing dynamic split of work unit xxxxxx;ccccccc;dddddd at > {"position":{"shufflePosition":"AAAAAQD_AP8A_wEAAQ"}} > Refusing to split <unstarted in shuffle range > [ShufflePosition(base64:AAAAAQD_AP8A_wD_AAE), > ShufflePosition(base64:AAAAAgD_AP8A_wD_AAE))> at > ShufflePosition(base64:AAAAAQD_AP8A_wEAAQ): unstarted > Refused to split GroupingShuffleReader <unstarted in shuffle range > [ShufflePosition(base64:AAAAAQD_AP8A_wD_AAE), > ShufflePosition(base64:AAAAAgD_AP8A_wD_AAE))> at > ShufflePosition(base64:AAAAAQD_AP8A_wEAAQ) > {code} > Run C: after 10mins: elements get added to output of {{GroupByKey}}, up to > 70,262 and then stays stuck doing nothing. No logs as above as far as I can > find. > I've run this about a dozen times and it always gets stuck. I am trying out > right now to run the pipeline with the worker limit set to one, and > {{GroupByKey}} has output 150k so far, still increasing. This seems like a > workaround, but using one worker only is not ideal. > cc [~dhalp...@google.com] -- This message was sent by Atlassian JIRA (v6.3.15#6346)