[ https://issues.apache.org/jira/browse/TEZ-1400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095816#comment-14095816 ]
Bikas Saha commented on TEZ-1400: --------------------------------- Can you confirm that ShuffleVertexManager is being explicitly enabled for certain (or all) vertices by calling the vertex.setVertexManager() and then providing it a payload that configures TEZ_AM_SHUFFLE_VERTEX_MANAGER_ENABLE_AUTO_PARALLEL to true. This should not be turned on via the main job configuration as it will get inadvertently turned on for vertices that should not change their parallelism. If this is being enabled explicitly via the setVertexManager() with a payload then that is where the bug should be. If its not being explicitly turned on via setVertexManager() then that should change. One other thing you could try is to create a formal payload object for this manager and have a configurer that can set up all its parameters. By default it could pick up params from the client side tez-site.xml. Also remove the creation of payload from am conf if there is no payload to make the payload required. > Reducers stuck when enabling auto-reduce parallelism (MRR case) > --------------------------------------------------------------- > > Key: TEZ-1400 > URL: https://issues.apache.org/jira/browse/TEZ-1400 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.5.0 > Reporter: Rajesh Balamohan > Assignee: Rajesh Balamohan > Labels: performance > Attachments: TEZ-1400.1.patch, dag.dot > > > In M -> R1 -> R2 case, if R1 is optimized by auto-parallelism R2 gets stuck > waiting for events. > e.g > Map 1: 0/1 Map 2: -/- Map 5: 0/1 Map 6: 0/1 Map 7: 0/1 > Reducer 3: 0/23 Reducer 4: 0/1 > ... > ... > Map 1: 1/1 Map 2: 148(+13)/161 Map 5: 1/1 Map 6: 1/1 Map > 7: 1/1 Reducer 3: 0(+3)/3 Reducer 4: 0(+1)/1 <== Auto reduce > parallelism kicks in > .. > Map 1: 1/1 Map 2: 161/161 Map 5: 1/1 Map 6: 1/1 Map 7: 1/1 > Reducer 3: 3/3 Reducer 4: 0(+1)/1 > Job is stuck waiting for events in Reducer 4. > [fetcher [Reducer_3] #23] > org.apache.tez.runtime.library.common.shuffle.impl.ShuffleScheduler: copy(3 > of 23 at 0.02 MB/s) <=== *Waiting for 20 more partitions, even though > Reducer3 has been optimized to use 3 reducers -- This message was sent by Atlassian JIRA (v6.2#6252)