[ https://issues.apache.org/jira/browse/TEZ-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129680#comment-14129680 ]
Jeff Zhang commented on TEZ-1345: --------------------------------- [~hitesh] Attach the new patch * Remove vertexName in VertexDataMovementEventsGeneratedEvent, using vertexId for unit test * bq. any reason for using synchronized as compared to using something like a LinkedBlockingQueue for the cached events? Does not need to be changed but just curious as to whether other options were considered? Using LinkedBlockingQueue may still cause onRootVertexInitialized return init_events from 2 inputs. After a second thought, I think using ConcurrentHashMap would be much better. Use ConcurrentHashMap in the new patch. * bq. Regd. the test in TestDAGRecovery, the test should likely pass even if the caching fix is not applied. The issue only shows up in cases where there is a vertex which has an additional input as well as an inbound edge to it from another vertex. This can be addressed as part of the overall recovery end-to-end regression tests jira. The test won't pass when there's only one addition input in the root vertex if the issue is not fixed. The init event will written after VertexInitedEvent which would cause the recovery issue. > Add checks to guarantee all init events are written to recovery to consider > vertex initialized > ---------------------------------------------------------------------------------------------- > > Key: TEZ-1345 > URL: https://issues.apache.org/jira/browse/TEZ-1345 > Project: Apache Tez > Issue Type: Sub-task > Reporter: Hitesh Shah > Assignee: Jeff Zhang > Attachments: Tez-1345-10.patch, Tez-1345-2.patch, Tez-1345-3.patch, > Tez-1345-4.patch, Tez-1345-5.patch, Tez-1345-6.patch, Tez-1345-7.patch, > Tez-1345-8.patch, Tez-1345-9.patch, Tez-1345.patch > > > Related to issue discovered in TEZ-1033 -- This message was sent by Atlassian JIRA (v6.3.4#6332)