[ https://issues.apache.org/jira/browse/TEZ-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100295#comment-14100295 ]
Jeff Zhang commented on TEZ-850: -------------------------------- [~hitesh] There's already a class for this kind of test: TestDAGRecovery which introduce customized VertexManger that fail at certain points. Here're the cases that I can think of need this kind of test: * Test AM recovery with all data movement types including 1-1, broadcast, scatter-gather with/without shuffle. AM should die in 2 scenarios: first-vertex task finishes completely and partially. * TezCounter recovery But I have one concern about the current implementation of TestDAGRecovery, it has limitation that it only verify the DAG could finish successfully in recover mode. But it may not enough to say that the recovery work normally when the job finish successfully (e.g. the task attempt may not recover successfully, it just rerun itself which could also make DAG finished successfully finally ). One way to verify it is that we could read the recovery log to verify whether the recovery works normally. What do you think ? > Recovery unit tests > ------------------- > > Key: TEZ-850 > URL: https://issues.apache.org/jira/browse/TEZ-850 > Project: Apache Tez > Issue Type: Sub-task > Reporter: Hitesh Shah > Assignee: Jeff Zhang > > Tests for custom edge managers, groups handling, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)