[ 
https://issues.apache.org/jira/browse/TEZ-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013085#comment-14013085
 ] 

Bikas Saha commented on TEZ-1060:
---------------------------------

In the random failure case, all these configurations should be ignored.
The Processor can use processorContext.getTaskAttemptNumber() to know its 
attempt number. So when the attempt number reaches the limit then it should 
never fail itself. Otherwise it should fail itself randomly.
The Input can know the attempt number of the source task by 
DataMovementEvent.getVersion(). When the source task attempt reaches the limit 
then the input should never fail that task. Otheriwse, it should fail that 
source task randomly.
Does that clarify?

> Add randomness to fault tolerance tests
> ---------------------------------------
>
>                 Key: TEZ-1060
>                 URL: https://issues.apache.org/jira/browse/TEZ-1060
>             Project: Apache Tez
>          Issue Type: Sub-task
>    Affects Versions: 0.5.0
>            Reporter: Tassapol Athiapinya
>            Assignee: Tassapol Athiapinya
>         Attachments: TEZ-1060.1.patch, TEZ-1060.2.patch
>
>
> We do have TestFaultTolerance for unit tests that see whether AM can 
> correctly handles a case when there are processor failures and input 
> failures. TestFaultTolerance uses TestProcessor and TestInput to simulate 
> controlled failure scenario for a DAG. In each test, on processor front, we 
> do select which tasks fail (do-fail), which physical task indexes fail 
> (failing-task-index) and upto which attempt these physical tasks fail 
> (failing-upto-task-attempt). On input front, we do select which tasks have 
> failed inputs (do-fail), which physical task indexes fail 
> (failing-task-index), upto which attempt these physical tasks have failed 
> input (failing-task-attempt), which physical inputs to fail 
> (failing-input-index) and upto which version of physical inputs tasks do 
> reject (failing-upto-input-attempt). In addition to task failure and input 
> failures, we also check values of specific physical tasks to see if inputs of 
> downstream vertices match outputs of upstream vertices (verify-value, 
> verify-task-index). These tests were added during 0.3.0 and 0.4.0. We could 
> find several issues in Tez AM, fixed them and enhanced stability of Tez AM. 
> Though current unit tests are useful, they are limited by scenarios carefully 
> chosen by individual contributors. When Tez is used in heavy load scenario, 
> more issues are likely to arise. To bring fault tolerance tests to new level, 
> we should add tests that generate randomized failure scenarios. When each 
> contributor runs unit tests, new scenario will be generated. From there, it 
> gives more opportunity for community to report and fix new issues.
> There are few criteria for new tests:
> - We want to keep time used to run unit tests minimal. Each contributor runs 
> different hardware. It is inconvenient if people with slow machine needs to 
> spend too much time to run tests for any patch.
> - Random scenario needs to be controlled enough to know expected behavior. 
> This means parameters have to be validated by test itself first.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to