[jira] [Commented] (TEZ-902) Fetch failure issues in shuffle Input

Siddharth Seth (JIRA) Wed, 05 Mar 2014 13:01:27 -0800

    [ 
https://issues.apache.org/jira/browse/TEZ-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13921391#comment-13921391
 ]


Siddharth Seth commented on TEZ-902:
------------------------------------

bq. I disagree that fixing it in the remaining list is the right solution. 1) 
There should be 1 way to handle bad inputs - there is already such a way == 
obsolete inputs - so lets stick to it. 2) While reading the code, its 
immediately obvious to anyone that obsolete inputs are used to invalidate 
inputs. The alternate solution that uses the return value of copyMapOutputs and 
then removes them from remaining list in the fetcher and then updates the items 
in the MapHost via the remaining list is non-obvious. Also, from what I see the 
remaining list is an internal object used in the Fetcher to track the inputs it 
has fetched while the input is being consumed via the input stream.
The current mechanism in Shuffle + Fetcher is to drain all Inputs for a host 
and assign them to a Fetcher. Once this happens, the only place in the system 
which is aware of these Inputs is the Fetcher. On failure, the Fetcher 
determines whether all pending Inputs have failed, or only a single one. 
Depending on the scenario - it puts the remaining Inputs back into the list 
maintained for that host.
The mechanism used rightnow re-populates the list based on the Fetcher 
determining how many of the Inputs need to be marked as Failed. 
Obsoletiion, is used primarily for InputFailures determined by the AM (the 
specific task may or may not have seen this failure).
If obsoletion is to be used for registering local fetch failures - and avoiding 
retries, quite a bit more should ideally be changed along with it - otherwise 
we end up with a mix of mechanisms used by the Fetcher to indicate failure.
In the current case, not putting the first Input back would fix the problem. My 
guess to why this isn't being done is something to do with the penalty 
calculation etc done in Shuffle - which is obviously not doing much.

Verification of the mapId is fine - typically we end up throwing an exception 
while reading from the stream itself, but it's possible to end up reading 
incorrect data.

On de-duping, I still think it's better to fail early and identify bugs on 
InputFailedEvent processing rather than masking them but if you feel we should 
leave it as is rightnow, a log message will definitely be useful - and remove 
as soon as possible.

Of the 3 points I'd mentioned in my previous comment, is this jira addressing 
the first one ? - and the rest needs to be addressed in separate jiras ?



> Fetch failure issues in shuffle Input
> -------------------------------------
>
>                 Key: TEZ-902
>                 URL: https://issues.apache.org/jira/browse/TEZ-902
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: TEZ-902.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TEZ-902) Fetch failure issues in shuffle Input

Reply via email to