[jira] [Commented] (TEZ-3984) Shuffle: Out of Band DME event sending causes errors
[ https://issues.apache.org/jira/browse/TEZ-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596774#comment-16596774 ] Jaume M commented on TEZ-3984: -- Yeah, good point > Shuffle: Out of Band DME event sending causes errors > > > Key: TEZ-3984 > URL: https://issues.apache.org/jira/browse/TEZ-3984 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.8.4, 0.9.1, 0.10.0 >Reporter: Gopal V >Assignee: Jaume M >Priority: Critical > Labels: correctness > Attachments: TEZ-3984.1.patch > > > In case of a task Input throwing an exception, the outputs are also closed in > the LogicalIOProcessorRuntimeTask.cleanup(). > Cleanup ignore all the events returned by output close, however if any output > tries to send an event out of band by directly calling > outputContext.sendEvents(events), then those events can reach the AM before > the task failure is reported. > This can cause correctness issues with shuffle since zero sized events can be > sent out due to an input failure and downstream tasks may never reattempt a > fetch from the valid attempt. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3984) Shuffle: Out of Band DME event sending causes errors
[ https://issues.apache.org/jira/browse/TEZ-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596751#comment-16596751 ] Gopal V commented on TEZ-3984: -- The patch looks good - minor NIT, the OrderedPartitionedKVOutput event lists need to be prefixed. The sorter events need to be inserted at 0, not appended (for event order related issues - which doesn't exist today, because it is likely to be no events in generateEvents but is neater to see them in order in the AM). > Shuffle: Out of Band DME event sending causes errors > > > Key: TEZ-3984 > URL: https://issues.apache.org/jira/browse/TEZ-3984 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.8.4, 0.9.1, 0.10.0 >Reporter: Gopal V >Assignee: Jaume M >Priority: Critical > Labels: correctness > Attachments: TEZ-3984.1.patch > > > In case of a task Input throwing an exception, the outputs are also closed in > the LogicalIOProcessorRuntimeTask.cleanup(). > Cleanup ignore all the events returned by output close, however if any output > tries to send an event out of band by directly calling > outputContext.sendEvents(events), then those events can reach the AM before > the task failure is reported. > This can cause correctness issues with shuffle since zero sized events can be > sent out due to an input failure and downstream tasks may never reattempt a > fetch from the valid attempt. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3972) Tez DAG can hang when a single task fails to fetch
[ https://issues.apache.org/jira/browse/TEZ-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596684#comment-16596684 ] Jonathan Eagles commented on TEZ-3972: -- [~kshukla], I think we may need to protect ourselves from division by zero to avoid this in race conditions. What do you think? > Tez DAG can hang when a single task fails to fetch > -- > > Key: TEZ-3972 > URL: https://issues.apache.org/jira/browse/TEZ-3972 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3972.001.patch, TEZ-3972.002.patch > > > Description of the hung DAG: > A DAG with 2 vertices. {{Map}} Vertex has 22k maps, downstream vertex > {{Reduce}} has 1009 tasks. All tasks succeed but one, which hangs. This one > task (attempt) is doing a local fetch from a node that (now) has a bad disk. > It fails to fetch and reports to the AM for the offending input attempt > identifiers. However the AM does not schedule a re-run as > {{uniquefailedOutputReports}} size is 1 (since only this task attempt failed > to fetch) and failure fraction is not met. The denominator for this fraction > is the total number of tasks. That causes the re-run to never occur. This > JIRA tracks the AM side of the change to alleviate this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (TEZ-3984) Shuffle: Out of Band DME event sending causes errors
[ https://issues.apache.org/jira/browse/TEZ-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jaume M reassigned TEZ-3984: Assignee: Jaume M Attachment: TEZ-3984.1.patch > Shuffle: Out of Band DME event sending causes errors > > > Key: TEZ-3984 > URL: https://issues.apache.org/jira/browse/TEZ-3984 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.8.4, 0.10.0 >Reporter: Gopal V >Assignee: Jaume M >Priority: Critical > Labels: correctness > Attachments: TEZ-3984.1.patch > > > In case of a task Input throwing an exception, the outputs are also closed in > the LogicalIOProcessorRuntimeTask.cleanup(). > Cleanup ignore all the events returned by output close, however if any output > tries to send an event out of band by directly calling > outputContext.sendEvents(events), then those events can reach the AM before > the task failure is reported. > This can cause correctness issues with shuffle since zero sized events can be > sent out due to an input failure and downstream tasks may never reattempt a > fetch from the valid attempt. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3980) ShuffleRunner: the wake loop needs to check for shutdown
[ https://issues.apache.org/jira/browse/TEZ-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated TEZ-3980: Fix Version/s: 0.9.2 Thanks, [~gopalv]! I committed this to branch-0.9 as well. > ShuffleRunner: the wake loop needs to check for shutdown > > > Key: TEZ-3980 > URL: https://issues.apache.org/jira/browse/TEZ-3980 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal V >Assignee: Gopal V >Priority: Major > Fix For: 0.9.2, 0.10.0 > > Attachments: TEZ-3980.1.patch > > > In the ShuffleRunner threads, there's a loop which does not terminate if the > task threads get killed. > {code} > while ((runningFetchers.size() >= numFetchers || > pendingHosts.isEmpty()) > && numCompletedInputs.get() < numInputs) { > inputContext.notifyProgress(); > boolean ret = wakeLoop.await(1000, TimeUnit.MILLISECONDS); > } > {code} > The wakeLoop signal does not exit this out of the loop and is missing a break > for shut-down. -- This message was sent by Atlassian JIRA (v7.6.3#76005)