[jira] [Commented] (TEZ-3984) Shuffle: Out of Band DME event sending causes errors

2018-08-29 Thread Jaume M (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596774#comment-16596774
 ] 

Jaume M commented on TEZ-3984:
--

Yeah, good point

> Shuffle: Out of Band DME event sending causes errors
> 
>
> Key: TEZ-3984
> URL: https://issues.apache.org/jira/browse/TEZ-3984
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.8.4, 0.9.1, 0.10.0
>Reporter: Gopal V
>Assignee: Jaume M
>Priority: Critical
>  Labels: correctness
> Attachments: TEZ-3984.1.patch
>
>
> In case of a task Input throwing an exception, the outputs are also closed in 
> the LogicalIOProcessorRuntimeTask.cleanup().
> Cleanup ignore all the events returned by output close, however if any output 
> tries to send an event out of band by directly calling 
> outputContext.sendEvents(events), then those events can reach the AM before 
> the task failure is reported.
> This can cause correctness issues with shuffle since zero sized events can be 
> sent out due to an input failure and downstream tasks may never reattempt a 
> fetch from the valid attempt.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3984) Shuffle: Out of Band DME event sending causes errors

2018-08-29 Thread Gopal V (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596751#comment-16596751
 ] 

Gopal V commented on TEZ-3984:
--

The patch looks good - minor NIT, the OrderedPartitionedKVOutput event lists 
need to be prefixed.

The sorter events need to be inserted at 0, not appended (for event order 
related issues - which doesn't exist today, because it is likely to be no 
events in generateEvents but is neater to see them in order in the AM).

> Shuffle: Out of Band DME event sending causes errors
> 
>
> Key: TEZ-3984
> URL: https://issues.apache.org/jira/browse/TEZ-3984
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.8.4, 0.9.1, 0.10.0
>Reporter: Gopal V
>Assignee: Jaume M
>Priority: Critical
>  Labels: correctness
> Attachments: TEZ-3984.1.patch
>
>
> In case of a task Input throwing an exception, the outputs are also closed in 
> the LogicalIOProcessorRuntimeTask.cleanup().
> Cleanup ignore all the events returned by output close, however if any output 
> tries to send an event out of band by directly calling 
> outputContext.sendEvents(events), then those events can reach the AM before 
> the task failure is reported.
> This can cause correctness issues with shuffle since zero sized events can be 
> sent out due to an input failure and downstream tasks may never reattempt a 
> fetch from the valid attempt.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3972) Tez DAG can hang when a single task fails to fetch

2018-08-29 Thread Jonathan Eagles (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596684#comment-16596684
 ] 

Jonathan Eagles commented on TEZ-3972:
--

[~kshukla], I think we may need to protect ourselves from division by zero to 
avoid this in race conditions. What do you think?

> Tez DAG can hang when a single task fails to fetch
> --
>
> Key: TEZ-3972
> URL: https://issues.apache.org/jira/browse/TEZ-3972
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
>Priority: Major
> Attachments: TEZ-3972.001.patch, TEZ-3972.002.patch
>
>
> Description of the hung DAG:
> A DAG with 2 vertices. {{Map}} Vertex has 22k maps, downstream vertex 
> {{Reduce}} has 1009 tasks. All tasks succeed but one, which hangs. This one 
> task (attempt) is doing a local fetch from a node that (now) has a bad disk. 
> It fails to fetch and reports to the AM for the offending input attempt 
> identifiers. However the AM does not schedule a re-run as 
> {{uniquefailedOutputReports}} size is 1 (since only this task attempt failed 
> to fetch) and failure fraction is not met. The denominator for this fraction 
> is the total number of tasks. That causes the re-run to never occur. This 
> JIRA tracks the AM side of the change to alleviate this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TEZ-3984) Shuffle: Out of Band DME event sending causes errors

2018-08-29 Thread Jaume M (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jaume M reassigned TEZ-3984:


  Assignee: Jaume M
Attachment: TEZ-3984.1.patch

> Shuffle: Out of Band DME event sending causes errors
> 
>
> Key: TEZ-3984
> URL: https://issues.apache.org/jira/browse/TEZ-3984
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1, 0.8.4, 0.10.0
>Reporter: Gopal V
>Assignee: Jaume M
>Priority: Critical
>  Labels: correctness
> Attachments: TEZ-3984.1.patch
>
>
> In case of a task Input throwing an exception, the outputs are also closed in 
> the LogicalIOProcessorRuntimeTask.cleanup().
> Cleanup ignore all the events returned by output close, however if any output 
> tries to send an event out of band by directly calling 
> outputContext.sendEvents(events), then those events can reach the AM before 
> the task failure is reported.
> This can cause correctness issues with shuffle since zero sized events can be 
> sent out due to an input failure and downstream tasks may never reattempt a 
> fetch from the valid attempt.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-3980) ShuffleRunner: the wake loop needs to check for shutdown

2018-08-29 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3980:

Fix Version/s: 0.9.2

Thanks, [~gopalv]!  I committed this to branch-0.9 as well.

> ShuffleRunner: the wake loop needs to check for shutdown
> 
>
> Key: TEZ-3980
> URL: https://issues.apache.org/jira/browse/TEZ-3980
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Gopal V
>Assignee: Gopal V
>Priority: Major
> Fix For: 0.9.2, 0.10.0
>
> Attachments: TEZ-3980.1.patch
>
>
> In the ShuffleRunner threads, there's a loop which does not terminate if the 
> task threads get killed.
> {code}
>   while ((runningFetchers.size() >= numFetchers || 
> pendingHosts.isEmpty())
>   && numCompletedInputs.get() < numInputs) {
> inputContext.notifyProgress();
> boolean ret = wakeLoop.await(1000, TimeUnit.MILLISECONDS);
>   }
> {code}
> The wakeLoop signal does not exit this out of the loop and is missing a break 
> for shut-down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)