[jira] [Commented] (TEZ-3910) Single node can cause Tez job to fail during shuffle

Jonathan Turner Eagles (Jira) Mon, 30 Jun 2025 15:24:05 -0700


    [ 
https://issues.apache.org/jira/browse/TEZ-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17987033#comment-17987033
 ]


Jonathan Turner Eagles commented on TEZ-3910:
---------------------------------------------

Another earlier stack trace associate with this error. In this case the 
upstream host, READTIMEDOUT_HOST, was giving read timed out errors. Task blamed 
itself instead of correctly blaming the upstream.
{code:java}
 
2017-07-11 13:15:11,575 [INFO] [Fetcher_O {scope_334} #2] |HttpConnection.url|: 
for url=http://READTIMEDOUT_HOST:8043/mapOutput?job=******** sent hash and 
receievd reply 0 ms 2017-07-11 13:29:11,661 [INFO] [Fetcher_O {scope_334} #2] 
|orderedgrouped.FetcherOrderedGrouped|: Failed to read data to memory for 
InputAttemptIdentifier [inputIdentifier=2, attemptNumber=0, 
pathComponent=attempt_********, spillType=0, spillId=-1]. len=6251502, 
decomp=28568387. ExceptionMessage=Read timed out 2017-07-11 13:29:11,661 [WARN] 
[Fetcher_O {scope_334} #2] |orderedgrouped.FetcherOrderedGrouped|: Shuffle 
output from READTIMEDOUT_HOST:8043 failed, retry it. 2017-07-11 13:32:11,763 
[WARN] [Fetcher_O {scope_334} #2] |orderedgrouped.FetcherOrderedGrouped|: 
Failed to verify reply after connecting to READTIMEDOUT_HOST:8043 with 1 inputs 
pending java.net.SocketTimeoutException: Read timed out at 
java.net.SocketInputStream.socketRead0(Native Method) at 
java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at 
java.net.SocketInputStream.read(SocketInputStream.java:170) at 
java.net.SocketInputStream.read(SocketInputStream.java:141) at 
java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at 
java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at 
java.io.BufferedInputStream.read(BufferedInputStream.java:345) at 
sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704) at 
sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647) at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1536)
 at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)
 at org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:260) 
at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:351)
 at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:292)
 at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:180)
 at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:192)
 at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:56)
 at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 2017-07-11 13:32:11,766 [INFO] 
[Fetcher_O {scope_334} #2] |orderedgrouped.ShuffleScheduler|: 
srcAttempt=InputAttemptIdentifier [inputIdentifier=2, attemptNumber=0, 
pathComponent=attempt_********, spillType=0, spillId=-1], numUniqueHosts=3, 
hostFailureThreshold=3, hostFailuresCount=0, hosts crossing threshold=0, 
reducerFetchIssues=false 2017-07-11 13:32:11,767 [WARN] [Fetcher_O {scope_334} 
#2] |orderedgrouped.FetcherOrderedGrouped|: copyMapOutput failed for tasks 
[InputAttemptIdentifier [inputIdentifier=2, attemptNumber=0, 
pathComponent=attempt_********, spillType=0, spillId=-1]] 2017-07-11 
13:32:11,767 [INFO] [Fetcher_O {scope_334} #2] 
|orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier 
[inputIdentifier=2, attemptNumber=0, pathComponent=attempt_********, 
spillType=0, spillId=-1], numUniqueHosts=3, hostFailureThreshold=3, 
hostFailuresCount=1, hosts crossing threshold=0, reducerFetchIssues=false 
2017-07-11 13:35:14,467 [WARN] [Fetcher_O {scope_334} #0] 
|orderedgrouped.FetcherOrderedGrouped|: Failed to verify reply after connecting 
to READTIMEDOUT_HOST:8043 with 1 inputs pending 
java.net.SocketTimeoutException: Read timed out at 
java.net.SocketInputStream.socketRead0(Native Method) at 
java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at 
java.net.SocketInputStream.read(SocketInputStream.java:170) at 
java.net.SocketInputStream.read(SocketInputStream.java:141) at 
java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at 
java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at 
java.io.BufferedInputStream.read(BufferedInputStream.java:345) at 
sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704) at 
sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647) at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1536)
 at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)
 at org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:260) 
at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:351)
 at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:261)
 at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:180)
 at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:192)
 at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:56)
 at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 2017-07-11 13:35:14,467 [ERROR] 
[Fetcher_O {scope_334} #0] |orderedgrouped.ShuffleScheduler|: scope_334: 
Shuffle failed with too many fetch failures and insufficient 
progress!failureCounts=1, pendingInputs=1, fetcherHealthy=false, 
reducerProgressedEnough=true, reducerStalled=true 2017-07-11 13:35:14,468 
[INFO] [Fetcher_O {scope_334} #0] |orderedgrouped.Shuffle|: scope_334: Setting 
throwable in reportException with message [scope_334: Shuffle failed with too 
many fetch failures and insufficient progress!failureCounts=1, pendingInputs=1, 
fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true] from 
thread [Fetcher_O {scope_334} #0 2017-07-11 13:35:14,468 [INFO] [Fetcher_O 
{scope_334} #0] |orderedgrouped.ShuffleScheduler|: copy(3 (spillsFetched=3) of 
4. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s) 
2017-07-11 13:35:14,468 [INFO] [Fetcher_O {scope_334} #0] 
|orderedgrouped.ShuffleScheduler|: Shutting down fetchers for input: scope_334, 
shutdown timetaken: 0 ms, hasFetcherExecutorStopped: true 2017-07-11 
13:35:14,468 [INFO] [ShuffleAndMergeRunner {scope_334}] 
|orderedgrouped.ShuffleScheduler|: scope_334: Interrupted while waiting for 
host and hasBeenShutdown. Breaking out of ShuffleSchedulerCallable loop 
2017-07-11 13:35:14,469 [INFO] [ShuffleAndMergeRunner {scope_334}] 
|orderedgrouped.ShuffleScheduler|: Shutting down FetchScheduler for input: 
scope_334, wasInterrupted=true 2017-07-11 13:35:14,469 [INFO] [Fetcher_O 
{scope_334} #0] |orderedgrouped.ShuffleScheduler|: scope_334: Already shutdown. 
Ignoring fetch complete 2017-07-11 13:35:14,469 [ERROR] [ShuffleAndMergeRunner 
{scope_334}] |orderedgrouped.Shuffle|: scope_334: ShuffleRunner failed with 
error 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
 error in shuffle in Fetcher_O {scope_334} #0 at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:304)
 at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:286)
 at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: 
scope_334: Shuffle failed with too many fetch failures and insufficient 
progress!failureCounts=1, pendingInputs=1, fetcherHealthy=false, 
reducerProgressedEnough=true, reducerStalled=true at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:1021)
 at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:762)
 at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:379)
 at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:261)
 at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:180)
 at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:192)
 at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:56)
 ... 5 more
{code}
 

> Single node can cause Tez job to fail during shuffle
> ----------------------------------------------------
>
>                 Key: TEZ-3910
>                 URL: https://issues.apache.org/jira/browse/TEZ-3910
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Kuhu Shukla
>            Assignee: Kuhu Shukla
>            Priority: Major
>         Attachments: TEZ-3910.001.patch, TEZ-3910.002.patch, 
> TEZ-3910.003.patch, TEZ-3910.004.patch, TEZ-3910.005.patch
>
>
> There is a race where a downstream task that is running into fetch failures 
> due to bad output from the upstream task can continue to blame itself for the 
> failure before the AM can do a re-run of the upstream offending task and fix 
> the fetch failure. This causes the DAG to fail even if a single node fails.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TEZ-3910) Single node can cause Tez job to fail during shuffle

Reply via email to