[
https://issues.apache.org/jira/browse/TEZ-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17987033#comment-17987033
]
Jonathan Turner Eagles commented on TEZ-3910:
---------------------------------------------
Another earlier stack trace associate with this error. In this case the
upstream host, READTIMEDOUT_HOST, was giving read timed out errors. Task blamed
itself instead of correctly blaming the upstream.
{code:java}
2017-07-11 13:15:11,575 [INFO] [Fetcher_O {scope_334} #2] |HttpConnection.url|:
for url=http://READTIMEDOUT_HOST:8043/mapOutput?job=******** sent hash and
receievd reply 0 ms 2017-07-11 13:29:11,661 [INFO] [Fetcher_O {scope_334} #2]
|orderedgrouped.FetcherOrderedGrouped|: Failed to read data to memory for
InputAttemptIdentifier [inputIdentifier=2, attemptNumber=0,
pathComponent=attempt_********, spillType=0, spillId=-1]. len=6251502,
decomp=28568387. ExceptionMessage=Read timed out 2017-07-11 13:29:11,661 [WARN]
[Fetcher_O {scope_334} #2] |orderedgrouped.FetcherOrderedGrouped|: Shuffle
output from READTIMEDOUT_HOST:8043 failed, retry it. 2017-07-11 13:32:11,763
[WARN] [Fetcher_O {scope_334} #2] |orderedgrouped.FetcherOrderedGrouped|:
Failed to verify reply after connecting to READTIMEDOUT_HOST:8043 with 1 inputs
pending java.net.SocketTimeoutException: Read timed out at
java.net.SocketInputStream.socketRead0(Native Method) at
java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at
java.net.SocketInputStream.read(SocketInputStream.java:170) at
java.net.SocketInputStream.read(SocketInputStream.java:141) at
java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at
java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at
java.io.BufferedInputStream.read(BufferedInputStream.java:345) at
sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704) at
sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647) at
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1536)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)
at org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:260)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:351)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:292)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:180)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:192)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:56)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) 2017-07-11 13:32:11,766 [INFO]
[Fetcher_O {scope_334} #2] |orderedgrouped.ShuffleScheduler|:
srcAttempt=InputAttemptIdentifier [inputIdentifier=2, attemptNumber=0,
pathComponent=attempt_********, spillType=0, spillId=-1], numUniqueHosts=3,
hostFailureThreshold=3, hostFailuresCount=0, hosts crossing threshold=0,
reducerFetchIssues=false 2017-07-11 13:32:11,767 [WARN] [Fetcher_O {scope_334}
#2] |orderedgrouped.FetcherOrderedGrouped|: copyMapOutput failed for tasks
[InputAttemptIdentifier [inputIdentifier=2, attemptNumber=0,
pathComponent=attempt_********, spillType=0, spillId=-1]] 2017-07-11
13:32:11,767 [INFO] [Fetcher_O {scope_334} #2]
|orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier
[inputIdentifier=2, attemptNumber=0, pathComponent=attempt_********,
spillType=0, spillId=-1], numUniqueHosts=3, hostFailureThreshold=3,
hostFailuresCount=1, hosts crossing threshold=0, reducerFetchIssues=false
2017-07-11 13:35:14,467 [WARN] [Fetcher_O {scope_334} #0]
|orderedgrouped.FetcherOrderedGrouped|: Failed to verify reply after connecting
to READTIMEDOUT_HOST:8043 with 1 inputs pending
java.net.SocketTimeoutException: Read timed out at
java.net.SocketInputStream.socketRead0(Native Method) at
java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at
java.net.SocketInputStream.read(SocketInputStream.java:170) at
java.net.SocketInputStream.read(SocketInputStream.java:141) at
java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at
java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at
java.io.BufferedInputStream.read(BufferedInputStream.java:345) at
sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704) at
sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647) at
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1536)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)
at org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:260)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:351)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:261)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:180)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:192)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:56)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) 2017-07-11 13:35:14,467 [ERROR]
[Fetcher_O {scope_334} #0] |orderedgrouped.ShuffleScheduler|: scope_334:
Shuffle failed with too many fetch failures and insufficient
progress!failureCounts=1, pendingInputs=1, fetcherHealthy=false,
reducerProgressedEnough=true, reducerStalled=true 2017-07-11 13:35:14,468
[INFO] [Fetcher_O {scope_334} #0] |orderedgrouped.Shuffle|: scope_334: Setting
throwable in reportException with message [scope_334: Shuffle failed with too
many fetch failures and insufficient progress!failureCounts=1, pendingInputs=1,
fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true] from
thread [Fetcher_O {scope_334} #0 2017-07-11 13:35:14,468 [INFO] [Fetcher_O
{scope_334} #0] |orderedgrouped.ShuffleScheduler|: copy(3 (spillsFetched=3) of
4. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s)
2017-07-11 13:35:14,468 [INFO] [Fetcher_O {scope_334} #0]
|orderedgrouped.ShuffleScheduler|: Shutting down fetchers for input: scope_334,
shutdown timetaken: 0 ms, hasFetcherExecutorStopped: true 2017-07-11
13:35:14,468 [INFO] [ShuffleAndMergeRunner {scope_334}]
|orderedgrouped.ShuffleScheduler|: scope_334: Interrupted while waiting for
host and hasBeenShutdown. Breaking out of ShuffleSchedulerCallable loop
2017-07-11 13:35:14,469 [INFO] [ShuffleAndMergeRunner {scope_334}]
|orderedgrouped.ShuffleScheduler|: Shutting down FetchScheduler for input:
scope_334, wasInterrupted=true 2017-07-11 13:35:14,469 [INFO] [Fetcher_O
{scope_334} #0] |orderedgrouped.ShuffleScheduler|: scope_334: Already shutdown.
Ignoring fetch complete 2017-07-11 13:35:14,469 [ERROR] [ShuffleAndMergeRunner
{scope_334}] |orderedgrouped.Shuffle|: scope_334: ShuffleRunner failed with
error
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
error in shuffle in Fetcher_O {scope_334} #0 at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:304)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:286)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException:
scope_334: Shuffle failed with too many fetch failures and insufficient
progress!failureCounts=1, pendingInputs=1, fetcherHealthy=false,
reducerProgressedEnough=true, reducerStalled=true at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:1021)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:762)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:379)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:261)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:180)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:192)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:56)
... 5 more
{code}
> Single node can cause Tez job to fail during shuffle
> ----------------------------------------------------
>
> Key: TEZ-3910
> URL: https://issues.apache.org/jira/browse/TEZ-3910
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.9.1
> Reporter: Kuhu Shukla
> Assignee: Kuhu Shukla
> Priority: Major
> Attachments: TEZ-3910.001.patch, TEZ-3910.002.patch,
> TEZ-3910.003.patch, TEZ-3910.004.patch, TEZ-3910.005.patch
>
>
> There is a race where a downstream task that is running into fetch failures
> due to bad output from the upstream task can continue to blame itself for the
> failure before the AM can do a re-run of the upstream offending task and fix
> the fetch failure. This causes the DAG to fail even if a single node fails.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)