[ 
https://issues.apache.org/jira/browse/TEZ-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated TEZ-4348:
------------------------------
    Description: 
The idea is the same as in TEZ-4336, this is for unordered codepaths.

An example with a problem attached as  
[^org.apache.hadoop.hive.cli.TestMiniLlapCliDriver-output.txt] 
this was discovered while I was working on a hive ticket:
1. qtest failed
2. there were no obvious hive related error
3. tons of messages in the logs like below:
{code}
2024-07-26T00:21:36,900  INFO [Fetcher_B {Map_1 -> Reducer_2} #0] 
impl.ShuffleManager: Map_1 -> Reducer_2: Fetch failed for src: 
InputAttemptIdentifier [inputIdentifier=0, attemptNumber=0, 
pathComponent=attempt_1721978473743_0001_8_00_000000_0_10129, spillType=0, 
spillId=-1] InputIdentifier: InputAttemptIdentifier [inputIdentifier=0, 
attemptNumber=0, pathComponent=attempt_1721978473743_0001_8_00_000000_0_10129, 
spillType=0, spillId=-1], connectFailed: true, local fetch: false, remote fetch 
failure reported as local failure: false)
{code}
4. after placing a log message to ShuffleManager I found the following:
{code}
2024-07-25T03:28:15,352  WARN [Fetcher_B {Map_1 -> Reducer_2} #0] 
impl.ShuffleManager: Fetch failure
java.io.IOException: Failed to connect to 
http://lbodor-MBP16.local:0/mapOutput?job=job_1721903278713_0001&dag=8&reduce=0&map=attempt_1721903278713_0001_8_00_000000_0_10129,
 #connectionFailures=1
        at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:166) 
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:121) 
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnection(Fetcher.java:505)
 
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:574)
 
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:493)
 
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:291)
 
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:78)
 
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) 
~[tez-common-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
 ~[guava-28.2-jre.jar:?]
        at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
 ~[guava-28.2-jre.jar:?]
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
 ~[guava-28.2-jre.jar:?]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_292]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_292]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
Caused by: java.net.ConnectException: Can't assign requested address (connect 
failed)
        at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_292]
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) 
~[?:1.8.0_292]
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
 ~[?:1.8.0_292]
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) 
~[?:1.8.0_292]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) 
~[?:1.8.0_292]
        at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_292]
        at sun.net.NetworkClient.doConnect(NetworkClient.java:175) 
~[?:1.8.0_292]
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) 
~[?:1.8.0_292]
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) 
~[?:1.8.0_292]
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:242) 
~[?:1.8.0_292]
        at sun.net.www.http.HttpClient.New(HttpClient.java:339) ~[?:1.8.0_292]
        at sun.net.www.http.HttpClient.New(HttpClient.java:357) ~[?:1.8.0_292]
        at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1226)
 ~[?:1.8.0_292]
        at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
 ~[?:1.8.0_292]
        at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
 ~[?:1.8.0_292]
        at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990) 
~[?:1.8.0_292]
        at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:149) 
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        ... 13 more
{code}

this eventually led to DAG failure

the expected behavior is:
1. log the exception and/or...
2. report the exception to the AM so it can report it on DAG failure

  was:
The idea is the same as in TEZ-4336, this is for unordered codepaths.

An example with a problem attached as  
[^org.apache.hadoop.hive.cli.TestMiniLlapCliDriver-output.txt] 
this was discovered while I was working on a hive ticket:
1. qtest failed
2. there were no obvious hive related error
3. tons of messages in the logs like below:
{code}
2024-07-26T00:21:36,900  INFO [Fetcher_B {Map_1 -> Reducer_2} #0] 
impl.ShuffleManager: Map_1 -> Reducer_2: Fetch failed for src: 
InputAttemptIdentifier [inputIdentifier=0, attemptNumber=0, 
pathComponent=attempt_1721978473743_0001_8_00_000000_0_10129, spillType=0, 
spillId=-1] InputIdentifier: InputAttemptIdentifier [inputIdentifier=0, 
attemptNumber=0, pathComponent=attempt_1721978473743_0001_8_00_000000_0_10129, 
spillType=0, spillId=-1], connectFailed: true, local fetch: false, remote fetch 
failure reported as local failure: false)
{code}
4. after putting a log message I found:
{code}
2024-07-25T03:28:15,352  WARN [Fetcher_B {Map_1 -> Reducer_2} #0] 
impl.ShuffleManager: Fetch failure
java.io.IOException: Failed to connect to 
http://lbodor-MBP16.local:0/mapOutput?job=job_1721903278713_0001&dag=8&reduce=0&map=attempt_1721903278713_0001_8_00_000000_0_10129,
 #connectionFailures=1
        at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:166) 
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:121) 
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnection(Fetcher.java:505)
 
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:574)
 
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:493)
 
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:291)
 
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:78)
 
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) 
~[tez-common-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
 ~[guava-28.2-jre.jar:?]
        at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
 ~[guava-28.2-jre.jar:?]
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
 ~[guava-28.2-jre.jar:?]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_292]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_292]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
Caused by: java.net.ConnectException: Can't assign requested address (connect 
failed)
        at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_292]
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) 
~[?:1.8.0_292]
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
 ~[?:1.8.0_292]
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) 
~[?:1.8.0_292]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) 
~[?:1.8.0_292]
        at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_292]
        at sun.net.NetworkClient.doConnect(NetworkClient.java:175) 
~[?:1.8.0_292]
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) 
~[?:1.8.0_292]
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) 
~[?:1.8.0_292]
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:242) 
~[?:1.8.0_292]
        at sun.net.www.http.HttpClient.New(HttpClient.java:339) ~[?:1.8.0_292]
        at sun.net.www.http.HttpClient.New(HttpClient.java:357) ~[?:1.8.0_292]
        at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1226)
 ~[?:1.8.0_292]
        at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
 ~[?:1.8.0_292]
        at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
 ~[?:1.8.0_292]
        at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990) 
~[?:1.8.0_292]
        at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:149) 
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
        ... 13 more
{code}

this eventually led to DAG failure

the expected behavior is:
1. log the exception and/or...
2. report the exception to the AM so it can report it on DAG failure


> ShuffleManager should try to report the original exception
> ----------------------------------------------------------
>
>                 Key: TEZ-4348
>                 URL: https://issues.apache.org/jira/browse/TEZ-4348
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: 
> org.apache.hadoop.hive.cli.TestMiniLlapCliDriver-output.txt
>
>
> The idea is the same as in TEZ-4336, this is for unordered codepaths.
> An example with a problem attached as  
> [^org.apache.hadoop.hive.cli.TestMiniLlapCliDriver-output.txt] 
> this was discovered while I was working on a hive ticket:
> 1. qtest failed
> 2. there were no obvious hive related error
> 3. tons of messages in the logs like below:
> {code}
> 2024-07-26T00:21:36,900  INFO [Fetcher_B {Map_1 -> Reducer_2} #0] 
> impl.ShuffleManager: Map_1 -> Reducer_2: Fetch failed for src: 
> InputAttemptIdentifier [inputIdentifier=0, attemptNumber=0, 
> pathComponent=attempt_1721978473743_0001_8_00_000000_0_10129, spillType=0, 
> spillId=-1] InputIdentifier: InputAttemptIdentifier [inputIdentifier=0, 
> attemptNumber=0, 
> pathComponent=attempt_1721978473743_0001_8_00_000000_0_10129, spillType=0, 
> spillId=-1], connectFailed: true, local fetch: false, remote fetch failure 
> reported as local failure: false)
> {code}
> 4. after placing a log message to ShuffleManager I found the following:
> {code}
> 2024-07-25T03:28:15,352  WARN [Fetcher_B {Map_1 -> Reducer_2} #0] 
> impl.ShuffleManager: Fetch failure
> java.io.IOException: Failed to connect to 
> http://lbodor-MBP16.local:0/mapOutput?job=job_1721903278713_0001&dag=8&reduce=0&map=attempt_1721903278713_0001_8_00_000000_0_10129,
>  #connectionFailures=1
>       at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:166) 
> ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
>       at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:121) 
> ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
>       at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnection(Fetcher.java:505)
>  
> ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
>       at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:574)
>  
> ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
>       at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:493)
>  
> ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
>       at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:291)
>  
> ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
>       at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:78)
>  
> ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
>       at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) 
> ~[tez-common-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
>       at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
>  ~[guava-28.2-jre.jar:?]
>       at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
>  ~[guava-28.2-jre.jar:?]
>       at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
>  ~[guava-28.2-jre.jar:?]
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  ~[?:1.8.0_292]
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  ~[?:1.8.0_292]
>       at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
> Caused by: java.net.ConnectException: Can't assign requested address (connect 
> failed)
>       at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_292]
>       at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) 
> ~[?:1.8.0_292]
>       at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>  ~[?:1.8.0_292]
>       at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) 
> ~[?:1.8.0_292]
>       at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) 
> ~[?:1.8.0_292]
>       at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_292]
>       at sun.net.NetworkClient.doConnect(NetworkClient.java:175) 
> ~[?:1.8.0_292]
>       at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) 
> ~[?:1.8.0_292]
>       at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) 
> ~[?:1.8.0_292]
>       at sun.net.www.http.HttpClient.<init>(HttpClient.java:242) 
> ~[?:1.8.0_292]
>       at sun.net.www.http.HttpClient.New(HttpClient.java:339) ~[?:1.8.0_292]
>       at sun.net.www.http.HttpClient.New(HttpClient.java:357) ~[?:1.8.0_292]
>       at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1226)
>  ~[?:1.8.0_292]
>       at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
>  ~[?:1.8.0_292]
>       at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
>  ~[?:1.8.0_292]
>       at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990)
>  ~[?:1.8.0_292]
>       at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:149) 
> ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
>       ... 13 more
> {code}
> this eventually led to DAG failure
> the expected behavior is:
> 1. log the exception and/or...
> 2. report the exception to the AM so it can report it on DAG failure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to