Shailesh Gupta created TEZ-4460:
-----------------------------------
Summary: Read timed out in shuffle handler
Key: TEZ-4460
URL: https://issues.apache.org/jira/browse/TEZ-4460
Project: Apache Tez
Issue Type: Bug
Affects Versions: 0.10.2
Reporter: Shailesh Gupta
Here is my hypothesis, i could be wrong. Would need help in debugging it.
We are running TPC-DS 1TB scale queries (q7) on EC2 with 2 worker nodes of
M5.8xl. We are migrating tez 0.9.2 to 0.10.2. Things are working fine on tez
0.9.2 but in tez 0.10.2 , we are seeing these logs while using tez shuffle
handler.
{{2022-11-30 13:19:26627 [INFO] [Fetcher_O \{Map_4 -> Reducer_5} #5]
|orderedgrouped.FetcherOrderedGrouped|: }}
{{345 2022-11-30 13:19:26628 [INFO] [Fetcher_O \{Map_4 -> Reducer_5} #5]
|orderedgrouped.FetcherOrderedGrouped|: Read timed out}}
{{java.net.SocketInputStream.socketRead0(Native Method) }}
{{java.net.SocketInputStream.socketRead(SocketInputStream.java:116) }}
{{java.net.SocketInputStream.read(SocketInputStream.java:171)}}
{{java.net.SocketInputStream.read(SocketInputStream.java:141)}}
{{java.io.BufferedInputStream.read1(BufferedInputStream.java:284)}}
{{java.io.BufferedInputStream.read(BufferedInputStream.java:345)}}
{{java.io.FilterInputStream.read(FilterInputStream.java:133)}}
{{sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3456)}}
{{java.io.BufferedInputStream.fill(BufferedInputStream.java:246)}}
{{java.io.BufferedInputStream.read(BufferedInputStream.java:265)}}
{{java.io.DataInputStream.readByte(DataInputStream.java:265)}}
{{org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:310)}}
{{org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:331)}}
{{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyMapOutput(FetcherOrderedGrouped.java:447)}}
{{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:286)}}
{{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:186)}}
{{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:198)}}
{{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:61)}}
{{org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)}}
{{com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)}}
{{com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)}}
{{com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)}}
{{java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
{{java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
{{java.lang.Thread.run(Thread.java:750)]}}
{{2022-11-30 13:19:26628 [WARN] [Fetcher_O \{Map_4 -> Reducer_5} #5]
|orderedgrouped.FetcherOrderedGrouped|: Shuffle output from XXXX:13563 failed
retry it.}}
In the node manager logs, i can see and other logs
{{2022-12-01 13:52:35,198 DEBUG auxservices.ShuffleHandler: Fetcher request
verfied.
enc_str=13563/mapOutput?job=job_1669871261925_0011&dag=1&reduce=8-11&attempt_1669871261925_0011_1_02_000016_1_10005....}}
There could be some issue in closing the connections while upgrading netty
version.
The case is reproducible every time, let me know if you need more logs or want
me to check something.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)