[ https://issues.apache.org/jira/browse/TEZ-4460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
László Bodor resolved TEZ-4460. ------------------------------- Resolution: Fixed > Read timed out in shuffle handler - incorrect usage of EMPTY_LAST_CONTENT and > channel write > ------------------------------------------------------------------------------------------- > > Key: TEZ-4460 > URL: https://issues.apache.org/jira/browse/TEZ-4460 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.10.2 > Reporter: Shailesh Gupta > Assignee: László Bodor > Priority: Major > Fix For: 0.10.3 > > Attachments: nm_logs_6_45-6_53.zip, > syslog_attempt_1669954729098_0007_1_01_000003_0.gz, > syslog_attempt_1669954729098_0007_1_01_000005_0.gz > > Time Spent: 3.5h > Remaining Estimate: 0h > > We are seeing issue in tez shuffle handle in tez-0.10.2. Would need help in > debugging it. > We are running TPC-DS 1TB scale queries (q7) on EC2 with 2 worker nodes of > M5.8xl. We are migrating tez 0.9.2 to 0.10.2. Things are working fine on tez > 0.9.2 but in tez 0.10.2 , we are seeing these logs while using tez shuffle > handler. > {{2022-11-30 13:19:26627 [INFO] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5] > |orderedgrouped.FetcherOrderedGrouped|: }} > {{345 2022-11-30 13:19:26628 [INFO] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5] > |orderedgrouped.FetcherOrderedGrouped|: Read timed out}} > {{java.net.SocketInputStream.socketRead0(Native Method) }} > {{java.net.SocketInputStream.socketRead(SocketInputStream.java:116) }} > {{java.net.SocketInputStream.read(SocketInputStream.java:171)}} > {{java.net.SocketInputStream.read(SocketInputStream.java:141)}} > {{java.io.BufferedInputStream.read1(BufferedInputStream.java:284)}} > {{java.io.BufferedInputStream.read(BufferedInputStream.java:345)}} > {{java.io.FilterInputStream.read(FilterInputStream.java:133)}} > {{sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3456)}} > {{java.io.BufferedInputStream.fill(BufferedInputStream.java:246)}} > {{java.io.BufferedInputStream.read(BufferedInputStream.java:265)}} > {{java.io.DataInputStream.readByte(DataInputStream.java:265)}} > {{org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:310)}} > {{org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:331)}} > {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyMapOutput(FetcherOrderedGrouped.java:447)}} > {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:286)}} > {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:186)}} > {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:198)}} > {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:61)}} > {{org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)}} > {{com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)}} > {{com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)}} > {{com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)}} > {{java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}} > {{java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}} > {{java.lang.Thread.run(Thread.java:750)]}} > {{2022-11-30 13:19:26628 [WARN] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5] > |orderedgrouped.FetcherOrderedGrouped|: Shuffle output from XXXX:13563 failed > retry it.}} > > In the node manager logs, i can see and other logs > > {{2022-12-01 13:52:35,198 DEBUG auxservices.ShuffleHandler: Fetcher request > verfied. > enc_str=13563/mapOutput?job=job_1669871261925_0011&dag=1&reduce=8-11&attempt_1669871261925_0011_1_02_000016_1_10005....}} > > There could be some issue in closing the connections while upgrading netty > version. > The case is reproducible every time, let me know if you need more logs or > want me to check something. > > > cc: [~abstractdog] > -- This message was sent by Atlassian Jira (v8.20.10#820010)