[ 
https://issues.apache.org/jira/browse/SPARK-41163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Kravchuk updated SPARK-41163:
------------------------------------
    Attachment: pom.xml

> Spark 3.2.2 storage.ShuffleBlockFetcherIterator and TransportResponseHandler 
> issue
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-41163
>                 URL: https://issues.apache.org/jira/browse/SPARK-41163
>             Project: Spark
>          Issue Type: Bug
>          Components: Build, Deploy
>    Affects Versions: 3.2.2
>         Environment: * spark 3.2.2
>  * hadoop 3.1.2
>  * hive 3.1.1
>  * scala 2.12
>            Reporter: Dmitry Kravchuk
>            Priority: Major
>             Fix For: 3.2.3
>
>         Attachments: container_1668606650061_0087_01_000057.txt, pom.xml
>
>
> Hello there.
> I've build spark 3.2.2 for my cluster which has hadoop 3.1.2 and scala 2.12 
> (pom.xml is attached).
> build script:
>  
> {code:java}
> cd spark && \
> ./build/mvn -Pyarn -Dhadoop.version=3.1.2 -Pscala-2.12 -Phive 
> -Phive-thriftserver -DskipTests clean package {code}
>  
> It was working fine but a few applications has got strage error and warning 
> form time to time.
> It always looks like datanode connection lost and shuffle reading issues.
> {code:java}
> 2022-11-16 22:18:25,423 ERROR server.TransportChannelHandler: Connection to 
> s00abd02node9.company.com/10.x.y.163:35143 has been quiet for 120000 ms while 
> there are outstanding requests. Assuming connection is dead; please adjust 
> spark.shuffle.io.connectionTimeout if this is wrong.
> 2022-11-16 22:18:25,423 ERROR client.TransportResponseHandler: Still have 5 
> requests outstanding when connection from 
> s00abd02node9.company.com/10.x.y.163:35143 is closed
> 2022-11-16 22:18:25,423 WARN netty.NettyBlockTransferService: Error while 
> trying to get the host local dirs for [16]
> 2022-11-16 22:18:25,425 ERROR storage.ShuffleBlockFetcherIterator: Error 
> occurred while fetching host local blocks {code}
> So when it happend application will go to retry and fail after 2nd start.
> Can anybody help?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to