[ https://issues.apache.org/jira/browse/SPARK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15025107#comment-15025107 ]
Michael Armbrust commented on SPARK-9328: ----------------------------------------- [~joshrosen] is this actually a 1.6 blocker? > Netty IO layer should implement read timeouts > --------------------------------------------- > > Key: SPARK-9328 > URL: https://issues.apache.org/jira/browse/SPARK-9328 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core > Affects Versions: 1.2.1, 1.3.1, 1.4.1, 1.5.0 > Reporter: Josh Rosen > Priority: Blocker > > Spark's network layer does not implement read timeouts which may lead to > stalls during shuffle: if a remote shuffle server stalls while responding to > a shuffle block fetch request but does not close the socket then the job may > block until an OS-level socket timeout occurs. > I think that we can fix this using Netty's ReadTimeoutHandler > (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler). > The tricky part of working on this will be figuring out the right place to > add the handler and ensuring that we don't introduce performance issues by > not re-using sockets. > Quoting from that linked StackOverflow question: > {quote} > Note that the ReadTimeoutHandler is also unaware of whether you have sent a > request - it only cares whether data has been read from the socket. If your > connection is persistent, and you only want read timeouts to fire when a > request has been sent, you'll need to build a request / response aware > timeout handler. > {quote} > If we want to avoid tearing down connections between shuffles then we may > have to do something like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org