[ https://issues.apache.org/jira/browse/SPARK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen updated SPARK-9328: ------------------------------ Target Version/s: (was: 1.6.0) > Netty IO layer should implement read timeouts > --------------------------------------------- > > Key: SPARK-9328 > URL: https://issues.apache.org/jira/browse/SPARK-9328 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core > Affects Versions: 1.2.1, 1.3.1 > Reporter: Josh Rosen > Priority: Blocker > Fix For: 1.4.0 > > > Spark's network layer does not implement read timeouts which may lead to > stalls during shuffle: if a remote shuffle server stalls while responding to > a shuffle block fetch request but does not close the socket then the job may > block until an OS-level socket timeout occurs. > I think that we can fix this using Netty's ReadTimeoutHandler > (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler). > The tricky part of working on this will be figuring out the right place to > add the handler and ensuring that we don't introduce performance issues by > not re-using sockets. > Quoting from that linked StackOverflow question: > {quote} > Note that the ReadTimeoutHandler is also unaware of whether you have sent a > request - it only cares whether data has been read from the socket. If your > connection is persistent, and you only want read timeouts to fire when a > request has been sent, you'll need to build a request / response aware > timeout handler. > {quote} > If we want to avoid tearing down connections between shuffles then we may > have to do something like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org