[ https://issues.apache.org/jira/browse/SPARK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-9328: ----------------------------------- Assignee: Apache Spark (was: Josh Rosen) > Netty IO layer should implement read timeouts > --------------------------------------------- > > Key: SPARK-9328 > URL: https://issues.apache.org/jira/browse/SPARK-9328 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core > Reporter: Josh Rosen > Assignee: Apache Spark > > Spark's network layer does not implement read timeouts which may lead to > stalls during shuffle: if a remote shuffle server stalls while responding to > a shuffle block fetch request but does not close the socket then the job may > block until an OS-level socket timeout occurs. > I think that we can fix this using Netty's ReadTimeoutHandler > (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler). > The tricky part of working on this will be figuring out the right place to > add the handler and ensuring that we don't introduce performance issues by > not re-using sockets. > Quoting from that linked StackOverflow question: > {quote} > Note that the ReadTimeoutHandler is also unaware of whether you have sent a > request - it only cares whether data has been read from the socket. If your > connection is persistent, and you only want read timeouts to fire when a > request has been sent, you'll need to build a request / response aware > timeout handler. > {quote} > If we want to avoid tearing down connections between shuffles then we may > have to do something like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org