[ 
https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6209:
------------------------------
    Fix Version/s:     (was: 1.2.2)
                   1.2.3

> ExecutorClassLoader can leak connections after failing to load classes from 
> the REPL class server
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6209
>                 URL: https://issues.apache.org/jira/browse/SPARK-6209
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.0, 1.0.3, 1.1.2, 1.2.1, 1.3.0, 1.4.0
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>            Priority: Critical
>             Fix For: 1.2.3, 1.3.1, 1.4.0
>
>
> ExecutorClassLoader does not ensure proper cleanup of network connections 
> that it opens.  If it fails to load a class, it may leak partially-consumed 
> InputStreams that are connected to the REPL's HTTP class server, causing that 
> server to exhaust its thread pool, which can cause the entire job to hang.
> Here is a simple reproduction:
> With
> {code}
> ./bin/spark-shell --master local-cluster[8,8,512] 
> {code}
> run the following command:
> {code}
> sc.parallelize(1 to 1000, 1000).map { x =>
>   try {
>       Class.forName("some.class.that.does.not.Exist")
>   } catch {
>       case e: Exception => // do nothing
>   }
>   x
> }.count()
> {code}
> This job will run 253 tasks, then will completely freeze without any errors 
> or failed tasks.
> It looks like the driver has 253 threads blocked in socketRead0() calls:
> {code}
> [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc
>      253     759   14674
> {code}
> e.g.
> {code}
> "qtp1287429402-13" daemon prio=5 tid=0x00007f868a1c0000 nid=0x5b03 runnable 
> [0x00000001159bd000]
>    java.lang.Thread.State: RUNNABLE
>     at java.net.SocketInputStream.socketRead0(Native Method)
>     at java.net.SocketInputStream.read(SocketInputStream.java:152)
>     at java.net.SocketInputStream.read(SocketInputStream.java:122)
>     at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
>     at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
>     at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
>     at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044)
>     at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
>     at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>     at 
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
>     at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>     at java.lang.Thread.run(Thread.java:745) 
> {code}
> Jstack on the executors shows blocking in loadClass / findClass, where a 
> single thread is RUNNABLE and waiting to hear back from the driver and other 
> executor threads are BLOCKED on object monitor synchronization at 
> Class.forName0().
> Remotely triggering a GC on a hanging executor allows the job to progress and 
> complete more tasks before hanging again.  If I repeatedly trigger GC on all 
> of the executors, then the job runs to completion:
> {code}
> jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run
> {code}
> The culprit is a {{catch}} block that ignores all exceptions and performs no 
> cleanup: 
> https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94
> This bug has been present since Spark 1.0.0, but I suspect that we haven't 
> seen it before because it's pretty hard to reproduce. Triggering this error 
> requires a job with tasks that trigger ClassNotFoundExceptions yet are still 
> able to run to completion.  It also requires that executors are able to leak 
> enough open connections to exhaust the class server's Jetty thread pool 
> limit, which requires that there are a large number of tasks (253+) and 
> either a large number of executors or a very low amount of GC pressure on 
> those executors (since GC will cause the leaked connections to be closed).
> The fix here is pretty simple: add proper resource cleanup to this class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to