shanyu zhao created SPARK-29003:
-----------------------------------

             Summary: Spark history server startup hang due to deadlock
                 Key: SPARK-29003
                 URL: https://issues.apache.org/jira/browse/SPARK-29003
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.4
            Reporter: shanyu zhao


Occasionally when starting Spark History Server, the service process will hang 
before binding to the port so Spark History Server is not usable. One has to 
kill the process and start again. You can write a simple bash program to stop 
and start Spark History Server and you can reproduce this problem approximately 
10% of time.

The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. 
This is what I collected with jstack:
{code:java}
"log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x00007fca90028800 
nid=0x6e8 in Object.wait() [0x00007fcaa9471000]"log-replay-executor-0" #17 
daemon prio=5 os_prio=0 tid=0x00007fca90028800 nid=0x6e8 in Object.wait() 
[0x00007fcaa9471000]   java.lang.Thread.State: RUNNABLE at 
java.nio.file.FileSystems.getDefault(FileSystems.java:176) ... at 
java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked <0x00000000aaac1d40> 
(a java.lang.Runtime) ... at 
org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698)
"main" #1 prio=5 os_prio=0 tid=0x00007fcad8016800 nid=0x6d8 waiting for monitor 
entry [0x00007fcae146c000]   java.lang.Thread.State: BLOCKED (on object 
monitor) at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock 
<0x00000000aaac1d40> (a java.lang.Runtime) ... at 
java.nio.file.FileSystems.getDefault(FileSystems.java:176) at 
java.io.File.toPath(File.java:2234) - locked <0x000000008699bb68> (a 
java.io.File) ...    at 
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code}
Basically "main" thread and "log-replay-executor-0" thread simultaneously 
calling java.nio,file.FileSystems.getDefault() and deadlocked. 

This is similar to the reported JDK bug:

[https://bugs.openjdk.java.net/browse/JDK-8037567]

The problem is that during Spark History Server startup, there are two things 
happening simultaneously that call into java.nio.file.FileSystems.getDefault():

1) start jetty server
2) start ApplicationHistoryProvider (which reads files from HDFS)

We should do this two things sequentially instead of in parallel.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to