[ 
https://issues.apache.org/jira/browse/SPARK-29003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29003:
----------------------------------
    Affects Version/s: 2.3.4

> Spark history server startup hang due to deadlock
> -------------------------------------------------
>
>                 Key: SPARK-29003
>                 URL: https://issues.apache.org/jira/browse/SPARK-29003
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.4, 2.4.4
>            Reporter: shanyu zhao
>            Priority: Major
>         Attachments: sparkhistory-jstack.log
>
>
> Occasionally when starting Spark History Server, the service process will 
> hang before binding to the port so Spark History Server is not usable. One 
> has to kill the process and start again. You can write a simple bash program 
> to stop and start Spark History Server and you can reproduce this problem 
> approximately 10% of time.
> The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. 
> This is what I collected with jstack:
> {code:java}
> "log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x00007fca90028800 
> nid=0x6e8 in Object.wait() [0x00007fcaa9471000]
>     java.lang.Thread.State: RUNNABLE 
>     at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
>     ... 
>     at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked 
> <0x00000000aaac1d40> (a java.lang.Runtime) 
>     ... 
>     at 
> org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698)
> "main" #1 prio=5 os_prio=0 tid=0x00007fcad8016800 nid=0x6d8 waiting for 
> monitor entry [0x00007fcae146c000]
>     java.lang.Thread.State: BLOCKED (on object monitor) 
>     at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock 
> <0x00000000aaac1d40> (a java.lang.Runtime) 
>     ... 
>     at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
>     at java.io.File.toPath(File.java:2234) - locked <0x000000008699bb68> (a 
> java.io.File) 
>     ... 
>     at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code}
> Basically "main" thread and "log-replay-executor-0" thread simultaneously 
> calling java.nio,file.FileSystems.getDefault() and deadlocked. 
> This is similar to the reported JDK bug:
> [https://bugs.openjdk.java.net/browse/JDK-8037567]
> The problem is that during Spark History Server startup, there are two things 
> happening simultaneously that call into 
> java.nio.file.FileSystems.getDefault():
> 1) start jetty server
>  2) start ApplicationHistoryProvider (which reads files from HDFS)
> We should do this two things sequentially instead of in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to