[ https://issues.apache.org/jira/browse/SPARK-29003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-29003: ---------------------------------- Affects Version/s: 2.3.4 > Spark history server startup hang due to deadlock > ------------------------------------------------- > > Key: SPARK-29003 > URL: https://issues.apache.org/jira/browse/SPARK-29003 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.3.4, 2.4.4 > Reporter: shanyu zhao > Priority: Major > Attachments: sparkhistory-jstack.log > > > Occasionally when starting Spark History Server, the service process will > hang before binding to the port so Spark History Server is not usable. One > has to kill the process and start again. You can write a simple bash program > to stop and start Spark History Server and you can reproduce this problem > approximately 10% of time. > The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. > This is what I collected with jstack: > {code:java} > "log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x00007fca90028800 > nid=0x6e8 in Object.wait() [0x00007fcaa9471000] > java.lang.Thread.State: RUNNABLE > at java.nio.file.FileSystems.getDefault(FileSystems.java:176) > ... > at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked > <0x00000000aaac1d40> (a java.lang.Runtime) > ... > at > org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698) > "main" #1 prio=5 os_prio=0 tid=0x00007fcad8016800 nid=0x6d8 waiting for > monitor entry [0x00007fcae146c000] > java.lang.Thread.State: BLOCKED (on object monitor) > at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock > <0x00000000aaac1d40> (a java.lang.Runtime) > ... > at java.nio.file.FileSystems.getDefault(FileSystems.java:176) > at java.io.File.toPath(File.java:2234) - locked <0x000000008699bb68> (a > java.io.File) > ... > at > org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code} > Basically "main" thread and "log-replay-executor-0" thread simultaneously > calling java.nio,file.FileSystems.getDefault() and deadlocked. > This is similar to the reported JDK bug: > [https://bugs.openjdk.java.net/browse/JDK-8037567] > The problem is that during Spark History Server startup, there are two things > happening simultaneously that call into > java.nio.file.FileSystems.getDefault(): > 1) start jetty server > 2) start ApplicationHistoryProvider (which reads files from HDFS) > We should do this two things sequentially instead of in parallel. > -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org