[ 
https://issues.apache.org/jira/browse/SPARK-29003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated SPARK-29003:
--------------------------------
    Description: 
Occasionally when starting Spark History Server, the service process will hang 
before binding to the port so Spark History Server is not usable. One has to 
kill the process and start again. You can write a simple bash program to stop 
and start Spark History Server and you can reproduce this problem approximately 
10% of time.

The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. 
This is what I collected with jstack:
{code:java}
"log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x00007fca90028800 
nid=0x6e8 in Object.wait() [0x00007fcaa9471000]
    java.lang.Thread.State: RUNNABLE 
    at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
    ... 
    at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked 
<0x00000000aaac1d40> (a java.lang.Runtime) 
    ... 
    at 
org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698)

"main" #1 prio=5 os_prio=0 tid=0x00007fcad8016800 nid=0x6d8 waiting for monitor 
entry [0x00007fcae146c000]
    java.lang.Thread.State: BLOCKED (on object monitor) 
    at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock 
<0x00000000aaac1d40> (a java.lang.Runtime) 
    ... 
    at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
    at java.io.File.toPath(File.java:2234) - locked <0x000000008699bb68> (a 
java.io.File) 
    ... 
    at 
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code}
Basically "main" thread and "log-replay-executor-0" thread simultaneously 
calling java.nio,file.FileSystems.getDefault() and deadlocked. 

This is similar to the reported JDK bug:

[https://bugs.openjdk.java.net/browse/JDK-8037567]

The problem is that during Spark History Server startup, there are two things 
happening simultaneously that call into java.nio.file.FileSystems.getDefault():

1) start jetty server
 2) start ApplicationHistoryProvider (which reads files from HDFS)

We should do this two things sequentially instead of in parallel.

 

  was:
Occasionally when starting Spark History Server, the service process will hang 
before binding to the port so Spark History Server is not usable. One has to 
kill the process and start again. You can write a simple bash program to stop 
and start Spark History Server and you can reproduce this problem approximately 
10% of time.

The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. 
This is what I collected with jstack:
{code:java}
"log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x00007fca90028800 
nid=0x6e8 in Object.wait() [0x00007fcaa9471000]"log-replay-executor-0" #17 
daemon prio=5 os_prio=0 tid=0x00007fca90028800 nid=0x6e8 in Object.wait() 
[0x00007fcaa9471000]   java.lang.Thread.State: RUNNABLE at 
java.nio.file.FileSystems.getDefault(FileSystems.java:176) ... at 
java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked <0x00000000aaac1d40> 
(a java.lang.Runtime) ... at 
org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698)
"main" #1 prio=5 os_prio=0 tid=0x00007fcad8016800 nid=0x6d8 waiting for monitor 
entry [0x00007fcae146c000]   java.lang.Thread.State: BLOCKED (on object 
monitor) at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock 
<0x00000000aaac1d40> (a java.lang.Runtime) ... at 
java.nio.file.FileSystems.getDefault(FileSystems.java:176) at 
java.io.File.toPath(File.java:2234) - locked <0x000000008699bb68> (a 
java.io.File) ...    at 
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code}
Basically "main" thread and "log-replay-executor-0" thread simultaneously 
calling java.nio,file.FileSystems.getDefault() and deadlocked. 

This is similar to the reported JDK bug:

[https://bugs.openjdk.java.net/browse/JDK-8037567]

The problem is that during Spark History Server startup, there are two things 
happening simultaneously that call into java.nio.file.FileSystems.getDefault():

1) start jetty server
2) start ApplicationHistoryProvider (which reads files from HDFS)

We should do this two things sequentially instead of in parallel.

 


> Spark history server startup hang due to deadlock
> -------------------------------------------------
>
>                 Key: SPARK-29003
>                 URL: https://issues.apache.org/jira/browse/SPARK-29003
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.4
>            Reporter: shanyu zhao
>            Priority: Major
>
> Occasionally when starting Spark History Server, the service process will 
> hang before binding to the port so Spark History Server is not usable. One 
> has to kill the process and start again. You can write a simple bash program 
> to stop and start Spark History Server and you can reproduce this problem 
> approximately 10% of time.
> The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. 
> This is what I collected with jstack:
> {code:java}
> "log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x00007fca90028800 
> nid=0x6e8 in Object.wait() [0x00007fcaa9471000]
>     java.lang.Thread.State: RUNNABLE 
>     at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
>     ... 
>     at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked 
> <0x00000000aaac1d40> (a java.lang.Runtime) 
>     ... 
>     at 
> org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698)
> "main" #1 prio=5 os_prio=0 tid=0x00007fcad8016800 nid=0x6d8 waiting for 
> monitor entry [0x00007fcae146c000]
>     java.lang.Thread.State: BLOCKED (on object monitor) 
>     at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock 
> <0x00000000aaac1d40> (a java.lang.Runtime) 
>     ... 
>     at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
>     at java.io.File.toPath(File.java:2234) - locked <0x000000008699bb68> (a 
> java.io.File) 
>     ... 
>     at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code}
> Basically "main" thread and "log-replay-executor-0" thread simultaneously 
> calling java.nio,file.FileSystems.getDefault() and deadlocked. 
> This is similar to the reported JDK bug:
> [https://bugs.openjdk.java.net/browse/JDK-8037567]
> The problem is that during Spark History Server startup, there are two things 
> happening simultaneously that call into 
> java.nio.file.FileSystems.getDefault():
> 1) start jetty server
>  2) start ApplicationHistoryProvider (which reads files from HDFS)
> We should do this two things sequentially instead of in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to