[jira] [Commented] (SPARK-29003) Spark history server startup hang due to deadlock

2019-09-06 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924581#comment-16924581
 ] 

Jungtaek Lim commented on SPARK-29003:
--

Thanks for providing jstack. Looks like it's known JDK issue but given the 
fixed version is too high I agree we may need to apply workaround on this.

[https://bugs.openjdk.java.net/browse/JDK-8194653]

> Spark history server startup hang due to deadlock
> -
>
> Key: SPARK-29003
> URL: https://issues.apache.org/jira/browse/SPARK-29003
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: shanyu zhao
>Priority: Major
> Attachments: sparkhistory-jstack.log
>
>
> Occasionally when starting Spark History Server, the service process will 
> hang before binding to the port so Spark History Server is not usable. One 
> has to kill the process and start again. You can write a simple bash program 
> to stop and start Spark History Server and you can reproduce this problem 
> approximately 10% of time.
> The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. 
> This is what I collected with jstack:
> {code:java}
> "log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x7fca90028800 
> nid=0x6e8 in Object.wait() [0x7fcaa9471000]
> java.lang.Thread.State: RUNNABLE 
> at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
> ... 
> at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked 
> <0xaaac1d40> (a java.lang.Runtime) 
> ... 
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698)
> "main" #1 prio=5 os_prio=0 tid=0x7fcad8016800 nid=0x6d8 waiting for 
> monitor entry [0x7fcae146c000]
>     java.lang.Thread.State: BLOCKED (on object monitor) 
> at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock 
> <0xaaac1d40> (a java.lang.Runtime) 
> ... 
> at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
> at java.io.File.toPath(File.java:2234) - locked <0x8699bb68> (a 
> java.io.File) 
> ... 
> at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code}
> Basically "main" thread and "log-replay-executor-0" thread simultaneously 
> calling java.nio,file.FileSystems.getDefault() and deadlocked. 
> This is similar to the reported JDK bug:
> [https://bugs.openjdk.java.net/browse/JDK-8037567]
> The problem is that during Spark History Server startup, there are two things 
> happening simultaneously that call into 
> java.nio.file.FileSystems.getDefault():
> 1) start jetty server
>  2) start ApplicationHistoryProvider (which reads files from HDFS)
> We should do this two things sequentially instead of in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29003) Spark history server startup hang due to deadlock

2019-09-06 Thread shanyu zhao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924567#comment-16924567
 ] 

shanyu zhao commented on SPARK-29003:
-

Please see the full jstack attached.

> Spark history server startup hang due to deadlock
> -
>
> Key: SPARK-29003
> URL: https://issues.apache.org/jira/browse/SPARK-29003
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: shanyu zhao
>Priority: Major
> Attachments: sparkhistory-jstack.log
>
>
> Occasionally when starting Spark History Server, the service process will 
> hang before binding to the port so Spark History Server is not usable. One 
> has to kill the process and start again. You can write a simple bash program 
> to stop and start Spark History Server and you can reproduce this problem 
> approximately 10% of time.
> The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. 
> This is what I collected with jstack:
> {code:java}
> "log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x7fca90028800 
> nid=0x6e8 in Object.wait() [0x7fcaa9471000]
> java.lang.Thread.State: RUNNABLE 
> at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
> ... 
> at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked 
> <0xaaac1d40> (a java.lang.Runtime) 
> ... 
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698)
> "main" #1 prio=5 os_prio=0 tid=0x7fcad8016800 nid=0x6d8 waiting for 
> monitor entry [0x7fcae146c000]
>     java.lang.Thread.State: BLOCKED (on object monitor) 
> at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock 
> <0xaaac1d40> (a java.lang.Runtime) 
> ... 
> at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
> at java.io.File.toPath(File.java:2234) - locked <0x8699bb68> (a 
> java.io.File) 
> ... 
> at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code}
> Basically "main" thread and "log-replay-executor-0" thread simultaneously 
> calling java.nio,file.FileSystems.getDefault() and deadlocked. 
> This is similar to the reported JDK bug:
> [https://bugs.openjdk.java.net/browse/JDK-8037567]
> The problem is that during Spark History Server startup, there are two things 
> happening simultaneously that call into 
> java.nio.file.FileSystems.getDefault():
> 1) start jetty server
>  2) start ApplicationHistoryProvider (which reads files from HDFS)
> We should do this two things sequentially instead of in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29003) Spark history server startup hang due to deadlock

2019-09-05 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923877#comment-16923877
 ] 

Jungtaek Lim commented on SPARK-29003:
--

Could you provide full of jstack? I guess unless you modify Spark it would be 
no line to redact. That would be much clearer to see the full picture.

> Spark history server startup hang due to deadlock
> -
>
> Key: SPARK-29003
> URL: https://issues.apache.org/jira/browse/SPARK-29003
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: shanyu zhao
>Priority: Major
>
> Occasionally when starting Spark History Server, the service process will 
> hang before binding to the port so Spark History Server is not usable. One 
> has to kill the process and start again. You can write a simple bash program 
> to stop and start Spark History Server and you can reproduce this problem 
> approximately 10% of time.
> The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. 
> This is what I collected with jstack:
> {code:java}
> "log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x7fca90028800 
> nid=0x6e8 in Object.wait() [0x7fcaa9471000]
> java.lang.Thread.State: RUNNABLE 
> at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
> ... 
> at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked 
> <0xaaac1d40> (a java.lang.Runtime) 
> ... 
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698)
> "main" #1 prio=5 os_prio=0 tid=0x7fcad8016800 nid=0x6d8 waiting for 
> monitor entry [0x7fcae146c000]
>     java.lang.Thread.State: BLOCKED (on object monitor) 
> at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock 
> <0xaaac1d40> (a java.lang.Runtime) 
> ... 
> at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
> at java.io.File.toPath(File.java:2234) - locked <0x8699bb68> (a 
> java.io.File) 
> ... 
> at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code}
> Basically "main" thread and "log-replay-executor-0" thread simultaneously 
> calling java.nio,file.FileSystems.getDefault() and deadlocked. 
> This is similar to the reported JDK bug:
> [https://bugs.openjdk.java.net/browse/JDK-8037567]
> The problem is that during Spark History Server startup, there are two things 
> happening simultaneously that call into 
> java.nio.file.FileSystems.getDefault():
> 1) start jetty server
>  2) start ApplicationHistoryProvider (which reads files from HDFS)
> We should do this two things sequentially instead of in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org