[ 
https://issues.apache.org/jira/browse/SPARK-21991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216529#comment-16216529
 ] 

Andrew Ash commented on SPARK-21991:
------------------------------------

Thanks for debugging and diagnosing this [~nivox]! I'm seeing the same issue 
right now on one of my Spark clusters so am interested in getting your fix in 
to mainline Spark for my users.

Have you deployed the change from your linked PR in a live setting, and has it 
fixed the issue for you?

> [LAUNCHER] LauncherServer acceptConnections thread sometime dies if machine 
> has very high load
> ----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-21991
>                 URL: https://issues.apache.org/jira/browse/SPARK-21991
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Submit
>    Affects Versions: 2.0.2, 2.1.0, 2.1.1, 2.2.0
>         Environment: Single node machine running Ubuntu 16.04.2 LTS 
> (4.4.0-79-generic)
> YARN 2.7.2
> Spark 2.0.2
>            Reporter: Andrea Zito
>            Priority: Minor
>
> The way the _LauncherServer_ _acceptConnections_ thread schedules client 
> timeouts causes (non-deterministically) the thread to die with the following 
> exception if the machine is under very high load:
> {noformat}
> Exception in thread "LauncherServer-1" java.lang.IllegalStateException: Task 
> already scheduled or cancelled
>         at java.util.Timer.sched(Timer.java:401)
>         at java.util.Timer.schedule(Timer.java:193)
>         at 
> org.apache.spark.launcher.LauncherServer.acceptConnections(LauncherServer.java:249)
>         at 
> org.apache.spark.launcher.LauncherServer.access$000(LauncherServer.java:80)
>         at 
> org.apache.spark.launcher.LauncherServer$1.run(LauncherServer.java:143)
> {noformat}
> The issue is related to the ordering of actions that the _acceptConnections_ 
> thread uses to handle a client connection:
> # create timeout action
> # create client thread
> # start client thread
> # schedule timeout action
> Under normal conditions the scheduling of the timeout action happen before 
> the client thread has a chance to start, however if the machine is under very 
> high load the client thread can receive CPU time before the timeout action 
> gets scheduled.
> If this condition happen, the client thread cancel the timeout action (which 
> is not yet been scheduled) and goes on, but as soon as the 
> _acceptConnections_ thread gets the CPU back, it will try to schedule the 
> timeout action (which has already been canceled) thus raising the exception.
> Changing the order in which the client thread gets started and the timeout 
> gets scheduled seems to be sufficient to fix this issue.
> As stated above the issue is non-deterministic, I faced the issue multiple 
> times on a single-node machine submitting a high number of short jobs 
> sequentially, but I couldn't easily create a test reproducing the issue. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to