[ https://issues.apache.org/jira/browse/SPARK-21991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-21991: ------------------------------------ Assignee: Apache Spark > [LAUNCHER] LauncherServer acceptConnections thread sometime dies if machine > has very high load > ---------------------------------------------------------------------------------------------- > > Key: SPARK-21991 > URL: https://issues.apache.org/jira/browse/SPARK-21991 > Project: Spark > Issue Type: Bug > Components: Spark Submit > Affects Versions: 2.0.2, 2.1.0, 2.1.1, 2.2.0 > Environment: Single node machine running Ubuntu 16.04.2 LTS > (4.4.0-79-generic) > YARN 2.7.2 > Spark 2.0.2 > Reporter: Andrea Zito > Assignee: Apache Spark > Priority: Minor > > The way the _LauncherServer_ _acceptConnections_ thread schedules client > timeouts causes (non-deterministically) the thread to die with the following > exception if the machine is under very high load: > {noformat} > Exception in thread "LauncherServer-1" java.lang.IllegalStateException: Task > already scheduled or cancelled > at java.util.Timer.sched(Timer.java:401) > at java.util.Timer.schedule(Timer.java:193) > at > org.apache.spark.launcher.LauncherServer.acceptConnections(LauncherServer.java:249) > at > org.apache.spark.launcher.LauncherServer.access$000(LauncherServer.java:80) > at > org.apache.spark.launcher.LauncherServer$1.run(LauncherServer.java:143) > {noformat} > The issue is related to the ordering of actions that the _acceptConnections_ > thread uses to handle a client connection: > # create timeout action > # create client thread > # start client thread > # schedule timeout action > Under normal conditions the scheduling of the timeout action happen before > the client thread has a chance to start, however if the machine is under very > high load the client thread can receive CPU time before the timeout action > gets scheduled. > If this condition happen, the client thread cancel the timeout action (which > is not yet been scheduled) and goes on, but as soon as the > _acceptConnections_ thread gets the CPU back, it will try to schedule the > timeout action (which has already been canceled) thus raising the exception. > Changing the order in which the client thread gets started and the timeout > gets scheduled seems to be sufficient to fix this issue. > As stated above the issue is non-deterministic, I faced the issue multiple > times on a single-node machine submitting a high number of short jobs > sequentially, but I couldn't easily create a test reproducing the issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org