[ https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shixiong Zhu updated SPARK-19617: --------------------------------- Description: The streaming thread in StreamExecution uses the following ways to check if it should exit: - Catch an InterruptException. - `StreamExecution.state` is TERMINATED. when starting and stopping a query quickly, the above two checks may both fail. - Hit [HADOOP-14084|https://issues.apache.org/jira/browse/HADOOP-14084] and swallow InterruptException - StreamExecution.stop is called before `state` becomes `ACTIVE`. Then [runBatches|https://github.com/apache/spark/blob/dcc2d540a53f0bd04baead43fdee1c170ef2b9f3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L252] changes the state from `TERMINATED` to `ACTIVE`. If the above cases both happen, the query will hang forever. was: Saw the following exception in some test log: {code} 17/02/14 21:20:10.987 stream execution thread for this_query [id = 09fd5d6d-bea3-4891-88c7-0d0f1909188d, runId = a564cb52-bc3d-47f1-8baf-7e0e4fa79a5e] WARN Shell: Interrupted while joining on: Thread[Thread-48,5,main] java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1249) at java.lang.Thread.join(Thread.java:1323) at org.apache.hadoop.util.Shell.joinThread(Shell.java:626) at org.apache.hadoop.util.Shell.runCommand(Shell.java:577) at org.apache.hadoop.util.Shell.run(Shell.java:479) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773) at org.apache.hadoop.util.Shell.execCommand(Shell.java:866) at org.apache.hadoop.util.Shell.execCommand(Shell.java:849) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:491) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:532) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:509) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1066) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:176) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:733) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.mkdirs(HDFSMetadataLog.scala:385) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.<init>(HDFSMetadataLog.scala:75) at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.<init>(CompactibleFileStreamLog.scala:46) at org.apache.spark.sql.execution.streaming.FileStreamSourceLog.<init>(FileStreamSourceLog.scala:36) at org.apache.spark.sql.execution.streaming.FileStreamSource.<init>(FileStreamSource.scala:59) at org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:246) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:145) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:141) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257) at org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan$lzycompute(StreamExecution.scala:141) at org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan(StreamExecution.scala:136) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:191) {code} This is the cause of some test timeout failures on Jenkins. > Fix the race condition when starting and stopping a query quickly > ----------------------------------------------------------------- > > Key: SPARK-19617 > URL: https://issues.apache.org/jira/browse/SPARK-19617 > Project: Spark > Issue Type: Bug > Components: Structured Streaming > Affects Versions: 2.0.2, 2.1.0 > Reporter: Shixiong Zhu > Assignee: Shixiong Zhu > > The streaming thread in StreamExecution uses the following ways to check if > it should exit: > - Catch an InterruptException. > - `StreamExecution.state` is TERMINATED. > when starting and stopping a query quickly, the above two checks may both > fail. > - Hit [HADOOP-14084|https://issues.apache.org/jira/browse/HADOOP-14084] and > swallow InterruptException > - StreamExecution.stop is called before `state` becomes `ACTIVE`. Then > [runBatches|https://github.com/apache/spark/blob/dcc2d540a53f0bd04baead43fdee1c170ef2b9f3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L252] > changes the state from `TERMINATED` to `ACTIVE`. > If the above cases both happen, the query will hang forever. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org