[jira] [Commented] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258955#comment-15258955
 ] 

Apache Spark commented on SPARK-13693:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/12712

> Flaky test: o.a.s.streaming.MapWithStateSuite
> -
>
> Key: SPARK-13693
> URL: https://issues.apache.org/jira/browse/SPARK-13693
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.0
>
>
> Fixed the following flaky test:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
> {code}
> sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-04-26 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258867#comment-15258867
 ] 

Josh Rosen commented on SPARK-13693:


I think that I've spotted the race:

If the stopping thread enters the synchronized block in 
{{CheckpointWriter.stop()}}, it will block the writer thread, who needs to 
synchronize on the same lock to access {{fs}}.

> Flaky test: o.a.s.streaming.MapWithStateSuite
> -
>
> Key: SPARK-13693
> URL: https://issues.apache.org/jira/browse/SPARK-13693
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.0
>
>
> Fixed the following flaky test:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
> {code}
> sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-04-26 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258856#comment-15258856
 ] 

Josh Rosen commented on SPARK-13693:


Yep, there's definitely a race condition here:

{code}
16/04/26 11:14:04.451 block-manager-slave-async-thread-pool-0 INFO 
BlockManager: Removing RDD 89
16/04/26 11:14:04.451 JobGenerator INFO JobGenerator: Checkpointing graph for 
time 7000 ms
16/04/26 11:14:04.451 JobGenerator INFO DStreamGraph: Updating checkpoint data 
for time 7000 ms
16/04/26 11:14:04.451 JobGenerator INFO DStreamGraph: Updated checkpoint data 
for time 7000 ms
16/04/26 11:14:04.452 JobGenerator INFO CheckpointWriter: Submitted checkpoint 
of time 7000 ms writer queue
16/04/26 11:14:04.452 pool-1787-thread-1 INFO CheckpointWriter: Saving 
checkpoint for time 7000 ms to file 
'file:/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/streaming/checkpoint/spark-dcb9046e-7dc0-440b-a406-f02883f3ae3b/checkpoint-7000'
16/04/26 11:14:14.452 pool-1-thread-1-ScalaTest-running-MapWithStateSuite INFO 
CheckpointWriter: CheckpointWriter executor terminated ? false, waited for 
1 ms.
16/04/26 11:14:14.452 pool-1-thread-1-ScalaTest-running-MapWithStateSuite INFO 
JobGenerator: Stopped JobGenerator
16/04/26 11:14:14.452 pool-1-thread-1-ScalaTest-running-MapWithStateSuite INFO 
JobScheduler: Stopped JobScheduler
16/04/26 11:14:14.453 pool-1-thread-1-ScalaTest-running-MapWithStateSuite INFO 
StreamingContext: StreamingContext stopped successfully
16/04/26 11:14:14.453 pool-1-thread-1-ScalaTest-running-MapWithStateSuite INFO 
MapWithStateSuite: 

= FINISHED o.a.s.streaming.MapWithStateSuite: 'mapWithState - basic 
operations with advanced API' =

16/04/26 11:14:14.781 pool-1787-thread-1 WARN CheckpointWriter: Error in 
attempt 1 of writing checkpoint to 
file:/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/streaming/checkpoint/spark-dcb9046e-7dc0-440b-a406-f02883f3ae3b/checkpoint-7000
java.io.IOException: java.lang.InterruptedException
at org.apache.hadoop.util.Shell.runCommand(Shell.java:541)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:656)
at 
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:490)
at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:462)
at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775)
at 
org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:224)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/04/26 11:14:14.781 pool-1787-thread-1 WARN CheckpointWriter: Could not write 
checkpoint for time 7000 ms to file 
file:/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/streaming/checkpoint/spark-dcb9046e-7dc0-440b-a406-f02883f3ae3b/checkpoint-7000'
16/04/26 11:14:14.786 dispatcher-event-loop-7 INFO 
MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
{code}

It looks like the CheckpointWriter doesn't finish writing before we return from 
the shutdown. There's code to wait but the wait duration is limited to 10 
seconds, which doesn't seem to be enough. From CheckpointWriter:

{code}
def stop(): Unit = synchronized {
if (stopped) return

executor.shutdown()
val startTime = System.currentTimeMillis()
val terminated = executor.awaitTermination(10, 
java.util.concurrent.TimeUnit.SECONDS)
if (!terminated) {
  executor.shutdownNow()
}
val endTime = System.currentTimeMillis()
logInfo("CheckpointWriter executor terminated ? " + terminated +
  ", waited for " + (endTime - startTime) + " ms.")
stopped = true
  }
{code}

I wonder if the shutdown process is causing the CheckpointWriter thread to get 
blocked / locked in a way that prevents it from finishing in time. This is a 
unit test with tiny data, so there's no way that the checkpoint should take 10 
seconds to write.

> Flaky test: o.a.s.streaming.MapWithStateSuite
> -
>
> Key: SPARK-13693
> URL: https://issues.apache.org/jira/browse/SPARK-13693
>  

[jira] [Commented] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-04-26 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258809#comment-15258809
 ] 

Josh Rosen commented on SPARK-13693:


I was able to reproduce some of this flakiness locally and noticed that adding 
a {{Thread.sleep(1000)}} before deleting the checkpoint directory seemed to fix 
things. Based on this, my guess is that the existing code is still prone to a 
race condition.

> Flaky test: o.a.s.streaming.MapWithStateSuite
> -
>
> Key: SPARK-13693
> URL: https://issues.apache.org/jira/browse/SPARK-13693
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.0
>
>
> Fixed the following flaky test:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
> {code}
> sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-04-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256054#comment-15256054
 ] 

Sean Owen commented on SPARK-13693:
---

Weird, because this file hasn't otherwise changed since the problem reappeared. 
It could have been triggered by something else -- or maybe was triggered by 
something else all along. One solution is to ignore this exception, since 
there's not much point failing the test over this.

> Flaky test: o.a.s.streaming.MapWithStateSuite
> -
>
> Key: SPARK-13693
> URL: https://issues.apache.org/jira/browse/SPARK-13693
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.0
>
>
> Fixed the following flaky test:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
> {code}
> sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-04-24 Thread Hemant Bhanawat (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255881#comment-15255881
 ] 

Hemant Bhanawat commented on SPARK-13693:
-

Latest Jenkins builds are failing with this issue. See:
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2863
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2865

> Flaky test: o.a.s.streaming.MapWithStateSuite
> -
>
> Key: SPARK-13693
> URL: https://issues.apache.org/jira/browse/SPARK-13693
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.0
>
>
> Fixed the following flaky test:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
> {code}
> sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181303#comment-15181303
 ] 

Apache Spark commented on SPARK-13693:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/11531

> Flaky test: o.a.s.streaming.MapWithStateSuite
> -
>
> Key: SPARK-13693
> URL: https://issues.apache.org/jira/browse/SPARK-13693
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> Fixed the following flaky test:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
> {code}
> sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org