[ 
https://issues.apache.org/jira/browse/SPARK-17624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-17624:
-------------------------------------

    Assignee: Tathagata Das

> Flaky test? StateStoreSuite maintenance
> ---------------------------------------
>
>                 Key: SPARK-17624
>                 URL: https://issues.apache.org/jira/browse/SPARK-17624
>             Project: Spark
>          Issue Type: Test
>          Components: Tests
>    Affects Versions: 2.0.1, 2.1.0
>            Reporter: Adam Roberts
>            Assignee: Tathagata Das
>            Priority: Minor
>
> I've noticed this test failing consistently (25x in a row) with a two core 
> machine but not on an eight core machine
> If we increase the spark.rpc.numRetries value used in the test from 1 to 2 (3 
> being the default in Spark), the test reliably passes, we can also gain 
> reliability by setting the master to be anything other than just local.
> Is there a reason spark.rpc.numRetries is set to be 1?
> I see this failure is also mentioned here so it's been flaky for a while 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-2-0-0-RC5-td18367.html
> If we run without the "quietly" code so we get debug info:
> {code}
> 16:26:15.213 WARN org.apache.spark.rpc.netty.NettyRpcEndpointRef: Error 
> sending message [message = 
> VerifyIfInstanceActive(StateStoreId(/home/aroberts/Spark-DK/sql/core/target/tmp/spark-cc44f5fa-b675-426f-9440-76785c365507/ૺꎖ鮎衲넅-28e9196f-8b2d-43ba-8421-44a5c5e98ceb,0,0),driver)]
>  in 1 attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
>         at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
>         at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
>         at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>         at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>         at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>         at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>         at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>         at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>         at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
>         at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.verifyIfInstanceActive(StateStoreCoordinator.scala:91)
>         at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227)
>         at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227)
>         at scala.Option.map(Option.scala:146)
>         at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$verifyIfStoreInstanceActive(StateStore.scala:227)
>         at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:199)
>         at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:197)
>         at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>         at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance(StateStore.scala:197)
>         at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anon$1.run(StateStore.scala:180)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:319)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:191)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:785)
> Caused by: org.apache.spark.SparkException: Could not find 
> StateStoreCoordinator.
>         at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
>         at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:129)
>         at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:225)
>         at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:508)
>         at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>         ... 19 more
> 16:26:15.217 WARN org.apache.spark.sql.execution.streaming.state.StateStore: 
> Error managing StateStore[id = (op=0, part=0), dir = 
> /home/aroberts/Spark-DK/sql/core/target/tmp/spark-cc44f5fa-b675-426f-9440-76785c365507/ૺꎖ鮎衲넅-28e9196f-8b2d-43ba-8421-44a5c5e98ceb/0/0],
>  stopping management thread
> - maintenance *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 636 times 
> over 10.009220005000001 seconds. Last failure message: 
> StateStoreSuite.this.fileExists(provider, 1L, false) was true earliest file 
> not deleted. (StateStoreSuite.scala:396)
> Run completed in 19 seconds, 181 milliseconds.
> Total number of tests run: 9
> Suites: completed 2, aborted 0
> Tests: succeeded 8, failed 1, canceled 0, ignored 0, pending 0
> *** 1 TEST FAILED ***
> {code}
> With local[2] (presumably what .setMaster local resolves to on a two core 
> box?) the test reliably passes and this is also the case for local[1] and 
> local[4], for example:
> {code}
> StateStoreSuite:
> 16:17:34.379 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable
> - get, put, remove, commit, and all data iterator
> - updates iterator with all combos of updates and removes
> - cancel
> - getStore with unexpected versions
> - snapshotting
> - cleaning
> - corrupted file handling
> - StateStore.get
> In the maint test
> In eventually timeout 10
> Done assert 1
> Now doing assert 2
> Done assert 2
> 16:17:38.207 WARN org.apache.spark.rpc.netty.NettyRpcEndpointRef: Error 
> sending message [message = 
> VerifyIfInstanceActive(StateStoreId(/home/aroberts/Spark-DK/sql/core/target/tmp/spark-24d82fd9-252e-444c-8951-c4e3dadd854f/닟述뮰⮤ꂟ-4a0e3477-59ca-4444-9181-8937e5f66d0c,0,0),driver)]
>  in 1 attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
>         at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
>         at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
>         at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>         at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>         at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>         at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>         at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>         at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>         at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
>         at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.verifyIfInstanceActive(StateStoreCoordinator.scala:91)
>         at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227)
>         at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227)
>         at scala.Option.map(Option.scala:146)
>         at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$verifyIfStoreInstanceActive(StateStore.scala:227)
>         at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:199)
>         at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:197)
>         at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>         at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance(StateStore.scala:197)
>         at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anon$1.run(StateStore.scala:180)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:319)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:191)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:785)
> Caused by: org.apache.spark.SparkException: Could not find 
> StateStoreCoordinator.
>         at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
>         at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:129)
>         at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:225)
>         at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:508)
>         at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>         ... 19 more
> 16:17:38.209 WARN org.apache.spark.sql.execution.streaming.state.StateStore: 
> Error managing StateStore[id = (op=0, part=0), dir = 
> /home/aroberts/Spark-DK/sql/core/target/tmp/spark-24d82fd9-252e-444c-8951-c4e3dadd854f/닟述뮰⮤ꂟ-4a0e3477-59ca-4444-9181-8937e5f66d0c/0/0],
>  stopping management thread
> - maintenance
> Run completed in 9 seconds, 317 milliseconds.
> Total number of tests run: 9
> Suites: completed 2, aborted 0
> Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0
> All tests passed.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to