Hi, I have recognized a strange behavior of spark core in combination with mllib. Running my pipeline results in a RDD. Calling count() on this RDD results in 160055. Calling count() directly afterwards results in 160044 and so on. The RDD seems to be unstable.
How can that be? Do you maybe have an explanation or guidance for further investigation? I'm investigating for 3 days now and can't isolate the bug. Unfortunately I can't provide a minimal working example only using Spark. At the moment I try to reproduce the bug with only using the Spark API to hand it over to someone more experienced. I recognized this behavior while investigating SPARK-5480. Trying to build a graph and calculate the transitive closure on such a unstable RDD results in a IndexOutOfBoundsException -1. My first suspicion is that org.apache.spark.mllib.rdd.RDDFunctions.sliding causes the problems. Replacing my algorithm which uses the sliding window solves the problem. The bug only occurs on large data sets. On small ones the pipeline works fine. That makes it hard to investigate because every run takes several minutes. Also generated data does not produce the bug. I didn't open a Jira ticket yet because I can't tell how to reproduce it. I'm running Spark 1.3.1 in standalone mode with HDFS on a 10 node cluster. Thanks for your advise, Niklas --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org