NoSuchElementException: key not found
Hi, I was working with a simple task (running locally). Just reading a file (35 mb) with about 200 features and making a random forest with 5 trees with 5 depth. While saving the file with: predictions.select("VisitNumber", "probability") .write.format("json") // tried different formats .mode(SaveMode.Overwrite) .option("header", "true") .save("finalResult2") I get an error: java.util.NoSuchElementException: key not found: -2.379675967804967E-16 (Stack trace below) Just for you info, I was not getting this error earlier, it started some time ago and i am not able to get rid of it. My SPARK CONFIG is simple: val conf = new SparkConf() .setAppName("walmart") .setMaster("local[2]") == Physical Plan == Project [VisitNumber#52,UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59)) AS features#81,UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59))) AS indexedFeatures#167,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59 AS rawPrediction#168,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59) AS probability#169,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59) AS prediction#170,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59)) AS predictedLabel#171] SortBasedAggregate(key=[VisitNumber#52], functions= [ (ConcatenateString(DepartmentDescription#56),mode=Final,isDistinct=false)], output=[VisitNumber#52,Agg-DepartmentDescription#59]) ConvertToSafe TungstenSort [VisitNumber#52 ASC], false, 0 TungstenExchange hashpartitioning(VisitNumber#52) ConvertToUnsafe SortBasedAggregate(key=[VisitNumber#52], functions=[(ConcatenateString(DepartmentDescription#56),mode=Partial,isDistinct=false)], output=[VisitNumber#52,concatenate#65]) ConvertToSafe TungstenSort [VisitNumber#52 ASC], false, 0 TungstenProject [VisitNumber#52,DepartmentDescription#56] Scan CsvRelation(src/main/resources/test.csv,true,,,",null,#,PERMISSIVE,COMMONS,false,false,null,UTF-8,true)[VisitNumber#52,Weekday#53,Upc#54L,ScanCount#55,DepartmentDescription#56,FinelineNumber#57] # STACK TRACE # java.util.NoSuchElementException: key not found: -2.379675967804967E-16 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:308) at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:307) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:307) at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:301) at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:343) at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:343) at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75) at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55) at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:241) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 15/11/10 11:35:05 ERROR DefaultWriterContainer: Task attempt attempt_201511101135_0068_m_00_0 aborted. 15/11/10 11:35:05 ERROR
Re: NoSuchElementException: key not found
Any suggestions any one? Using version 1.5.1. Regards Ankush Khanna On Nov 10, 2015, at 11:37 AM, Ankush Khannawrote: Hi, I was working with a simple task (running locally). Just reading a file (35 mb) with about 200 features and making a random forest with 5 trees with 5 depth. While saving the file with: predictions.select("VisitNumber", "probability") .write.format("json") // tried different formats .mode(SaveMode.Overwrite) .option("header", "true") .save("finalResult2") I get an error: java.util.NoSuchElementException: key not found: -2.379675967804967E-16 (Stack trace below) Just for you info, I was not getting this error earlier, it started some time ago and i am not able to get rid of it. My SPARK CONFIG is simple: val conf = new SparkConf() .setAppName("walmart") .setMaster("local[2]") == Physical Plan == Project [VisitNumber#52,UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59)) AS features#81,UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59))) AS indexedFeatures#167,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59 AS rawPrediction#168,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59) AS probability#169,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59) AS prediction#170,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59)) AS predictedLabel#171] SortBasedAggregate(key=[VisitNumber#52], functions= [ (ConcatenateString(DepartmentDescription#56),mode=Final,isDistinct=false)], output=[VisitNumber#52,Agg-DepartmentDescription#59]) ConvertToSafe TungstenSort [VisitNumber#52 ASC], false, 0 TungstenExchange hashpartitioning(VisitNumber#52) ConvertToUnsafe SortBasedAggregate(key=[VisitNumber#52], functions=[(ConcatenateString(DepartmentDescription#56),mode=Partial,isDistinct=false)], output=[VisitNumber#52,concatenate#65]) ConvertToSafe TungstenSort [VisitNumber#52 ASC], false, 0 TungstenProject [VisitNumber#52,DepartmentDescription#56] Scan CsvRelation(src/main/resources/test.csv,true,,,",null,#,PERMISSIVE,COMMONS,false,false,null,UTF-8,true)[VisitNumber#52,Weekday#53,Upc#54L,ScanCount#55,DepartmentDescription#56,FinelineNumber#57] # STACK TRACE # java.util.NoSuchElementException: key not found: -2.379675967804967E-16 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:308) at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:307) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:307) at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:301) at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:343) at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:343) at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75) at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55) at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:241) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at
Re: NoSuchElementException: key not found when changing the window lenght and interval in Spark Streaming
Hi I get exactly the same problem here, do you've found the problem ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchElementException-key-not-found-when-changing-the-window-lenght-and-interval-in-Spark-Streaming-tp9010p9283.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: NoSuchElementException: key not found when changing the window lenght and interval in Spark Streaming
This bug has been fixed. Either use the master branch of Spark, or maybe wait a few days for Spark 1.0.1 to be released (voting has successfully closed). TD On Thu, Jul 10, 2014 at 2:33 AM, richiesgr richie...@gmail.com wrote: Hi I get exactly the same problem here, do you've found the problem ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchElementException-key-not-found-when-changing-the-window-lenght-and-interval-in-Spark-Streaming-tp9010p9283.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: NoSuchElementException: key not found
Hi Tathagata, Im seeing the same issue on a load run over night with Kafka streaming (6000 mgs/sec) and 500millisec batch size. Again occasional and only happening after a few hours I believe Im using updateStateByKey. Any comment will be very welcome. tnks, Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchElementException-key-not-found-tp6743p7157.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: NoSuchElementException: key not found
I think I know what is going on! This probably a race condition in the DAGScheduler. I have added a JIRA for this. The fix is not trivial though. https://issues.apache.org/jira/browse/SPARK-2002 A not-so-good workaround for now would be not use coalesced RDD, which is avoids the race condition. TD On Tue, Jun 3, 2014 at 10:09 AM, Michael Chang m...@tellapart.com wrote: I only had the warning level logs, unfortunately. There were no other references of 32855 (except a repeated stack trace, I believe). I'm using Spark 0.9.1 On Mon, Jun 2, 2014 at 5:50 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Do you have the info level logs of the application? Can you grep the value 32855 to find any references to it? Also what version of the Spark are you using (so that I can match the stack trace, does not seem to match with Spark 1.0)? TD On Mon, Jun 2, 2014 at 3:27 PM, Michael Chang m...@tellapart.com wrote: Hi all, Seeing a random exception kill my spark streaming job. Here's a stack trace: java.util.NoSuchElementException: key not found: 32855 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:211) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1072) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:716) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:172) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:189) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:188) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:351) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:183) at org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:234) at org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:333) at org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:81) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:31) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.RDD.take(RDD.scala:830) at org.apache.spark.api.java.JavaRDDLike$class.take(JavaRDDLike.scala:337) at org.apache.spark.api.java.JavaRDD.take(JavaRDD.scala:27) at com.tellapart.manifolds.spark.ManifoldsUtil$PersistToKafkaFunction.call(ManifoldsUtil.java:87) at com.tellapart.manifolds.spark.ManifoldsUtil$PersistToKafkaFunction.call(ManifoldsUtil.java:53) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:270) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:270) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:520) at
Re: NoSuchElementException: key not found
Hi Tathagata, Thanks for your help! By not using coalesced RDD, do you mean not repartitioning my Dstream? Thanks, Mike On Tue, Jun 3, 2014 at 12:03 PM, Tathagata Das tathagata.das1...@gmail.com wrote: I think I know what is going on! This probably a race condition in the DAGScheduler. I have added a JIRA for this. The fix is not trivial though. https://issues.apache.org/jira/browse/SPARK-2002 A not-so-good workaround for now would be not use coalesced RDD, which is avoids the race condition. TD On Tue, Jun 3, 2014 at 10:09 AM, Michael Chang m...@tellapart.com wrote: I only had the warning level logs, unfortunately. There were no other references of 32855 (except a repeated stack trace, I believe). I'm using Spark 0.9.1 On Mon, Jun 2, 2014 at 5:50 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Do you have the info level logs of the application? Can you grep the value 32855 to find any references to it? Also what version of the Spark are you using (so that I can match the stack trace, does not seem to match with Spark 1.0)? TD On Mon, Jun 2, 2014 at 3:27 PM, Michael Chang m...@tellapart.com wrote: Hi all, Seeing a random exception kill my spark streaming job. Here's a stack trace: java.util.NoSuchElementException: key not found: 32855 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:211) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1072) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:716) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:172) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:189) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:188) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:351) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:183) at org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:234) at org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:333) at org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:81) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:31) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.RDD.take(RDD.scala:830) at org.apache.spark.api.java.JavaRDDLike$class.take(JavaRDDLike.scala:337) at org.apache.spark.api.java.JavaRDD.take(JavaRDD.scala:27) at com.tellapart.manifolds.spark.ManifoldsUtil$PersistToKafkaFunction.call(ManifoldsUtil.java:87) at com.tellapart.manifolds.spark.ManifoldsUtil$PersistToKafkaFunction.call(ManifoldsUtil.java:53) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:270) at
Re: NoSuchElementException: key not found
I am not sure what DStream operations you are using, but some operation is internally creating CoalescedRDDs. That is causing the race condition. I might be able help if you can tell me what DStream operations you are using. TD On Tue, Jun 3, 2014 at 4:54 PM, Michael Chang m...@tellapart.com wrote: Hi Tathagata, Thanks for your help! By not using coalesced RDD, do you mean not repartitioning my Dstream? Thanks, Mike On Tue, Jun 3, 2014 at 12:03 PM, Tathagata Das tathagata.das1...@gmail.com wrote: I think I know what is going on! This probably a race condition in the DAGScheduler. I have added a JIRA for this. The fix is not trivial though. https://issues.apache.org/jira/browse/SPARK-2002 A not-so-good workaround for now would be not use coalesced RDD, which is avoids the race condition. TD On Tue, Jun 3, 2014 at 10:09 AM, Michael Chang m...@tellapart.com wrote: I only had the warning level logs, unfortunately. There were no other references of 32855 (except a repeated stack trace, I believe). I'm using Spark 0.9.1 On Mon, Jun 2, 2014 at 5:50 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Do you have the info level logs of the application? Can you grep the value 32855 to find any references to it? Also what version of the Spark are you using (so that I can match the stack trace, does not seem to match with Spark 1.0)? TD On Mon, Jun 2, 2014 at 3:27 PM, Michael Chang m...@tellapart.com wrote: Hi all, Seeing a random exception kill my spark streaming job. Here's a stack trace: java.util.NoSuchElementException: key not found: 32855 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:211) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1072) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:716) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:172) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:189) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:188) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:351) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:183) at org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:234) at org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:333) at org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:81) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:31) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.RDD.take(RDD.scala:830) at org.apache.spark.api.java.JavaRDDLike$class.take(JavaRDDLike.scala:337) at org.apache.spark.api.java.JavaRDD.take(JavaRDD.scala:27) at
NoSuchElementException: key not found
Hi all, Seeing a random exception kill my spark streaming job. Here's a stack trace: java.util.NoSuchElementException: key not found: 32855 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:211) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1072) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:716) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:172) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:189) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:188) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:351) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:183) at org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:234) at org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:333) at org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:81) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:31) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.RDD.take(RDD.scala:830) at org.apache.spark.api.java.JavaRDDLike$class.take(JavaRDDLike.scala:337) at org.apache.spark.api.java.JavaRDD.take(JavaRDD.scala:27) at com.tellapart.manifolds.spark.ManifoldsUtil$PersistToKafkaFunction.call(ManifoldsUtil.java:87) at com.tellapart.manifolds.spark.ManifoldsUtil$PersistToKafkaFunction.call(ManifoldsUtil.java:53) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:270) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:270) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:520) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:520) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:155) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) It doesn't seem to happen consistently, but I have no idea causes it. Has anyone seen this before? The PersistToKafkaFunction here is just trying to write the elements in a RDD
Re: NoSuchElementException: key not found
Do you have the info level logs of the application? Can you grep the value 32855 to find any references to it? Also what version of the Spark are you using (so that I can match the stack trace, does not seem to match with Spark 1.0)? TD On Mon, Jun 2, 2014 at 3:27 PM, Michael Chang m...@tellapart.com wrote: Hi all, Seeing a random exception kill my spark streaming job. Here's a stack trace: java.util.NoSuchElementException: key not found: 32855 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:211) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1072) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:716) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:172) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:189) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:188) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:351) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:183) at org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:234) at org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:333) at org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:81) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:31) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.RDD.take(RDD.scala:830) at org.apache.spark.api.java.JavaRDDLike$class.take(JavaRDDLike.scala:337) at org.apache.spark.api.java.JavaRDD.take(JavaRDD.scala:27) at com.tellapart.manifolds.spark.ManifoldsUtil$PersistToKafkaFunction.call(ManifoldsUtil.java:87) at com.tellapart.manifolds.spark.ManifoldsUtil$PersistToKafkaFunction.call(ManifoldsUtil.java:53) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:270) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:270) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:520) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:520) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:155)