NoSuchElementException: key not found

2015-11-10 Thread Ankush Khanna

Hi,

I was working with a simple task (running locally). Just reading a file (35 mb) 
with about 200 features and making a random forest with 5 trees with 5 depth. 
While saving the file with:
predictions.select("VisitNumber", "probability")
   .write.format("json") // tried different formats
   .mode(SaveMode.Overwrite)
   .option("header", "true")
   .save("finalResult2")

I get an error: java.util.NoSuchElementException: key not found: 
-2.379675967804967E-16 (Stack trace below)

Just for you info, I was not getting this error earlier, it started some time 
ago and i am not able to get rid of it. 
My SPARK CONFIG is simple: 
val conf = new SparkConf() 
.setAppName("walmart")
.setMaster("local[2]")

== Physical Plan ==
Project 
[VisitNumber#52,UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59)) AS 
features#81,UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59))) AS 
indexedFeatures#167,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59
 AS 
rawPrediction#168,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59)
 AS 
probability#169,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59)
 AS 
prediction#170,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59))
 AS predictedLabel#171]
   SortBasedAggregate(key=[VisitNumber#52], functions=   [  
(ConcatenateString(DepartmentDescription#56),mode=Final,isDistinct=false)], 
output=[VisitNumber#52,Agg-DepartmentDescription#59])
ConvertToSafe
TungstenSort [VisitNumber#52 ASC], false, 0
TungstenExchange hashpartitioning(VisitNumber#52)
ConvertToUnsafe
SortBasedAggregate(key=[VisitNumber#52], 
functions=[(ConcatenateString(DepartmentDescription#56),mode=Partial,isDistinct=false)],
 output=[VisitNumber#52,concatenate#65])
ConvertToSafe
TungstenSort [VisitNumber#52 ASC], false, 0
TungstenProject [VisitNumber#52,DepartmentDescription#56]
Scan 
CsvRelation(src/main/resources/test.csv,true,,,",null,#,PERMISSIVE,COMMONS,false,false,null,UTF-8,true)[VisitNumber#52,Weekday#53,Upc#54L,ScanCount#55,DepartmentDescription#56,FinelineNumber#57]


# STACK TRACE #
java.util.NoSuchElementException: key not found: -2.379675967804967E-16
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:308)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:307)
at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:307)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:301)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:343)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:343)
at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:241)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/11/10 11:35:05 ERROR DefaultWriterContainer: Task attempt 
attempt_201511101135_0068_m_00_0 aborted.
15/11/10 11:35:05 ERROR 

Re: NoSuchElementException: key not found

2015-11-10 Thread Ankush Khanna

Any suggestions any one?
Using version 1.5.1.

Regards
Ankush Khanna

On Nov 10, 2015, at 11:37 AM, Ankush Khanna  wrote:

Hi,

I was working with a simple task (running locally). Just reading a file (35 mb) 
with about 200 features and making a random forest with 5 trees with 5 depth. 
While saving the file with:
predictions.select("VisitNumber", "probability")
   .write.format("json") // tried different formats
   .mode(SaveMode.Overwrite)
   .option("header", "true")
   .save("finalResult2")

I get an error: java.util.NoSuchElementException: key not found: 
-2.379675967804967E-16 (Stack trace below)

Just for you info, I was not getting this error earlier, it started some time 
ago and i am not able to get rid of it. 
My SPARK CONFIG is simple: 
val conf = new SparkConf() 
.setAppName("walmart")
.setMaster("local[2]")

== Physical Plan ==
Project 
[VisitNumber#52,UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59)) AS 
features#81,UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59))) AS 
indexedFeatures#167,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59
 AS 
rawPrediction#168,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59)
 AS 
probability#169,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59)
 AS 
prediction#170,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59))
 AS predictedLabel#171]
   SortBasedAggregate(key=[VisitNumber#52], functions=   [  
(ConcatenateString(DepartmentDescription#56),mode=Final,isDistinct=false)], 
output=[VisitNumber#52,Agg-DepartmentDescription#59])
ConvertToSafe
TungstenSort [VisitNumber#52 ASC], false, 0
TungstenExchange hashpartitioning(VisitNumber#52)
ConvertToUnsafe
SortBasedAggregate(key=[VisitNumber#52], 
functions=[(ConcatenateString(DepartmentDescription#56),mode=Partial,isDistinct=false)],
 output=[VisitNumber#52,concatenate#65])
ConvertToSafe
TungstenSort [VisitNumber#52 ASC], false, 0
TungstenProject [VisitNumber#52,DepartmentDescription#56]
Scan 
CsvRelation(src/main/resources/test.csv,true,,,",null,#,PERMISSIVE,COMMONS,false,false,null,UTF-8,true)[VisitNumber#52,Weekday#53,Upc#54L,ScanCount#55,DepartmentDescription#56,FinelineNumber#57]


# STACK TRACE #
java.util.NoSuchElementException: key not found: -2.379675967804967E-16
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:308)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:307)
at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:307)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:301)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:343)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:343)
at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:241)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at 

Re: NoSuchElementException: key not found when changing the window lenght and interval in Spark Streaming

2014-07-10 Thread richiesgr
Hi

I get exactly the same problem here, do you've found the problem ?
Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchElementException-key-not-found-when-changing-the-window-lenght-and-interval-in-Spark-Streaming-tp9010p9283.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: NoSuchElementException: key not found when changing the window lenght and interval in Spark Streaming

2014-07-10 Thread Tathagata Das
This bug has been fixed. Either use the master branch of Spark, or maybe
wait a few days for Spark 1.0.1 to be released (voting has successfully
closed).

TD


On Thu, Jul 10, 2014 at 2:33 AM, richiesgr richie...@gmail.com wrote:

 Hi

 I get exactly the same problem here, do you've found the problem ?
 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchElementException-key-not-found-when-changing-the-window-lenght-and-interval-in-Spark-Streaming-tp9010p9283.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: NoSuchElementException: key not found

2014-06-06 Thread RodrigoB
Hi Tathagata,

Im seeing the same issue on a load run over night with Kafka streaming (6000
mgs/sec) and 500millisec batch size. Again occasional and only happening
after a few hours I believe

Im using updateStateByKey.

Any comment will be very welcome.

tnks,
Rod



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchElementException-key-not-found-tp6743p7157.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: NoSuchElementException: key not found

2014-06-03 Thread Tathagata Das
I think I know what is going on! This probably a race condition in the
DAGScheduler. I have added a JIRA for this. The fix is not trivial though.

https://issues.apache.org/jira/browse/SPARK-2002

A not-so-good workaround for now would be not use coalesced RDD, which is
avoids the race condition.

TD


On Tue, Jun 3, 2014 at 10:09 AM, Michael Chang m...@tellapart.com wrote:

 I only had the warning level logs, unfortunately.  There were no other
 references of 32855 (except a repeated stack trace, I believe).  I'm using
 Spark 0.9.1


 On Mon, Jun 2, 2014 at 5:50 PM, Tathagata Das tathagata.das1...@gmail.com
  wrote:

 Do you have the info level logs of the application? Can you grep the
 value 32855 to find any references to it? Also what version of the
 Spark are you using (so that I can match the stack trace, does not seem to
 match with Spark 1.0)?

 TD


 On Mon, Jun 2, 2014 at 3:27 PM, Michael Chang m...@tellapart.com wrote:

 Hi all,

 Seeing a random exception kill my spark streaming job. Here's a stack
 trace:

 java.util.NoSuchElementException: key not found: 32855
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at
 org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:211)
  at
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1072)
 at
 org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:716)
 at
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:172)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:189)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:188)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:351)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:183)
 at
 org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:234)
 at
 org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:333)
 at
 org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:81)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at
 org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at
 org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at
 org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:31)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at org.apache.spark.rdd.RDD.take(RDD.scala:830)
 at
 org.apache.spark.api.java.JavaRDDLike$class.take(JavaRDDLike.scala:337)
 at org.apache.spark.api.java.JavaRDD.take(JavaRDD.scala:27)
 at
 com.tellapart.manifolds.spark.ManifoldsUtil$PersistToKafkaFunction.call(ManifoldsUtil.java:87)
 at
 com.tellapart.manifolds.spark.ManifoldsUtil$PersistToKafkaFunction.call(ManifoldsUtil.java:53)
 at
 org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:270)
 at
 org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:270)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:520)
 at
 

Re: NoSuchElementException: key not found

2014-06-03 Thread Michael Chang
Hi Tathagata,

Thanks for your help!  By not using coalesced RDD, do you mean not
repartitioning my Dstream?

Thanks,
Mike




On Tue, Jun 3, 2014 at 12:03 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 I think I know what is going on! This probably a race condition in the
 DAGScheduler. I have added a JIRA for this. The fix is not trivial though.

 https://issues.apache.org/jira/browse/SPARK-2002

 A not-so-good workaround for now would be not use coalesced RDD, which
 is avoids the race condition.

 TD


 On Tue, Jun 3, 2014 at 10:09 AM, Michael Chang m...@tellapart.com wrote:

 I only had the warning level logs, unfortunately.  There were no other
 references of 32855 (except a repeated stack trace, I believe).  I'm using
 Spark 0.9.1


 On Mon, Jun 2, 2014 at 5:50 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 Do you have the info level logs of the application? Can you grep the
 value 32855 to find any references to it? Also what version of the
 Spark are you using (so that I can match the stack trace, does not seem to
 match with Spark 1.0)?

 TD


 On Mon, Jun 2, 2014 at 3:27 PM, Michael Chang m...@tellapart.com
 wrote:

 Hi all,

 Seeing a random exception kill my spark streaming job. Here's a stack
 trace:

 java.util.NoSuchElementException: key not found: 32855
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at
 org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:211)
  at
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1072)
 at
 org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:716)
 at
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:172)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:189)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:188)
 at
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at
 scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:351)
 at
 scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:183)
 at
 org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:234)
 at
 org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:333)
 at
 org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:81)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at
 org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at
 org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at
 org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:31)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at org.apache.spark.rdd.RDD.take(RDD.scala:830)
 at
 org.apache.spark.api.java.JavaRDDLike$class.take(JavaRDDLike.scala:337)
 at org.apache.spark.api.java.JavaRDD.take(JavaRDD.scala:27)
 at
 com.tellapart.manifolds.spark.ManifoldsUtil$PersistToKafkaFunction.call(ManifoldsUtil.java:87)
 at
 com.tellapart.manifolds.spark.ManifoldsUtil$PersistToKafkaFunction.call(ManifoldsUtil.java:53)
 at
 org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:270)
 at
 

Re: NoSuchElementException: key not found

2014-06-03 Thread Tathagata Das
I am not sure what DStream operations you are using, but some operation is
internally creating CoalescedRDDs. That is causing the race condition. I
might be able help if you can tell me what DStream operations you are using.

TD


On Tue, Jun 3, 2014 at 4:54 PM, Michael Chang m...@tellapart.com wrote:

 Hi Tathagata,

 Thanks for your help!  By not using coalesced RDD, do you mean not
 repartitioning my Dstream?

 Thanks,
 Mike




 On Tue, Jun 3, 2014 at 12:03 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 I think I know what is going on! This probably a race condition in the
 DAGScheduler. I have added a JIRA for this. The fix is not trivial though.

 https://issues.apache.org/jira/browse/SPARK-2002

 A not-so-good workaround for now would be not use coalesced RDD, which
 is avoids the race condition.

 TD


 On Tue, Jun 3, 2014 at 10:09 AM, Michael Chang m...@tellapart.com
 wrote:

 I only had the warning level logs, unfortunately.  There were no other
 references of 32855 (except a repeated stack trace, I believe).  I'm using
 Spark 0.9.1


 On Mon, Jun 2, 2014 at 5:50 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 Do you have the info level logs of the application? Can you grep the
 value 32855 to find any references to it? Also what version of the
 Spark are you using (so that I can match the stack trace, does not seem to
 match with Spark 1.0)?

 TD


 On Mon, Jun 2, 2014 at 3:27 PM, Michael Chang m...@tellapart.com
 wrote:

 Hi all,

 Seeing a random exception kill my spark streaming job. Here's a stack
 trace:

 java.util.NoSuchElementException: key not found: 32855
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at
 org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:211)
  at
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1072)
 at
 org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:716)
 at
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:172)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:189)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:188)
 at
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at
 scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:351)
 at
 scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:183)
 at
 org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:234)
 at
 org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:333)
 at
 org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:81)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at
 org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at
 org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at
 org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:31)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at org.apache.spark.rdd.RDD.take(RDD.scala:830)
 at
 org.apache.spark.api.java.JavaRDDLike$class.take(JavaRDDLike.scala:337)
 at org.apache.spark.api.java.JavaRDD.take(JavaRDD.scala:27)
 at
 

NoSuchElementException: key not found

2014-06-02 Thread Michael Chang
Hi all,

Seeing a random exception kill my spark streaming job. Here's a stack
trace:

java.util.NoSuchElementException: key not found: 32855
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at
org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:211)
at
org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1072)
at
org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:716)
at
org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:172)
at
org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:189)
at
org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:188)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:351)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at
org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:183)
at
org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:234)
at
org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:333)
at
org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:81)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at
org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:31)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.RDD.take(RDD.scala:830)
at
org.apache.spark.api.java.JavaRDDLike$class.take(JavaRDDLike.scala:337)
at org.apache.spark.api.java.JavaRDD.take(JavaRDD.scala:27)
at
com.tellapart.manifolds.spark.ManifoldsUtil$PersistToKafkaFunction.call(ManifoldsUtil.java:87)
at
com.tellapart.manifolds.spark.ManifoldsUtil$PersistToKafkaFunction.call(ManifoldsUtil.java:53)
at
org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:270)
at
org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:270)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:520)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:520)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:155)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

It doesn't seem to happen consistently, but I have no idea causes it.  Has
anyone seen this before?  The PersistToKafkaFunction here is just trying to
write the elements in a RDD 

Re: NoSuchElementException: key not found

2014-06-02 Thread Tathagata Das
Do you have the info level logs of the application? Can you grep the
value 32855
to find any references to it? Also what version of the Spark are you using
(so that I can match the stack trace, does not seem to match with Spark
1.0)?

TD


On Mon, Jun 2, 2014 at 3:27 PM, Michael Chang m...@tellapart.com wrote:

 Hi all,

 Seeing a random exception kill my spark streaming job. Here's a stack
 trace:

 java.util.NoSuchElementException: key not found: 32855
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at
 org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:211)
  at
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1072)
 at
 org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:716)
 at
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:172)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:189)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:188)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:351)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:183)
 at
 org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:234)
 at
 org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:333)
 at
 org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:81)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at
 org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:31)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at org.apache.spark.rdd.RDD.take(RDD.scala:830)
 at
 org.apache.spark.api.java.JavaRDDLike$class.take(JavaRDDLike.scala:337)
 at org.apache.spark.api.java.JavaRDD.take(JavaRDD.scala:27)
 at
 com.tellapart.manifolds.spark.ManifoldsUtil$PersistToKafkaFunction.call(ManifoldsUtil.java:87)
 at
 com.tellapart.manifolds.spark.ManifoldsUtil$PersistToKafkaFunction.call(ManifoldsUtil.java:53)
 at
 org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:270)
 at
 org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:270)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:520)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:520)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at scala.util.Try$.apply(Try.scala:161)
 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:155)