It says 478548 on host 172.18.152.36: java.lang.ArrayIndexOutOfBoundsException Can you try putting a try { }catch around all those operations that you are doing on the DStream? In that way it will not stop the entire application due to corrupt data/field etc.
Thanks Best Regards On Fri, Oct 31, 2014 at 10:09 AM, sivarani <whitefeathers...@gmail.com> wrote: > The problem is simple > > I want a to stream data 24/7 do some calculations and save the result in a > csv/json file so that i could use it for visualization using dc.js/d3.js > > I opted for spark streaming on yarn cluster with kafka tried running it for > 24/7 > > Using GroupByKey and updateStateByKey to have the computed historical data > > Initially streaming is working fine.. but after few hours i am getting > > 14/10/30 23:48:49 ERROR TaskSetManager: Task 2485162.0:3 failed 4 times; > aborting job > 14/10/30 23:48:50 ERROR JobScheduler: Error running job streaming job > 1414692270000 ms.1 > org.apache.spark.SparkException: Job aborted due to stage failure: Task > 2485162.0:3 failed 4 times, most recent failure: Exception failure in TID > 478548 on host 172.18.152.36: java.lang.ArrayIndexOutOfBoundsException > > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) > at > > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) > at > > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) > I guess its due to the GroupByKey and updateStateByKey, i tried > GroupByKey(100) increased partition > > Also when data is in state say for eg 10th sec 1000 records are in state, > 100th sec 20,000 records are in state out of which 19,000 records are not > updated how to remove them from state.. UpdateStateByKey(none) how and when > to do that, how we will know when to send none, and save the data before > setting none? > > I also tried not sending any data a few hours but check the web ui i am > getting task FINISHED > > app-20141030203943-0000 NewApp 0 6.0 GB 2014/10/30 20:39:43 > hadoop FINISHED > 4.2 h > > This makes me confused.. In the code it says awaitTermination, but did not > terminate the task.. will streaming stop if no data is received for a > significant amount of time? Is there any doc available on how much time > spark will run when no data is streamed? Any Doc available > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Issue-not-running-24-7-tp17791.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >