It says 478548 on host 172.18.152.36: java.lang.ArrayIndexOutOfBoundsException
Can you try putting a try { }catch around all those operations that you are
doing on the DStream? In that way it will not stop the entire application
due to corrupt data/field etc.

Thanks
Best Regards

On Fri, Oct 31, 2014 at 10:09 AM, sivarani <whitefeathers...@gmail.com>
wrote:

> The problem is simple
>
> I want a to stream data 24/7 do some calculations and save the result in a
> csv/json file so that i could use it for visualization using dc.js/d3.js
>
> I opted for spark streaming on yarn cluster with kafka tried running it for
> 24/7
>
> Using GroupByKey and updateStateByKey to have the computed historical data
>
> Initially streaming is working fine.. but after few hours i am getting
>
> 14/10/30 23:48:49 ERROR TaskSetManager: Task 2485162.0:3 failed 4 times;
> aborting job
> 14/10/30 23:48:50 ERROR JobScheduler: Error running job streaming job
> 1414692270000 ms.1
> org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 2485162.0:3 failed 4 times, most recent failure: Exception failure in TID
> 478548 on host 172.18.152.36: java.lang.ArrayIndexOutOfBoundsException
>
> Driver stacktrace:
>     at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
>     at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
>     at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
> I guess its due to the GroupByKey and updateStateByKey, i tried
> GroupByKey(100) increased partition
>
> Also when data is in state say for eg 10th sec 1000 records are in state,
> 100th sec 20,000 records are in state out of which 19,000 records are not
> updated how to remove them from state.. UpdateStateByKey(none) how and when
> to do that, how we will know when to send none, and save the data before
> setting none?
>
> I also tried not sending any data a few hours but check the web ui i am
> getting task FINISHED
>
> app-20141030203943-0000 NewApp  0       6.0 GB  2014/10/30 20:39:43
>  hadoop  FINISHED
> 4.2 h
>
> This makes me confused.. In the code it says awaitTermination, but did not
> terminate the task.. will streaming stop if no data is received for a
> significant amount of time? Is there any doc available on how much time
> spark will run when no data is streamed? Any Doc available
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Issue-not-running-24-7-tp17791.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to