FW: Spark streaming - failed recovery from checkpoint

2015-11-02 Thread Adrian Tanase
Re-posting here, didn’t get any feedback on the dev list. Has anyone experienced corrupted checkpoints recently? Thanks! -adrian From: Adrian Tanase Date: Thursday, October 29, 2015 at 1:38 PM To: "d...@spark.apache.org<mailto:d...@spark.apache.org>" Subject: Spark streaming -

Re: Rule Engine for Spark

2015-11-04 Thread Adrian Tanase
Another way to do it is to extract your filters as SQL code and load it in a transform – which allows you to change the filters at runtime. Inside the transform you could apply the filters by goind RDD -> DF -> SQL -> RDD. Lastly, depending on how complex your filters are, you could skip SQL an

Re: Why some executors are lazy?

2015-11-04 Thread Adrian Tanase
If some of the operations required involve shuffling and partitioning, it might mean that the data set is skewed to specific partitions which will create hot spotting on certain executors. -adrian From: Khaled Ammar Date: Tuesday, November 3, 2015 at 11:43 PM To: "user@spark.apache.org

Custom application.conf on spark executor nodes?

2015-11-04 Thread Adrian Tanase
Hi guys, I’m trying to deploy a custom application.conf that contains various app specific entries and config overrides for akka and spray. The file is successfully loaded by the driver by submitting with —driver-class-path – at least for the application specific config. I haven’t yet managed

Re: Spark Streaming data checkpoint performance

2015-11-04 Thread Adrian Tanase
Nice! Thanks for sharing, I wasn’t aware of the new API. Left some comments on the JIRA and design doc. -adrian From: Shixiong Zhu Date: Tuesday, November 3, 2015 at 3:32 AM To: Thúy Hằng Lê Cc: Adrian Tanase, "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re:

Re: Why some executors are lazy?

2015-11-04 Thread Adrian Tanase
* The scheduler might decide to take advantage of free cores in the cluster and schedule an off-node processing and you can control how long it waits through the spark.locality.wait settings Hope this helps, -adrian From: Khaled Ammar Date: Wednesday, November 4, 2015 at 4:03 PM To: Adrian Tanase

Re: Scheduling Spark process

2015-11-05 Thread Adrian Tanase
You should also specify how you’re planning to query or “publish” the data. I would consider a combination of: - spark streaming job that ingests the raw events in real time, validates, pre-process and saves to stable storage - stable storage could be HDFS/parquet or a database optimized for ti

Re: How to use data from Database and reload every hour

2015-11-05 Thread Adrian Tanase
You should look at .transform – it’s a powerful transformation (sic) that allows you to dynamically load resources and it gets executed in every micro batch. Re-broadcasting something should be possible from inside transform as that code is executed on the driver but it’s still a controversial

Re: How to unpersist a DStream in Spark Streaming

2015-11-06 Thread Adrian Tanase
Do we have any guarantees on the maximum duration? I've seen RDDs kept around for 7-10 minutes on batches of 20 secs and checkpoint of 100 secs. No windows, just updateStateByKey. t's not a memory issue but on checkpoint recovery it goes back to Kafka for 10 minutes of data, any idea why? -ad

Re: Dynamic Allocation & Spark Streaming

2015-11-06 Thread Adrian Tanase
You can register a streaming listener – in the BatchInfo you’ll find a lot of stats (including count of received records) that you can base your logic on: https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/StreamingListener.scala http://spark.

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-10 Thread Adrian Tanase
Can you be a bit more specific about what “blow up” means? Also what do you mean by “messed up” brokers? Inbalance? Broker(s) dead? We’re also using the direct consumer and so far nothing dramatic happened: - on READ it automatically reads from backups if leader is dead (machine gone) - or READ i

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-10 Thread Adrian Tanase
I’ve seen this before during an extreme outage on the cluster, where the kafka offsets checkpointed by the directstreamRdd were bigger than what kafka reported. The checkpoint was therefore corrupted. I don’t know the root cause but since I was stressing the cluster during a reliability test I c

Re: thought experiment: use spark ML to real time prediction

2015-11-11 Thread Adrian Tanase
I don’t think this answers your question but here’s how you would evaluate the model in realtime in a streaming app https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html Maybe you can find a way to extract portions of MLLib and run them out

<    1   2