Running in local mode as SQL engine - what to optimize?

2016-09-29 Thread RodrigoB
Hi all, For several reasons which I won't elaborate (yet), we're using Spark on local mode as an in memory SQL engine for data we're retrieving from Cassandra, execute SQL queries and return to the client - so no cluster, no worker nodes. I'm well aware local mode has always been considered a

Re: Spark on Apache Ingnite?

2016-01-11 Thread RodrigoB
Although I haven't work explicitly with either, they do seem to differ in design and consequently in usage scenarios. Ignite is claimed to be a pure in-memory distributed database. With Ignite, updating existing keys is something that is self-managed comparing with Tachyon. In Tachyon once a

Re: Scala 2.11 and Akka 2.4.0

2015-12-07 Thread RodrigoB
Hi Manas, Thanks for the reply. I've done that. The problem lies with Spark + akka 2.4.0 build. Seems the maven shader plugin is altering some class files and breaking the Akka runtime. Seems the Spark build on Scala 2.11 using SBT is broken. I'm getting build errors using sbt due to the issues

Spark Streaming - Minimizing batch interval

2015-03-25 Thread RodrigoB
I've been given a feature requirement that means processing events on a latency lower than 0.25ms. Meaning I would have to make sure that Spark streaming gets new events from the messaging layer within that period of time. Would anyone have achieve such numbers using a Spark cluster? Or would

Re: Spark Streaming checkpoint recovery causes IO re-execution

2015-01-20 Thread RodrigoB
Hi Hannes, Good to know I'm not alone on the boat. Sorry about not posting back, I haven't gone in a while onto the user list. It's on my agenda to get over this issue. Will be very important for our recovery implementation. I have done an internal proof of concept but without any conclusions

Re: Low Level Kafka Consumer for Spark

2014-12-02 Thread RodrigoB
Hi Dibyendu,What are your thoughts on keeping this solution (or not), considering that Spark Streaming v1.2 will have built-in recoverability of the received data?https://issues.apache.org/jira/browse/SPARK-1647I'm concerned about the complexity of this solution with regards the added complexity

Dynamically switching Nr of allocated core

2014-11-03 Thread RodrigoB
Hi all, I can't seem to find a clear answer on the documentation. Does the standalone cluster support dynamic assigment of nr of allocated cores to an application once another app stops? I'm aware that we can have core sharding if we use Mesos between active applications depending on the nr of

Re: NullPointerException on reading checkpoint files

2014-09-23 Thread RodrigoB
Hi TD, This is actually an important requirement (recovery of shared variables) for us as we need to spread some referential data across the Spark nodes on application startup. I just bumped into this issue on Spark version 1.0.1. I assume the latest one also doesn't include this capability. Are

Re: RDD data checkpoint cleaning

2014-09-23 Thread RodrigoB
Hi TD, tnks for getting back on this. Yes that's what I was experiencing - data checkpoints were being recovered from considerable time before the last data checkpoint, probably since the beginning of the first writes, would have to confirm. I have some development on this though. These results

Re: RDD data checkpoint cleaning

2014-09-23 Thread RodrigoB
Just a follow-up. Just to make sure about the RDDs not being cleaned up, I just replayed the app both on the windows remote laptop and then on the linux machine and at the same time was observing the RDD folders in HDFS. Confirming the observed behavior: running on the laptop I could see the

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-31 Thread RodrigoB
Hi Yana, You are correct. What needs to be added is that besides RDDs being checkpointed, metadata which represents execution of computations are also checkpointed in Spark Streaming. Upon driver recovery, the last batches (the ones already executed and the ones that should have been executed

Re: Low Level Kafka Consumer for Spark

2014-08-31 Thread RodrigoB
Just a comment on the recovery part. Is it correct to say that currently Spark Streaming recovery design does not consider re-computations (upon metadata lineage recovery) that depend on blocks of data of the received stream? https://issues.apache.org/jira/browse/SPARK-1647 Just to illustrate a

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread RodrigoB
Hi Yana, The fact is that the DB writing is happening on the node level and not on Spark level. One of the benefits of distributed computing nature of Spark is enabling IO distribution as well. For example, is much faster to have the nodes to write to Cassandra instead of having them all

Re: Low Level Kafka Consumer for Spark

2014-08-25 Thread RodrigoB
Hi Dibyendu, My colleague has taken a look at the spark kafka consumer github you have provided and started experimenting. We found that somehow when Spark has a failure after a data checkpoint, the expected re-computations correspondent to the metadata checkpoints are not recovered so we loose

Re: Running a task once on each executor

2014-08-11 Thread RodrigoB
Hi Christopher, I am also in the need of having a single function call on the node level. Your suggestion makes sense as a solution to the requirement, but still feels like a workaround, this check will get called on every row...Also having static members and methods created specially on a

Re: streaming window not behaving as advertised (v1.0.1)

2014-08-01 Thread RodrigoB
Hi TD, I've also been fighting this issue only to find the exact same solution you are suggesting. Too bad I didn't find either the post or the issue sooner. I'm using a 1 second batch with N amount of kafka events (1 to 1 with the state objects) per batch and only calling the updatestatebykey

Re: UpdatestateByKey assumptions

2014-07-29 Thread RodrigoB
Just an update on this, Looking into Spark logs seems that some partitions are not found and recomputed. Gives the impression that those are related with the delayed updatestatebykey calls. I'm seeing something like: log line 1 - Partition rdd_132_1 not found, computing it log line N -

Re: Cassandra driver Spark question

2014-07-16 Thread RodrigoB
Tnks to both for the comments and the debugging suggestion, I will try to use. Regarding you comment, yes I do agree the current solution was not efficient but for using the saveToCassandra method I need an RDD thus the paralelize method. I finally got direct by Piotr to use the

Cassandra driver Spark question

2014-07-09 Thread RodrigoB
Hi all, I am currently trying to save to Cassandra after some Spark Streaming computation. I call a myDStream.foreachRDD so that I can collect each RDD in the driver app runtime and inside I do something like this: myDStream.foreachRDD(rdd = { var someCol = Seq[MyType]() foreach(kv ={

Re: Cassandra driver Spark question

2014-07-09 Thread RodrigoB
Hi Luis, Yes it's actually an ouput of the previous RDD. Have you ever used the Cassandra Spark Driver on the driver app? I believe these limitations go around that - it's designed to save RDDs from the nodes. tnks, Rod -- View this message in context:

RE: range partitioner with updateStateByKey

2014-06-06 Thread RodrigoB
Hi TD, I have the same question: I need the workers to process using arrival order since it's updating a state based on previous one. tnks in advance. Rod -- View this message in context:

Re: NoSuchElementException: key not found

2014-06-06 Thread RodrigoB
Hi Tathagata, Im seeing the same issue on a load run over night with Kafka streaming (6000 mgs/sec) and 500millisec batch size. Again occasional and only happening after a few hours I believe Im using updateStateByKey. Any comment will be very welcome. tnks, Rod -- View this message in