Hi all,
For several reasons which I won't elaborate (yet), we're using Spark on
local mode as an in memory SQL engine for data we're retrieving from
Cassandra, execute SQL queries and return to the client - so no cluster, no
worker nodes. I'm well aware local mode has always been considered a
Although I haven't work explicitly with either, they do seem to differ in
design and consequently in usage scenarios.
Ignite is claimed to be a pure in-memory distributed database.
With Ignite, updating existing keys is something that is self-managed
comparing with Tachyon. In Tachyon once a
Hi Manas,
Thanks for the reply. I've done that. The problem lies with Spark + akka
2.4.0 build. Seems the maven shader plugin is altering some class files and
breaking the Akka runtime.
Seems the Spark build on Scala 2.11 using SBT is broken. I'm getting build
errors using sbt due to the issues
I've been given a feature requirement that means processing events on a
latency lower than 0.25ms.
Meaning I would have to make sure that Spark streaming gets new events from
the messaging layer within that period of time. Would anyone have achieve
such numbers using a Spark cluster? Or would
Hi Hannes,
Good to know I'm not alone on the boat. Sorry about not posting back, I
haven't gone in a while onto the user list.
It's on my agenda to get over this issue. Will be very important for our
recovery implementation. I have done an internal proof of concept but
without any conclusions
Hi Dibyendu,What are your thoughts on keeping this solution (or not),
considering that Spark Streaming v1.2 will have built-in recoverability of
the received data?https://issues.apache.org/jira/browse/SPARK-1647I'm
concerned about the complexity of this solution with regards the added
complexity
Hi all,
I can't seem to find a clear answer on the documentation.
Does the standalone cluster support dynamic assigment of nr of allocated
cores to an application once another app stops?
I'm aware that we can have core sharding if we use Mesos between active
applications depending on the nr of
Hi TD,
This is actually an important requirement (recovery of shared variables) for
us as we need to spread some referential data across the Spark nodes on
application startup. I just bumped into this issue on Spark version 1.0.1. I
assume the latest one also doesn't include this capability. Are
Hi TD, tnks for getting back on this.
Yes that's what I was experiencing - data checkpoints were being recovered
from considerable time before the last data checkpoint, probably since the
beginning of the first writes, would have to confirm. I have some
development on this though.
These results
Just a follow-up.
Just to make sure about the RDDs not being cleaned up, I just replayed the
app both on the windows remote laptop and then on the linux machine and at
the same time was observing the RDD folders in HDFS.
Confirming the observed behavior: running on the laptop I could see the
Hi Yana,
You are correct. What needs to be added is that besides RDDs being
checkpointed, metadata which represents execution of computations are also
checkpointed in Spark Streaming.
Upon driver recovery, the last batches (the ones already executed and the
ones that should have been executed
Just a comment on the recovery part.
Is it correct to say that currently Spark Streaming recovery design does not
consider re-computations (upon metadata lineage recovery) that depend on
blocks of data of the received stream?
https://issues.apache.org/jira/browse/SPARK-1647
Just to illustrate a
Hi Yana,
The fact is that the DB writing is happening on the node level and not on
Spark level. One of the benefits of distributed computing nature of Spark is
enabling IO distribution as well. For example, is much faster to have the
nodes to write to Cassandra instead of having them all
Hi Dibyendu,
My colleague has taken a look at the spark kafka consumer github you have
provided and started experimenting.
We found that somehow when Spark has a failure after a data checkpoint, the
expected re-computations correspondent to the metadata checkpoints are not
recovered so we loose
Hi Christopher,
I am also in the need of having a single function call on the node level.
Your suggestion makes sense as a solution to the requirement, but still
feels like a workaround, this check will get called on every row...Also
having static members and methods created specially on a
Hi TD,
I've also been fighting this issue only to find the exact same solution you
are suggesting.
Too bad I didn't find either the post or the issue sooner.
I'm using a 1 second batch with N amount of kafka events (1 to 1 with the
state objects) per batch and only calling the updatestatebykey
Just an update on this,
Looking into Spark logs seems that some partitions are not found and
recomputed. Gives the impression that those are related with the delayed
updatestatebykey calls.
I'm seeing something like:
log line 1 - Partition rdd_132_1 not found, computing it
log line N -
Tnks to both for the comments and the debugging suggestion, I will try to
use.
Regarding you comment, yes I do agree the current solution was not efficient
but for using the saveToCassandra method I need an RDD thus the paralelize
method. I finally got direct by Piotr to use the
Hi all,
I am currently trying to save to Cassandra after some Spark Streaming
computation.
I call a myDStream.foreachRDD so that I can collect each RDD in the driver
app runtime and inside I do something like this:
myDStream.foreachRDD(rdd = {
var someCol = Seq[MyType]()
foreach(kv ={
Hi Luis,
Yes it's actually an ouput of the previous RDD.
Have you ever used the Cassandra Spark Driver on the driver app? I believe
these limitations go around that - it's designed to save RDDs from the
nodes.
tnks,
Rod
--
View this message in context:
Hi TD,
I have the same question: I need the workers to process using arrival order
since it's updating a state based on previous one.
tnks in advance.
Rod
--
View this message in context:
Hi Tathagata,
Im seeing the same issue on a load run over night with Kafka streaming (6000
mgs/sec) and 500millisec batch size. Again occasional and only happening
after a few hours I believe
Im using updateStateByKey.
Any comment will be very welcome.
tnks,
Rod
--
View this message in
22 matches
Mail list logo