This is actually a very tricky as their two pretty big challenges that need
to be solved.
(i) Checkpointing for broadcast variables: Unlike RDDs, broadcasts variable
dont have checkpointing support (that is you cannot write the content of a
broadcast variable to HDFS and recover it automatically when needed).
(ii) Remembering the checkpoint info of broacast vars used in every batch,
and recovering those vars from the checkpoint info. And exposing this in
the API such that it can be used such that all the checkpointing/recovering
can be done by Spark Streaming seamlessly without user's knowledge.
I have some thoughts on it, but nothing concrete yet. The first, that is,
broadcast checkpointing, should be straight forward, and may be rewarding
outside streaming.
TD
On Tue, Sep 23, 2014 at 4:22 PM, RodrigoB rodrigo.boav...@aspect.com
wrote:
Hi TD,
This is actually an important requirement (recovery of shared variables)
for
us as we need to spread some referential data across the Spark nodes on
application startup. I just bumped into this issue on Spark version 1.0.1.
I
assume the latest one also doesn't include this capability. Are there any
plans to do so.
If not could you give me your opinion on how difficult would it be to
implement this? If it's nothing too complex I could consider contributing
on
that level.
BTW, regarding recovery I have posted a topic on which I would very much
appreciate your comments on
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-data-checkpoint-cleaning-td14847.html
tnks,
Rod
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-on-reading-checkpoint-files-tp7306p14882.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org