Re: Spark Streaming Checkpointing solutions

2015-07-21 Thread Dean Wampler
TD's Spark Summit talk offers suggestions (
https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/).
He recommends using HDFS, because you get the triplicate resiliency it
offers, albeit with extra overhead. I believe the driver doesn't need
visibility to the checkpointing directory, e.g., if you're running in
client mode, but all the cluster nodes would need to see it for recovering
a lost stage, where it might get started on a different node. Hence, I
would think NFS could work, if all nodes have the same mount, although
there would be a lot of network overhead. In some situations, a high
performance file system appliance, e.g., NAS, could suffice.

My $0.02,
dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Tue, Jul 21, 2015 at 10:43 AM, Emmanuel fortin.emman...@gmail.com
wrote:

 Hi,

 I'm working on a Spark Streaming application and I would like to know what
 is the best storage to use
 for checkpointing.

 For testing purposes we're are using NFS between the worker, the master and
 the driver program (in client mode),
 but we have some issues with the CheckpointWriter (1 thread dedicated). *My
 understanding is that NFS is not a good candidate for this usage.*

 1. What is the best solution for checkpointing and what are the
 alternatives
 ?

 2. Does checkpointings directories need to be shared by the driver
 application and the workers too ?

 Thanks for your replies



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Checkpointing-solutions-tp23932.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark Streaming Checkpointing solutions

2015-07-21 Thread Emmanuel Fortin
Thank you for your reply. I will consider hdfs for the checkpoint storage.



Le mar. 21 juil. 2015 à 17:51, Dean Wampler deanwamp...@gmail.com a
écrit :

 TD's Spark Summit talk offers suggestions (
 https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/).
 He recommends using HDFS, because you get the triplicate resiliency it
 offers, albeit with extra overhead. I believe the driver doesn't need
 visibility to the checkpointing directory, e.g., if you're running in
 client mode, but all the cluster nodes would need to see it for recovering
 a lost stage, where it might get started on a different node. Hence, I
 would think NFS could work, if all nodes have the same mount, although
 there would be a lot of network overhead. In some situations, a high
 performance file system appliance, e.g., NAS, could suffice.

 My $0.02,
 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Tue, Jul 21, 2015 at 10:43 AM, Emmanuel fortin.emman...@gmail.com
 wrote:

 Hi,

 I'm working on a Spark Streaming application and I would like to know what
 is the best storage to use
 for checkpointing.

 For testing purposes we're are using NFS between the worker, the master
 and
 the driver program (in client mode),
 but we have some issues with the CheckpointWriter (1 thread dedicated).
 *My
 understanding is that NFS is not a good candidate for this usage.*

 1. What is the best solution for checkpointing and what are the
 alternatives
 ?

 2. Does checkpointings directories need to be shared by the driver
 application and the workers too ?

 Thanks for your replies



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Checkpointing-solutions-tp23932.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org