Re: checkpointing without streaming?
You can use *SparkContext.checkpointFile()*. However note that the checkpoint file contains Java serialized data. So if your data types change in between writing and reading of the checkpoint file for whatever reason (Spark version change, your code was recompiled, etc.), you may not be able to read from the checkpoint. So use carefully :) On Thu, May 18, 2017 at 12:18 AM, Neelesh Sambhajiche < sambhajicheneel...@gmail.com> wrote: > That is exactly what we are currently doing - storing it in a csv file. > However, as checkpointing permanently writes to disk, if we use > checkpointing along with saving the RDD to a text file, the data gets > stored twice on the disk. That is why I was looking for a way to read the > checkpointed data in a different program. > > On Wed, May 17, 2017 at 12:59 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> Why not just save the RDD to a proper file? text file, sequence, file, >> many options. Then its standard to read it back in different program. >> >> On Wed, May 17, 2017 at 12:01 AM, neelesh.sa < >> sambhajicheneel...@gmail.com> wrote: >> >>> Is it possible to checkpoint a RDD in one run of my application and use >>> the >>> saved RDD in the next run of my application? >>> >>> For example, with the following code: >>> val x = List(1,2,3,4) >>> val y = sc.parallelize(x ,2).map( c => c*2) >>> y.checkpoint >>> y.count >>> >>> Is it possible to read the checkpointed RDD in another application? >>> >>> >>> >>> >>> >>> -- >>> View this message in context: http://apache-spark-user-list. >>> 1001560.n3.nabble.com/checkpointing-without-streaming-tp4541p28691.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> > > > -- > > > *Regards,Neelesh SambhajicheMobile: 8058437181 <(805)%20843-7181>* > > [image: Inline image 1] > *Birla Institute of Technology & Science,* Pilani > Pilani Campus, Rajasthan 333 031, INDIA >
Re: checkpointing without streaming?
That is exactly what we are currently doing - storing it in a csv file. However, as checkpointing permanently writes to disk, if we use checkpointing along with saving the RDD to a text file, the data gets stored twice on the disk. That is why I was looking for a way to read the checkpointed data in a different program. On Wed, May 17, 2017 at 12:59 PM, Tathagata Das <tathagata.das1...@gmail.com > wrote: > Why not just save the RDD to a proper file? text file, sequence, file, > many options. Then its standard to read it back in different program. > > On Wed, May 17, 2017 at 12:01 AM, neelesh.sa <sambhajicheneel...@gmail.com > > wrote: > >> Is it possible to checkpoint a RDD in one run of my application and use >> the >> saved RDD in the next run of my application? >> >> For example, with the following code: >> val x = List(1,2,3,4) >> val y = sc.parallelize(x ,2).map( c => c*2) >> y.checkpoint >> y.count >> >> Is it possible to read the checkpointed RDD in another application? >> >> >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/checkpointing-without-streaming-tp4541p28691.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > -- *Regards,Neelesh SambhajicheMobile: 8058437181* [image: Inline image 1] *Birla Institute of Technology & Science,* Pilani Pilani Campus, Rajasthan 333 031, INDIA
Re: checkpointing without streaming?
Why not just save the RDD to a proper file? text file, sequence, file, many options. Then its standard to read it back in different program. On Wed, May 17, 2017 at 12:01 AM, neelesh.sa <sambhajicheneel...@gmail.com> wrote: > Is it possible to checkpoint a RDD in one run of my application and use the > saved RDD in the next run of my application? > > For example, with the following code: > val x = List(1,2,3,4) > val y = sc.parallelize(x ,2).map( c => c*2) > y.checkpoint > y.count > > Is it possible to read the checkpointed RDD in another application? > > > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/checkpointing-without-streaming-tp4541p28691.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: checkpointing without streaming?
Is it possible to checkpoint a RDD in one run of my application and use the saved RDD in the next run of my application? For example, with the following code: val x = List(1,2,3,4) val y = sc.parallelize(x ,2).map( c => c*2) y.checkpoint y.count Is it possible to read the checkpointed RDD in another application? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/checkpointing-without-streaming-tp4541p28691.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
checkpointing without streaming?
I'm trying to understand when I would want to checkpoint an RDD rather than just persist to disk. Every reference I can find to checkpoint related to Spark Streaming. But the method is defined in the core Spark library, not Streaming. Does it exist solely for streaming, or are there circumstances unrelated to streaming in which I might want to checkpoint...and if so, like what? Thanks, Diana
Re: checkpointing without streaming?
Checkpoint clears dependencies. You might need checkpoint to cut a long lineage in iterative algorithms. -Xiangrui On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll dcarr...@cloudera.com wrote: I'm trying to understand when I would want to checkpoint an RDD rather than just persist to disk. Every reference I can find to checkpoint related to Spark Streaming. But the method is defined in the core Spark library, not Streaming. Does it exist solely for streaming, or are there circumstances unrelated to streaming in which I might want to checkpoint...and if so, like what? Thanks, Diana
Re: checkpointing without streaming?
When might that be necessary or useful? Presumably I can persist and replicate my RDD to avoid re-computation, if that's my goal. What advantage does checkpointing provide over disk persistence with replication? On Mon, Apr 21, 2014 at 2:42 PM, Xiangrui Meng men...@gmail.com wrote: Checkpoint clears dependencies. You might need checkpoint to cut a long lineage in iterative algorithms. -Xiangrui On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll dcarr...@cloudera.com wrote: I'm trying to understand when I would want to checkpoint an RDD rather than just persist to disk. Every reference I can find to checkpoint related to Spark Streaming. But the method is defined in the core Spark library, not Streaming. Does it exist solely for streaming, or are there circumstances unrelated to streaming in which I might want to checkpoint...and if so, like what? Thanks, Diana
Re: checkpointing without streaming?
Diana, that is a good question. When you persist an RDD, the system still remembers the whole lineage of parent RDDs that created that RDD. If one of the executor fails, and the persist data is lost (both local disk and memory data will get lost), then the lineage is used to recreate the RDD. The longer the lineage, the more recomputation the system has to do in case of failure, and hence higher recovery time. So its not a good idea to have a very long lineage, as it leads to all sorts of problems, like the one Xiangrui pointed to. Checkpointing an RDD actually saves the RDD data to HDFS and removes pointers to the parent RDDs (as the data can be regenerated just by reading from the HDFS file). So that RDDs data does not need to be recomputed when worker fails, just re-read. In fact, the data is also retained across driver restarts as it is in HDFS. RDD.checkpoint was introduced with streaming because streaming is obvious use case where the lineage will grow infinitely long (for stateful computations where each result depends on all the previously received data). However, this checkpointing is useful for any long running RDD computation, and I know that people have used RDD.checkpoint() independent of streaming. TD On Mon, Apr 21, 2014 at 1:10 PM, Xiangrui Meng men...@gmail.com wrote: Persist doesn't cut lineage. You might run into StackOverflow problem with a long lineage. See https://spark-project.atlassian.net/browse/SPARK-1006 for example. On Mon, Apr 21, 2014 at 12:11 PM, Diana Carroll dcarr...@cloudera.com wrote: When might that be necessary or useful? Presumably I can persist and replicate my RDD to avoid re-computation, if that's my goal. What advantage does checkpointing provide over disk persistence with replication? On Mon, Apr 21, 2014 at 2:42 PM, Xiangrui Meng men...@gmail.com wrote: Checkpoint clears dependencies. You might need checkpoint to cut a long lineage in iterative algorithms. -Xiangrui On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll dcarr...@cloudera.com wrote: I'm trying to understand when I would want to checkpoint an RDD rather than just persist to disk. Every reference I can find to checkpoint related to Spark Streaming. But the method is defined in the core Spark library, not Streaming. Does it exist solely for streaming, or are there circumstances unrelated to streaming in which I might want to checkpoint...and if so, like what? Thanks, Diana