Re: checkpointing without streaming?

2017-05-18 Thread Tathagata Das
You can use *SparkContext.checkpointFile()*. However note that the checkpoint file contains Java
serialized data. So if your data types change in between writing and
reading of the checkpoint file for whatever reason (Spark version change,
your code was recompiled, etc.), you may not be able to read from the
checkpoint. So use carefully :)




On Thu, May 18, 2017 at 12:18 AM, Neelesh Sambhajiche <
sambhajicheneel...@gmail.com> wrote:

> That is exactly what we are currently doing - storing it in a csv file.
> However, as checkpointing permanently writes to disk, if we use
> checkpointing along with saving the RDD to a text file, the data gets
> stored twice on the disk. That is why I was looking for a way to read the
> checkpointed data in a different program.
>
> On Wed, May 17, 2017 at 12:59 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Why not just save the RDD to a proper file? text file, sequence, file,
>> many options. Then its standard to read it back in different program.
>>
>> On Wed, May 17, 2017 at 12:01 AM, neelesh.sa <
>> sambhajicheneel...@gmail.com> wrote:
>>
>>> Is it possible to checkpoint a RDD in one run of my application and use
>>> the
>>> saved RDD in the next run of my application?
>>>
>>> For example, with the following code:
>>> val x = List(1,2,3,4)
>>> val y = sc.parallelize(x ,2).map( c => c*2)
>>> y.checkpoint
>>> y.count
>>>
>>> Is it possible to read the checkpointed RDD in another application?
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/checkpointing-without-streaming-tp4541p28691.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>
>
> --
>
>
> *Regards,Neelesh SambhajicheMobile: 8058437181 <(805)%20843-7181>*
>
> [image: Inline image 1]
> *Birla Institute of Technology & Science,* Pilani
> Pilani Campus, Rajasthan 333 031, INDIA
>


Re: checkpointing without streaming?

2017-05-18 Thread Neelesh Sambhajiche
That is exactly what we are currently doing - storing it in a csv file.
However, as checkpointing permanently writes to disk, if we use
checkpointing along with saving the RDD to a text file, the data gets
stored twice on the disk. That is why I was looking for a way to read the
checkpointed data in a different program.

On Wed, May 17, 2017 at 12:59 PM, Tathagata Das <tathagata.das1...@gmail.com
> wrote:

> Why not just save the RDD to a proper file? text file, sequence, file,
> many options. Then its standard to read it back in different program.
>
> On Wed, May 17, 2017 at 12:01 AM, neelesh.sa <sambhajicheneel...@gmail.com
> > wrote:
>
>> Is it possible to checkpoint a RDD in one run of my application and use
>> the
>> saved RDD in the next run of my application?
>>
>> For example, with the following code:
>> val x = List(1,2,3,4)
>> val y = sc.parallelize(x ,2).map( c => c*2)
>> y.checkpoint
>> y.count
>>
>> Is it possible to read the checkpointed RDD in another application?
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/checkpointing-without-streaming-tp4541p28691.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


-- 


*Regards,Neelesh SambhajicheMobile: 8058437181*

[image: Inline image 1]
*Birla Institute of Technology & Science,* Pilani
Pilani Campus, Rajasthan 333 031, INDIA


Re: checkpointing without streaming?

2017-05-17 Thread Tathagata Das
Why not just save the RDD to a proper file? text file, sequence, file, many
options. Then its standard to read it back in different program.

On Wed, May 17, 2017 at 12:01 AM, neelesh.sa <sambhajicheneel...@gmail.com>
wrote:

> Is it possible to checkpoint a RDD in one run of my application and use the
> saved RDD in the next run of my application?
>
> For example, with the following code:
> val x = List(1,2,3,4)
> val y = sc.parallelize(x ,2).map( c => c*2)
> y.checkpoint
> y.count
>
> Is it possible to read the checkpointed RDD in another application?
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/checkpointing-without-streaming-tp4541p28691.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: checkpointing without streaming?

2017-05-17 Thread neelesh.sa
Is it possible to checkpoint a RDD in one run of my application and use the
saved RDD in the next run of my application?

For example, with the following code:
val x = List(1,2,3,4)
val y = sc.parallelize(x ,2).map( c => c*2)
y.checkpoint
y.count

Is it possible to read the checkpointed RDD in another application?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/checkpointing-without-streaming-tp4541p28691.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



checkpointing without streaming?

2014-04-21 Thread Diana Carroll
I'm trying to understand when I would want to checkpoint an RDD rather than
just persist to disk.

Every reference I can find to checkpoint related to Spark Streaming.  But
the method is defined in the core Spark library, not Streaming.

Does it exist solely for streaming, or are there circumstances unrelated to
streaming in which I might want to checkpoint...and if so, like what?

Thanks,
Diana


Re: checkpointing without streaming?

2014-04-21 Thread Xiangrui Meng
Checkpoint clears dependencies. You might need checkpoint to cut a
long lineage in iterative algorithms. -Xiangrui

On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll dcarr...@cloudera.com wrote:
 I'm trying to understand when I would want to checkpoint an RDD rather than
 just persist to disk.

 Every reference I can find to checkpoint related to Spark Streaming.  But
 the method is defined in the core Spark library, not Streaming.

 Does it exist solely for streaming, or are there circumstances unrelated to
 streaming in which I might want to checkpoint...and if so, like what?

 Thanks,
 Diana


Re: checkpointing without streaming?

2014-04-21 Thread Diana Carroll
When might that be necessary or useful?  Presumably I can persist and
replicate my RDD to avoid re-computation, if that's my goal.  What
advantage  does checkpointing provide over disk persistence with
replication?


On Mon, Apr 21, 2014 at 2:42 PM, Xiangrui Meng men...@gmail.com wrote:

 Checkpoint clears dependencies. You might need checkpoint to cut a
 long lineage in iterative algorithms. -Xiangrui

 On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll dcarr...@cloudera.com
 wrote:
  I'm trying to understand when I would want to checkpoint an RDD rather
 than
  just persist to disk.
 
  Every reference I can find to checkpoint related to Spark Streaming.  But
  the method is defined in the core Spark library, not Streaming.
 
  Does it exist solely for streaming, or are there circumstances unrelated
 to
  streaming in which I might want to checkpoint...and if so, like what?
 
  Thanks,
  Diana



Re: checkpointing without streaming?

2014-04-21 Thread Tathagata Das
Diana, that is a good question.

When you persist an RDD, the system still remembers the whole lineage of
parent RDDs that created that RDD. If one of the executor fails, and the
persist data is lost (both local disk and memory data will get lost), then
the lineage is used to recreate the RDD. The longer the lineage, the more
recomputation the system has to do in case of failure, and hence higher
recovery time. So its not a good idea to have a very long lineage, as it
leads to all sorts of problems, like the one Xiangrui pointed to.

Checkpointing an RDD actually saves the RDD data to HDFS and removes
pointers to the parent RDDs (as the data can be regenerated just by reading
from the HDFS file). So that RDDs data does not need to be recomputed
when worker fails, just re-read. In fact, the data is also retained across
driver restarts as it is in HDFS.

RDD.checkpoint was introduced with streaming because streaming is obvious
use case where the lineage will grow infinitely long (for stateful
computations where each result depends on all the previously received
data). However, this checkpointing is useful for any long running RDD
computation, and I know that people have used RDD.checkpoint() independent
of streaming.

TD


On Mon, Apr 21, 2014 at 1:10 PM, Xiangrui Meng men...@gmail.com wrote:

 Persist doesn't cut lineage. You might run into StackOverflow problem
 with a long lineage. See
 https://spark-project.atlassian.net/browse/SPARK-1006 for example.

 On Mon, Apr 21, 2014 at 12:11 PM, Diana Carroll dcarr...@cloudera.com
 wrote:
  When might that be necessary or useful?  Presumably I can persist and
  replicate my RDD to avoid re-computation, if that's my goal.  What
 advantage
  does checkpointing provide over disk persistence with replication?
 
 
  On Mon, Apr 21, 2014 at 2:42 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Checkpoint clears dependencies. You might need checkpoint to cut a
  long lineage in iterative algorithms. -Xiangrui
 
  On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll dcarr...@cloudera.com
  wrote:
   I'm trying to understand when I would want to checkpoint an RDD rather
   than
   just persist to disk.
  
   Every reference I can find to checkpoint related to Spark Streaming.
   But
   the method is defined in the core Spark library, not Streaming.
  
   Does it exist solely for streaming, or are there circumstances
 unrelated
   to
   streaming in which I might want to checkpoint...and if so, like what?
  
   Thanks,
   Diana