Re: How can i remove the need for calling cache

2017-08-02 Thread jeff saremi
thanks Vadim. yes this is a good option for us. thanks From: Vadim Semenov <vadim.seme...@datadoghq.com> Sent: Wednesday, August 2, 2017 6:24:40 PM To: Suzen, Mehmet Cc: jeff saremi; user@spark.apache.org Subject: Re: How can i remove the need for calling

Re: How can i remove the need for calling cache

2017-08-02 Thread Vadim Semenov
So if you just save an RDD to HDFS via 'saveAsSequenceFile', you would have to create a new RDD that reads that data, this way you'll avoid recomputing the RDD but may lose time on saving/loading. Exactly same thing happens in 'checkpoint', 'checkpoint' is just a convenient method that gives you

Re: How can i remove the need for calling cache

2017-08-02 Thread Suzen, Mehmet
On 3 August 2017 at 03:00, Vadim Semenov wrote: > `saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so it > just saves data to some destination. Yes, that's what I thought, so the statement "..otherwise saving it on a file will require

Re: How can i remove the need for calling cache

2017-08-02 Thread Vadim Semenov
`saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so it just saves data to some destination. `cache/persist` allow you to cache data and keep the DAG in case of some executor that holds data goes down, so Spark would still be able to recalculate missing partitions

Re: How can i remove the need for calling cache

2017-08-02 Thread Vadim Semenov
- >> *From:* Vadim Semenov <vadim.seme...@datadoghq.com> >> *Sent:* Tuesday, August 1, 2017 12:05:17 PM >> *To:* jeff saremi >> *Cc:* user@spark.apache.org >> *Subject:* Re: How can i remove the need for calling cache >> >> You can use `.checkpoint()

Re: How can i remove the need for calling cache

2017-08-02 Thread Vadim Semenov
q.com> > *Sent:* Tuesday, August 1, 2017 12:05:17 PM > *To:* jeff saremi > *Cc:* user@spark.apache.org > *Subject:* Re: How can i remove the need for calling cache > > You can use `.checkpoint()`: > ``` > val sc: SparkContext > sc.setCheckpointDir("hdfs:///tmp/checkpoint

Re: How can i remove the need for calling cache

2017-08-02 Thread Suzen, Mehmet
On 3 August 2017 at 01:05, jeff saremi wrote: > Vadim: > > This is from the Mastering Spark book: > > "It is strongly recommended that a checkpointed RDD is persisted in memory, > otherwise saving it on a file will require recomputation." Is this really true? I had the

Re: How can i remove the need for calling cache

2017-08-02 Thread jeff saremi
as hoping for From: Vadim Semenov <vadim.seme...@datadoghq.com> Sent: Tuesday, August 1, 2017 12:05:17 PM To: jeff saremi Cc: user@spark.apache.org Subject: Re: How can i remove the need for calling cache You can use `.checkpoint()`: ``` val sc: SparkContext sc.setCheckpoint

Re: How can i remove the need for calling cache

2017-08-01 Thread jeff saremi
Thanks Mark. I'll examine the status more carefully to observe this. From: Mark Hamstra <m...@clearstorydata.com> Sent: Tuesday, August 1, 2017 11:25:46 AM To: user@spark.apache.org Subject: Re: How can i remove the need for calling cache Very likely

Re: How can i remove the need for calling cache

2017-08-01 Thread jeff saremi
Thanks Vadim. I'll try that From: Vadim Semenov <vadim.seme...@datadoghq.com> Sent: Tuesday, August 1, 2017 12:05:17 PM To: jeff saremi Cc: user@spark.apache.org Subject: Re: How can i remove the need for calling cache You can use `.checkpoint()`: ```

Re: How can i remove the need for calling cache

2017-08-01 Thread Vadim Semenov
You can use `.checkpoint()`: ``` val sc: SparkContext sc.setCheckpointDir("hdfs:///tmp/checkpointDirectory") myrdd.checkpoint() val result1 = myrdd.map(op1(_)) result1.count() // Will save `myrdd` to HDFS and do map(op1… val result2 = myrdd.map(op2(_)) result2.count() // Will load `myrdd` from

Re: How can i remove the need for calling cache

2017-08-01 Thread jeff saremi
ubject: Re: How can i remove the need for calling cache Hi Jeff, that looks sane to me. Do you have additional details? On 1 August 2017 at 11:05, jeff saremi <jeffsar...@hotmail.com<mailto:jeffsar...@hotmail.com>> wrote: Calling cache/persist fails all our jobs (i have

Re: How can i remove the need for calling cache

2017-08-01 Thread Mark Hamstra
Very likely, much of the potential duplication is already being avoided even without calling cache/persist. When running the above code without `myrdd.cache`, have you looked at the Spark web UI for the Jobs? For at least one of them you will likely see that many Stages are marked as "skipped",

Re: How can i remove the need for calling cache

2017-08-01 Thread lucas.g...@gmail.com
Hi Jeff, that looks sane to me. Do you have additional details? On 1 August 2017 at 11:05, jeff saremi wrote: > Calling cache/persist fails all our jobs (i have posted 2 threads on > this). > > And we're giving up hope in finding a solution. > So I'd like to find a

How can i remove the need for calling cache

2017-08-01 Thread jeff saremi
Calling cache/persist fails all our jobs (i have posted 2 threads on this). And we're giving up hope in finding a solution. So I'd like to find a workaround for that: If I save an RDD to hdfs and read it back, can I use it in more than one operation? Example: (using cache) // do a whole bunch