Re: Re: Spark RDD cache persistence

Deenar Toraskar Thu, 05 Nov 2015 22:30:42 -0800

You can have a long running Spark context in several fashions. This will
ensure your data will be cached in memory. Clients will access the RDD
through a REST API that you can expose. See the Spark Job Server, it does
something similar. It has something called Named RDDs


Using Named RDDs

Named RDDs are a way to easily share RDDs among job. Using this facility,
computed RDDs can be cached with a given name and later on retrieved. To
use this feature, the SparkJob needs to mixinNamedRddSupport:

Alternatively if you use the Spark Thrift Server, any cached
dataframes/RDDs will be available to all clients of Spark via the Thrift
Server until it is shutdown.

If you want to support key value lookups you might want to use IndexedRDD
<https://github.com/amplab/spark-indexedrdd>

Finally not the same as sharing RDDs, Tachyon can cache underlying HDFS
blocks.

Deenar

*Think Reactive Ltd*
deenar.toras...@thinkreactive.co.uk
07714140812



On 6 November 2015 at 05:56, r7raul1...@163.com <r7raul1...@163.com> wrote:

> You can try
> http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory
>  .
>   Hive tmp table use this function to speed


On 6 November 2015 at 05:56, r7raul1...@163.com <r7raul1...@163.com> wrote:

> You can try
> http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory
>  .
>   Hive tmp table use this function to speed job.
> https://issues.apache.org/jira/browse/HIVE-7313
>
> ------------------------------
> r7raul1...@163.com
>
>
> *From:* Christian <engr...@gmail.com>
> *Date:* 2015-11-06 13:50
> *To:* Deepak Sharma <deepakmc...@gmail.com>
> *CC:* user <user@spark.apache.org>
> *Subject:* Re: Spark RDD cache persistence
> I've never had this need and I've never done it. There are options that
> allow this. For example, I know there are web apps out there that work like
> the spark REPL. One of these I think is called Zepplin. . I've never used
> them, but I've seen them demoed. There is also Tachyon that Spark
> supports.. Hopefully, that gives you a place to start.
> On Thu, Nov 5, 2015 at 9:21 PM Deepak Sharma <deepakmc...@gmail.com>
> wrote:
>
>> Thanks Christian.
>> So is there any inbuilt mechanism in spark or api integration  to other
>> inmemory cache products such as redis to load the RDD to these system upon
>> program exit ?
>> What's the best approach to have long lived RDD cache ?
>> Thanks
>>
>>
>> Deepak
>> On 6 Nov 2015 8:34 am, "Christian" <engr...@gmail.com> wrote:
>>
>>> The cache gets cleared out when the job finishes. I am not aware of a
>>> way to keep the cache around between jobs. You could save it as an object
>>> file to disk and load it as an object file on your next job for speed.
>>> On Thu, Nov 5, 2015 at 6:17 PM Deepak Sharma <deepakmc...@gmail.com>
>>> wrote:
>>>
>>>> Hi All
>>>> I am confused on RDD persistence in cache .
>>>> If I cache RDD , is it going to stay there in memory even if my spark
>>>> program completes execution , which created it.
>>>> If not , how can I guarantee that RDD is persisted in cache even after
>>>> the program finishes execution.
>>>>
>>>> Thanks
>>>>
>>>>
>>>> Deepak
>>>>
>>>

Re: Re: Spark RDD cache persistence

Reply via email to