Re: Re: Spark RDD cache persistence

2015-12-09 Thread Calvin Jia
Hi Deepak,

For persistence across Spark jobs, you can store and access the RDDs in
Tachyon. Tachyon works with ramdisk which would give you similar in-memory
performance you would have within a Spark job.

For more information, you can take a look at the docs on Tachyon-Spark
integration:
http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html

Hope this helps,
Calvin

On Thu, Nov 5, 2015 at 10:29 PM, Deenar Toraskar 
wrote:

> You can have a long running Spark context in several fashions. This will
> ensure your data will be cached in memory. Clients will access the RDD
> through a REST API that you can expose. See the Spark Job Server, it does
> something similar. It has something called Named RDDs
>
> Using Named RDDs
>
> Named RDDs are a way to easily share RDDs among job. Using this facility,
> computed RDDs can be cached with a given name and later on retrieved. To
> use this feature, the SparkJob needs to mixinNamedRddSupport:
>
> Alternatively if you use the Spark Thrift Server, any cached
> dataframes/RDDs will be available to all clients of Spark via the Thrift
> Server until it is shutdown.
>
> If you want to support key value lookups you might want to use IndexedRDD
> 
>
> Finally not the same as sharing RDDs, Tachyon can cache underlying HDFS
> blocks.
>
> Deenar
>
> *Think Reactive Ltd*
> deenar.toras...@thinkreactive.co.uk
> 07714140812
>
>
>
> On 6 November 2015 at 05:56, r7raul1...@163.com 
>  wrote:
>
>> You can try
>> http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory
>>  .
>>   Hive tmp table use this function to speed
>
>
> On 6 November 2015 at 05:56, r7raul1...@163.com 
> wrote:
>
>> You can try
>> http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory
>>  .
>>   Hive tmp table use this function to speed job.
>> https://issues.apache.org/jira/browse/HIVE-7313
>>
>> --
>> r7raul1...@163.com
>>
>>
>> *From:* Christian 
>> *Date:* 2015-11-06 13:50
>> *To:* Deepak Sharma 
>> *CC:* user 
>> *Subject:* Re: Spark RDD cache persistence
>> I've never had this need and I've never done it. There are options that
>> allow this. For example, I know there are web apps out there that work like
>> the spark REPL. One of these I think is called Zepplin. . I've never used
>> them, but I've seen them demoed. There is also Tachyon that Spark
>> supports.. Hopefully, that gives you a place to start.
>> On Thu, Nov 5, 2015 at 9:21 PM Deepak Sharma 
>> wrote:
>>
>>> Thanks Christian.
>>> So is there any inbuilt mechanism in spark or api integration  to other
>>> inmemory cache products such as redis to load the RDD to these system upon
>>> program exit ?
>>> What's the best approach to have long lived RDD cache ?
>>> Thanks
>>>
>>>
>>> Deepak
>>> On 6 Nov 2015 8:34 am, "Christian"  wrote:
>>>
 The cache gets cleared out when the job finishes. I am not aware of a
 way to keep the cache around between jobs. You could save it as an object
 file to disk and load it as an object file on your next job for speed.
 On Thu, Nov 5, 2015 at 6:17 PM Deepak Sharma 
 wrote:

> Hi All
> I am confused on RDD persistence in cache .
> If I cache RDD , is it going to stay there in memory even if my spark
> program completes execution , which created it.
> If not , how can I guarantee that RDD is persisted in cache even after
> the program finishes execution.
>
> Thanks
>
>
> Deepak
>

>


Re: Re: Spark RDD cache persistence

2015-11-05 Thread r7raul1...@163.com
You can try 
http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory
 .   Hive tmp table use this function to speed job. 
https://issues.apache.org/jira/browse/HIVE-7313



r7raul1...@163.com
 
From: Christian
Date: 2015-11-06 13:50
To: Deepak Sharma
CC: user
Subject: Re: Spark RDD cache persistence
I've never had this need and I've never done it. There are options that allow 
this. For example, I know there are web apps out there that work like the spark 
REPL. One of these I think is called Zepplin. . I've never used them, but I've 
seen them demoed. There is also Tachyon that Spark supports.. Hopefully, that 
gives you a place to start. 
On Thu, Nov 5, 2015 at 9:21 PM Deepak Sharma  wrote:
Thanks Christian.
So is there any inbuilt mechanism in spark or api integration  to other 
inmemory cache products such as redis to load the RDD to these system upon 
program exit ?
What's the best approach to have long lived RDD cache ?
Thanks

Deepak
On 6 Nov 2015 8:34 am, "Christian"  wrote:
The cache gets cleared out when the job finishes. I am not aware of a way to 
keep the cache around between jobs. You could save it as an object file to disk 
and load it as an object file on your next job for speed. 
On Thu, Nov 5, 2015 at 6:17 PM Deepak Sharma  wrote:
Hi All
I am confused on RDD persistence in cache .
If I cache RDD , is it going to stay there in memory even if my spark program 
completes execution , which created it.
If not , how can I guarantee that RDD is persisted in cache even after the 
program finishes execution.
Thanks

Deepak


Re: Re: Spark RDD cache persistence

2015-11-05 Thread Deenar Toraskar
You can have a long running Spark context in several fashions. This will
ensure your data will be cached in memory. Clients will access the RDD
through a REST API that you can expose. See the Spark Job Server, it does
something similar. It has something called Named RDDs

Using Named RDDs

Named RDDs are a way to easily share RDDs among job. Using this facility,
computed RDDs can be cached with a given name and later on retrieved. To
use this feature, the SparkJob needs to mixinNamedRddSupport:

Alternatively if you use the Spark Thrift Server, any cached
dataframes/RDDs will be available to all clients of Spark via the Thrift
Server until it is shutdown.

If you want to support key value lookups you might want to use IndexedRDD


Finally not the same as sharing RDDs, Tachyon can cache underlying HDFS
blocks.

Deenar

*Think Reactive Ltd*
deenar.toras...@thinkreactive.co.uk
07714140812



On 6 November 2015 at 05:56, r7raul1...@163.com  wrote:

> You can try
> http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory
>  .
>   Hive tmp table use this function to speed


On 6 November 2015 at 05:56, r7raul1...@163.com  wrote:

> You can try
> http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory
>  .
>   Hive tmp table use this function to speed job.
> https://issues.apache.org/jira/browse/HIVE-7313
>
> --
> r7raul1...@163.com
>
>
> *From:* Christian 
> *Date:* 2015-11-06 13:50
> *To:* Deepak Sharma 
> *CC:* user 
> *Subject:* Re: Spark RDD cache persistence
> I've never had this need and I've never done it. There are options that
> allow this. For example, I know there are web apps out there that work like
> the spark REPL. One of these I think is called Zepplin. . I've never used
> them, but I've seen them demoed. There is also Tachyon that Spark
> supports.. Hopefully, that gives you a place to start.
> On Thu, Nov 5, 2015 at 9:21 PM Deepak Sharma 
> wrote:
>
>> Thanks Christian.
>> So is there any inbuilt mechanism in spark or api integration  to other
>> inmemory cache products such as redis to load the RDD to these system upon
>> program exit ?
>> What's the best approach to have long lived RDD cache ?
>> Thanks
>>
>>
>> Deepak
>> On 6 Nov 2015 8:34 am, "Christian"  wrote:
>>
>>> The cache gets cleared out when the job finishes. I am not aware of a
>>> way to keep the cache around between jobs. You could save it as an object
>>> file to disk and load it as an object file on your next job for speed.
>>> On Thu, Nov 5, 2015 at 6:17 PM Deepak Sharma 
>>> wrote:
>>>
 Hi All
 I am confused on RDD persistence in cache .
 If I cache RDD , is it going to stay there in memory even if my spark
 program completes execution , which created it.
 If not , how can I guarantee that RDD is persisted in cache even after
 the program finishes execution.

 Thanks


 Deepak

>>>