Re: Spark RDD cache persistence
The cache gets cleared out when the job finishes. I am not aware of a way to keep the cache around between jobs. You could save it as an object file to disk and load it as an object file on your next job for speed. On Thu, Nov 5, 2015 at 6:17 PM Deepak Sharma wrote: > Hi All > I am confused on RDD persistence in cache . > If I cache RDD , is it going to stay there in memory even if my spark > program completes execution , which created it. > If not , how can I guarantee that RDD is persisted in cache even after the > program finishes execution. > > Thanks > > > Deepak >
Re: Spark RDD cache persistence
Thanks Christian. So is there any inbuilt mechanism in spark or api integration to other inmemory cache products such as redis to load the RDD to these system upon program exit ? What's the best approach to have long lived RDD cache ? Thanks Deepak On 6 Nov 2015 8:34 am, "Christian" wrote: > The cache gets cleared out when the job finishes. I am not aware of a way > to keep the cache around between jobs. You could save it as an object file > to disk and load it as an object file on your next job for speed. > On Thu, Nov 5, 2015 at 6:17 PM Deepak Sharma > wrote: > >> Hi All >> I am confused on RDD persistence in cache . >> If I cache RDD , is it going to stay there in memory even if my spark >> program completes execution , which created it. >> If not , how can I guarantee that RDD is persisted in cache even after >> the program finishes execution. >> >> Thanks >> >> >> Deepak >> >
Re: Spark RDD cache persistence
I've never had this need and I've never done it. There are options that allow this. For example, I know there are web apps out there that work like the spark REPL. One of these I think is called Zepplin. . I've never used them, but I've seen them demoed. There is also Tachyon that Spark supports.. Hopefully, that gives you a place to start. On Thu, Nov 5, 2015 at 9:21 PM Deepak Sharma wrote: > Thanks Christian. > So is there any inbuilt mechanism in spark or api integration to other > inmemory cache products such as redis to load the RDD to these system upon > program exit ? > What's the best approach to have long lived RDD cache ? > Thanks > > > Deepak > On 6 Nov 2015 8:34 am, "Christian" wrote: > >> The cache gets cleared out when the job finishes. I am not aware of a way >> to keep the cache around between jobs. You could save it as an object file >> to disk and load it as an object file on your next job for speed. >> On Thu, Nov 5, 2015 at 6:17 PM Deepak Sharma >> wrote: >> >>> Hi All >>> I am confused on RDD persistence in cache . >>> If I cache RDD , is it going to stay there in memory even if my spark >>> program completes execution , which created it. >>> If not , how can I guarantee that RDD is persisted in cache even after >>> the program finishes execution. >>> >>> Thanks >>> >>> >>> Deepak >>> >>
Re: Re: Spark RDD cache persistence
You can try http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory . Hive tmp table use this function to speed job. https://issues.apache.org/jira/browse/HIVE-7313 r7raul1...@163.com From: Christian Date: 2015-11-06 13:50 To: Deepak Sharma CC: user Subject: Re: Spark RDD cache persistence I've never had this need and I've never done it. There are options that allow this. For example, I know there are web apps out there that work like the spark REPL. One of these I think is called Zepplin. . I've never used them, but I've seen them demoed. There is also Tachyon that Spark supports.. Hopefully, that gives you a place to start. On Thu, Nov 5, 2015 at 9:21 PM Deepak Sharma wrote: Thanks Christian. So is there any inbuilt mechanism in spark or api integration to other inmemory cache products such as redis to load the RDD to these system upon program exit ? What's the best approach to have long lived RDD cache ? Thanks Deepak On 6 Nov 2015 8:34 am, "Christian" wrote: The cache gets cleared out when the job finishes. I am not aware of a way to keep the cache around between jobs. You could save it as an object file to disk and load it as an object file on your next job for speed. On Thu, Nov 5, 2015 at 6:17 PM Deepak Sharma wrote: Hi All I am confused on RDD persistence in cache . If I cache RDD , is it going to stay there in memory even if my spark program completes execution , which created it. If not , how can I guarantee that RDD is persisted in cache even after the program finishes execution. Thanks Deepak
Re: Re: Spark RDD cache persistence
You can have a long running Spark context in several fashions. This will ensure your data will be cached in memory. Clients will access the RDD through a REST API that you can expose. See the Spark Job Server, it does something similar. It has something called Named RDDs Using Named RDDs Named RDDs are a way to easily share RDDs among job. Using this facility, computed RDDs can be cached with a given name and later on retrieved. To use this feature, the SparkJob needs to mixinNamedRddSupport: Alternatively if you use the Spark Thrift Server, any cached dataframes/RDDs will be available to all clients of Spark via the Thrift Server until it is shutdown. If you want to support key value lookups you might want to use IndexedRDD <https://github.com/amplab/spark-indexedrdd> Finally not the same as sharing RDDs, Tachyon can cache underlying HDFS blocks. Deenar *Think Reactive Ltd* deenar.toras...@thinkreactive.co.uk 07714140812 On 6 November 2015 at 05:56, r7raul1...@163.com wrote: > You can try > http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory > . > Hive tmp table use this function to speed On 6 November 2015 at 05:56, r7raul1...@163.com wrote: > You can try > http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory > . > Hive tmp table use this function to speed job. > https://issues.apache.org/jira/browse/HIVE-7313 > > -- > r7raul1...@163.com > > > *From:* Christian > *Date:* 2015-11-06 13:50 > *To:* Deepak Sharma > *CC:* user > *Subject:* Re: Spark RDD cache persistence > I've never had this need and I've never done it. There are options that > allow this. For example, I know there are web apps out there that work like > the spark REPL. One of these I think is called Zepplin. . I've never used > them, but I've seen them demoed. There is also Tachyon that Spark > supports.. Hopefully, that gives you a place to start. > On Thu, Nov 5, 2015 at 9:21 PM Deepak Sharma > wrote: > >> Thanks Christian. >> So is there any inbuilt mechanism in spark or api integration to other >> inmemory cache products such as redis to load the RDD to these system upon >> program exit ? >> What's the best approach to have long lived RDD cache ? >> Thanks >> >> >> Deepak >> On 6 Nov 2015 8:34 am, "Christian" wrote: >> >>> The cache gets cleared out when the job finishes. I am not aware of a >>> way to keep the cache around between jobs. You could save it as an object >>> file to disk and load it as an object file on your next job for speed. >>> On Thu, Nov 5, 2015 at 6:17 PM Deepak Sharma >>> wrote: >>> >>>> Hi All >>>> I am confused on RDD persistence in cache . >>>> If I cache RDD , is it going to stay there in memory even if my spark >>>> program completes execution , which created it. >>>> If not , how can I guarantee that RDD is persisted in cache even after >>>> the program finishes execution. >>>> >>>> Thanks >>>> >>>> >>>> Deepak >>>> >>>
Re: Re: Spark RDD cache persistence
Hi Deepak, For persistence across Spark jobs, you can store and access the RDDs in Tachyon. Tachyon works with ramdisk which would give you similar in-memory performance you would have within a Spark job. For more information, you can take a look at the docs on Tachyon-Spark integration: http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html Hope this helps, Calvin On Thu, Nov 5, 2015 at 10:29 PM, Deenar Toraskar wrote: > You can have a long running Spark context in several fashions. This will > ensure your data will be cached in memory. Clients will access the RDD > through a REST API that you can expose. See the Spark Job Server, it does > something similar. It has something called Named RDDs > > Using Named RDDs > > Named RDDs are a way to easily share RDDs among job. Using this facility, > computed RDDs can be cached with a given name and later on retrieved. To > use this feature, the SparkJob needs to mixinNamedRddSupport: > > Alternatively if you use the Spark Thrift Server, any cached > dataframes/RDDs will be available to all clients of Spark via the Thrift > Server until it is shutdown. > > If you want to support key value lookups you might want to use IndexedRDD > <https://github.com/amplab/spark-indexedrdd> > > Finally not the same as sharing RDDs, Tachyon can cache underlying HDFS > blocks. > > Deenar > > *Think Reactive Ltd* > deenar.toras...@thinkreactive.co.uk > 07714140812 > > > > On 6 November 2015 at 05:56, r7raul1...@163.com > wrote: > >> You can try >> http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory >> . >> Hive tmp table use this function to speed > > > On 6 November 2015 at 05:56, r7raul1...@163.com > wrote: > >> You can try >> http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory >> . >> Hive tmp table use this function to speed job. >> https://issues.apache.org/jira/browse/HIVE-7313 >> >> -- >> r7raul1...@163.com >> >> >> *From:* Christian >> *Date:* 2015-11-06 13:50 >> *To:* Deepak Sharma >> *CC:* user >> *Subject:* Re: Spark RDD cache persistence >> I've never had this need and I've never done it. There are options that >> allow this. For example, I know there are web apps out there that work like >> the spark REPL. One of these I think is called Zepplin. . I've never used >> them, but I've seen them demoed. There is also Tachyon that Spark >> supports.. Hopefully, that gives you a place to start. >> On Thu, Nov 5, 2015 at 9:21 PM Deepak Sharma >> wrote: >> >>> Thanks Christian. >>> So is there any inbuilt mechanism in spark or api integration to other >>> inmemory cache products such as redis to load the RDD to these system upon >>> program exit ? >>> What's the best approach to have long lived RDD cache ? >>> Thanks >>> >>> >>> Deepak >>> On 6 Nov 2015 8:34 am, "Christian" wrote: >>> >>>> The cache gets cleared out when the job finishes. I am not aware of a >>>> way to keep the cache around between jobs. You could save it as an object >>>> file to disk and load it as an object file on your next job for speed. >>>> On Thu, Nov 5, 2015 at 6:17 PM Deepak Sharma >>>> wrote: >>>> >>>>> Hi All >>>>> I am confused on RDD persistence in cache . >>>>> If I cache RDD , is it going to stay there in memory even if my spark >>>>> program completes execution , which created it. >>>>> If not , how can I guarantee that RDD is persisted in cache even after >>>>> the program finishes execution. >>>>> >>>>> Thanks >>>>> >>>>> >>>>> Deepak >>>>> >>>> >