Re: Use Case of mutable RDD - any ideas around will help.
[moving to user@] This would typically be accomplished with a union() operation. You can't mutate an RDD in-place, but you can create a new RDD with a union() which is an inexpensive operator. On Fri, Sep 12, 2014 at 5:28 AM, Archit Thakur wrote: > Hi, > > We have a use case where we are planning to keep sparkcontext alive in a > server and run queries on it. But the issue is we have a continuous > flowing data the comes in batches of constant duration(say, 1hour). Now we > want to exploit the schemaRDD and its benefits of columnar caching and > compression. Is there a way I can append the new batch (uncached) to the > older(cached) batch without losing the older data from cache and caching > the whole dataset. > > Thanks and Regards, > > > Archit Thakur. > Sr Software Developer, > Guavus, Inc. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Use Case of mutable RDD - any ideas around will help.
Hi Patrick, What if all the data has to be keep in cache all time. If applying union result in new RDD then caching this would result into keeping older as well as this into memory hence duplicating data. Below is what i understood from your comment. sqlContext.cacheTable(existingRDD)// caches the RDD as schema RDD uses columnar compression existingRDD.union(newRDD).registerAsTable("newTable") sqlContext.cacheTable(newTable) -- duplicated data -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14107.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Use Case of mutable RDD - any ideas around will help.
You can always use sqlContext.uncacheTable to uncache the old table. On Fri, Sep 12, 2014 at 10:33 AM, pankaj.arora wrote: > Hi Patrick, > > What if all the data has to be keep in cache all time. If applying union > result in new RDD then caching this would result into keeping older as well > as this into memory hence duplicating data. > > Below is what i understood from your comment. > > sqlContext.cacheTable(existingRDD)// caches the RDD as schema RDD uses > columnar compression > > existingRDD.union(newRDD).registerAsTable("newTable") > > sqlContext.cacheTable(newTable) -- duplicated data > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14107.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Use Case of mutable RDD - any ideas around will help.
I think i should elaborate usecase little more. So we have UI dashboard whose response time is quite fast as all the data is cached. Users query data based on time range and also there is always new data coming into the system at predefined frequency lets say 1 hour. As you said i can uncache tables it will basically drop all data from memory. I cannot afford losing my cache even for short interval. As all queries from UI will get slow till the time cache loads again. UI response time needs to be predictable and shoudl be fast enough so that user does not get irritated. Also i cannot keep two copies of data(till newrdd materialize) into memory as it will surpass total available memory in system. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14111.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Use Case of mutable RDD - any ideas around will help.
I think i should elaborate usecase little more. So we have UI dashboard whose response time is quite fast as all the data is cached. Users query data based on time range and also there is always new data coming into the system at predefined frequency lets say 1 hour. As you said i can uncache tables it will basically drop all data from memory. I cannot afford losing my cache even for short interval. As all queries from UI will get slow till the time cache loads again. UI response time needs to be predictable and shoudl be fast enough so that user does not get irritated. Also i cannot keep two copies of data(till newrdd materialize) into memory as it will surpass total available memory in system. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Use Case of mutable RDD - any ideas around will help.
LittleCode snippet: line1: cacheTable(existingRDDTableName) line2: //some operations which will materialize existingRDD dataset. line3: existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName) line4: cacheTable(new_existingRDDTableName) line5: //some operation that will materialize new _existingRDD. now, what we expect is in line4 rather than caching both existingRDDTableName and new_existingRDDTableName, it should cache only new_existingRDDTableName. but we cannot explicitly uncache existingRDDTableName because we want the union to use the cached existingRDDTableName. since being lazy new_existingRDDTableName could be materialized later and by then we cant lose existingRDDTableName from cache. What if keep the same name of the new table so, cacheTable(existingRDDTableName) existingRDD.union(newRDD).registerAsTable(existingRDDTableName) cacheTable(existingRDDTableName) //might not be needed again. Will our both cases be satisfied, that it uses existingRDDTableName from cache for union and dont duplicate the data in the cache but somehow, append to the older cacheTable. Thanks and Regards, Archit Thakur. Sr Software Developer, Guavus, Inc. On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora wrote: > I think i should elaborate usecase little more. > > So we have UI dashboard whose response time is quite fast as all the data > is > cached. Users query data based on time range and also there is always new > data coming into the system at predefined frequency lets say 1 hour. > > As you said i can uncache tables it will basically drop all data from > memory. > I cannot afford losing my cache even for short interval. As all queries > from > UI will get slow till the time cache loads again. UI response time needs to > be predictable and shoudl be fast enough so that user does not get > irritated. > > Also i cannot keep two copies of data(till newrdd materialize) into memory > as it will surpass total available memory in system. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Use Case of mutable RDD - any ideas around will help.
Ah, I see. So basically what you need is something like cache write through support which exists in Shark but not implemented in Spark SQL yet. In Shark, when inserting data into a table that has already been cached, the newly inserted data will be automatically cached and “union”-ed with the existing table content. SPARK-1671 <https://issues.apache.org/jira/browse/SPARK-1671> was created to track this feature. We’ll work on that. Currently, as a workaround, instead of doing union at the RDD level, you may try cache the new table, union it with the old table and then query the union-ed table. The drawbacks is higher code complexity and you end up with lots of temporary tables. But the performance should be reasonable. On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur wrote: > LittleCode snippet: > > line1: cacheTable(existingRDDTableName) > line2: //some operations which will materialize existingRDD dataset. > line3: existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName) > line4: cacheTable(new_existingRDDTableName) > line5: //some operation that will materialize new _existingRDD. > > now, what we expect is in line4 rather than caching both > existingRDDTableName and new_existingRDDTableName, it should cache only > new_existingRDDTableName. but we cannot explicitly uncache > existingRDDTableName because we want the union to use the cached > existingRDDTableName. since being lazy new_existingRDDTableName could be > materialized later and by then we cant lose existingRDDTableName from > cache. > > What if keep the same name of the new table > > so, cacheTable(existingRDDTableName) > existingRDD.union(newRDD).registerAsTable(existingRDDTableName) > cacheTable(existingRDDTableName) //might not be needed again. > > Will our both cases be satisfied, that it uses existingRDDTableName from > cache for union and dont duplicate the data in the cache but somehow, > append to the older cacheTable. > > Thanks and Regards, > > > Archit Thakur. > Sr Software Developer, > Guavus, Inc. > > On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora > wrote: > >> I think i should elaborate usecase little more. >> >> So we have UI dashboard whose response time is quite fast as all the data >> is >> cached. Users query data based on time range and also there is always new >> data coming into the system at predefined frequency lets say 1 hour. >> >> As you said i can uncache tables it will basically drop all data from >> memory. >> I cannot afford losing my cache even for short interval. As all queries >> from >> UI will get slow till the time cache loads again. UI response time needs >> to >> be predictable and shoudl be fast enough so that user does not get >> irritated. >> >> Also i cannot keep two copies of data(till newrdd materialize) into memory >> as it will surpass total available memory in system. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Use Case of mutable RDD - any ideas around will help.
SPARK-1671 looks really promising. Note that even right now, you don't need to un-cache the existing table. You can do something like this: newAdditionRdd.registerTempTable("table2") sqlContext.cacheTable("table2") val unionedRdd = sqlContext.table("table1").unionAll(sqlContext.table("table2")) When you use "table", it will return you the cached representation, so that the union executes much faster. However, there is some unknown slowdown, it's not quite as fast as what you would expect. On Fri, Sep 12, 2014 at 2:09 PM, Cheng Lian wrote: > Ah, I see. So basically what you need is something like cache write through > support which exists in Shark but not implemented in Spark SQL yet. In > Shark, when inserting data into a table that has already been cached, the > newly inserted data will be automatically cached and “union”-ed with the > existing table content. SPARK-1671 was created to track this feature. We’ll > work on that. > > Currently, as a workaround, instead of doing union at the RDD level, you may > try cache the new table, union it with the old table and then query the > union-ed table. The drawbacks is higher code complexity and you end up with > lots of temporary tables. But the performance should be reasonable. > > > On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur > wrote: >> >> LittleCode snippet: >> >> line1: cacheTable(existingRDDTableName) >> line2: //some operations which will materialize existingRDD dataset. >> line3: existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName) >> line4: cacheTable(new_existingRDDTableName) >> line5: //some operation that will materialize new _existingRDD. >> >> now, what we expect is in line4 rather than caching both >> existingRDDTableName and new_existingRDDTableName, it should cache only >> new_existingRDDTableName. but we cannot explicitly uncache >> existingRDDTableName because we want the union to use the cached >> existingRDDTableName. since being lazy new_existingRDDTableName could be >> materialized later and by then we cant lose existingRDDTableName from cache. >> >> What if keep the same name of the new table >> >> so, cacheTable(existingRDDTableName) >> existingRDD.union(newRDD).registerAsTable(existingRDDTableName) >> cacheTable(existingRDDTableName) //might not be needed again. >> >> Will our both cases be satisfied, that it uses existingRDDTableName from >> cache for union and dont duplicate the data in the cache but somehow, append >> to the older cacheTable. >> >> Thanks and Regards, >> >> >> Archit Thakur. >> Sr Software Developer, >> Guavus, Inc. >> >> On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora >> wrote: >>> >>> I think i should elaborate usecase little more. >>> >>> So we have UI dashboard whose response time is quite fast as all the data >>> is >>> cached. Users query data based on time range and also there is always new >>> data coming into the system at predefined frequency lets say 1 hour. >>> >>> As you said i can uncache tables it will basically drop all data from >>> memory. >>> I cannot afford losing my cache even for short interval. As all queries >>> from >>> UI will get slow till the time cache loads again. UI response time needs >>> to >>> be predictable and shoudl be fast enough so that user does not get >>> irritated. >>> >>> Also i cannot keep two copies of data(till newrdd materialize) into >>> memory >>> as it will surpass total available memory in system. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Use Case of mutable RDD - any ideas around will help.
The unknown slowdown might be addressed by https://github.com/apache/spark/commit/f858f466862541c3faad76a1fa2391f1c17ec9dd On Sun, Sep 14, 2014 at 10:40 PM, Evan Chan wrote: > SPARK-1671 looks really promising. > > Note that even right now, you don't need to un-cache the existing > table. You can do something like this: > > newAdditionRdd.registerTempTable("table2") > sqlContext.cacheTable("table2") > val unionedRdd = > sqlContext.table("table1").unionAll(sqlContext.table("table2")) > > When you use "table", it will return you the cached representation, so > that the union executes much faster. > > However, there is some unknown slowdown, it's not quite as fast as > what you would expect. > > On Fri, Sep 12, 2014 at 2:09 PM, Cheng Lian wrote: > > Ah, I see. So basically what you need is something like cache write > through > > support which exists in Shark but not implemented in Spark SQL yet. In > > Shark, when inserting data into a table that has already been cached, the > > newly inserted data will be automatically cached and “union”-ed with the > > existing table content. SPARK-1671 was created to track this feature. > We’ll > > work on that. > > > > Currently, as a workaround, instead of doing union at the RDD level, you > may > > try cache the new table, union it with the old table and then query the > > union-ed table. The drawbacks is higher code complexity and you end up > with > > lots of temporary tables. But the performance should be reasonable. > > > > > > On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur < > archit279tha...@gmail.com> > > wrote: > >> > >> LittleCode snippet: > >> > >> line1: cacheTable(existingRDDTableName) > >> line2: //some operations which will materialize existingRDD dataset. > >> line3: > existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName) > >> line4: cacheTable(new_existingRDDTableName) > >> line5: //some operation that will materialize new _existingRDD. > >> > >> now, what we expect is in line4 rather than caching both > >> existingRDDTableName and new_existingRDDTableName, it should cache only > >> new_existingRDDTableName. but we cannot explicitly uncache > >> existingRDDTableName because we want the union to use the cached > >> existingRDDTableName. since being lazy new_existingRDDTableName could be > >> materialized later and by then we cant lose existingRDDTableName from > cache. > >> > >> What if keep the same name of the new table > >> > >> so, cacheTable(existingRDDTableName) > >> existingRDD.union(newRDD).registerAsTable(existingRDDTableName) > >> cacheTable(existingRDDTableName) //might not be needed again. > >> > >> Will our both cases be satisfied, that it uses existingRDDTableName from > >> cache for union and dont duplicate the data in the cache but somehow, > append > >> to the older cacheTable. > >> > >> Thanks and Regards, > >> > >> > >> Archit Thakur. > >> Sr Software Developer, > >> Guavus, Inc. > >> > >> On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora > >> wrote: > >>> > >>> I think i should elaborate usecase little more. > >>> > >>> So we have UI dashboard whose response time is quite fast as all the > data > >>> is > >>> cached. Users query data based on time range and also there is always > new > >>> data coming into the system at predefined frequency lets say 1 hour. > >>> > >>> As you said i can uncache tables it will basically drop all data from > >>> memory. > >>> I cannot afford losing my cache even for short interval. As all queries > >>> from > >>> UI will get slow till the time cache loads again. UI response time > needs > >>> to > >>> be predictable and shoudl be fast enough so that user does not get > >>> irritated. > >>> > >>> Also i cannot keep two copies of data(till newrdd materialize) into > >>> memory > >>> as it will surpass total available memory in system. > >>> > >>> > >>> > >>> -- > >>> View this message in context: > >>> > http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html > >>> Sent from the Apache Spark User List mailing list archive at > Nabble.com. > >>> > >>> - > >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >>> For additional commands, e-mail: user-h...@spark.apache.org > >>> > >> > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Use Case of mutable RDD - any ideas around will help.
Sweet, that's probably it. Too bad it didn't seem to make 1.1? On Wed, Sep 17, 2014 at 5:32 PM, Michael Armbrust wrote: > The unknown slowdown might be addressed by > https://github.com/apache/spark/commit/f858f466862541c3faad76a1fa2391f1c17ec9dd > > On Sun, Sep 14, 2014 at 10:40 PM, Evan Chan wrote: >> >> SPARK-1671 looks really promising. >> >> Note that even right now, you don't need to un-cache the existing >> table. You can do something like this: >> >> newAdditionRdd.registerTempTable("table2") >> sqlContext.cacheTable("table2") >> val unionedRdd = >> sqlContext.table("table1").unionAll(sqlContext.table("table2")) >> >> When you use "table", it will return you the cached representation, so >> that the union executes much faster. >> >> However, there is some unknown slowdown, it's not quite as fast as >> what you would expect. >> >> On Fri, Sep 12, 2014 at 2:09 PM, Cheng Lian wrote: >> > Ah, I see. So basically what you need is something like cache write >> > through >> > support which exists in Shark but not implemented in Spark SQL yet. In >> > Shark, when inserting data into a table that has already been cached, >> > the >> > newly inserted data will be automatically cached and “union”-ed with the >> > existing table content. SPARK-1671 was created to track this feature. >> > We’ll >> > work on that. >> > >> > Currently, as a workaround, instead of doing union at the RDD level, you >> > may >> > try cache the new table, union it with the old table and then query the >> > union-ed table. The drawbacks is higher code complexity and you end up >> > with >> > lots of temporary tables. But the performance should be reasonable. >> > >> > >> > On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur >> > >> > wrote: >> >> >> >> LittleCode snippet: >> >> >> >> line1: cacheTable(existingRDDTableName) >> >> line2: //some operations which will materialize existingRDD dataset. >> >> line3: >> >> existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName) >> >> line4: cacheTable(new_existingRDDTableName) >> >> line5: //some operation that will materialize new _existingRDD. >> >> >> >> now, what we expect is in line4 rather than caching both >> >> existingRDDTableName and new_existingRDDTableName, it should cache only >> >> new_existingRDDTableName. but we cannot explicitly uncache >> >> existingRDDTableName because we want the union to use the cached >> >> existingRDDTableName. since being lazy new_existingRDDTableName could >> >> be >> >> materialized later and by then we cant lose existingRDDTableName from >> >> cache. >> >> >> >> What if keep the same name of the new table >> >> >> >> so, cacheTable(existingRDDTableName) >> >> existingRDD.union(newRDD).registerAsTable(existingRDDTableName) >> >> cacheTable(existingRDDTableName) //might not be needed again. >> >> >> >> Will our both cases be satisfied, that it uses existingRDDTableName >> >> from >> >> cache for union and dont duplicate the data in the cache but somehow, >> >> append >> >> to the older cacheTable. >> >> >> >> Thanks and Regards, >> >> >> >> >> >> Archit Thakur. >> >> Sr Software Developer, >> >> Guavus, Inc. >> >> >> >> On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora >> >> wrote: >> >>> >> >>> I think i should elaborate usecase little more. >> >>> >> >>> So we have UI dashboard whose response time is quite fast as all the >> >>> data >> >>> is >> >>> cached. Users query data based on time range and also there is always >> >>> new >> >>> data coming into the system at predefined frequency lets say 1 hour. >> >>> >> >>> As you said i can uncache tables it will basically drop all data from >> >>> memory. >> >>> I cannot afford losing my cache even for short interval. As all >> >>> queries >> >>> from >> >>> UI will get slow till the time cache loads again. UI response time >> >>> needs >> >>> to >> >>>