Re: Use Case of mutable RDD - any ideas around will help.

Evan Chan Sun, 14 Sep 2014 22:41:24 -0700

SPARK-1671 looks really promising.

Note that even right now, you don't need to un-cache the existing
table.   You can do something like this:


newAdditionRdd.registerTempTable("table2")
sqlContext.cacheTable("table2")
val unionedRdd = sqlContext.table("table1").unionAll(sqlContext.table("table2"))

When you use "table", it will return you the cached representation, so
that the union executes much faster.

However, there is some unknown slowdown, it's not quite as fast as
what you would expect.

On Fri, Sep 12, 2014 at 2:09 PM, Cheng Lian <lian.cs....@gmail.com> wrote:
> Ah, I see. So basically what you need is something like cache write through
> support which exists in Shark but not implemented in Spark SQL yet. In
> Shark, when inserting data into a table that has already been cached, the
> newly inserted data will be automatically cached and “union”-ed with the
> existing table content. SPARK-1671 was created to track this feature. We’ll
> work on that.
>
> Currently, as a workaround, instead of doing union at the RDD level, you may
> try cache the new table, union it with the old table and then query the
> union-ed table. The drawbacks is higher code complexity and you end up with
> lots of temporary tables. But the performance should be reasonable.
>
>
> On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur <archit279tha...@gmail.com>
> wrote:
>>
>> LittleCode snippet:
>>
>> line1: cacheTable(existingRDDTableName)
>> line2: //some operations which will materialize existingRDD dataset.
>> line3: existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName)
>> line4: cacheTable(new_existingRDDTableName)
>> line5: //some operation that will materialize new _existingRDD.
>>
>> now, what we expect is in line4 rather than caching both
>> existingRDDTableName and new_existingRDDTableName, it should cache only
>> new_existingRDDTableName. but we cannot explicitly uncache
>> existingRDDTableName because we want the union to use the cached
>> existingRDDTableName. since being lazy new_existingRDDTableName could be
>> materialized later and by then we cant lose existingRDDTableName from cache.
>>
>> What if keep the same name of the new table
>>
>> so, cacheTable(existingRDDTableName)
>> existingRDD.union(newRDD).registerAsTable(existingRDDTableName)
>> cacheTable(existingRDDTableName) //might not be needed again.
>>
>> Will our both cases be satisfied, that it uses existingRDDTableName from
>> cache for union and dont duplicate the data in the cache but somehow, append
>> to the older cacheTable.
>>
>> Thanks and Regards,
>>
>>
>> Archit Thakur.
>> Sr Software Developer,
>> Guavus, Inc.
>>
>> On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora
>> <pankajarora.n...@gmail.com> wrote:
>>>
>>> I think i should elaborate usecase little more.
>>>
>>> So we have UI dashboard whose response time is quite fast as all the data
>>> is
>>> cached. Users query data based on time range and also there is always new
>>> data coming into the system at predefined frequency lets say 1 hour.
>>>
>>> As you said i can uncache tables it will basically drop all data from
>>> memory.
>>> I cannot afford losing my cache even for short interval. As all queries
>>> from
>>> UI will get slow till the time cache loads again. UI response time needs
>>> to
>>> be predictable and shoudl be fast enough so that user does not get
>>> irritated.
>>>
>>> Also i cannot keep two copies of data(till newrdd materialize) into
>>> memory
>>> as it will surpass total available memory in system.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Use Case of mutable RDD - any ideas around will help.

Reply via email to