Ah, I see. So basically what you need is something like cache write through
support which exists in Shark but not implemented in Spark SQL yet. In
Shark, when inserting data into a table that has already been cached, the
newly inserted data will be automatically cached and “union”-ed with the
existing table content. SPARK-1671
<https://issues.apache.org/jira/browse/SPARK-1671> was created to track
this feature. We’ll work on that.

Currently, as a workaround, instead of doing union at the RDD level, you
may try cache the new table, union it with the old table and then query the
union-ed table. The drawbacks is higher code complexity and you end up with
lots of temporary tables. But the performance should be reasonable.
​

On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur <archit279tha...@gmail.com>
wrote:

> LittleCode snippet:
>
> line1: cacheTable(existingRDDTableName)
> line2: //some operations which will materialize existingRDD dataset.
> line3: existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName)
> line4: cacheTable(new_existingRDDTableName)
> line5: //some operation that will materialize new _existingRDD.
>
> now, what we expect is in line4 rather than caching both
> existingRDDTableName and new_existingRDDTableName, it should cache only
> new_existingRDDTableName. but we cannot explicitly uncache
> existingRDDTableName because we want the union to use the cached
> existingRDDTableName. since being lazy new_existingRDDTableName could be
> materialized later and by then we cant lose existingRDDTableName from
> cache.
>
> What if keep the same name of the new table
>
> so, cacheTable(existingRDDTableName)
> existingRDD.union(newRDD).registerAsTable(existingRDDTableName)
> cacheTable(existingRDDTableName) //might not be needed again.
>
> Will our both cases be satisfied, that it uses existingRDDTableName from
> cache for union and dont duplicate the data in the cache but somehow,
> append to the older cacheTable.
>
> Thanks and Regards,
>
>
> Archit Thakur.
> Sr Software Developer,
> Guavus, Inc.
>
> On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora <pankajarora.n...@gmail.com
> > wrote:
>
>> I think i should elaborate usecase little more.
>>
>> So we have UI dashboard whose response time is quite fast as all the data
>> is
>> cached. Users query data based on time range and also there is always new
>> data coming into the system at predefined frequency lets say 1 hour.
>>
>> As you said i can uncache tables it will basically drop all data from
>> memory.
>> I cannot afford losing my cache even for short interval. As all queries
>> from
>> UI will get slow till the time cache loads again. UI response time needs
>> to
>> be predictable and shoudl be fast enough so that user does not get
>> irritated.
>>
>> Also i cannot keep two copies of data(till newrdd materialize) into memory
>> as it will surpass total available memory in system.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to