Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Patrick Wendell
[moving to user@]

This would typically be accomplished with a union() operation. You
can't mutate an RDD in-place, but you can create a new RDD with a
union() which is an inexpensive operator.

On Fri, Sep 12, 2014 at 5:28 AM, Archit Thakur
 wrote:
> Hi,
>
> We have a use case where we are planning to keep sparkcontext alive in a
> server and run queries on it. But the issue is we have  a continuous
> flowing data the comes in batches of constant duration(say, 1hour). Now we
> want to exploit the schemaRDD and its benefits of columnar caching and
> compression. Is there a way I can append the new batch (uncached) to the
> older(cached) batch without losing the older data from cache and caching
> the whole dataset.
>
> Thanks and Regards,
>
>
> Archit Thakur.
> Sr Software Developer,
> Guavus, Inc.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread pankaj.arora
Hi Patrick,

What if all the data has to be keep in cache all time. If applying union
result in new RDD then caching this would result into keeping older as well
as this into memory hence duplicating data.

Below is what i understood from your comment.

sqlContext.cacheTable(existingRDD)// caches the RDD as schema RDD uses
columnar compression

existingRDD.union(newRDD).registerAsTable("newTable")

sqlContext.cacheTable(newTable) -- duplicated data



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14107.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Cheng Lian
You can always use sqlContext.uncacheTable to uncache the old table.
​

On Fri, Sep 12, 2014 at 10:33 AM, pankaj.arora 
wrote:

> Hi Patrick,
>
> What if all the data has to be keep in cache all time. If applying union
> result in new RDD then caching this would result into keeping older as well
> as this into memory hence duplicating data.
>
> Below is what i understood from your comment.
>
> sqlContext.cacheTable(existingRDD)// caches the RDD as schema RDD uses
> columnar compression
>
> existingRDD.union(newRDD).registerAsTable("newTable")
>
> sqlContext.cacheTable(newTable) -- duplicated data
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14107.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread pankaj arora
I think i should elaborate usecase little more.

So we have UI dashboard whose response time is quite fast as all the data is
cached. Users query data based on time range and also there is always new
data coming into the system at predefined frequency lets say 1 hour.

As you said i can uncache tables it will basically drop all data from
memory.
I cannot afford losing my cache even for short interval. As all queries from
UI will get slow till the time cache loads again. UI response time needs to
be predictable and shoudl be fast enough so that user does not get
irritated.

Also i cannot keep two copies of data(till newrdd materialize) into memory
as it will surpass total available memory in system.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14111.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread pankaj arora
I think i should elaborate usecase little more. 

So we have UI dashboard whose response time is quite fast as all the data is
cached. Users query data based on time range and also there is always new
data coming into the system at predefined frequency lets say 1 hour. 

As you said i can uncache tables it will basically drop all data from
memory. 
I cannot afford losing my cache even for short interval. As all queries from
UI will get slow till the time cache loads again. UI response time needs to
be predictable and shoudl be fast enough so that user does not get
irritated. 

Also i cannot keep two copies of data(till newrdd materialize) into memory
as it will surpass total available memory in system. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Archit Thakur
LittleCode snippet:

line1: cacheTable(existingRDDTableName)
line2: //some operations which will materialize existingRDD dataset.
line3: existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName)
line4: cacheTable(new_existingRDDTableName)
line5: //some operation that will materialize new _existingRDD.

now, what we expect is in line4 rather than caching both
existingRDDTableName and new_existingRDDTableName, it should cache only
new_existingRDDTableName. but we cannot explicitly uncache
existingRDDTableName because we want the union to use the cached
existingRDDTableName. since being lazy new_existingRDDTableName could be
materialized later and by then we cant lose existingRDDTableName from
cache.

What if keep the same name of the new table

so, cacheTable(existingRDDTableName)
existingRDD.union(newRDD).registerAsTable(existingRDDTableName)
cacheTable(existingRDDTableName) //might not be needed again.

Will our both cases be satisfied, that it uses existingRDDTableName from
cache for union and dont duplicate the data in the cache but somehow,
append to the older cacheTable.

Thanks and Regards,


Archit Thakur.
Sr Software Developer,
Guavus, Inc.

On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora 
wrote:

> I think i should elaborate usecase little more.
>
> So we have UI dashboard whose response time is quite fast as all the data
> is
> cached. Users query data based on time range and also there is always new
> data coming into the system at predefined frequency lets say 1 hour.
>
> As you said i can uncache tables it will basically drop all data from
> memory.
> I cannot afford losing my cache even for short interval. As all queries
> from
> UI will get slow till the time cache loads again. UI response time needs to
> be predictable and shoudl be fast enough so that user does not get
> irritated.
>
> Also i cannot keep two copies of data(till newrdd materialize) into memory
> as it will surpass total available memory in system.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Cheng Lian
Ah, I see. So basically what you need is something like cache write through
support which exists in Shark but not implemented in Spark SQL yet. In
Shark, when inserting data into a table that has already been cached, the
newly inserted data will be automatically cached and “union”-ed with the
existing table content. SPARK-1671
<https://issues.apache.org/jira/browse/SPARK-1671> was created to track
this feature. We’ll work on that.

Currently, as a workaround, instead of doing union at the RDD level, you
may try cache the new table, union it with the old table and then query the
union-ed table. The drawbacks is higher code complexity and you end up with
lots of temporary tables. But the performance should be reasonable.
​

On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur 
wrote:

> LittleCode snippet:
>
> line1: cacheTable(existingRDDTableName)
> line2: //some operations which will materialize existingRDD dataset.
> line3: existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName)
> line4: cacheTable(new_existingRDDTableName)
> line5: //some operation that will materialize new _existingRDD.
>
> now, what we expect is in line4 rather than caching both
> existingRDDTableName and new_existingRDDTableName, it should cache only
> new_existingRDDTableName. but we cannot explicitly uncache
> existingRDDTableName because we want the union to use the cached
> existingRDDTableName. since being lazy new_existingRDDTableName could be
> materialized later and by then we cant lose existingRDDTableName from
> cache.
>
> What if keep the same name of the new table
>
> so, cacheTable(existingRDDTableName)
> existingRDD.union(newRDD).registerAsTable(existingRDDTableName)
> cacheTable(existingRDDTableName) //might not be needed again.
>
> Will our both cases be satisfied, that it uses existingRDDTableName from
> cache for union and dont duplicate the data in the cache but somehow,
> append to the older cacheTable.
>
> Thanks and Regards,
>
>
> Archit Thakur.
> Sr Software Developer,
> Guavus, Inc.
>
> On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora  > wrote:
>
>> I think i should elaborate usecase little more.
>>
>> So we have UI dashboard whose response time is quite fast as all the data
>> is
>> cached. Users query data based on time range and also there is always new
>> data coming into the system at predefined frequency lets say 1 hour.
>>
>> As you said i can uncache tables it will basically drop all data from
>> memory.
>> I cannot afford losing my cache even for short interval. As all queries
>> from
>> UI will get slow till the time cache loads again. UI response time needs
>> to
>> be predictable and shoudl be fast enough so that user does not get
>> irritated.
>>
>> Also i cannot keep two copies of data(till newrdd materialize) into memory
>> as it will surpass total available memory in system.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Use Case of mutable RDD - any ideas around will help.

2014-09-14 Thread Evan Chan
SPARK-1671 looks really promising.

Note that even right now, you don't need to un-cache the existing
table.   You can do something like this:

newAdditionRdd.registerTempTable("table2")
sqlContext.cacheTable("table2")
val unionedRdd = sqlContext.table("table1").unionAll(sqlContext.table("table2"))

When you use "table", it will return you the cached representation, so
that the union executes much faster.

However, there is some unknown slowdown, it's not quite as fast as
what you would expect.

On Fri, Sep 12, 2014 at 2:09 PM, Cheng Lian  wrote:
> Ah, I see. So basically what you need is something like cache write through
> support which exists in Shark but not implemented in Spark SQL yet. In
> Shark, when inserting data into a table that has already been cached, the
> newly inserted data will be automatically cached and “union”-ed with the
> existing table content. SPARK-1671 was created to track this feature. We’ll
> work on that.
>
> Currently, as a workaround, instead of doing union at the RDD level, you may
> try cache the new table, union it with the old table and then query the
> union-ed table. The drawbacks is higher code complexity and you end up with
> lots of temporary tables. But the performance should be reasonable.
>
>
> On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur 
> wrote:
>>
>> LittleCode snippet:
>>
>> line1: cacheTable(existingRDDTableName)
>> line2: //some operations which will materialize existingRDD dataset.
>> line3: existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName)
>> line4: cacheTable(new_existingRDDTableName)
>> line5: //some operation that will materialize new _existingRDD.
>>
>> now, what we expect is in line4 rather than caching both
>> existingRDDTableName and new_existingRDDTableName, it should cache only
>> new_existingRDDTableName. but we cannot explicitly uncache
>> existingRDDTableName because we want the union to use the cached
>> existingRDDTableName. since being lazy new_existingRDDTableName could be
>> materialized later and by then we cant lose existingRDDTableName from cache.
>>
>> What if keep the same name of the new table
>>
>> so, cacheTable(existingRDDTableName)
>> existingRDD.union(newRDD).registerAsTable(existingRDDTableName)
>> cacheTable(existingRDDTableName) //might not be needed again.
>>
>> Will our both cases be satisfied, that it uses existingRDDTableName from
>> cache for union and dont duplicate the data in the cache but somehow, append
>> to the older cacheTable.
>>
>> Thanks and Regards,
>>
>>
>> Archit Thakur.
>> Sr Software Developer,
>> Guavus, Inc.
>>
>> On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora
>>  wrote:
>>>
>>> I think i should elaborate usecase little more.
>>>
>>> So we have UI dashboard whose response time is quite fast as all the data
>>> is
>>> cached. Users query data based on time range and also there is always new
>>> data coming into the system at predefined frequency lets say 1 hour.
>>>
>>> As you said i can uncache tables it will basically drop all data from
>>> memory.
>>> I cannot afford losing my cache even for short interval. As all queries
>>> from
>>> UI will get slow till the time cache loads again. UI response time needs
>>> to
>>> be predictable and shoudl be fast enough so that user does not get
>>> irritated.
>>>
>>> Also i cannot keep two copies of data(till newrdd materialize) into
>>> memory
>>> as it will surpass total available memory in system.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Use Case of mutable RDD - any ideas around will help.

2014-09-17 Thread Michael Armbrust
The unknown slowdown might be addressed by
https://github.com/apache/spark/commit/f858f466862541c3faad76a1fa2391f1c17ec9dd

On Sun, Sep 14, 2014 at 10:40 PM, Evan Chan  wrote:

> SPARK-1671 looks really promising.
>
> Note that even right now, you don't need to un-cache the existing
> table.   You can do something like this:
>
> newAdditionRdd.registerTempTable("table2")
> sqlContext.cacheTable("table2")
> val unionedRdd =
> sqlContext.table("table1").unionAll(sqlContext.table("table2"))
>
> When you use "table", it will return you the cached representation, so
> that the union executes much faster.
>
> However, there is some unknown slowdown, it's not quite as fast as
> what you would expect.
>
> On Fri, Sep 12, 2014 at 2:09 PM, Cheng Lian  wrote:
> > Ah, I see. So basically what you need is something like cache write
> through
> > support which exists in Shark but not implemented in Spark SQL yet. In
> > Shark, when inserting data into a table that has already been cached, the
> > newly inserted data will be automatically cached and “union”-ed with the
> > existing table content. SPARK-1671 was created to track this feature.
> We’ll
> > work on that.
> >
> > Currently, as a workaround, instead of doing union at the RDD level, you
> may
> > try cache the new table, union it with the old table and then query the
> > union-ed table. The drawbacks is higher code complexity and you end up
> with
> > lots of temporary tables. But the performance should be reasonable.
> >
> >
> > On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur <
> archit279tha...@gmail.com>
> > wrote:
> >>
> >> LittleCode snippet:
> >>
> >> line1: cacheTable(existingRDDTableName)
> >> line2: //some operations which will materialize existingRDD dataset.
> >> line3:
> existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName)
> >> line4: cacheTable(new_existingRDDTableName)
> >> line5: //some operation that will materialize new _existingRDD.
> >>
> >> now, what we expect is in line4 rather than caching both
> >> existingRDDTableName and new_existingRDDTableName, it should cache only
> >> new_existingRDDTableName. but we cannot explicitly uncache
> >> existingRDDTableName because we want the union to use the cached
> >> existingRDDTableName. since being lazy new_existingRDDTableName could be
> >> materialized later and by then we cant lose existingRDDTableName from
> cache.
> >>
> >> What if keep the same name of the new table
> >>
> >> so, cacheTable(existingRDDTableName)
> >> existingRDD.union(newRDD).registerAsTable(existingRDDTableName)
> >> cacheTable(existingRDDTableName) //might not be needed again.
> >>
> >> Will our both cases be satisfied, that it uses existingRDDTableName from
> >> cache for union and dont duplicate the data in the cache but somehow,
> append
> >> to the older cacheTable.
> >>
> >> Thanks and Regards,
> >>
> >>
> >> Archit Thakur.
> >> Sr Software Developer,
> >> Guavus, Inc.
> >>
> >> On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora
> >>  wrote:
> >>>
> >>> I think i should elaborate usecase little more.
> >>>
> >>> So we have UI dashboard whose response time is quite fast as all the
> data
> >>> is
> >>> cached. Users query data based on time range and also there is always
> new
> >>> data coming into the system at predefined frequency lets say 1 hour.
> >>>
> >>> As you said i can uncache tables it will basically drop all data from
> >>> memory.
> >>> I cannot afford losing my cache even for short interval. As all queries
> >>> from
> >>> UI will get slow till the time cache loads again. UI response time
> needs
> >>> to
> >>> be predictable and shoudl be fast enough so that user does not get
> >>> irritated.
> >>>
> >>> Also i cannot keep two copies of data(till newrdd materialize) into
> >>> memory
> >>> as it will surpass total available memory in system.
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> -
> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>
> >>
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Use Case of mutable RDD - any ideas around will help.

2014-09-18 Thread Evan Chan
Sweet, that's probably it.   Too bad it didn't seem to make 1.1?

On Wed, Sep 17, 2014 at 5:32 PM, Michael Armbrust
 wrote:
> The unknown slowdown might be addressed by
> https://github.com/apache/spark/commit/f858f466862541c3faad76a1fa2391f1c17ec9dd
>
> On Sun, Sep 14, 2014 at 10:40 PM, Evan Chan  wrote:
>>
>> SPARK-1671 looks really promising.
>>
>> Note that even right now, you don't need to un-cache the existing
>> table.   You can do something like this:
>>
>> newAdditionRdd.registerTempTable("table2")
>> sqlContext.cacheTable("table2")
>> val unionedRdd =
>> sqlContext.table("table1").unionAll(sqlContext.table("table2"))
>>
>> When you use "table", it will return you the cached representation, so
>> that the union executes much faster.
>>
>> However, there is some unknown slowdown, it's not quite as fast as
>> what you would expect.
>>
>> On Fri, Sep 12, 2014 at 2:09 PM, Cheng Lian  wrote:
>> > Ah, I see. So basically what you need is something like cache write
>> > through
>> > support which exists in Shark but not implemented in Spark SQL yet. In
>> > Shark, when inserting data into a table that has already been cached,
>> > the
>> > newly inserted data will be automatically cached and “union”-ed with the
>> > existing table content. SPARK-1671 was created to track this feature.
>> > We’ll
>> > work on that.
>> >
>> > Currently, as a workaround, instead of doing union at the RDD level, you
>> > may
>> > try cache the new table, union it with the old table and then query the
>> > union-ed table. The drawbacks is higher code complexity and you end up
>> > with
>> > lots of temporary tables. But the performance should be reasonable.
>> >
>> >
>> > On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur
>> > 
>> > wrote:
>> >>
>> >> LittleCode snippet:
>> >>
>> >> line1: cacheTable(existingRDDTableName)
>> >> line2: //some operations which will materialize existingRDD dataset.
>> >> line3:
>> >> existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName)
>> >> line4: cacheTable(new_existingRDDTableName)
>> >> line5: //some operation that will materialize new _existingRDD.
>> >>
>> >> now, what we expect is in line4 rather than caching both
>> >> existingRDDTableName and new_existingRDDTableName, it should cache only
>> >> new_existingRDDTableName. but we cannot explicitly uncache
>> >> existingRDDTableName because we want the union to use the cached
>> >> existingRDDTableName. since being lazy new_existingRDDTableName could
>> >> be
>> >> materialized later and by then we cant lose existingRDDTableName from
>> >> cache.
>> >>
>> >> What if keep the same name of the new table
>> >>
>> >> so, cacheTable(existingRDDTableName)
>> >> existingRDD.union(newRDD).registerAsTable(existingRDDTableName)
>> >> cacheTable(existingRDDTableName) //might not be needed again.
>> >>
>> >> Will our both cases be satisfied, that it uses existingRDDTableName
>> >> from
>> >> cache for union and dont duplicate the data in the cache but somehow,
>> >> append
>> >> to the older cacheTable.
>> >>
>> >> Thanks and Regards,
>> >>
>> >>
>> >> Archit Thakur.
>> >> Sr Software Developer,
>> >> Guavus, Inc.
>> >>
>> >> On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora
>> >>  wrote:
>> >>>
>> >>> I think i should elaborate usecase little more.
>> >>>
>> >>> So we have UI dashboard whose response time is quite fast as all the
>> >>> data
>> >>> is
>> >>> cached. Users query data based on time range and also there is always
>> >>> new
>> >>> data coming into the system at predefined frequency lets say 1 hour.
>> >>>
>> >>> As you said i can uncache tables it will basically drop all data from
>> >>> memory.
>> >>> I cannot afford losing my cache even for short interval. As all
>> >>> queries
>> >>> from
>> >>> UI will get slow till the time cache loads again. UI response time
>> >>> needs
>> >>> to
>> >>>