Re: Incremental Updates to an RDD

2013-12-10 Thread Christopher Nguyen
Wes, it depends on what you mean by "sliding window" as related to "RDD":

   1. Some operation over multiple rows of data within a single, large RDD,
   for which the operations are required to be temporally sequential. This may
   be the case where you're computing a running average over historical
   time-based data.
   2. Some operation over multiple rows of data within a single, large RDD,
   for which the operations may be run in parallel, even out of order. This
   may be the case where your RDD represents a two-dimensional geospatial map
   and you're computing something (e.g., population average) over a grid.
   3. Some operation on data streaming in, over a fixed-size window, and
   you would like the representation of that windowed data to be an RDD.

For #1 and #2, there's only one "static" RDD and the task is largely
bookkeeping: tracking which window you're working on when, and dealing with
partition boundaries (*mapPartitions* or *mapPartitionsWithIndex *would be
a useful interface here as it allows you to see multiple rows at a time, as
well as know what partition # you're working with at any given time).

For #3, that's what Spark Streaming does, and it does so by introducing a
higher-level concept of a DStream, which is a sequence of RDDs, where each
RDD is one data sample. Given that it is a collection of RDDs, the
windowing management task simply involves maintaining what RDDs are
contained that sequence.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Tue, Dec 10, 2013 at 12:01 PM, Wes Mitchell wrote:

> So, does that mean that if I want to do a sliding window, then I have to,
> in some fashion,
> build a stream from the RDD, push a new value on the head, filter out the
> oldest value, and
> re-persist as an RDD?
>
>
>
>
> On Fri, Dec 6, 2013 at 10:13 PM, Christopher Nguyen wrote:
>
>> Kyle, the fundamental contract of a Spark RDD is that it is immutable.
>> This follows the paradigm where data is (functionally) transformed into
>> other data, rather than mutated. This allows these systems to make certain
>> assumptions and guarantees that otherwise they wouldn't be able to.
>>
>> Now we've been able to get mutative behavior with RDDs---for fun,
>> almost---but that's implementation dependent and may break at any time.
>>
>> It turns out this behavior is quite appropriate for the analytic stack,
>> where you typically apply the same transform/operator to all data. You're
>> finding that transactional systems are the exact opposite, where you
>> typically apply a different operation to individual pieces of the data.
>> Incidentally this is also the dichotomy between column- and row-based
>> storage being optimal for each respective pattern.
>>
>> Spark is intended for the analytic stack. To use Spark as the persistence
>> layer of a transaction system is going to be very awkward. I know there are
>> some vendors who position their in-memory databases as good for both OLTP
>> and OLAP use cases, but when you talk to them in depth they will readily
>> admit that it's really optimal for one and not the other.
>>
>> If you want to make a project out of making a special Spark RDD that
>> supports this behavior, it might be interesting. But there will be no
>> simple shortcuts to get there from here.
>>
>> --
>> Christopher T. Nguyen
>> Co-founder & CEO, Adatao 
>> linkedin.com/in/ctnguyen
>>
>>
>>
>> On Fri, Dec 6, 2013 at 10:56 PM, Kyle Ellrott wrote:
>>
>>> I'm trying to figure out if I can use an RDD to backend an interactive
>>> server. One of the requirements would be to have incremental updates to
>>> elements in the RDD, ie transforms that change/add/delete a single element
>>> in the RDD.
>>> It seems pretty drastic to do a full RDD filter to remove a single
>>> element, or do the union of the RDD with another one of size 1 to add an
>>> element. (Or is it?) Is there an efficient way to do this in Spark? Are
>>> there any example of this kind of usage?
>>>
>>> Thank you,
>>> Kyle
>>>
>>
>>
>


Re: Incremental Updates to an RDD

2013-12-10 Thread Wes Mitchell
So, does that mean that if I want to do a sliding window, then I have to,
in some fashion,
build a stream from the RDD, push a new value on the head, filter out the
oldest value, and
re-persist as an RDD?




On Fri, Dec 6, 2013 at 10:13 PM, Christopher Nguyen  wrote:

> Kyle, the fundamental contract of a Spark RDD is that it is immutable.
> This follows the paradigm where data is (functionally) transformed into
> other data, rather than mutated. This allows these systems to make certain
> assumptions and guarantees that otherwise they wouldn't be able to.
>
> Now we've been able to get mutative behavior with RDDs---for fun,
> almost---but that's implementation dependent and may break at any time.
>
> It turns out this behavior is quite appropriate for the analytic stack,
> where you typically apply the same transform/operator to all data. You're
> finding that transactional systems are the exact opposite, where you
> typically apply a different operation to individual pieces of the data.
> Incidentally this is also the dichotomy between column- and row-based
> storage being optimal for each respective pattern.
>
> Spark is intended for the analytic stack. To use Spark as the persistence
> layer of a transaction system is going to be very awkward. I know there are
> some vendors who position their in-memory databases as good for both OLTP
> and OLAP use cases, but when you talk to them in depth they will readily
> admit that it's really optimal for one and not the other.
>
> If you want to make a project out of making a special Spark RDD that
> supports this behavior, it might be interesting. But there will be no
> simple shortcuts to get there from here.
>
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao 
> linkedin.com/in/ctnguyen
>
>
>
> On Fri, Dec 6, 2013 at 10:56 PM, Kyle Ellrott wrote:
>
>> I'm trying to figure out if I can use an RDD to backend an interactive
>> server. One of the requirements would be to have incremental updates to
>> elements in the RDD, ie transforms that change/add/delete a single element
>> in the RDD.
>> It seems pretty drastic to do a full RDD filter to remove a single
>> element, or do the union of the RDD with another one of size 1 to add an
>> element. (Or is it?) Is there an efficient way to do this in Spark? Are
>> there any example of this kind of usage?
>>
>> Thank you,
>> Kyle
>>
>
>


Re: Incremental Updates to an RDD

2013-12-09 Thread Christopher Nguyen
Kyle, many of your design goals are something we also want. Indeed it's
interesting you separate "resilient" from RDD, as I've suggested there
should be ways to boost performance if you're willing to give up some or
all of the "R" guarantees.

We haven't started looking into this yet due to other priorities. If
someone with similar design goals wants to get started that'd be great.

To be sure, a semi-shortcut to what you want may be found by looking at
Tachyon. It's fairly early days for Tachyon so I don't know what its actual
behavior would be under transactional loads.

Sent while mobile. Pls excuse typos etc.
On Dec 9, 2013 10:47 AM, "Kyle Ellrott"  wrote:

> I'd like to use Spark as an analytical stack, the only difference is that
> I would like find the best way to connect it to a dataset that I'm actively
> working on. Perhaps saying 'updates to an RDD' is a bit of a loaded term, I
> don't need the 'resilient', just a distributed data set.
> Right now, the best way I can think of doing that is working with the data
> in a distributed system, like HBase, then when I want to do my analytics, I
> use the HadoopInputFormat readers to transfer the data from the HBase
> system to Spark and then do my analytics. Of course, then I have the
> overhead of serialization/deserialization and network transfer before I can
> even start my calculations. If I already held the dataset in the Spark
> processes, then I could start calculations immediately.
> So is there is a 'better' way to manage a distributed data set, which
> would then serve as an input to Spark RDDs?
>
> Kyle
>
>
>
>
> On Fri, Dec 6, 2013 at 10:13 PM, Christopher Nguyen wrote:
>
>> Kyle, the fundamental contract of a Spark RDD is that it is immutable.
>> This follows the paradigm where data is (functionally) transformed into
>> other data, rather than mutated. This allows these systems to make certain
>> assumptions and guarantees that otherwise they wouldn't be able to.
>>
>> Now we've been able to get mutative behavior with RDDs---for fun,
>> almost---but that's implementation dependent and may break at any time.
>>
>> It turns out this behavior is quite appropriate for the analytic stack,
>> where you typically apply the same transform/operator to all data. You're
>> finding that transactional systems are the exact opposite, where you
>> typically apply a different operation to individual pieces of the data.
>> Incidentally this is also the dichotomy between column- and row-based
>> storage being optimal for each respective pattern.
>>
>> Spark is intended for the analytic stack. To use Spark as the persistence
>> layer of a transaction system is going to be very awkward. I know there are
>> some vendors who position their in-memory databases as good for both OLTP
>> and OLAP use cases, but when you talk to them in depth they will readily
>> admit that it's really optimal for one and not the other.
>>
>> If you want to make a project out of making a special Spark RDD that
>> supports this behavior, it might be interesting. But there will be no
>> simple shortcuts to get there from here.
>>
>> --
>> Christopher T. Nguyen
>> Co-founder & CEO, Adatao 
>> linkedin.com/in/ctnguyen
>>
>>
>>
>> On Fri, Dec 6, 2013 at 10:56 PM, Kyle Ellrott wrote:
>>
>>> I'm trying to figure out if I can use an RDD to backend an interactive
>>> server. One of the requirements would be to have incremental updates to
>>> elements in the RDD, ie transforms that change/add/delete a single element
>>> in the RDD.
>>> It seems pretty drastic to do a full RDD filter to remove a single
>>> element, or do the union of the RDD with another one of size 1 to add an
>>> element. (Or is it?) Is there an efficient way to do this in Spark? Are
>>> there any example of this kind of usage?
>>>
>>> Thank you,
>>> Kyle
>>>
>>
>>
>


Re: Incremental Updates to an RDD

2013-12-09 Thread Kyle Ellrott
I'd like to use Spark as an analytical stack, the only difference is that I
would like find the best way to connect it to a dataset that I'm actively
working on. Perhaps saying 'updates to an RDD' is a bit of a loaded term, I
don't need the 'resilient', just a distributed data set.
Right now, the best way I can think of doing that is working with the data
in a distributed system, like HBase, then when I want to do my analytics, I
use the HadoopInputFormat readers to transfer the data from the HBase
system to Spark and then do my analytics. Of course, then I have the
overhead of serialization/deserialization and network transfer before I can
even start my calculations. If I already held the dataset in the Spark
processes, then I could start calculations immediately.
So is there is a 'better' way to manage a distributed data set, which would
then serve as an input to Spark RDDs?

Kyle




On Fri, Dec 6, 2013 at 10:13 PM, Christopher Nguyen  wrote:

> Kyle, the fundamental contract of a Spark RDD is that it is immutable.
> This follows the paradigm where data is (functionally) transformed into
> other data, rather than mutated. This allows these systems to make certain
> assumptions and guarantees that otherwise they wouldn't be able to.
>
> Now we've been able to get mutative behavior with RDDs---for fun,
> almost---but that's implementation dependent and may break at any time.
>
> It turns out this behavior is quite appropriate for the analytic stack,
> where you typically apply the same transform/operator to all data. You're
> finding that transactional systems are the exact opposite, where you
> typically apply a different operation to individual pieces of the data.
> Incidentally this is also the dichotomy between column- and row-based
> storage being optimal for each respective pattern.
>
> Spark is intended for the analytic stack. To use Spark as the persistence
> layer of a transaction system is going to be very awkward. I know there are
> some vendors who position their in-memory databases as good for both OLTP
> and OLAP use cases, but when you talk to them in depth they will readily
> admit that it's really optimal for one and not the other.
>
> If you want to make a project out of making a special Spark RDD that
> supports this behavior, it might be interesting. But there will be no
> simple shortcuts to get there from here.
>
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao 
> linkedin.com/in/ctnguyen
>
>
>
> On Fri, Dec 6, 2013 at 10:56 PM, Kyle Ellrott wrote:
>
>> I'm trying to figure out if I can use an RDD to backend an interactive
>> server. One of the requirements would be to have incremental updates to
>> elements in the RDD, ie transforms that change/add/delete a single element
>> in the RDD.
>> It seems pretty drastic to do a full RDD filter to remove a single
>> element, or do the union of the RDD with another one of size 1 to add an
>> element. (Or is it?) Is there an efficient way to do this in Spark? Are
>> there any example of this kind of usage?
>>
>> Thank you,
>> Kyle
>>
>
>


Re: Incremental Updates to an RDD

2013-12-06 Thread Christopher Nguyen
Kyle, the fundamental contract of a Spark RDD is that it is immutable. This
follows the paradigm where data is (functionally) transformed into other
data, rather than mutated. This allows these systems to make certain
assumptions and guarantees that otherwise they wouldn't be able to.

Now we've been able to get mutative behavior with RDDs---for fun,
almost---but that's implementation dependent and may break at any time.

It turns out this behavior is quite appropriate for the analytic stack,
where you typically apply the same transform/operator to all data. You're
finding that transactional systems are the exact opposite, where you
typically apply a different operation to individual pieces of the data.
Incidentally this is also the dichotomy between column- and row-based
storage being optimal for each respective pattern.

Spark is intended for the analytic stack. To use Spark as the persistence
layer of a transaction system is going to be very awkward. I know there are
some vendors who position their in-memory databases as good for both OLTP
and OLAP use cases, but when you talk to them in depth they will readily
admit that it's really optimal for one and not the other.

If you want to make a project out of making a special Spark RDD that
supports this behavior, it might be interesting. But there will be no
simple shortcuts to get there from here.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Fri, Dec 6, 2013 at 10:56 PM, Kyle Ellrott  wrote:

> I'm trying to figure out if I can use an RDD to backend an interactive
> server. One of the requirements would be to have incremental updates to
> elements in the RDD, ie transforms that change/add/delete a single element
> in the RDD.
> It seems pretty drastic to do a full RDD filter to remove a single
> element, or do the union of the RDD with another one of size 1 to add an
> element. (Or is it?) Is there an efficient way to do this in Spark? Are
> there any example of this kind of usage?
>
> Thank you,
> Kyle
>


Incremental Updates to an RDD

2013-12-06 Thread Kyle Ellrott
I'm trying to figure out if I can use an RDD to backend an interactive
server. One of the requirements would be to have incremental updates to
elements in the RDD, ie transforms that change/add/delete a single element
in the RDD.
It seems pretty drastic to do a full RDD filter to remove a single element,
or do the union of the RDD with another one of size 1 to add an element.
(Or is it?) Is there an efficient way to do this in Spark? Are there any
example of this kind of usage?

Thank you,
Kyle