PS: Just to clarify my statement:

>>Unlike the feared RDD operations on the driver, it's my understanding
that these Dstream ops on the driver are merely creating an execution plan
for each RDD.

With "feared RDD operations on the driver" I meant to contrast an rdd
action like rdd.collect that would pull all rdd data to the driver, with
dstream.foreachRDD(rdd => rdd.op) for which documentation says 'it runs on
the driver' yet, all that it looks to be running on the driver is the
scheduling of 'op' on that rdd, just like it happens for all rdd other
operations
(thanks to Sean for the clarification)

So, not to move focus away from the original question:

In Spark Streaming, would it be better to do foreachRDD early in a pipeline
or instead do as much Dstream transformations before going into the
foreachRDD call?

Between these two pieces of code, from a performance perspective, what
would be preferred and why:

- Early foreachRDD:

dstream.foreachRDD(rdd =>
    val records = rdd.map(elem => record(elem))
    targets.foreach(target => records.filter{record =>
isTarget(target,record)}.writeToCassandra(target,table))
)

- As most dstream transformations as possible before foreachRDD:

val recordStream = dstream.map(elem => record(elem))
targets.foreach{target => recordStream.filter(record =>
isTarget(target,record)).foreachRDD(_.writeToCassandra(target,table))}

?

kr, Gerard.



On Wed, Oct 22, 2014 at 2:12 PM, Gerard Maas <gerard.m...@gmail.com> wrote:

> Thanks Matt,
>
> Unlike the feared RDD operations on the driver, it's my understanding that
> these Dstream ops on the driver are merely creating an execution plan for
> each RDD.
> My question still remains: Is it better to foreachRDD early in the process
> or do as much Dstream transformations before going into the foreachRDD
> call?
>
> Maybe this will require some empirical testing specific to each
> implementation?
>
> -kr, Gerard.
>
>
> On Mon, Oct 20, 2014 at 5:07 PM, Matt Narrell <matt.narr...@gmail.com>
> wrote:
>
>> http://spark.apache.org/docs/latest/streaming-programming-guide.html
>>
>> foreachRDD is executed on the driver….
>>
>> mn
>>
>> On Oct 20, 2014, at 3:07 AM, Gerard Maas <gerard.m...@gmail.com> wrote:
>>
>> Pinging TD  -- I'm sure you know :-)
>>
>> -kr, Gerard.
>>
>> On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas <gerard.m...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> We have been implementing several Spark Streaming jobs that are
>>> basically processing data and inserting it into Cassandra, sorting it among
>>> different keyspaces.
>>>
>>> We've been following the pattern:
>>>
>>> dstream.foreachRDD(rdd =>
>>>     val records = rdd.map(elem => record(elem))
>>>     targets.foreach(target => records.filter{record =>
>>> isTarget(target,record)}.writeToCassandra(target,table))
>>> )
>>>
>>> I've been wondering whether there would be a performance difference in
>>> transforming the dstream instead of transforming the RDD within the dstream
>>> with regards to how the transformations get scheduled.
>>>
>>> Instead of the RDD-centric computation, I could transform the dstream
>>> until the last step, where I need an rdd to store.
>>> For example, the  previous  transformation could be written as:
>>>
>>> val recordStream = dstream.map(elem => record(elem))
>>> targets.foreach{target => recordStream.filter(record =>
>>> isTarget(target,record)).foreachRDD(_.writeToCassandra(target,table))}
>>>
>>> Would  be a difference in execution and/or performance?  What would be
>>> the preferred way to do this?
>>>
>>> Bonus question: Is there a better (more performant) way to sort the data
>>> in different "buckets" instead of filtering the data collection times the
>>> #buckets?
>>>
>>> thanks,  Gerard.
>>>
>>>
>>
>>
>

Reply via email to