Re: Map one RDD into two RDD

Gerard Maas Thu, 07 May 2015 12:11:24 -0700

Hi Bill,

I just found weird that one would use parallel threads to 'filter', as
filter is lazy in Spark, and multithreading wouldn't have any effect unless
the action triggering the execution of the lineage containing such filter
is executed on a separate thread. One must have very specific
reasons/requirements to do that, beyond 'not traversing the data twice'.
The request for the code was only to help checking that.


-kr, Gerard.

On Thu, May 7, 2015 at 7:26 PM, Bill Q <bill.q....@gmail.com> wrote:

> The multi-threading code in Scala is quite simple and you can google it
> pretty easily. We used the Future framework. You can use Akka also.
>
> @Evo My concerns for filtering solution are: 1. Will rdd2.filter run
> before rdd1.filter finish? 2. We have to traverse rdd twice. Any comments?
>
>
>
> On Thursday, May 7, 2015, Evo Eftimov <evo.efti...@isecc.com> wrote:
>
>> Scala is a language, Spark is an OO/Functional, Distributed Framework
>> facilitating Parallel Programming in a distributed environment
>>
>>
>>
>> Any “Scala parallelism” occurs within the Parallel Model imposed by the
>> Spark OO Framework – ie it is limited in terms of what it can achieve in
>> terms of influencing the Spark Framework behavior – that is the nature of
>> programming with/for frameworks
>>
>>
>>
>> When RDD1 and RDD2 are partitioned and different Actions applied to them
>> this will result in Parallel Pipelines / DAGs within the Spark Framework
>>
>> RDD1 = RDD.filter()
>>
>> RDD2 = RDD.filter()
>>
>>
>>
>>
>>
>> *From:* Bill Q [mailto:bill.q....@gmail.com]
>> *Sent:* Thursday, May 7, 2015 4:55 PM
>> *To:* Evo Eftimov
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Map one RDD into two RDD
>>
>>
>>
>> Thanks for the replies. We decided to use concurrency in Scala to do the
>> two mappings using the same source RDD in parallel. So far, it seems to be
>> working. Any comments?
>>
>> On Wednesday, May 6, 2015, Evo Eftimov <evo.efti...@isecc.com> wrote:
>>
>> RDD1 = RDD.filter()
>>
>> RDD2 = RDD.filter()
>>
>>
>>
>> *From:* Bill Q [mailto:bill.q....@gmail.com]
>> *Sent:* Tuesday, May 5, 2015 10:42 PM
>> *To:* user@spark.apache.org
>> *Subject:* Map one RDD into two RDD
>>
>>
>>
>> Hi all,
>>
>> I have a large RDD that I map a function to it. Based on the nature of
>> each record in the input RDD, I will generate two types of data. I would
>> like to save each type into its own RDD. But I can't seem to find an
>> efficient way to do it. Any suggestions?
>>
>>
>>
>> Many thanks.
>>
>>
>>
>>
>>
>> Bill
>>
>>
>>
>> --
>>
>> Many thanks.
>>
>> Bill
>>
>>
>>
>>
>>
>> --
>>
>> Many thanks.
>>
>> Bill
>>
>>
>>
>
>
> --
> Many thanks.
>
>
> Bill
>
>

Re: Map one RDD into two RDD

Reply via email to