Re: Map one RDD into two RDD

ayan guha Fri, 08 May 2015 20:08:01 -0700

Do as Evo suggested. Rdd1=rdd.filter, rdd2=rdd.filter
On 9 May 2015 05:19, "anshu shukla" <anshushuk...@gmail.com> wrote:


> Any update to above mail
> and  Can anyone tell me logic - I have to filter tweets and submit tweets
>  with particular  #hashtag1 to SparkSQL  databases and tweets with
>  #hashtag2  will be passed to sentiment analysis phase ."Problem is how to
> split the input data in two streams using hashtags "
>
> On Fri, May 8, 2015 at 2:42 AM, anshu shukla <anshushuk...@gmail.com>
> wrote:
>
>> One of  the best discussion in mailing list  :-)  ...Please  help me in
>> concluding --
>>
>> The whole discussion concludes that -
>>
>> 1-  Framework  does not support  increasing parallelism of any task just
>> by any inbuilt function .
>> 2-  User have to manualy write logic for filter output of upstream node
>> in DAG  to manage input to Downstream nodes (like shuffle grouping etc in
>> STORM)
>> 3- If we want to increase the level of parallelism of twitter streaming
>>  Spout  to *get higher rate of  DStream of tweets  (to increase the rate
>> of input )  , how it is possible ...  *
>>
>>   *val tweetStream = **TwitterUtils.createStream(ssc, Utils.getAuth)*
>>
>>
>>
>> On Fri, May 8, 2015 at 2:16 AM, Evo Eftimov <evo.efti...@isecc.com>
>> wrote:
>>
>>> 1. Will rdd2.filter run before rdd1.filter finish?
>>>
>>>
>>>
>>> YES
>>>
>>>
>>>
>>> 2. We have to traverse rdd twice. Any comments?
>>>
>>>
>>>
>>> You can invoke filter or whatever other transformation / function many
>>> times
>>>
>>> Ps: you  have to study / learn the Parallel Programming Model of an OO
>>> Framework like Spark – in any OO Framework lots of Behavior is hidden /
>>> encapsulated by the Framework and the client code gets invoked at specific
>>> points in the Flow of Control / Data based on callback functions
>>>
>>>
>>>
>>> That’s why stuff like RDD.filter(), RDD.filter() may look “sequential”
>>> to you but it is not
>>>
>>>
>>>
>>>
>>>
>>> *From:* Bill Q [mailto:bill.q....@gmail.com]
>>> *Sent:* Thursday, May 7, 2015 6:27 PM
>>>
>>> *To:* Evo Eftimov
>>> *Cc:* user@spark.apache.org
>>> *Subject:* Re: Map one RDD into two RDD
>>>
>>>
>>>
>>> The multi-threading code in Scala is quite simple and you can google it
>>> pretty easily. We used the Future framework. You can use Akka also.
>>>
>>>
>>>
>>> @Evo My concerns for filtering solution are: 1. Will rdd2.filter run
>>> before rdd1.filter finish? 2. We have to traverse rdd twice. Any comments?
>>>
>>>
>>>
>>> On Thursday, May 7, 2015, Evo Eftimov <evo.efti...@isecc.com> wrote:
>>>
>>> Scala is a language, Spark is an OO/Functional, Distributed Framework
>>> facilitating Parallel Programming in a distributed environment
>>>
>>>
>>>
>>> Any “Scala parallelism” occurs within the Parallel Model imposed by the
>>> Spark OO Framework – ie it is limited in terms of what it can achieve in
>>> terms of influencing the Spark Framework behavior – that is the nature of
>>> programming with/for frameworks
>>>
>>>
>>>
>>> When RDD1 and RDD2 are partitioned and different Actions applied to them
>>> this will result in Parallel Pipelines / DAGs within the Spark Framework
>>>
>>> RDD1 = RDD.filter()
>>>
>>> RDD2 = RDD.filter()
>>>
>>>
>>>
>>>
>>>
>>> *From:* Bill Q [mailto:bill.q....@gmail.com]
>>> *Sent:* Thursday, May 7, 2015 4:55 PM
>>> *To:* Evo Eftimov
>>> *Cc:* user@spark.apache.org
>>> *Subject:* Re: Map one RDD into two RDD
>>>
>>>
>>>
>>> Thanks for the replies. We decided to use concurrency in Scala to do the
>>> two mappings using the same source RDD in parallel. So far, it seems to be
>>> working. Any comments?
>>>
>>> On Wednesday, May 6, 2015, Evo Eftimov <evo.efti...@isecc.com> wrote:
>>>
>>> RDD1 = RDD.filter()
>>>
>>> RDD2 = RDD.filter()
>>>
>>>
>>>
>>> *From:* Bill Q [mailto:bill.q....@gmail.com <bill.q....@gmail.com>]
>>> *Sent:* Tuesday, May 5, 2015 10:42 PM
>>> *To:* user@spark.apache.org
>>> *Subject:* Map one RDD into two RDD
>>>
>>>
>>>
>>> Hi all,
>>>
>>> I have a large RDD that I map a function to it. Based on the nature of
>>> each record in the input RDD, I will generate two types of data. I would
>>> like to save each type into its own RDD. But I can't seem to find an
>>> efficient way to do it. Any suggestions?
>>>
>>>
>>>
>>> Many thanks.
>>>
>>>
>>>
>>>
>>>
>>> Bill
>>>
>>>
>>>
>>> --
>>>
>>> Many thanks.
>>>
>>> Bill
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Many thanks.
>>>
>>> Bill
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Many thanks.
>>>
>>> Bill
>>>
>>>
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>> Anshu Shukla
>>
>
>
>
> --
> Thanks & Regards,
> Anshu Shukla
>

Re: Map one RDD into two RDD

Reply via email to