Do as Evo suggested. Rdd1=rdd.filter, rdd2=rdd.filter On 9 May 2015 05:19, "anshu shukla" <anshushuk...@gmail.com> wrote:
> Any update to above mail > and Can anyone tell me logic - I have to filter tweets and submit tweets > with particular #hashtag1 to SparkSQL databases and tweets with > #hashtag2 will be passed to sentiment analysis phase ."Problem is how to > split the input data in two streams using hashtags " > > On Fri, May 8, 2015 at 2:42 AM, anshu shukla <anshushuk...@gmail.com> > wrote: > >> One of the best discussion in mailing list :-) ...Please help me in >> concluding -- >> >> The whole discussion concludes that - >> >> 1- Framework does not support increasing parallelism of any task just >> by any inbuilt function . >> 2- User have to manualy write logic for filter output of upstream node >> in DAG to manage input to Downstream nodes (like shuffle grouping etc in >> STORM) >> 3- If we want to increase the level of parallelism of twitter streaming >> Spout to *get higher rate of DStream of tweets (to increase the rate >> of input ) , how it is possible ... * >> >> *val tweetStream = **TwitterUtils.createStream(ssc, Utils.getAuth)* >> >> >> >> On Fri, May 8, 2015 at 2:16 AM, Evo Eftimov <evo.efti...@isecc.com> >> wrote: >> >>> 1. Will rdd2.filter run before rdd1.filter finish? >>> >>> >>> >>> YES >>> >>> >>> >>> 2. We have to traverse rdd twice. Any comments? >>> >>> >>> >>> You can invoke filter or whatever other transformation / function many >>> times >>> >>> Ps: you have to study / learn the Parallel Programming Model of an OO >>> Framework like Spark – in any OO Framework lots of Behavior is hidden / >>> encapsulated by the Framework and the client code gets invoked at specific >>> points in the Flow of Control / Data based on callback functions >>> >>> >>> >>> That’s why stuff like RDD.filter(), RDD.filter() may look “sequential” >>> to you but it is not >>> >>> >>> >>> >>> >>> *From:* Bill Q [mailto:bill.q....@gmail.com] >>> *Sent:* Thursday, May 7, 2015 6:27 PM >>> >>> *To:* Evo Eftimov >>> *Cc:* user@spark.apache.org >>> *Subject:* Re: Map one RDD into two RDD >>> >>> >>> >>> The multi-threading code in Scala is quite simple and you can google it >>> pretty easily. We used the Future framework. You can use Akka also. >>> >>> >>> >>> @Evo My concerns for filtering solution are: 1. Will rdd2.filter run >>> before rdd1.filter finish? 2. We have to traverse rdd twice. Any comments? >>> >>> >>> >>> On Thursday, May 7, 2015, Evo Eftimov <evo.efti...@isecc.com> wrote: >>> >>> Scala is a language, Spark is an OO/Functional, Distributed Framework >>> facilitating Parallel Programming in a distributed environment >>> >>> >>> >>> Any “Scala parallelism” occurs within the Parallel Model imposed by the >>> Spark OO Framework – ie it is limited in terms of what it can achieve in >>> terms of influencing the Spark Framework behavior – that is the nature of >>> programming with/for frameworks >>> >>> >>> >>> When RDD1 and RDD2 are partitioned and different Actions applied to them >>> this will result in Parallel Pipelines / DAGs within the Spark Framework >>> >>> RDD1 = RDD.filter() >>> >>> RDD2 = RDD.filter() >>> >>> >>> >>> >>> >>> *From:* Bill Q [mailto:bill.q....@gmail.com] >>> *Sent:* Thursday, May 7, 2015 4:55 PM >>> *To:* Evo Eftimov >>> *Cc:* user@spark.apache.org >>> *Subject:* Re: Map one RDD into two RDD >>> >>> >>> >>> Thanks for the replies. We decided to use concurrency in Scala to do the >>> two mappings using the same source RDD in parallel. So far, it seems to be >>> working. Any comments? >>> >>> On Wednesday, May 6, 2015, Evo Eftimov <evo.efti...@isecc.com> wrote: >>> >>> RDD1 = RDD.filter() >>> >>> RDD2 = RDD.filter() >>> >>> >>> >>> *From:* Bill Q [mailto:bill.q....@gmail.com <bill.q....@gmail.com>] >>> *Sent:* Tuesday, May 5, 2015 10:42 PM >>> *To:* user@spark.apache.org >>> *Subject:* Map one RDD into two RDD >>> >>> >>> >>> Hi all, >>> >>> I have a large RDD that I map a function to it. Based on the nature of >>> each record in the input RDD, I will generate two types of data. I would >>> like to save each type into its own RDD. But I can't seem to find an >>> efficient way to do it. Any suggestions? >>> >>> >>> >>> Many thanks. >>> >>> >>> >>> >>> >>> Bill >>> >>> >>> >>> -- >>> >>> Many thanks. >>> >>> Bill >>> >>> >>> >>> >>> >>> -- >>> >>> Many thanks. >>> >>> Bill >>> >>> >>> >>> >>> >>> -- >>> >>> Many thanks. >>> >>> Bill >>> >>> >>> >> >> >> >> -- >> Thanks & Regards, >> Anshu Shukla >> > > > > -- > Thanks & Regards, > Anshu Shukla >