Re: Map one RDD into two RDD
Any update to above mail and Can anyone tell me logic - I have to filter tweets and submit tweets with particular #hashtag1 to SparkSQL databases and tweets with #hashtag2 will be passed to sentiment analysis phase .Problem is how to split the input data in two streams using hashtags On Fri, May 8, 2015 at 2:42 AM, anshu shukla anshushuk...@gmail.com wrote: One of the best discussion in mailing list :-) ...Please help me in concluding -- The whole discussion concludes that - 1- Framework does not support increasing parallelism of any task just by any inbuilt function . 2- User have to manualy write logic for filter output of upstream node in DAG to manage input to Downstream nodes (like shuffle grouping etc in STORM) 3- If we want to increase the level of parallelism of twitter streaming Spout to *get higher rate of DStream of tweets (to increase the rate of input ) , how it is possible ... * *val tweetStream = **TwitterUtils.createStream(ssc, Utils.getAuth)* On Fri, May 8, 2015 at 2:16 AM, Evo Eftimov evo.efti...@isecc.com wrote: 1. Will rdd2.filter run before rdd1.filter finish? YES 2. We have to traverse rdd twice. Any comments? You can invoke filter or whatever other transformation / function many times Ps: you have to study / learn the Parallel Programming Model of an OO Framework like Spark – in any OO Framework lots of Behavior is hidden / encapsulated by the Framework and the client code gets invoked at specific points in the Flow of Control / Data based on callback functions That’s why stuff like RDD.filter(), RDD.filter() may look “sequential” to you but it is not *From:* Bill Q [mailto:bill.q@gmail.com] *Sent:* Thursday, May 7, 2015 6:27 PM *To:* Evo Eftimov *Cc:* user@spark.apache.org *Subject:* Re: Map one RDD into two RDD The multi-threading code in Scala is quite simple and you can google it pretty easily. We used the Future framework. You can use Akka also. @Evo My concerns for filtering solution are: 1. Will rdd2.filter run before rdd1.filter finish? 2. We have to traverse rdd twice. Any comments? On Thursday, May 7, 2015, Evo Eftimov evo.efti...@isecc.com wrote: Scala is a language, Spark is an OO/Functional, Distributed Framework facilitating Parallel Programming in a distributed environment Any “Scala parallelism” occurs within the Parallel Model imposed by the Spark OO Framework – ie it is limited in terms of what it can achieve in terms of influencing the Spark Framework behavior – that is the nature of programming with/for frameworks When RDD1 and RDD2 are partitioned and different Actions applied to them this will result in Parallel Pipelines / DAGs within the Spark Framework RDD1 = RDD.filter() RDD2 = RDD.filter() *From:* Bill Q [mailto:bill.q@gmail.com] *Sent:* Thursday, May 7, 2015 4:55 PM *To:* Evo Eftimov *Cc:* user@spark.apache.org *Subject:* Re: Map one RDD into two RDD Thanks for the replies. We decided to use concurrency in Scala to do the two mappings using the same source RDD in parallel. So far, it seems to be working. Any comments? On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com wrote: RDD1 = RDD.filter() RDD2 = RDD.filter() *From:* Bill Q [mailto:bill.q@gmail.com bill.q@gmail.com] *Sent:* Tuesday, May 5, 2015 10:42 PM *To:* user@spark.apache.org *Subject:* Map one RDD into two RDD Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD, I will generate two types of data. I would like to save each type into its own RDD. But I can't seem to find an efficient way to do it. Any suggestions? Many thanks. Bill -- Many thanks. Bill -- Many thanks. Bill -- Many thanks. Bill -- Thanks Regards, Anshu Shukla -- Thanks Regards, Anshu Shukla
Re: Map one RDD into two RDD
Do as Evo suggested. Rdd1=rdd.filter, rdd2=rdd.filter On 9 May 2015 05:19, anshu shukla anshushuk...@gmail.com wrote: Any update to above mail and Can anyone tell me logic - I have to filter tweets and submit tweets with particular #hashtag1 to SparkSQL databases and tweets with #hashtag2 will be passed to sentiment analysis phase .Problem is how to split the input data in two streams using hashtags On Fri, May 8, 2015 at 2:42 AM, anshu shukla anshushuk...@gmail.com wrote: One of the best discussion in mailing list :-) ...Please help me in concluding -- The whole discussion concludes that - 1- Framework does not support increasing parallelism of any task just by any inbuilt function . 2- User have to manualy write logic for filter output of upstream node in DAG to manage input to Downstream nodes (like shuffle grouping etc in STORM) 3- If we want to increase the level of parallelism of twitter streaming Spout to *get higher rate of DStream of tweets (to increase the rate of input ) , how it is possible ... * *val tweetStream = **TwitterUtils.createStream(ssc, Utils.getAuth)* On Fri, May 8, 2015 at 2:16 AM, Evo Eftimov evo.efti...@isecc.com wrote: 1. Will rdd2.filter run before rdd1.filter finish? YES 2. We have to traverse rdd twice. Any comments? You can invoke filter or whatever other transformation / function many times Ps: you have to study / learn the Parallel Programming Model of an OO Framework like Spark – in any OO Framework lots of Behavior is hidden / encapsulated by the Framework and the client code gets invoked at specific points in the Flow of Control / Data based on callback functions That’s why stuff like RDD.filter(), RDD.filter() may look “sequential” to you but it is not *From:* Bill Q [mailto:bill.q@gmail.com] *Sent:* Thursday, May 7, 2015 6:27 PM *To:* Evo Eftimov *Cc:* user@spark.apache.org *Subject:* Re: Map one RDD into two RDD The multi-threading code in Scala is quite simple and you can google it pretty easily. We used the Future framework. You can use Akka also. @Evo My concerns for filtering solution are: 1. Will rdd2.filter run before rdd1.filter finish? 2. We have to traverse rdd twice. Any comments? On Thursday, May 7, 2015, Evo Eftimov evo.efti...@isecc.com wrote: Scala is a language, Spark is an OO/Functional, Distributed Framework facilitating Parallel Programming in a distributed environment Any “Scala parallelism” occurs within the Parallel Model imposed by the Spark OO Framework – ie it is limited in terms of what it can achieve in terms of influencing the Spark Framework behavior – that is the nature of programming with/for frameworks When RDD1 and RDD2 are partitioned and different Actions applied to them this will result in Parallel Pipelines / DAGs within the Spark Framework RDD1 = RDD.filter() RDD2 = RDD.filter() *From:* Bill Q [mailto:bill.q@gmail.com] *Sent:* Thursday, May 7, 2015 4:55 PM *To:* Evo Eftimov *Cc:* user@spark.apache.org *Subject:* Re: Map one RDD into two RDD Thanks for the replies. We decided to use concurrency in Scala to do the two mappings using the same source RDD in parallel. So far, it seems to be working. Any comments? On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com wrote: RDD1 = RDD.filter() RDD2 = RDD.filter() *From:* Bill Q [mailto:bill.q@gmail.com bill.q@gmail.com] *Sent:* Tuesday, May 5, 2015 10:42 PM *To:* user@spark.apache.org *Subject:* Map one RDD into two RDD Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD, I will generate two types of data. I would like to save each type into its own RDD. But I can't seem to find an efficient way to do it. Any suggestions? Many thanks. Bill -- Many thanks. Bill -- Many thanks. Bill -- Many thanks. Bill -- Thanks Regards, Anshu Shukla -- Thanks Regards, Anshu Shukla
Re: Map one RDD into two RDD
The multi-threading code in Scala is quite simple and you can google it pretty easily. We used the Future framework. You can use Akka also. @Evo My concerns for filtering solution are: 1. Will rdd2.filter run before rdd1.filter finish? 2. We have to traverse rdd twice. Any comments? On Thursday, May 7, 2015, Evo Eftimov evo.efti...@isecc.com wrote: Scala is a language, Spark is an OO/Functional, Distributed Framework facilitating Parallel Programming in a distributed environment Any “Scala parallelism” occurs within the Parallel Model imposed by the Spark OO Framework – ie it is limited in terms of what it can achieve in terms of influencing the Spark Framework behavior – that is the nature of programming with/for frameworks When RDD1 and RDD2 are partitioned and different Actions applied to them this will result in Parallel Pipelines / DAGs within the Spark Framework RDD1 = RDD.filter() RDD2 = RDD.filter() *From:* Bill Q [mailto:bill.q@gmail.com javascript:_e(%7B%7D,'cvml','bill.q@gmail.com');] *Sent:* Thursday, May 7, 2015 4:55 PM *To:* Evo Eftimov *Cc:* user@spark.apache.org javascript:_e(%7B%7D,'cvml','user@spark.apache.org'); *Subject:* Re: Map one RDD into two RDD Thanks for the replies. We decided to use concurrency in Scala to do the two mappings using the same source RDD in parallel. So far, it seems to be working. Any comments? On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com javascript:_e(%7B%7D,'cvml','evo.efti...@isecc.com'); wrote: RDD1 = RDD.filter() RDD2 = RDD.filter() *From:* Bill Q [mailto:bill.q@gmail.com] *Sent:* Tuesday, May 5, 2015 10:42 PM *To:* user@spark.apache.org *Subject:* Map one RDD into two RDD Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD, I will generate two types of data. I would like to save each type into its own RDD. But I can't seem to find an efficient way to do it. Any suggestions? Many thanks. Bill -- Many thanks. Bill -- Many thanks. Bill -- Many thanks. Bill
Re: Map one RDD into two RDD
Hi Bill, I just found weird that one would use parallel threads to 'filter', as filter is lazy in Spark, and multithreading wouldn't have any effect unless the action triggering the execution of the lineage containing such filter is executed on a separate thread. One must have very specific reasons/requirements to do that, beyond 'not traversing the data twice'. The request for the code was only to help checking that. -kr, Gerard. On Thu, May 7, 2015 at 7:26 PM, Bill Q bill.q@gmail.com wrote: The multi-threading code in Scala is quite simple and you can google it pretty easily. We used the Future framework. You can use Akka also. @Evo My concerns for filtering solution are: 1. Will rdd2.filter run before rdd1.filter finish? 2. We have to traverse rdd twice. Any comments? On Thursday, May 7, 2015, Evo Eftimov evo.efti...@isecc.com wrote: Scala is a language, Spark is an OO/Functional, Distributed Framework facilitating Parallel Programming in a distributed environment Any “Scala parallelism” occurs within the Parallel Model imposed by the Spark OO Framework – ie it is limited in terms of what it can achieve in terms of influencing the Spark Framework behavior – that is the nature of programming with/for frameworks When RDD1 and RDD2 are partitioned and different Actions applied to them this will result in Parallel Pipelines / DAGs within the Spark Framework RDD1 = RDD.filter() RDD2 = RDD.filter() *From:* Bill Q [mailto:bill.q@gmail.com] *Sent:* Thursday, May 7, 2015 4:55 PM *To:* Evo Eftimov *Cc:* user@spark.apache.org *Subject:* Re: Map one RDD into two RDD Thanks for the replies. We decided to use concurrency in Scala to do the two mappings using the same source RDD in parallel. So far, it seems to be working. Any comments? On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com wrote: RDD1 = RDD.filter() RDD2 = RDD.filter() *From:* Bill Q [mailto:bill.q@gmail.com] *Sent:* Tuesday, May 5, 2015 10:42 PM *To:* user@spark.apache.org *Subject:* Map one RDD into two RDD Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD, I will generate two types of data. I would like to save each type into its own RDD. But I can't seem to find an efficient way to do it. Any suggestions? Many thanks. Bill -- Many thanks. Bill -- Many thanks. Bill -- Many thanks. Bill
Re: Map one RDD into two RDD
One of the best discussion in mailing list :-) ...Please help me in concluding -- The whole discussion concludes that - 1- Framework does not support increasing parallelism of any task just by any inbuilt function . 2- User have to manualy write logic for filter output of upstream node in DAG to manage input to Downstream nodes (like shuffle grouping etc in STORM) 3- If we want to increase the level of parallelism of twitter streaming Spout to *get higher rate of DStream of tweets (to increase the rate of input ) , how it is possible ... * *val tweetStream = **TwitterUtils.createStream(ssc, Utils.getAuth)* On Fri, May 8, 2015 at 2:16 AM, Evo Eftimov evo.efti...@isecc.com wrote: 1. Will rdd2.filter run before rdd1.filter finish? YES 2. We have to traverse rdd twice. Any comments? You can invoke filter or whatever other transformation / function many times Ps: you have to study / learn the Parallel Programming Model of an OO Framework like Spark – in any OO Framework lots of Behavior is hidden / encapsulated by the Framework and the client code gets invoked at specific points in the Flow of Control / Data based on callback functions That’s why stuff like RDD.filter(), RDD.filter() may look “sequential” to you but it is not *From:* Bill Q [mailto:bill.q@gmail.com] *Sent:* Thursday, May 7, 2015 6:27 PM *To:* Evo Eftimov *Cc:* user@spark.apache.org *Subject:* Re: Map one RDD into two RDD The multi-threading code in Scala is quite simple and you can google it pretty easily. We used the Future framework. You can use Akka also. @Evo My concerns for filtering solution are: 1. Will rdd2.filter run before rdd1.filter finish? 2. We have to traverse rdd twice. Any comments? On Thursday, May 7, 2015, Evo Eftimov evo.efti...@isecc.com wrote: Scala is a language, Spark is an OO/Functional, Distributed Framework facilitating Parallel Programming in a distributed environment Any “Scala parallelism” occurs within the Parallel Model imposed by the Spark OO Framework – ie it is limited in terms of what it can achieve in terms of influencing the Spark Framework behavior – that is the nature of programming with/for frameworks When RDD1 and RDD2 are partitioned and different Actions applied to them this will result in Parallel Pipelines / DAGs within the Spark Framework RDD1 = RDD.filter() RDD2 = RDD.filter() *From:* Bill Q [mailto:bill.q@gmail.com] *Sent:* Thursday, May 7, 2015 4:55 PM *To:* Evo Eftimov *Cc:* user@spark.apache.org *Subject:* Re: Map one RDD into two RDD Thanks for the replies. We decided to use concurrency in Scala to do the two mappings using the same source RDD in parallel. So far, it seems to be working. Any comments? On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com wrote: RDD1 = RDD.filter() RDD2 = RDD.filter() *From:* Bill Q [mailto:bill.q@gmail.com bill.q@gmail.com] *Sent:* Tuesday, May 5, 2015 10:42 PM *To:* user@spark.apache.org *Subject:* Map one RDD into two RDD Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD, I will generate two types of data. I would like to save each type into its own RDD. But I can't seem to find an efficient way to do it. Any suggestions? Many thanks. Bill -- Many thanks. Bill -- Many thanks. Bill -- Many thanks. Bill -- Thanks Regards, Anshu Shukla
Re: Map one RDD into two RDD
Thanks for the replies. We decided to use concurrency in Scala to do the two mappings using the same source RDD in parallel. So far, it seems to be working. Any comments? On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com wrote: RDD1 = RDD.filter() RDD2 = RDD.filter() *From:* Bill Q [mailto:bill.q@gmail.com javascript:_e(%7B%7D,'cvml','bill.q@gmail.com');] *Sent:* Tuesday, May 5, 2015 10:42 PM *To:* user@spark.apache.org javascript:_e(%7B%7D,'cvml','user@spark.apache.org'); *Subject:* Map one RDD into two RDD Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD, I will generate two types of data. I would like to save each type into its own RDD. But I can't seem to find an efficient way to do it. Any suggestions? Many thanks. Bill -- Many thanks. Bill -- Many thanks. Bill
Re: Map one RDD into two RDD
Hi Bill, Could you show a snippet of code to illustrate your choice? -Gerard. On Thu, May 7, 2015 at 5:55 PM, Bill Q bill.q@gmail.com wrote: Thanks for the replies. We decided to use concurrency in Scala to do the two mappings using the same source RDD in parallel. So far, it seems to be working. Any comments? On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com wrote: RDD1 = RDD.filter() RDD2 = RDD.filter() *From:* Bill Q [mailto:bill.q@gmail.com] *Sent:* Tuesday, May 5, 2015 10:42 PM *To:* user@spark.apache.org *Subject:* Map one RDD into two RDD Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD, I will generate two types of data. I would like to save each type into its own RDD. But I can't seem to find an efficient way to do it. Any suggestions? Many thanks. Bill -- Many thanks. Bill -- Many thanks. Bill
RE: Map one RDD into two RDD
Scala is a language, Spark is an OO/Functional, Distributed Framework facilitating Parallel Programming in a distributed environment Any “Scala parallelism” occurs within the Parallel Model imposed by the Spark OO Framework – ie it is limited in terms of what it can achieve in terms of influencing the Spark Framework behavior – that is the nature of programming with/for frameworks When RDD1 and RDD2 are partitioned and different Actions applied to them this will result in Parallel Pipelines / DAGs within the Spark Framework RDD1 = RDD.filter() RDD2 = RDD.filter() From: Bill Q [mailto:bill.q@gmail.com] Sent: Thursday, May 7, 2015 4:55 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: Map one RDD into two RDD Thanks for the replies. We decided to use concurrency in Scala to do the two mappings using the same source RDD in parallel. So far, it seems to be working. Any comments? On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com wrote: RDD1 = RDD.filter() RDD2 = RDD.filter() From: Bill Q [mailto:bill.q@gmail.com javascript:_e(%7B%7D,'cvml','bill.q@gmail.com'); ] Sent: Tuesday, May 5, 2015 10:42 PM To: user@spark.apache.org javascript:_e(%7B%7D,'cvml','user@spark.apache.org'); Subject: Map one RDD into two RDD Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD, I will generate two types of data. I would like to save each type into its own RDD. But I can't seem to find an efficient way to do it. Any suggestions? Many thanks. Bill -- Many thanks. Bill -- Many thanks. Bill
RE: Map one RDD into two RDD
RDD1 = RDD.filter() RDD2 = RDD.filter() From: Bill Q [mailto:bill.q@gmail.com] Sent: Tuesday, May 5, 2015 10:42 PM To: user@spark.apache.org Subject: Map one RDD into two RDD Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD, I will generate two types of data. I would like to save each type into its own RDD. But I can't seem to find an efficient way to do it. Any suggestions? Many thanks. Bill -- Many thanks. Bill
Re: Map one RDD into two RDD
Have you looked at RDD#randomSplit() (as example) ? Cheers On Tue, May 5, 2015 at 2:42 PM, Bill Q bill.q@gmail.com wrote: Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD, I will generate two types of data. I would like to save each type into its own RDD. But I can't seem to find an efficient way to do it. Any suggestions? Many thanks. Bill -- Many thanks. Bill