Re: Map one RDD into two RDD

2015-05-08 Thread anshu shukla
Any update to above mail
and  Can anyone tell me logic - I have to filter tweets and submit tweets
 with particular  #hashtag1 to SparkSQL  databases and tweets with
 #hashtag2  will be passed to sentiment analysis phase .Problem is how to
split the input data in two streams using hashtags 

On Fri, May 8, 2015 at 2:42 AM, anshu shukla anshushuk...@gmail.com wrote:

 One of  the best discussion in mailing list  :-)  ...Please  help me in
 concluding --

 The whole discussion concludes that -

 1-  Framework  does not support  increasing parallelism of any task just
 by any inbuilt function .
 2-  User have to manualy write logic for filter output of upstream node in
 DAG  to manage input to Downstream nodes (like shuffle grouping etc in
 STORM)
 3- If we want to increase the level of parallelism of twitter streaming
  Spout  to *get higher rate of  DStream of tweets  (to increase the rate
 of input )  , how it is possible ...  *

   *val tweetStream = **TwitterUtils.createStream(ssc, Utils.getAuth)*



 On Fri, May 8, 2015 at 2:16 AM, Evo Eftimov evo.efti...@isecc.com wrote:

 1. Will rdd2.filter run before rdd1.filter finish?



 YES



 2. We have to traverse rdd twice. Any comments?



 You can invoke filter or whatever other transformation / function many
 times

 Ps: you  have to study / learn the Parallel Programming Model of an OO
 Framework like Spark – in any OO Framework lots of Behavior is hidden /
 encapsulated by the Framework and the client code gets invoked at specific
 points in the Flow of Control / Data based on callback functions



 That’s why stuff like RDD.filter(), RDD.filter() may look “sequential” to
 you but it is not





 *From:* Bill Q [mailto:bill.q@gmail.com]
 *Sent:* Thursday, May 7, 2015 6:27 PM

 *To:* Evo Eftimov
 *Cc:* user@spark.apache.org
 *Subject:* Re: Map one RDD into two RDD



 The multi-threading code in Scala is quite simple and you can google it
 pretty easily. We used the Future framework. You can use Akka also.



 @Evo My concerns for filtering solution are: 1. Will rdd2.filter run
 before rdd1.filter finish? 2. We have to traverse rdd twice. Any comments?



 On Thursday, May 7, 2015, Evo Eftimov evo.efti...@isecc.com wrote:

 Scala is a language, Spark is an OO/Functional, Distributed Framework
 facilitating Parallel Programming in a distributed environment



 Any “Scala parallelism” occurs within the Parallel Model imposed by the
 Spark OO Framework – ie it is limited in terms of what it can achieve in
 terms of influencing the Spark Framework behavior – that is the nature of
 programming with/for frameworks



 When RDD1 and RDD2 are partitioned and different Actions applied to them
 this will result in Parallel Pipelines / DAGs within the Spark Framework

 RDD1 = RDD.filter()

 RDD2 = RDD.filter()





 *From:* Bill Q [mailto:bill.q@gmail.com]
 *Sent:* Thursday, May 7, 2015 4:55 PM
 *To:* Evo Eftimov
 *Cc:* user@spark.apache.org
 *Subject:* Re: Map one RDD into two RDD



 Thanks for the replies. We decided to use concurrency in Scala to do the
 two mappings using the same source RDD in parallel. So far, it seems to be
 working. Any comments?

 On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com wrote:

 RDD1 = RDD.filter()

 RDD2 = RDD.filter()



 *From:* Bill Q [mailto:bill.q@gmail.com bill.q@gmail.com]
 *Sent:* Tuesday, May 5, 2015 10:42 PM
 *To:* user@spark.apache.org
 *Subject:* Map one RDD into two RDD



 Hi all,

 I have a large RDD that I map a function to it. Based on the nature of
 each record in the input RDD, I will generate two types of data. I would
 like to save each type into its own RDD. But I can't seem to find an
 efficient way to do it. Any suggestions?



 Many thanks.





 Bill



 --

 Many thanks.

 Bill





 --

 Many thanks.

 Bill





 --

 Many thanks.

 Bill






 --
 Thanks  Regards,
 Anshu Shukla




-- 
Thanks  Regards,
Anshu Shukla


Re: Map one RDD into two RDD

2015-05-08 Thread ayan guha
Do as Evo suggested. Rdd1=rdd.filter, rdd2=rdd.filter
On 9 May 2015 05:19, anshu shukla anshushuk...@gmail.com wrote:

 Any update to above mail
 and  Can anyone tell me logic - I have to filter tweets and submit tweets
  with particular  #hashtag1 to SparkSQL  databases and tweets with
  #hashtag2  will be passed to sentiment analysis phase .Problem is how to
 split the input data in two streams using hashtags 

 On Fri, May 8, 2015 at 2:42 AM, anshu shukla anshushuk...@gmail.com
 wrote:

 One of  the best discussion in mailing list  :-)  ...Please  help me in
 concluding --

 The whole discussion concludes that -

 1-  Framework  does not support  increasing parallelism of any task just
 by any inbuilt function .
 2-  User have to manualy write logic for filter output of upstream node
 in DAG  to manage input to Downstream nodes (like shuffle grouping etc in
 STORM)
 3- If we want to increase the level of parallelism of twitter streaming
  Spout  to *get higher rate of  DStream of tweets  (to increase the rate
 of input )  , how it is possible ...  *

   *val tweetStream = **TwitterUtils.createStream(ssc, Utils.getAuth)*



 On Fri, May 8, 2015 at 2:16 AM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 1. Will rdd2.filter run before rdd1.filter finish?



 YES



 2. We have to traverse rdd twice. Any comments?



 You can invoke filter or whatever other transformation / function many
 times

 Ps: you  have to study / learn the Parallel Programming Model of an OO
 Framework like Spark – in any OO Framework lots of Behavior is hidden /
 encapsulated by the Framework and the client code gets invoked at specific
 points in the Flow of Control / Data based on callback functions



 That’s why stuff like RDD.filter(), RDD.filter() may look “sequential”
 to you but it is not





 *From:* Bill Q [mailto:bill.q@gmail.com]
 *Sent:* Thursday, May 7, 2015 6:27 PM

 *To:* Evo Eftimov
 *Cc:* user@spark.apache.org
 *Subject:* Re: Map one RDD into two RDD



 The multi-threading code in Scala is quite simple and you can google it
 pretty easily. We used the Future framework. You can use Akka also.



 @Evo My concerns for filtering solution are: 1. Will rdd2.filter run
 before rdd1.filter finish? 2. We have to traverse rdd twice. Any comments?



 On Thursday, May 7, 2015, Evo Eftimov evo.efti...@isecc.com wrote:

 Scala is a language, Spark is an OO/Functional, Distributed Framework
 facilitating Parallel Programming in a distributed environment



 Any “Scala parallelism” occurs within the Parallel Model imposed by the
 Spark OO Framework – ie it is limited in terms of what it can achieve in
 terms of influencing the Spark Framework behavior – that is the nature of
 programming with/for frameworks



 When RDD1 and RDD2 are partitioned and different Actions applied to them
 this will result in Parallel Pipelines / DAGs within the Spark Framework

 RDD1 = RDD.filter()

 RDD2 = RDD.filter()





 *From:* Bill Q [mailto:bill.q@gmail.com]
 *Sent:* Thursday, May 7, 2015 4:55 PM
 *To:* Evo Eftimov
 *Cc:* user@spark.apache.org
 *Subject:* Re: Map one RDD into two RDD



 Thanks for the replies. We decided to use concurrency in Scala to do the
 two mappings using the same source RDD in parallel. So far, it seems to be
 working. Any comments?

 On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com wrote:

 RDD1 = RDD.filter()

 RDD2 = RDD.filter()



 *From:* Bill Q [mailto:bill.q@gmail.com bill.q@gmail.com]
 *Sent:* Tuesday, May 5, 2015 10:42 PM
 *To:* user@spark.apache.org
 *Subject:* Map one RDD into two RDD



 Hi all,

 I have a large RDD that I map a function to it. Based on the nature of
 each record in the input RDD, I will generate two types of data. I would
 like to save each type into its own RDD. But I can't seem to find an
 efficient way to do it. Any suggestions?



 Many thanks.





 Bill



 --

 Many thanks.

 Bill





 --

 Many thanks.

 Bill





 --

 Many thanks.

 Bill






 --
 Thanks  Regards,
 Anshu Shukla




 --
 Thanks  Regards,
 Anshu Shukla



Re: Map one RDD into two RDD

2015-05-07 Thread Bill Q
The multi-threading code in Scala is quite simple and you can google it
pretty easily. We used the Future framework. You can use Akka also.

@Evo My concerns for filtering solution are: 1. Will rdd2.filter run before
rdd1.filter finish? 2. We have to traverse rdd twice. Any comments?


On Thursday, May 7, 2015, Evo Eftimov evo.efti...@isecc.com wrote:

 Scala is a language, Spark is an OO/Functional, Distributed Framework
 facilitating Parallel Programming in a distributed environment



 Any “Scala parallelism” occurs within the Parallel Model imposed by the
 Spark OO Framework – ie it is limited in terms of what it can achieve in
 terms of influencing the Spark Framework behavior – that is the nature of
 programming with/for frameworks



 When RDD1 and RDD2 are partitioned and different Actions applied to them
 this will result in Parallel Pipelines / DAGs within the Spark Framework

 RDD1 = RDD.filter()

 RDD2 = RDD.filter()





 *From:* Bill Q [mailto:bill.q@gmail.com
 javascript:_e(%7B%7D,'cvml','bill.q@gmail.com');]
 *Sent:* Thursday, May 7, 2015 4:55 PM
 *To:* Evo Eftimov
 *Cc:* user@spark.apache.org
 javascript:_e(%7B%7D,'cvml','user@spark.apache.org');
 *Subject:* Re: Map one RDD into two RDD



 Thanks for the replies. We decided to use concurrency in Scala to do the
 two mappings using the same source RDD in parallel. So far, it seems to be
 working. Any comments?

 On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com
 javascript:_e(%7B%7D,'cvml','evo.efti...@isecc.com'); wrote:

 RDD1 = RDD.filter()

 RDD2 = RDD.filter()



 *From:* Bill Q [mailto:bill.q@gmail.com]
 *Sent:* Tuesday, May 5, 2015 10:42 PM
 *To:* user@spark.apache.org
 *Subject:* Map one RDD into two RDD



 Hi all,

 I have a large RDD that I map a function to it. Based on the nature of
 each record in the input RDD, I will generate two types of data. I would
 like to save each type into its own RDD. But I can't seem to find an
 efficient way to do it. Any suggestions?



 Many thanks.





 Bill



 --

 Many thanks.

 Bill





 --

 Many thanks.

 Bill





-- 
Many thanks.


Bill


Re: Map one RDD into two RDD

2015-05-07 Thread Gerard Maas
Hi Bill,

I just found weird that one would use parallel threads to 'filter', as
filter is lazy in Spark, and multithreading wouldn't have any effect unless
the action triggering the execution of the lineage containing such filter
is executed on a separate thread. One must have very specific
reasons/requirements to do that, beyond 'not traversing the data twice'.
The request for the code was only to help checking that.

-kr, Gerard.

On Thu, May 7, 2015 at 7:26 PM, Bill Q bill.q@gmail.com wrote:

 The multi-threading code in Scala is quite simple and you can google it
 pretty easily. We used the Future framework. You can use Akka also.

 @Evo My concerns for filtering solution are: 1. Will rdd2.filter run
 before rdd1.filter finish? 2. We have to traverse rdd twice. Any comments?



 On Thursday, May 7, 2015, Evo Eftimov evo.efti...@isecc.com wrote:

 Scala is a language, Spark is an OO/Functional, Distributed Framework
 facilitating Parallel Programming in a distributed environment



 Any “Scala parallelism” occurs within the Parallel Model imposed by the
 Spark OO Framework – ie it is limited in terms of what it can achieve in
 terms of influencing the Spark Framework behavior – that is the nature of
 programming with/for frameworks



 When RDD1 and RDD2 are partitioned and different Actions applied to them
 this will result in Parallel Pipelines / DAGs within the Spark Framework

 RDD1 = RDD.filter()

 RDD2 = RDD.filter()





 *From:* Bill Q [mailto:bill.q@gmail.com]
 *Sent:* Thursday, May 7, 2015 4:55 PM
 *To:* Evo Eftimov
 *Cc:* user@spark.apache.org
 *Subject:* Re: Map one RDD into two RDD



 Thanks for the replies. We decided to use concurrency in Scala to do the
 two mappings using the same source RDD in parallel. So far, it seems to be
 working. Any comments?

 On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com wrote:

 RDD1 = RDD.filter()

 RDD2 = RDD.filter()



 *From:* Bill Q [mailto:bill.q@gmail.com]
 *Sent:* Tuesday, May 5, 2015 10:42 PM
 *To:* user@spark.apache.org
 *Subject:* Map one RDD into two RDD



 Hi all,

 I have a large RDD that I map a function to it. Based on the nature of
 each record in the input RDD, I will generate two types of data. I would
 like to save each type into its own RDD. But I can't seem to find an
 efficient way to do it. Any suggestions?



 Many thanks.





 Bill



 --

 Many thanks.

 Bill





 --

 Many thanks.

 Bill





 --
 Many thanks.


 Bill




Re: Map one RDD into two RDD

2015-05-07 Thread anshu shukla
One of  the best discussion in mailing list  :-)  ...Please  help me in
concluding --

The whole discussion concludes that -

1-  Framework  does not support  increasing parallelism of any task just by
any inbuilt function .
2-  User have to manualy write logic for filter output of upstream node in
DAG  to manage input to Downstream nodes (like shuffle grouping etc in
STORM)
3- If we want to increase the level of parallelism of twitter streaming
 Spout  to *get higher rate of  DStream of tweets  (to increase the rate of
input )  , how it is possible ...  *

  *val tweetStream = **TwitterUtils.createStream(ssc, Utils.getAuth)*



On Fri, May 8, 2015 at 2:16 AM, Evo Eftimov evo.efti...@isecc.com wrote:

 1. Will rdd2.filter run before rdd1.filter finish?



 YES



 2. We have to traverse rdd twice. Any comments?



 You can invoke filter or whatever other transformation / function many
 times

 Ps: you  have to study / learn the Parallel Programming Model of an OO
 Framework like Spark – in any OO Framework lots of Behavior is hidden /
 encapsulated by the Framework and the client code gets invoked at specific
 points in the Flow of Control / Data based on callback functions



 That’s why stuff like RDD.filter(), RDD.filter() may look “sequential” to
 you but it is not





 *From:* Bill Q [mailto:bill.q@gmail.com]
 *Sent:* Thursday, May 7, 2015 6:27 PM

 *To:* Evo Eftimov
 *Cc:* user@spark.apache.org
 *Subject:* Re: Map one RDD into two RDD



 The multi-threading code in Scala is quite simple and you can google it
 pretty easily. We used the Future framework. You can use Akka also.



 @Evo My concerns for filtering solution are: 1. Will rdd2.filter run
 before rdd1.filter finish? 2. We have to traverse rdd twice. Any comments?



 On Thursday, May 7, 2015, Evo Eftimov evo.efti...@isecc.com wrote:

 Scala is a language, Spark is an OO/Functional, Distributed Framework
 facilitating Parallel Programming in a distributed environment



 Any “Scala parallelism” occurs within the Parallel Model imposed by the
 Spark OO Framework – ie it is limited in terms of what it can achieve in
 terms of influencing the Spark Framework behavior – that is the nature of
 programming with/for frameworks



 When RDD1 and RDD2 are partitioned and different Actions applied to them
 this will result in Parallel Pipelines / DAGs within the Spark Framework

 RDD1 = RDD.filter()

 RDD2 = RDD.filter()





 *From:* Bill Q [mailto:bill.q@gmail.com]
 *Sent:* Thursday, May 7, 2015 4:55 PM
 *To:* Evo Eftimov
 *Cc:* user@spark.apache.org
 *Subject:* Re: Map one RDD into two RDD



 Thanks for the replies. We decided to use concurrency in Scala to do the
 two mappings using the same source RDD in parallel. So far, it seems to be
 working. Any comments?

 On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com wrote:

 RDD1 = RDD.filter()

 RDD2 = RDD.filter()



 *From:* Bill Q [mailto:bill.q@gmail.com bill.q@gmail.com]
 *Sent:* Tuesday, May 5, 2015 10:42 PM
 *To:* user@spark.apache.org
 *Subject:* Map one RDD into two RDD



 Hi all,

 I have a large RDD that I map a function to it. Based on the nature of
 each record in the input RDD, I will generate two types of data. I would
 like to save each type into its own RDD. But I can't seem to find an
 efficient way to do it. Any suggestions?



 Many thanks.





 Bill



 --

 Many thanks.

 Bill





 --

 Many thanks.

 Bill





 --

 Many thanks.

 Bill






-- 
Thanks  Regards,
Anshu Shukla


Re: Map one RDD into two RDD

2015-05-07 Thread Bill Q
Thanks for the replies. We decided to use concurrency in Scala to do the
two mappings using the same source RDD in parallel. So far, it seems to be
working. Any comments?

On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com wrote:

 RDD1 = RDD.filter()

 RDD2 = RDD.filter()



 *From:* Bill Q [mailto:bill.q@gmail.com
 javascript:_e(%7B%7D,'cvml','bill.q@gmail.com');]
 *Sent:* Tuesday, May 5, 2015 10:42 PM
 *To:* user@spark.apache.org
 javascript:_e(%7B%7D,'cvml','user@spark.apache.org');
 *Subject:* Map one RDD into two RDD



 Hi all,

 I have a large RDD that I map a function to it. Based on the nature of
 each record in the input RDD, I will generate two types of data. I would
 like to save each type into its own RDD. But I can't seem to find an
 efficient way to do it. Any suggestions?



 Many thanks.





 Bill



 --

 Many thanks.

 Bill





-- 
Many thanks.


Bill


Re: Map one RDD into two RDD

2015-05-07 Thread Gerard Maas
Hi Bill,

Could you show a snippet of code to illustrate your choice?

-Gerard.

On Thu, May 7, 2015 at 5:55 PM, Bill Q bill.q@gmail.com wrote:

 Thanks for the replies. We decided to use concurrency in Scala to do the
 two mappings using the same source RDD in parallel. So far, it seems to be
 working. Any comments?


 On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com wrote:

 RDD1 = RDD.filter()

 RDD2 = RDD.filter()



 *From:* Bill Q [mailto:bill.q@gmail.com]
 *Sent:* Tuesday, May 5, 2015 10:42 PM
 *To:* user@spark.apache.org
 *Subject:* Map one RDD into two RDD



 Hi all,

 I have a large RDD that I map a function to it. Based on the nature of
 each record in the input RDD, I will generate two types of data. I would
 like to save each type into its own RDD. But I can't seem to find an
 efficient way to do it. Any suggestions?



 Many thanks.





 Bill



 --

 Many thanks.

 Bill





 --
 Many thanks.


 Bill




RE: Map one RDD into two RDD

2015-05-07 Thread Evo Eftimov
Scala is a language, Spark is an OO/Functional, Distributed Framework 
facilitating Parallel Programming in a distributed environment 

 

Any “Scala parallelism” occurs within the Parallel Model imposed by the Spark 
OO Framework – ie it is limited in terms of what it can achieve in terms of 
influencing the Spark Framework behavior – that is the nature of programming 
with/for frameworks 

 

When RDD1 and RDD2 are partitioned and different Actions applied to them this 
will result in Parallel Pipelines / DAGs within the Spark Framework

RDD1 = RDD.filter()

RDD2 = RDD.filter()

 

 

From: Bill Q [mailto:bill.q@gmail.com] 
Sent: Thursday, May 7, 2015 4:55 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: Map one RDD into two RDD

 

Thanks for the replies. We decided to use concurrency in Scala to do the two 
mappings using the same source RDD in parallel. So far, it seems to be working. 
Any comments?

On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com wrote:

RDD1 = RDD.filter()

RDD2 = RDD.filter()

 

From: Bill Q [mailto:bill.q@gmail.com 
javascript:_e(%7B%7D,'cvml','bill.q@gmail.com'); ] 
Sent: Tuesday, May 5, 2015 10:42 PM
To: user@spark.apache.org 
javascript:_e(%7B%7D,'cvml','user@spark.apache.org'); 
Subject: Map one RDD into two RDD

 

Hi all,

I have a large RDD that I map a function to it. Based on the nature of each 
record in the input RDD, I will generate two types of data. I would like to 
save each type into its own RDD. But I can't seem to find an efficient way to 
do it. Any suggestions?

 

Many thanks.

 

 

Bill



-- 

Many thanks.

Bill

 



-- 

Many thanks.



Bill

 



RE: Map one RDD into two RDD

2015-05-06 Thread Evo Eftimov
RDD1 = RDD.filter()

RDD2 = RDD.filter()

 

From: Bill Q [mailto:bill.q@gmail.com] 
Sent: Tuesday, May 5, 2015 10:42 PM
To: user@spark.apache.org
Subject: Map one RDD into two RDD

 

Hi all,

I have a large RDD that I map a function to it. Based on the nature of each 
record in the input RDD, I will generate two types of data. I would like to 
save each type into its own RDD. But I can't seem to find an efficient way to 
do it. Any suggestions?

 

Many thanks.

 

 

Bill



-- 

Many thanks.



Bill

 



Re: Map one RDD into two RDD

2015-05-05 Thread Ted Yu
Have you looked at RDD#randomSplit() (as example) ?

Cheers

On Tue, May 5, 2015 at 2:42 PM, Bill Q bill.q@gmail.com wrote:

 Hi all,
 I have a large RDD that I map a function to it. Based on the nature of
 each record in the input RDD, I will generate two types of data. I would
 like to save each type into its own RDD. But I can't seem to find an
 efficient way to do it. Any suggestions?

 Many thanks.


 Bill


 --
 Many thanks.


 Bill