Re: randomSplit instead of a huge map reduce ?

2015-02-21 Thread Krishna Sankar
   - Divide and conquer with reduceByKey (like Ashish mentioned, each pair
   being the key) would work - looks like a mapReduce with combiners
   problem. I think reduceByKey would use combiners while aggregateByKey
   wouldn't.
   - Could we optimize this further by using combineByKey directly ?

Cheers
k/

On Fri, Feb 20, 2015 at 6:39 PM, Ashish Rangole arang...@gmail.com wrote:

 Is there a check you can put in place to not create pairs that aren't in
 your set of 20M pairs? Additionally, once you have your arrays converted to
 pairs you can do aggregateByKey with each pair being the key.
 On Feb 20, 2015 1:57 PM, shlomib shl...@summerhq.com wrote:

 Hi,

 I am new to Spark and I think I missed something very basic.

 I have the following use case (I use Java and run Spark locally on my
 laptop):


 I have a JavaRDDString[]

 - The RDD contains around 72,000 arrays of strings (String[])

 - Each array contains 80 words (on average).


 What I want to do is to convert each array into a new array/list of pairs,
 for example:

 Input: String[] words = ['a', 'b', 'c']

 Output: List[String, Sting] pairs = [('a', 'b'), (a', 'c'), (b', 'c')]

 and then I want to count the number of times each pair appeared, so my
 final
 output should be something like:

 Output: List[String, Sting, Integer] result = [('a', 'b', 3), (a', 'c',
 8), (b', 'c', 10)]


 The problem:

 Since each array contains around 80 words, it returns around 3,200 pairs,
 so
 after “mapping” my entire RDD I get 3,200 * 72,000 = *230,400,000* pairs
 to
 reduce which require way too much memory.

 (I know I have only around *20,000,000* unique pairs!)

 I already modified my code and used 'mapPartitions' instead of 'map'. It
 definitely improved the performance, but I still feel I'm doing something
 completely wrong.


 I was wondering if this is the right 'Spark way' to solve this kind of
 problem, or maybe I should do something like splitting my original RDD
 into
 smaller parts (by using randomSplit), then iterate over each part,
 aggregate
 the results into some result RDD (by using 'union') and move on to the
 next
 part.


 Can anyone please explain me which solution is better?


 Thank you very much,

 Shlomi.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/randomSplit-instead-of-a-huge-map-reduce-tp21744.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




randomSplit instead of a huge map reduce ?

2015-02-20 Thread shlomib
Hi,

I am new to Spark and I think I missed something very basic.

I have the following use case (I use Java and run Spark locally on my
laptop):


I have a JavaRDDString[]

- The RDD contains around 72,000 arrays of strings (String[])

- Each array contains 80 words (on average).


What I want to do is to convert each array into a new array/list of pairs,
for example:

Input: String[] words = ['a', 'b', 'c']

Output: List[String, Sting] pairs = [('a', 'b'), (a', 'c'), (b', 'c')]

and then I want to count the number of times each pair appeared, so my final
output should be something like:

Output: List[String, Sting, Integer] result = [('a', 'b', 3), (a', 'c',
8), (b', 'c', 10)]


The problem:

Since each array contains around 80 words, it returns around 3,200 pairs, so
after “mapping” my entire RDD I get 3,200 * 72,000 = *230,400,000* pairs to
reduce which require way too much memory.

(I know I have only around *20,000,000* unique pairs!)

I already modified my code and used 'mapPartitions' instead of 'map'. It
definitely improved the performance, but I still feel I'm doing something
completely wrong.


I was wondering if this is the right 'Spark way' to solve this kind of
problem, or maybe I should do something like splitting my original RDD into
smaller parts (by using randomSplit), then iterate over each part, aggregate
the results into some result RDD (by using 'union') and move on to the next
part.


Can anyone please explain me which solution is better?


Thank you very much,

Shlomi.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/randomSplit-instead-of-a-huge-map-reduce-tp21744.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: randomSplit instead of a huge map reduce ?

2015-02-20 Thread Ashish Rangole
Is there a check you can put in place to not create pairs that aren't in
your set of 20M pairs? Additionally, once you have your arrays converted to
pairs you can do aggregateByKey with each pair being the key.
On Feb 20, 2015 1:57 PM, shlomib shl...@summerhq.com wrote:

 Hi,

 I am new to Spark and I think I missed something very basic.

 I have the following use case (I use Java and run Spark locally on my
 laptop):


 I have a JavaRDDString[]

 - The RDD contains around 72,000 arrays of strings (String[])

 - Each array contains 80 words (on average).


 What I want to do is to convert each array into a new array/list of pairs,
 for example:

 Input: String[] words = ['a', 'b', 'c']

 Output: List[String, Sting] pairs = [('a', 'b'), (a', 'c'), (b', 'c')]

 and then I want to count the number of times each pair appeared, so my
 final
 output should be something like:

 Output: List[String, Sting, Integer] result = [('a', 'b', 3), (a', 'c',
 8), (b', 'c', 10)]


 The problem:

 Since each array contains around 80 words, it returns around 3,200 pairs,
 so
 after “mapping” my entire RDD I get 3,200 * 72,000 = *230,400,000* pairs to
 reduce which require way too much memory.

 (I know I have only around *20,000,000* unique pairs!)

 I already modified my code and used 'mapPartitions' instead of 'map'. It
 definitely improved the performance, but I still feel I'm doing something
 completely wrong.


 I was wondering if this is the right 'Spark way' to solve this kind of
 problem, or maybe I should do something like splitting my original RDD into
 smaller parts (by using randomSplit), then iterate over each part,
 aggregate
 the results into some result RDD (by using 'union') and move on to the next
 part.


 Can anyone please explain me which solution is better?


 Thank you very much,

 Shlomi.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/randomSplit-instead-of-a-huge-map-reduce-tp21744.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org