Hi,

I am new to Spark and I think I missed something very basic.

I have the following use case (I use Java and run Spark locally on my
laptop):


I have a JavaRDD<String[]>

- The RDD contains around 72,000 arrays of strings (String[])

- Each array contains 80 words (on average).


What I want to do is to convert each array into a new array/list of pairs,
for example:

Input: String[] words = ['a', 'b', 'c']

Output: List[<String, Sting>] pairs = [('a', 'b'), (a', 'c'), (b', 'c')]

and then I want to count the number of times each pair appeared, so my final
output should be something like:

Output: List[<String, Sting, Integer>] result = [('a', 'b', 3), (a', 'c',
8), (b', 'c', 10)]


The problem:

Since each array contains around 80 words, it returns around 3,200 pairs, so
after “mapping” my entire RDD I get 3,200 * 72,000 = *230,400,000* pairs to
reduce which require way too much memory.

(I know I have only around *20,000,000* unique pairs!)

I already modified my code and used 'mapPartitions' instead of 'map'. It
definitely improved the performance, but I still feel I'm doing something
completely wrong.


I was wondering if this is the right 'Spark way' to solve this kind of
problem, or maybe I should do something like splitting my original RDD into
smaller parts (by using randomSplit), then iterate over each part, aggregate
the results into some result RDD (by using 'union') and move on to the next
part.


Can anyone please explain me which solution is better?


Thank you very much,

Shlomi.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/randomSplit-instead-of-a-huge-map-reduce-tp21744.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to