Re: randomSplit instead of a huge map reduce ?
- Divide and conquer with reduceByKey (like Ashish mentioned, each pair being the key) would work - looks like a mapReduce with combiners problem. I think reduceByKey would use combiners while aggregateByKey wouldn't. - Could we optimize this further by using combineByKey directly ? Cheers k/ On Fri, Feb 20, 2015 at 6:39 PM, Ashish Rangole arang...@gmail.com wrote: Is there a check you can put in place to not create pairs that aren't in your set of 20M pairs? Additionally, once you have your arrays converted to pairs you can do aggregateByKey with each pair being the key. On Feb 20, 2015 1:57 PM, shlomib shl...@summerhq.com wrote: Hi, I am new to Spark and I think I missed something very basic. I have the following use case (I use Java and run Spark locally on my laptop): I have a JavaRDDString[] - The RDD contains around 72,000 arrays of strings (String[]) - Each array contains 80 words (on average). What I want to do is to convert each array into a new array/list of pairs, for example: Input: String[] words = ['a', 'b', 'c'] Output: List[String, Sting] pairs = [('a', 'b'), (a', 'c'), (b', 'c')] and then I want to count the number of times each pair appeared, so my final output should be something like: Output: List[String, Sting, Integer] result = [('a', 'b', 3), (a', 'c', 8), (b', 'c', 10)] The problem: Since each array contains around 80 words, it returns around 3,200 pairs, so after “mapping” my entire RDD I get 3,200 * 72,000 = *230,400,000* pairs to reduce which require way too much memory. (I know I have only around *20,000,000* unique pairs!) I already modified my code and used 'mapPartitions' instead of 'map'. It definitely improved the performance, but I still feel I'm doing something completely wrong. I was wondering if this is the right 'Spark way' to solve this kind of problem, or maybe I should do something like splitting my original RDD into smaller parts (by using randomSplit), then iterate over each part, aggregate the results into some result RDD (by using 'union') and move on to the next part. Can anyone please explain me which solution is better? Thank you very much, Shlomi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/randomSplit-instead-of-a-huge-map-reduce-tp21744.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
randomSplit instead of a huge map reduce ?
Hi, I am new to Spark and I think I missed something very basic. I have the following use case (I use Java and run Spark locally on my laptop): I have a JavaRDDString[] - The RDD contains around 72,000 arrays of strings (String[]) - Each array contains 80 words (on average). What I want to do is to convert each array into a new array/list of pairs, for example: Input: String[] words = ['a', 'b', 'c'] Output: List[String, Sting] pairs = [('a', 'b'), (a', 'c'), (b', 'c')] and then I want to count the number of times each pair appeared, so my final output should be something like: Output: List[String, Sting, Integer] result = [('a', 'b', 3), (a', 'c', 8), (b', 'c', 10)] The problem: Since each array contains around 80 words, it returns around 3,200 pairs, so after “mapping” my entire RDD I get 3,200 * 72,000 = *230,400,000* pairs to reduce which require way too much memory. (I know I have only around *20,000,000* unique pairs!) I already modified my code and used 'mapPartitions' instead of 'map'. It definitely improved the performance, but I still feel I'm doing something completely wrong. I was wondering if this is the right 'Spark way' to solve this kind of problem, or maybe I should do something like splitting my original RDD into smaller parts (by using randomSplit), then iterate over each part, aggregate the results into some result RDD (by using 'union') and move on to the next part. Can anyone please explain me which solution is better? Thank you very much, Shlomi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/randomSplit-instead-of-a-huge-map-reduce-tp21744.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: randomSplit instead of a huge map reduce ?
Is there a check you can put in place to not create pairs that aren't in your set of 20M pairs? Additionally, once you have your arrays converted to pairs you can do aggregateByKey with each pair being the key. On Feb 20, 2015 1:57 PM, shlomib shl...@summerhq.com wrote: Hi, I am new to Spark and I think I missed something very basic. I have the following use case (I use Java and run Spark locally on my laptop): I have a JavaRDDString[] - The RDD contains around 72,000 arrays of strings (String[]) - Each array contains 80 words (on average). What I want to do is to convert each array into a new array/list of pairs, for example: Input: String[] words = ['a', 'b', 'c'] Output: List[String, Sting] pairs = [('a', 'b'), (a', 'c'), (b', 'c')] and then I want to count the number of times each pair appeared, so my final output should be something like: Output: List[String, Sting, Integer] result = [('a', 'b', 3), (a', 'c', 8), (b', 'c', 10)] The problem: Since each array contains around 80 words, it returns around 3,200 pairs, so after “mapping” my entire RDD I get 3,200 * 72,000 = *230,400,000* pairs to reduce which require way too much memory. (I know I have only around *20,000,000* unique pairs!) I already modified my code and used 'mapPartitions' instead of 'map'. It definitely improved the performance, but I still feel I'm doing something completely wrong. I was wondering if this is the right 'Spark way' to solve this kind of problem, or maybe I should do something like splitting my original RDD into smaller parts (by using randomSplit), then iterate over each part, aggregate the results into some result RDD (by using 'union') and move on to the next part. Can anyone please explain me which solution is better? Thank you very much, Shlomi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/randomSplit-instead-of-a-huge-map-reduce-tp21744.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org