I still don't follow how spark is partitioning data in multi node environment. Is there a document on how spark does portioning of data. For eg: in word count eg how is spark distributing words to multiple nodes?
On Fri, Mar 13, 2015 at 3:01 PM, Tathagata Das <t...@databricks.com> wrote: > If you want to access the keys in an RDD that is partition by key, then > you can use RDD.mapPartition(), which gives you access to the whole > partition as an iterator<key, value>. You have the option of maintaing the > partitioning information or not by setting the preservePartitioning flag in > mapPartition (see docs). But use it at your own risk. If you modify the > keys, and yet preserve partitioning, the partitioning would not make sense > any more as the hash of the keys have changed. > > TD > > > > On Fri, Mar 13, 2015 at 2:26 PM, Mohit Anchlia <mohitanch...@gmail.com> > wrote: > >> I am trying to look for a documentation on partitioning, which I can't >> seem to find. I am looking at spark streaming and was wondering how does it >> partition RDD in a multi node environment. Where are the keys defined that >> is used for partitioning? For instance in below example keys seem to be >> implicit: >> >> Which one is key and which one is value? Or is it called a flatMap >> because there are no keys? >> >> // Split each line into words >> JavaDStream<String> words = lines.flatMap( >> new FlatMapFunction<String, String>() { >> @Override public Iterable<String> call(String x) { >> return Arrays.asList(x.split(" ")); >> } >> }); >> >> >> And are Keys available inside of Function2 in case it's required for a >> given use case ? >> >> >> JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey( >> new Function2<Integer, Integer, Integer>() { >> @Override public Integer call(Integer i1, Integer i2) throws >> Exception { >> return i1 + i2; >> } >> }); >> >> >> >> >> >