I still don't follow how spark is partitioning data in multi node
environment. Is there a document on how spark does portioning of data. For
eg: in word count eg how is spark distributing words to multiple nodes?

On Fri, Mar 13, 2015 at 3:01 PM, Tathagata Das <t...@databricks.com> wrote:

> If you want to access the keys in an RDD that is partition by key, then
> you can use RDD.mapPartition(), which gives you access to the whole
> partition as an iterator<key, value>. You have the option of maintaing the
> partitioning information or not by setting the preservePartitioning flag in
> mapPartition (see docs). But use it at your own risk. If you modify the
> keys, and yet preserve partitioning, the partitioning would not make sense
> any more as the hash of the keys have changed.
>
> TD
>
>
>
> On Fri, Mar 13, 2015 at 2:26 PM, Mohit Anchlia <mohitanch...@gmail.com>
> wrote:
>
>> I am trying to look for a documentation on partitioning, which I can't
>> seem to find. I am looking at spark streaming and was wondering how does it
>> partition RDD in a multi node environment. Where are the keys defined that
>> is used for partitioning? For instance in below example keys seem to be
>> implicit:
>>
>> Which one is key and which one is value? Or is it called a flatMap
>> because there are no keys?
>>
>> // Split each line into words
>> JavaDStream<String> words = lines.flatMap(
>>   new FlatMapFunction<String, String>() {
>>     @Override public Iterable<String> call(String x) {
>>       return Arrays.asList(x.split(" "));
>>     }
>>   });
>>
>>
>> And are Keys available inside of Function2 in case it's required for a
>> given use case ?
>>
>>
>> JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey(
>>   new Function2<Integer, Integer, Integer>() {
>>     @Override public Integer call(Integer i1, Integer i2) throws
>> Exception {
>>       return i1 + i2;
>>     }
>>   });
>>
>>
>>
>>
>>
>

Reply via email to