If I'm understanding you correctly, there's lots of ways you could do that.
 Here's one, continuing from the previous example:

// rdd26: RDD[String] split by first letter into 26 partitions

val range = (Char.char2int('A') - 65 to Char.char2int('M') - 65)

rdd26..mapPartitionsWithIndex { (idx, itr) => if (range.contains(idx))
itr.map(_.toUpperCase) else itr }




On Tue, Jan 28, 2014 at 12:52 PM, David Thomas <dt5434...@gmail.com> wrote:

> Thank you! That helps.
>
> A follow up question on this. How can I apply a function only on a subset
> of this RDD. Lets say, I need all strings starting in the range 'A' - 'M'
> be applied toUpperCase and not touch the remaining. Is that possible
> without running an 'if' condition on all the partitions in the cluster?
>
>
>
> On Tue, Jan 28, 2014 at 1:20 PM, Mark Hamstra <m...@clearstorydata.com>wrote:
>
>> scala> import org.apache.spark.RangePartitioner
>>
>> scala> val rdd = sc.parallelize(List("apple", "Ball", "cat", "dog",
>> "Elephant", "fox", "gas", "horse", "index", "jet", "kitsch", "long",
>> "moon", "Neptune", "ooze", "Pen", "quiet", "rose", "sun", "talk",
>> "umbrella", "voice", "Walrus", "xeon", "Yam", "zebra"))
>>
>> scala> rdd.keyBy(s => s(0).toUpper)
>> res0: org.apache.spark.rdd.RDD[(Char, String)] = MappedRDD[1] at keyBy at
>> <console>:15
>>
>> scala> res0.partitionBy(new RangePartitioner[Char, String](26,
>> res0)).values
>> res2: org.apache.spark.rdd.RDD[String] = MappedRDD[5] at values at
>> <console>:18
>>
>> scala> res2.mapPartitionsWithIndex((idx, itr) => itr.map(s => (idx,
>> s))).collect.foreach(println)
>>
>> (0,apple)
>> (1,Ball)
>> (2,cat)
>> (3,dog)
>> (4,Elephant)
>> (5,fox)
>> (6,gas)
>> (7,horse)
>> (8,index)
>> (9,jet)
>> (10,kitsch)
>> (11,long)
>> (12,moon)
>> (13,Neptune)
>> (14,ooze)
>> (15,Pen)
>> (16,quiet)
>> (17,rose)
>> (18,sun)
>> (19,talk)
>> (20,umbrella)
>> (21,voice)
>> (22,Walrus)
>> (23,xeon)
>> (24,Yam)
>> (25,zebra)
>>
>>
>>
>> On Tue, Jan 28, 2014 at 11:48 AM, Nick Pentreath <
>> nick.pentre...@gmail.com> wrote:
>>
>>> If you do something like:
>>>
>>> rdd.map{ str => (str.take(1), str) }
>>>
>>> you will have an RDD[(String, String)] where the key is the first
>>> character of the string. Now when you perform an operation that uses
>>> partitioning (e.g. reduceByKey) you will end up with the 1st reduce task
>>> receiving all the strings with A, the 2nd all the strings with B etc. Note
>>> that you may not be able to enforce that each *machine* gets a
>>> different letter, but in most cases that doesn't particularly matter as
>>> long as you get "all values for a given key go to the same reducer"
>>> behaviour.
>>>
>>> Perhaps if you expand on your use case we can provide more detailed
>>> assistance.
>>>
>>>
>>> On Tue, Jan 28, 2014 at 9:35 PM, David Thomas <dt5434...@gmail.com>wrote:
>>>
>>>> Lets say I have an RDD of Strings and there are 26 machines in the
>>>> cluster. How can I repartition the RDD in such a way that all strings
>>>> starting with A gets collected on machine1, B on machine2 and so on.
>>>>
>>>>
>>>
>>
>

Reply via email to