Re: PySpark RDD.partitionBy() requires an RDD of tuples

Nicholas Chammas Wed, 02 Apr 2014 14:46:22 -0700

Update: I'm now using this ghetto function to partition the RDD I get back
when I call textFile() on a gzipped file:


# Python 2.6
def partitionRDD(rdd, numPartitions):
    counter = {'a': 0}
    def count_up(x):
        counter['a'] += 1
        return counter['a']
    return (rdd.keyBy(count_up)
        .partitionBy(numPartitions)
        .map(lambda (counter, data): data))

If there's supposed to be a built-in Spark method to do this, I'd love to
learn more about it.

Nick


On Tue, Apr 1, 2014 at 7:59 PM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> Hmm, doing help(rdd) in PySpark doesn't show a method called repartition().
> Trying rdd.repartition() or rdd.repartition(10) also fail. I'm on 0.9.0.
>
> The approach I'm going with to partition my MappedRDD is to key it by a
> random int, and then partition it.
>
> So something like:
>
> rdd = sc.textFile('s3n://gzipped_file_brah.gz') # rdd has 1 partition;
> minSplits is not actionable due to gzip
> keyed_rdd = rdd.keyBy(lambda x: randint(1,100)) # we key the RDD so we can
> partition it
> partitioned_rdd = keyed_rdd.partitionBy(10)     # rdd has 10 partitions
>
> Are you saying I don't have to do this?
>
> Nick
>
>
>
> On Tue, Apr 1, 2014 at 7:38 PM, Aaron Davidson <ilike...@gmail.com> wrote:
>
>> Hm, yeah, the docs are not clear on this one. The function you're looking
>> for to change the number of partitions on any ol' RDD is "repartition()",
>> which is available in master but for some reason doesn't seem to show up in
>> the latest docs. Sorry about that, I also didn't realize partitionBy() had
>> this behavior from reading the Python docs (though it is consistent with
>> the Scala API, just more type-safe there).
>>
>>
>> On Tue, Apr 1, 2014 at 3:01 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Just an FYI, it's not obvious from the 
>>> docs<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBy>that
>>>  the following code should fail:
>>>
>>> a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
>>> a._jrdd.splits().size()
>>> a.count()
>>> b = a.partitionBy(5)
>>> b._jrdd.splits().size()
>>> b.count()
>>>
>>> I figured out from the example that if I generated a key by doing this
>>>
>>> b = a.map(lambda x: (x, x)).partitionBy(5)
>>>
>>>  then all would be well.
>>>
>>> In other words, partitionBy() only works on RDDs of tuples. Is that
>>> correct?
>>>
>>> Nick
>>>
>>>
>>> ------------------------------
>>> View this message in context: PySpark RDD.partitionBy() requires an RDD
>>> of 
>>> tuples<http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-RDD-partitionBy-requires-an-RDD-of-tuples-tp3598.html>
>>> Sent from the Apache Spark User List mailing list 
>>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>>>
>>
>>
>

Re: PySpark RDD.partitionBy() requires an RDD of tuples

Reply via email to