Re: PySpark RDD.partitionBy() requires an RDD of tuples

Nicholas Chammas Tue, 01 Apr 2014 17:01:21 -0700

Hmm, doing help(rdd) in PySpark doesn't show a method called repartition().
Trying rdd.repartition() or rdd.repartition(10) also fail. I'm on 0.9.0.


The approach I'm going with to partition my MappedRDD is to key it by a
random int, and then partition it.

So something like:

rdd = sc.textFile('s3n://gzipped_file_brah.gz') # rdd has 1 partition;
minSplits is not actionable due to gzip
keyed_rdd = rdd.keyBy(lambda x: randint(1,100)) # we key the RDD so we can
partition it
partitioned_rdd = keyed_rdd.partitionBy(10)     # rdd has 10 partitions

Are you saying I don't have to do this?

Nick



On Tue, Apr 1, 2014 at 7:38 PM, Aaron Davidson <ilike...@gmail.com> wrote:

> Hm, yeah, the docs are not clear on this one. The function you're looking
> for to change the number of partitions on any ol' RDD is "repartition()",
> which is available in master but for some reason doesn't seem to show up in
> the latest docs. Sorry about that, I also didn't realize partitionBy() had
> this behavior from reading the Python docs (though it is consistent with
> the Scala API, just more type-safe there).
>
>
> On Tue, Apr 1, 2014 at 3:01 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Just an FYI, it's not obvious from the 
>> docs<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBy>that
>>  the following code should fail:
>>
>> a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
>> a._jrdd.splits().size()
>> a.count()
>> b = a.partitionBy(5)
>> b._jrdd.splits().size()
>> b.count()
>>
>> I figured out from the example that if I generated a key by doing this
>>
>> b = a.map(lambda x: (x, x)).partitionBy(5)
>>
>>  then all would be well.
>>
>> In other words, partitionBy() only works on RDDs of tuples. Is that
>> correct?
>>
>> Nick
>>
>>
>> ------------------------------
>> View this message in context: PySpark RDD.partitionBy() requires an RDD
>> of 
>> tuples<http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-RDD-partitionBy-requires-an-RDD-of-tuples-tp3598.html>
>> Sent from the Apache Spark User List mailing list 
>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>>
>
>

Re: PySpark RDD.partitionBy() requires an RDD of tuples

Reply via email to