Hmm, doing help(rdd) in PySpark doesn't show a method called repartition(). Trying rdd.repartition() or rdd.repartition(10) also fail. I'm on 0.9.0.
The approach I'm going with to partition my MappedRDD is to key it by a random int, and then partition it. So something like: rdd = sc.textFile('s3n://gzipped_file_brah.gz') # rdd has 1 partition; minSplits is not actionable due to gzip keyed_rdd = rdd.keyBy(lambda x: randint(1,100)) # we key the RDD so we can partition it partitioned_rdd = keyed_rdd.partitionBy(10) # rdd has 10 partitions Are you saying I don't have to do this? Nick On Tue, Apr 1, 2014 at 7:38 PM, Aaron Davidson <ilike...@gmail.com> wrote: > Hm, yeah, the docs are not clear on this one. The function you're looking > for to change the number of partitions on any ol' RDD is "repartition()", > which is available in master but for some reason doesn't seem to show up in > the latest docs. Sorry about that, I also didn't realize partitionBy() had > this behavior from reading the Python docs (though it is consistent with > the Scala API, just more type-safe there). > > > On Tue, Apr 1, 2014 at 3:01 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Just an FYI, it's not obvious from the >> docs<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBy>that >> the following code should fail: >> >> a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2) >> a._jrdd.splits().size() >> a.count() >> b = a.partitionBy(5) >> b._jrdd.splits().size() >> b.count() >> >> I figured out from the example that if I generated a key by doing this >> >> b = a.map(lambda x: (x, x)).partitionBy(5) >> >> then all would be well. >> >> In other words, partitionBy() only works on RDDs of tuples. Is that >> correct? >> >> Nick >> >> >> ------------------------------ >> View this message in context: PySpark RDD.partitionBy() requires an RDD >> of >> tuples<http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-RDD-partitionBy-requires-an-RDD-of-tuples-tp3598.html> >> Sent from the Apache Spark User List mailing list >> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com. >> > >