Update: I'm now using this ghetto function to partition the RDD I get back when I call textFile() on a gzipped file:
# Python 2.6 def partitionRDD(rdd, numPartitions): counter = {'a': 0} def count_up(x): counter['a'] += 1 return counter['a'] return (rdd.keyBy(count_up) .partitionBy(numPartitions) .map(lambda (counter, data): data)) If there's supposed to be a built-in Spark method to do this, I'd love to learn more about it. Nick On Tue, Apr 1, 2014 at 7:59 PM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote: > Hmm, doing help(rdd) in PySpark doesn't show a method called repartition(). > Trying rdd.repartition() or rdd.repartition(10) also fail. I'm on 0.9.0. > > The approach I'm going with to partition my MappedRDD is to key it by a > random int, and then partition it. > > So something like: > > rdd = sc.textFile('s3n://gzipped_file_brah.gz') # rdd has 1 partition; > minSplits is not actionable due to gzip > keyed_rdd = rdd.keyBy(lambda x: randint(1,100)) # we key the RDD so we can > partition it > partitioned_rdd = keyed_rdd.partitionBy(10) # rdd has 10 partitions > > Are you saying I don't have to do this? > > Nick > > > > On Tue, Apr 1, 2014 at 7:38 PM, Aaron Davidson <ilike...@gmail.com> wrote: > >> Hm, yeah, the docs are not clear on this one. The function you're looking >> for to change the number of partitions on any ol' RDD is "repartition()", >> which is available in master but for some reason doesn't seem to show up in >> the latest docs. Sorry about that, I also didn't realize partitionBy() had >> this behavior from reading the Python docs (though it is consistent with >> the Scala API, just more type-safe there). >> >> >> On Tue, Apr 1, 2014 at 3:01 PM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> Just an FYI, it's not obvious from the >>> docs<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBy>that >>> the following code should fail: >>> >>> a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2) >>> a._jrdd.splits().size() >>> a.count() >>> b = a.partitionBy(5) >>> b._jrdd.splits().size() >>> b.count() >>> >>> I figured out from the example that if I generated a key by doing this >>> >>> b = a.map(lambda x: (x, x)).partitionBy(5) >>> >>> then all would be well. >>> >>> In other words, partitionBy() only works on RDDs of tuples. Is that >>> correct? >>> >>> Nick >>> >>> >>> ------------------------------ >>> View this message in context: PySpark RDD.partitionBy() requires an RDD >>> of >>> tuples<http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-RDD-partitionBy-requires-an-RDD-of-tuples-tp3598.html> >>> Sent from the Apache Spark User List mailing list >>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com. >>> >> >> >