I want to avoid the small files problem when using Spark, without having to
manually calibrate a `repartition` at the end of each Spark application I
am writing, since the amount of data passing through sadly isn't all that
predictable. We're picking up from and writing data to HDFS.

I know other tools like Pig can set the number of reducers and thus the
number of output partitions for you based on the size of the input data,
but I want to know if anyone else has a better way to do this with Spark's
primitives.

Right now we have an ok solution but it is starting to break down. We cache
our output RDD at the end of the application's flow, and then map over once
more it to guess what size it will be when pickled and gzipped (we're in
pyspark), and then compute a number to repartition to using a target
partition size. The problem is that we want to work with datasets bigger
than what will comfortably fit in the cache. Just spit balling here, but
what would be amazing is the ability to ask Spark how big it thinks each
partition might be, or the ability to give an accumulator as an argument to
`repartition` who's value wouldn't be used until the stage prior had
finished, or the ability to just have Spark repartition to a target
partition size for us.

Thanks for any help you can give me!

Reply via email to