I want to avoid the small files problem when using Spark, without having to manually calibrate a `repartition` at the end of each Spark application I am writing, since the amount of data passing through sadly isn't all that predictable. We're picking up from and writing data to HDFS.
I know other tools like Pig can set the number of reducers and thus the number of output partitions for you based on the size of the input data, but I want to know if anyone else has a better way to do this with Spark's primitives. Right now we have an ok solution but it is starting to break down. We cache our output RDD at the end of the application's flow, and then map over once more it to guess what size it will be when pickled and gzipped (we're in pyspark), and then compute a number to repartition to using a target partition size. The problem is that we want to work with datasets bigger than what will comfortably fit in the cache. Just spit balling here, but what would be amazing is the ability to ask Spark how big it thinks each partition might be, or the ability to give an accumulator as an argument to `repartition` who's value wouldn't be used until the stage prior had finished, or the ability to just have Spark repartition to a target partition size for us. Thanks for any help you can give me!