Hello, I want to be able to write dataframe with set average size of the file using orc or parquet. Also, preserving dataframe sorting is important. The task is that I have a dataframe that I know nothing about as an argument, and I need to write orc or parquet files with constant size with minimal variance.
Most of repartitioning methods work either with number of partitions or max number of rows per file. Which both can be calculated from average row size, via sampling dataframe, serializing it with compression, and finding it's size. Which of course not perfect due to compression of row columnar form. But as estimation, might just do it. As for repartitioning itself, then I have tried: - df.repartition - Can split into equivalent partition, but due to hash partitioning sorting is not preserved - df.repartitionByRange.sortWithinPartitions - Can preserve sorting if known original sorting keys (I might not know them), although some say repartitionByRange might not always preserve sorting. But if keys not uniformly distributed, then file size will vary a lot. - df.coalesce - Sorting seems to be preserved, although some say that not always the case for every version of spark. Also, partition size may vary a lot. And can only decrease number of partitions. - df.write.option("maxRecordsPerFile", 10000) - Not sure about sorting preservation. And also seem there still problem with small files, due to no minimum on records per file. Merging to one partition and then using maxRecordsPerFile won't work since it might not fit in one partition. What I am trying to solve seems to be a complex bin packing problem, which should also parallelize. As a simple way, I thought I might do it is by counting rows in each partition and creating new dataframe with tasks kind of this way: Let's say I want 100M sized files with 2M (just for example) as average row size. Which means 50 rows per file/partition. [repartition.png] But I am not quite sure if it's the best way of doing that. Also, there is going to be needed a lot of optimizing in locality of tasks and partitions in order to reduce network load. Are there any built in things or libraries that might help me solve that? Or anyone had the same issue? Thanks in advance Danil