I think you're right about the cause. The current estimate is what is buffered in memory, so it includes all of the intermediate data for the last page before it is finalized and compressed.
We could probably get a better estimate by using the amount of buffered data and how large other pages in a column were after fully encoding and compressing. So if you have 5 pages compressed and buffered, and another 1000 values, use the compression ratio of the 5 pages to estimate the final size. We'd probably want to use some overhead value for the header. And, we'd want to separate the amount of buffered data from our row group size estimate, which are currently the same thing. rb On Thu, Jun 21, 2018 at 1:17 AM Gabor Szadovszky <ga...@apache.org> wrote: > Hi All, > > One of our customers faced the following issue. parquet.block.size is > configured to 128M. (parquet.writer.max-padding is left with the default > 8M.) In average 7 row-groups are generated in one block with the sizes > ~74M, ~16M, ~12M, ~9M, ~7M, ~5M, ~4M. By increasing the padding to e.g. 60M > only one row-group per block is written but it is a waste of disk space. > By investigating the logs it turns out that parquet-mr thinks the row-group > is already close to 128M so it writes the first one then realize we still > have space to write until reaching the block size and so on: > INFO hadoop.InternalParquetRecordWriter: mem size 134,673,545 > > 134,217,728: flushing 484,972 records to disk. > INFO hadoop.InternalParquetRecordWriter: mem size 59,814,120 > 59,814,925: > flushing 99,030 records to disk. > INFO hadoop.InternalParquetRecordWriter: mem size 43,396,192 > 43,397,248: > flushing 71,848 records to disk. > ... > > My idea about the root cause is that there are many dictionary encoded > columns where the value variance is low. When we are approximating the > row-group size there are pages which are still open (not encoded yet). If > these pages are dictionary encoded we calculate with 4bytes values as the > dictionary indexes. But if the variance is low, the RLE and bitpacking will > decrease the size of these pages dramatically. > > What do you guys think? Are we able to make the approximation a bit better? > Do we have some properties that can solve this issue? > > Thanks a lot, > Gabor > -- Ryan Blue Software Engineer Netflix