cxxiii commented on issue #3645: URL: https://github.com/apache/amoro/issues/3645#issuecomment-3048909972
> Could you further explain why the Parquet writer might misestimate the target file size? The estimation error of Parquet writer file size mainly stems from the impact of compression. The estimated file size consists of two parts: the data already written to disk, which is compressed and thus accurately measured, and the data still buffered in memory. The in-memory buffer includes both the uncompressed data in the column store (not yet flushed to the page store) and the compressed data already in the page store for each column. Since most of the buffered data is counted in its uncompressed form, the estimated file size is often larger than the actual file size. Additionally, files with more columns are more prone to estimation errors. For files with fewer columns, the flush to the page store is often triggered before flushing to the disk (the specified row group size which is default 128 MB is reached)—either because the content of a single column exceeds the specified page size (1 MB) or rowCount inserting exceed pageRowCountLimit (default 2000 rows). Since data written to the page store is compressed, files with fewer columns have a smaller proportion of uncompressed data in memory, resulting in smaller estimation errors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
