Re: [I] [Improvement]: Optimize target file size after self-optimizing [amoro]

via GitHub Tue, 08 Jul 2025 06:11:45 -0700


cxxiii commented on issue #3645:
URL: https://github.com/apache/amoro/issues/3645#issuecomment-3048909972


   > Could you further explain why the Parquet writer might misestimate the 
target file size?
   
   The estimation error of Parquet writer file size mainly stems from the 
impact of compression. The estimated file size consists of two parts: the data 
already written to disk, which is compressed and thus accurately measured, and 
the data still buffered in memory. The in-memory buffer includes both the 
uncompressed data in the column store (not yet flushed to the page store) and 
the compressed data already in the page store for each column. Since most of 
the buffered data is counted in its uncompressed form, the estimated file size 
is often larger than the actual file size.
   Additionally, files with more columns are more prone to estimation errors. 
For files with fewer columns, the flush to the page store is often triggered 
before  flushing to the disk (the specified row group size which is default 128 
MB is reached)—either because the content of a single column exceeds the 
specified page size (1 MB) or rowCount inserting exceed pageRowCountLimit 
(default 2000 rows). Since data written to the page store is compressed, files 
with fewer columns have a smaller proportion of uncompressed data in memory, 
resulting in smaller estimation errors.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Improvement]: Optimize target file size after self-optimizing [amoro]

Reply via email to