Thanks to openinx for opening this discussion.
One thing to note, the current approach faces a problem, because of some
optimization mechanisms, when writing a large amount of duplicate data,
there will be some deviation between the estimated and the actual size.
However, when cached data is flush
> As their widths are not the same, I think we may need to use an average
width minus the batch.size (which is row count actually).
@Kyle, sorry I miss-typed the word before. I mean "need an average width
multiplied by the batch.size".
On Fri, Mar 4, 2022 at 1:29 PM liwei li wrote:
> Thanks to
Hi Iceberg dev
As we all know, in our current apache iceberg write path, the ORC file
writer cannot just roll over to a new file once its byte size reaches the
expected threshold. The core reason that we don't support this before is:
The lack of correct approach to estimate the byte size from