Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-04 Thread liwei li
Thanks to openinx for opening this discussion. One thing to note, the current approach faces a problem, because of some optimization mechanisms, when writing a large amount of duplicate data, there will be some deviation between the estimated and the actual size. However, when cached data is flush

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-03 Thread OpenInx
> As their widths are not the same, I think we may need to use an average width minus the batch.size (which is row count actually). @Kyle, sorry I miss-typed the word before. I mean "need an average width multiplied by the batch.size". On Fri, Mar 4, 2022 at 1:29 PM liwei li wrote: > Thanks to

[DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-03 Thread OpenInx
Hi Iceberg dev As we all know, in our current apache iceberg write path, the ORC file writer cannot just roll over to a new file once its byte size reaches the expected threshold. The core reason that we don't support this before is: The lack of correct approach to estimate the byte size from