Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-07 Thread OpenInx
Thanks Dongjoon & Yiqun for the quick PR for adding the `estimateMemory` API. Also thanks Yiqun & Owen for your points, I think you are right. So a more accurate estimation method may be to multiply batch.size by the average width of the data type, and then multiply it by the compression rate,

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-04 Thread Owen O'Malley
At the stripe boundaries, the bytes on disk statistics are accurate. A stripe that is in flight, is going to be an estimate, because the dictionaries can't be compressed until the stripe is flushed. The memory usage will be a significant over estimate, because it includes buffers that are

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-04 Thread Dongjoon Hyun
The following is merged for Apache ORC 1.7.4. ORC-1123 Add `estimationMemory` method for writer According to the Apache ORC milestone, it will be released on May 15th. https://github.com/apache/orc/milestones Bests, Dongjoon. On 2022/03/04 13:11:15 Yiqun Zhang wrote: > Hi Openinx > > Thank

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-04 Thread Yiqun Zhang
Hi Openinx Thank you for initiating this discussion. I think we can get the `TypeDescription` from the writer and in the `TypeDescription` we know which types and more precisely the maximum length of the varchar/char. This will help us to estimate the average width. Also, I agree with your

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-03 Thread OpenInx
> As their widths are not the same, I think we may need to use an average width minus the batch.size (which is row count actually). @Kyle, sorry I miss-typed the word before. I mean "need an average width multiplied by the batch.size". On Fri, Mar 4, 2022 at 1:29 PM liwei li wrote: > Thanks

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-03 Thread liwei li
Thanks to openinx for opening this discussion. One thing to note, the current approach faces a problem, because of some optimization mechanisms, when writing a large amount of duplicate data, there will be some deviation between the estimated and the actual size. However, when cached data is

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-03 Thread Kyle Bendickson
Hi Openinx. Thanks for bringing this to our attention. And many thanks to hiliwei for their willingness to tackle big problems and little problems. I wanted to say that I think most anything that’s relatively close would be better than the current situation most likely (where the feature is

[DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-03 Thread OpenInx
Hi Iceberg dev As we all know, in our current apache iceberg write path, the ORC file writer cannot just roll over to a new file once its byte size reaches the expected threshold. The core reason that we don't support this before is: The lack of correct approach to estimate the byte size