Thanks Dongjoon & Yiqun for the quick PR for adding the `estimateMemory`
API.
Also thanks Yiqun & Owen for your points, I think you are right. So
a more accurate estimation method may be to multiply batch.size by the
average width of the data type, and then multiply it by the compression
rate, w
At the stripe boundaries, the bytes on disk statistics are accurate. A
stripe that is in flight, is going to be an estimate, because the
dictionaries can't be compressed until the stripe is flushed. The memory
usage will be a significant over estimate, because it includes buffers that
are allocated
The following is merged for Apache ORC 1.7.4.
ORC-1123 Add `estimationMemory` method for writer
According to the Apache ORC milestone, it will be released on May 15th.
https://github.com/apache/orc/milestones
Bests,
Dongjoon.
On 2022/03/04 13:11:15 Yiqun Zhang wrote:
> Hi Openinx
>
> Thank yo
Hi Openinx
Thank you for initiating this discussion. I think we can get the
`TypeDescription` from the writer and in the `TypeDescription` we know which
types and more precisely the maximum length of the varchar/char. This will help
us to estimate the average width.
Also, I agree with your sug
> As their widths are not the same, I think we may need to use an average
width minus the batch.size (which is row count actually).
@Kyle, sorry I miss-typed the word before. I mean "need an average width
multiplied by the batch.size".
On Fri, Mar 4, 2022 at 1:29 PM liwei li wrote:
> Thanks to
Thanks to openinx for opening this discussion.
One thing to note, the current approach faces a problem, because of some
optimization mechanisms, when writing a large amount of duplicate data,
there will be some deviation between the estimated and the actual size.
However, when cached data is flush
Hi Openinx.
Thanks for bringing this to our attention. And many thanks to hiliwei for
their willingness to tackle big problems and little problems.
I wanted to say that I think most anything that’s relatively close would be
better than the current situation most likely (where the feature is
disab
Hi Iceberg dev
As we all know, in our current apache iceberg write path, the ORC file
writer cannot just roll over to a new file once its byte size reaches the
expected threshold. The core reason that we don't support this before is:
The lack of correct approach to estimate the byte size from