At the stripe boundaries, the bytes on disk statistics are accurate. A stripe that is in flight, is going to be an estimate, because the dictionaries can't be compressed until the stripe is flushed. The memory usage will be a significant over estimate, because it includes buffers that are allocated, but not used yet.
.. Owen On Fri, Mar 4, 2022 at 5:23 PM Dongjoon Hyun <dongj...@apache.org> wrote: > The following is merged for Apache ORC 1.7.4. > > ORC-1123 Add `estimationMemory` method for writer > > According to the Apache ORC milestone, it will be released on May 15th. > > https://github.com/apache/orc/milestones > > Bests, > Dongjoon. > > On 2022/03/04 13:11:15 Yiqun Zhang wrote: > > Hi Openinx > > > > Thank you for initiating this discussion. I think we can get the > `TypeDescription` from the writer and in the `TypeDescription` we know > which types and more precisely the maximum length of the varchar/char. This > will help us to estimate the average width. > > > > Also, I agree with your suggestion, I will make a PR later to add the > `estimateMemory` public method for Writer. > > > > On 2022/03/04 04:01:04 OpenInx wrote: > > > Hi Iceberg dev > > > > > > As we all know, in our current apache iceberg write path, the ORC > file > > > writer cannot just roll over to a new file once its byte size reaches > the > > > expected threshold. The core reason that we don't support this before > is: > > > The lack of correct approach to estimate the byte size from an > unclosed > > > ORC writer. > > > > > > In this PR: https://github.com/apache/iceberg/pull/3784, hiliwei is > trying > > > to propose an estimate approach to fix this fundamentally (Also > enabled all > > > those ORC writer unit tests that we disabled intentionally before). > > > > > > The approach is: If a file is still unclosed , let's estimate its > size in > > > three steps ( PR: > > > > https://github.com/apache/iceberg/pull/3784/files#diff-e7fcc622bb5551f5158e35bd0e929e6eeec73717d1a01465eaa691ed098af3c0R107 > > > ) > > > > > > 1. Size of data that has been written to stripe.The value is obtained > by > > > summing the offset and length of the last stripe of the writer. > > > 2. Size of data that has been submitted to the writer but has not been > > > written to the stripe. When creating OrcFileAppender, treeWriter is > > > obtained through reflection, and uses its estimateMemory to estimate > how > > > much memory is being used. > > > 3. Data that has not been submitted to the writer, that is, the size > of the > > > buffer. The maximum default value of the buffer is used here. > > > > > > My feeling is: > > > > > > For the file-persisted bytes , I think using the last strip's offset > plus > > > its length should be correct. For the memory encoded batch vector , the > > > TreeWriter#estimateMemory should be okay. > > > But for the batch vector whose rows did not flush to encoded memory, > using > > > the batch.size shouldn't be correct. Because the rows can be any data > type, > > > such as Integer, Long, Timestamp, String etc. As their widths are not > the > > > same, I think we may need to use an average width minus the batch.size > > > (which is row count actually). > > > > > > Another thing is about the `TreeWriter#estimateMemory` method, The > current > > > `org.apache.orc.Writer` don't expose the `TreeWriter` field or > > > `estimateMemory` method to public, I will suggest to publish a PR to > > > apache ORC project to expose those interfaces in > `org.apache.orc.Writer` ( > > > see: https://github.com/apache/iceberg/pull/3784/files#r819238427 ) > > > > > > I'd like to invite the iceberg dev to evaluate the current approach. > Is > > > there any other concern from the ORC experts' side ? > > > > > > Thanks. > > > > > >