Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

Owen O'Malley Fri, 04 Mar 2022 10:28:19 -0800

At the stripe boundaries, the bytes on disk statistics are accurate. A
stripe that is in flight, is going to be an estimate, because the
dictionaries can't be compressed until the stripe is flushed. The memory
usage will be a significant over estimate, because it includes buffers that
are allocated, but not used yet.


.. Owen

On Fri, Mar 4, 2022 at 5:23 PM Dongjoon Hyun <dongj...@apache.org> wrote:

> The following is merged for Apache ORC 1.7.4.
>
> ORC-1123 Add `estimationMemory` method for writer
>
> According to the Apache ORC milestone, it will be released on May 15th.
>
> https://github.com/apache/orc/milestones
>
> Bests,
> Dongjoon.
>
> On 2022/03/04 13:11:15 Yiqun Zhang wrote:
> > Hi Openinx
> >
> > Thank you for initiating this discussion. I think we can get the
> `TypeDescription` from the writer and in the `TypeDescription` we know
> which types and more precisely the maximum length of the varchar/char. This
> will help us to estimate the average width.
> >
> > Also, I agree with your suggestion, I will make a PR later to add the
> `estimateMemory` public method for Writer.
> >
> > On 2022/03/04 04:01:04 OpenInx wrote:
> > > Hi Iceberg dev
> > >
> > > As we all know,  in our current apache iceberg write path,  the ORC
> file
> > > writer cannot just roll over to a new file once its byte size reaches
> the
> > > expected threshold.  The core reason that we don't support this before
> is:
> > >   The lack of correct approach to estimate the byte size from an
> unclosed
> > > ORC writer.
> > >
> > > In this PR: https://github.com/apache/iceberg/pull/3784,  hiliwei is
> trying
> > > to propose an estimate approach to fix this fundamentally (Also
> enabled all
> > > those ORC writer unit tests that we disabled intentionally before).
> > >
> > > The approach is:  If a file is still unclosed , let's estimate its
> size in
> > > three steps ( PR:
> > >
> https://github.com/apache/iceberg/pull/3784/files#diff-e7fcc622bb5551f5158e35bd0e929e6eeec73717d1a01465eaa691ed098af3c0R107
> > > )
> > >
> > > 1. Size of data that has been written to stripe.The value is obtained
> by
> > > summing the offset and length of the last stripe of the writer.
> > > 2. Size of data that has been submitted to the writer but has not been
> > > written to the stripe. When creating OrcFileAppender, treeWriter is
> > > obtained through reflection, and uses its estimateMemory to estimate
> how
> > > much memory is being used.
> > > 3. Data that has not been submitted to the writer, that is, the size
> of the
> > > buffer. The maximum default value of the buffer is used here.
> > >
> > > My feeling is:
> > >
> > > For the file-persisted bytes , I think using the last strip's offset
> plus
> > > its length should be correct. For the memory encoded batch vector , the
> > > TreeWriter#estimateMemory should be okay.
> > > But for the batch vector whose rows did not flush to encoded memory,
> using
> > > the batch.size shouldn't be correct. Because the rows can be any data
> type,
> > > such as Integer, Long, Timestamp, String etc. As their widths are not
> the
> > > same, I think we may need to use an average width minus the batch.size
> > > (which is row count actually).
> > >
> > > Another thing is about the `TreeWriter#estimateMemory` method,  The
> current
> > > `org.apache.orc.Writer`  don't expose the `TreeWriter` field or
> > > `estimateMemory` method to public,  I will suggest to publish a PR to
> > > apache ORC project to expose those interfaces in
> `org.apache.orc.Writer` (
> > > see: https://github.com/apache/iceberg/pull/3784/files#r819238427 )
> > >
> > > I'd like to invite the iceberg dev to evaluate the current approach.
> Is
> > > there any other concern from the ORC experts' side ?
> > >
> > > Thanks.
> > >
> >
>

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

Reply via email to