Thanks to openinx for opening this discussion.

One thing to note, the current approach faces a problem, because of some
optimization mechanisms, when writing a large amount of duplicate data,
there will be some deviation between the estimated and the actual size.
However, when cached data is flushed (the amount or size exceeds the
threshold), the estimate is revised. That's one of the reasons I changed
this
https://github.com/apache/iceberg/pull/3784#discussion_r819229491


Kyle Bendickson <k...@tabular.io>于2022年3月4日 周五12:55写道:

> Hi Openinx.
>
> Thanks for bringing this to our attention. And many thanks to hiliwei for
> their willingness to tackle big problems and little problems.
>
> I wanted to say that I think most anything that’s relatively close would
> be better than the current situation most likely (where the feature is
> disabled entirely).
>
> Thank you for your succinct summary of the situation. I tagged Dongjoon
> Hyun, one of the ORC VPs, in the PR and will reach out to him as well.
>
> I am inclined to agree that we need to consider the width of the types, as
> fields like binary or even string can be potentially quite wide compared to
> int.
>
> I like your suggestion to use an “average width” when used with the batch
> size, though subtracting batch size from average width seems slightly off…
> I would think maybe the average width needs to be multiples or divided with
> the batch size. Possibly I’m not understanding fully.
>
> How would you propose to get an “average width”, for use with the data
> that’s not been flushed to disk yet? And would it be an average width based
> on the actually observed data or just on the types?
>
> Again, I think that any approach is better than none, and we can iterate
> on the statistics collection. But I am inclined to agree, points (1) and
> (2) seem ok. And it would be beneficial to consider the points raised
> regarding (3).
>
> Thanks for bringing this to the dev list.
>
> And many thanks to hiliwei for their work so far!
>
> - Kyle
>
> On Thu, Mar 3, 2022 at 8:01 PM OpenInx <open...@gmail.com> wrote:
>
>> Hi Iceberg dev
>>
>> As we all know,  in our current apache iceberg write path,  the ORC file
>> writer cannot just roll over to a new file once its byte size reaches the
>> expected threshold.  The core reason that we don't support this before is:
>>   The lack of correct approach to estimate the byte size from an unclosed
>> ORC writer.
>>
>> In this PR: https://github.com/apache/iceberg/pull/3784,  hiliwei is
>> trying to propose an estimate approach to fix this fundamentally (Also
>> enabled all those ORC writer unit tests that we disabled intentionally
>> before).
>>
>> The approach is:  If a file is still unclosed , let's estimate its size
>> in three steps ( PR:
>> https://github.com/apache/iceberg/pull/3784/files#diff-e7fcc622bb5551f5158e35bd0e929e6eeec73717d1a01465eaa691ed098af3c0R107
>> )
>>
>> 1. Size of data that has been written to stripe.The value is obtained by
>> summing the offset and length of the last stripe of the writer.
>> 2. Size of data that has been submitted to the writer but has not been
>> written to the stripe. When creating OrcFileAppender, treeWriter is
>> obtained through reflection, and uses its estimateMemory to estimate how
>> much memory is being used.
>> 3. Data that has not been submitted to the writer, that is, the size of
>> the buffer. The maximum default value of the buffer is used here.
>>
>> My feeling is:
>>
>> For the file-persisted bytes , I think using the last strip's offset plus
>> its length should be correct. For the memory encoded batch vector , the
>> TreeWriter#estimateMemory should be okay.
>> But for the batch vector whose rows did not flush to encoded memory,
>> using the batch.size shouldn't be correct. Because the rows can be any data
>> type, such as Integer, Long, Timestamp, String etc. As their widths are not
>> the same, I think we may need to use an average width minus the batch.size
>> (which is row count actually).
>>
>> Another thing is about the `TreeWriter#estimateMemory` method,  The
>> current `org.apache.orc.Writer`  don't expose the `TreeWriter` field or
>> `estimateMemory` method to public,  I will suggest to publish a PR to
>> apache ORC project to expose those interfaces in `org.apache.orc.Writer` (
>> see: https://github.com/apache/iceberg/pull/3784/files#r819238427 )
>>
>> I'd like to invite the iceberg dev to evaluate the current approach.  Is
>> there any other concern from the ORC experts' side ?
>>
>> Thanks.
>>
>

Reply via email to