[DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

OpenInx Thu, 03 Mar 2022 20:01:29 -0800

Hi Iceberg dev

As we all know,  in our current apache iceberg write path,  the ORC file
writer cannot just roll over to a new file once its byte size reaches the
expected threshold.  The core reason that we don't support this before is:
  The lack of correct approach to estimate the byte size from an unclosed
ORC writer.

In this PR: https://github.com/apache/iceberg/pull/3784, hiliwei is trying
to propose an estimate approach to fix this fundamentally (Also enabled all
those ORC writer unit tests that we disabled intentionally before).

The approach is: If a file is still unclosed , let's estimate its size in
three steps ( PR:
https://github.com/apache/iceberg/pull/3784/files#diff-e7fcc622bb5551f5158e35bd0e929e6eeec73717d1a01465eaa691ed098af3c0R107
)

1. Size of data that has been written to stripe.The value is obtained by
summing the offset and length of the last stripe of the writer.
2. Size of data that has been submitted to the writer but has not been
written to the stripe. When creating OrcFileAppender, treeWriter is
obtained through reflection, and uses its estimateMemory to estimate how
much memory is being used.
3. Data that has not been submitted to the writer, that is, the size of the
buffer. The maximum default value of the buffer is used here.

My feeling is:

For the file-persisted bytes , I think using the last strip's offset plus
its length should be correct. For the memory encoded batch vector , the
TreeWriter#estimateMemory should be okay.
But for the batch vector whose rows did not flush to encoded memory, using
the batch.size shouldn't be correct. Because the rows can be any data type,
such as Integer, Long, Timestamp, String etc. As their widths are not the
same, I think we may need to use an average width minus the batch.size
(which is row count actually).

Another thing is about the `TreeWriter#estimateMemory` method, The current
`org.apache.orc.Writer` don't expose the `TreeWriter` field or
`estimateMemory` method to public, I will suggest to publish a PR to
apache ORC project to expose those interfaces in `org.apache.orc.Writer` (
see: https://github.com/apache/iceberg/pull/3784/files#r819238427 )

I'd like to invite the iceberg dev to evaluate the current approach. Is
there any other concern from the ORC experts' side ?

Thanks.

[DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

Reply via email to