Re: ROWS_DIVISOR in BaseTaskWriter

Anton Okolnychyi Fri, 19 Jul 2024 15:16:05 -0700

We introduced that config to avoid expensive calls to determine the current
file size. It was not made configurable as we kind of expected it to work
in most use cases and were reluctant to add another config nobody will
ever set. I think we made a similar check configurable when writing Parquet
row groups [1]. We can do a similar thing here.


[1] - https://github.com/apache/iceberg/pull/3181

- Anton

пт, 19 лип. 2024 р. о 13:18 Ha Cao <ha....@twosigma.com> пише:

> Hello,
>
>
>
> I am writing to ask about the ROWS_DIVISOR in BaseTaskWriter. Based on
> this
> <https://sourcegraph.com/github.com/apache/iceberg/-/blob/core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java?L339>,
> we only roll over to a new file every 1000 rows and if the 1000 rows has
> reached target file size bytes. However, I happen to have some large data
> that reaches 2GB before 1000 rows and because of that it throws
> OutOfMemoryError in the CapacityByteArrayOutputStream
> <https://sourcegraph.com/github.com/apache/parquet-java/-/blob/parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java?L176>
> when it tries to write more rows to the same output stream.
>
>
>
> Is there a reason that it must only flush the output stream every 1000
> rows? Is it possible to make this number configurable to account for data
> with very large rows?
>
>
>
> Thank you!
>
> Best,
>
> Ha
>

Re: ROWS_DIVISOR in BaseTaskWriter

Reply via email to