Hello,

I am writing to ask about the ROWS_DIVISOR in BaseTaskWriter. Based on 
this<https://sourcegraph.com/github.com/apache/iceberg/-/blob/core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java?L339>,
 we only roll over to a new file every 1000 rows and if the 1000 rows has 
reached target file size bytes. However, I happen to have some large data that 
reaches 2GB before 1000 rows and because of that it throws OutOfMemoryError in 
the 
CapacityByteArrayOutputStream<https://sourcegraph.com/github.com/apache/parquet-java/-/blob/parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java?L176>
 when it tries to write more rows to the same output stream.

Is there a reason that it must only flush the output stream every 1000 rows? Is 
it possible to make this number configurable to account for data with very 
large rows?

Thank you!
Best,
Ha

Reply via email to