ifxchris commented on issue #1994:
URL:
https://github.com/apache/iceberg-python/issues/1994#issuecomment-2897769050
Hi all,
we are experiencing the same issue but a bit more severe:
~~~
avg_row_size_bytes = tbl.nbytes / tbl.num_rows
target_rows_per_file = target_file_size // avg_row_size_bytes
batches = tbl.to_batches(max_chunksize=target_rows_per_file)
~~~
Our data is loaded from a parquet file with the following metadata:
`Row group 0: count: 1 4.163 MB records start: 4 total(compressed):
4.163 MB total(uncompressed):40.739 MB`
So we only have one record in the table.
According to tbl.nbytes this is around 600MB in memory.
Since this one record is bigger than 512Mb in memory, `target_rows_per_file`
is calculated to be zero.
As a result the `max_chunksize` is set to 0 and pyiceberg will crash due to
that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]