bigluck commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1963740242
@kevinjqliu, your latest changes are mind-blowing
(https://github.com/apache/iceberg-python/issues/428#issuecomment-1962460623
for reference)
I have tested your last changes on `c5ad.8xlarge` and `c5ad.16xlarge`
instances using my `10,000,000` table.
- On a `c5ad.8xlarge` instance with `10Gbps NIC`, `32 cores`, and a `64
RAM`, it took an average of `3.9s` to write `14` Parquet files. Previously,
using `pynessie 0.6.0`, it took `31s`.
- On a `c5ad.16xlarge` instance with `20Gbps NIC`, `64 cores`, and a `128
RAM`, it took approximately `3.6s` to complete the same task, compared to
`28.2s` using `pynessie 0.6.0`.
I have been experimenting with different settings to improve the writing
performances, but I failed.
I tried adjusting the `PYICEBERG_MAX_WORKERS` variable, but it did not make
much difference. This might be due to the small size of my dataset (only `6.69
GB` in arrow format), which resulted in only 14 output files.
I also tested the `write.target-file-size-bytes` property, which produced 27
files when set to `268435456` and 54 files when set to `134217728`.
However, even when I set `PYICEBERG_MAX_WORKERS` to 64, the total write
operation still took 3.6 seconds.
Overall, I am very impressed with how it works now! Well done!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]