[GitHub] [hudi] Rap70r commented on issue #3697: [SUPPORT] Performance Tuning: How to speed up stages?

2021-12-02 Thread GitBox
Rap70r commented on issue #3697: URL: https://github.com/apache/hudi/issues/3697#issuecomment-984910975 Hi @xushiyan, I was wondering if is there a way to control the size of parquet files created under a partition. For example, if a partition has 1 million records, it will probably cre

[GitHub] [hudi] Rap70r commented on issue #3697: [SUPPORT] Performance Tuning: How to speed up stages?

2021-10-10 Thread GitBox
Rap70r commented on issue #3697: URL: https://github.com/apache/hudi/issues/3697#issuecomment-939628475 Hi @xushiyan, We increased topic partitions from 50 to 400 and configured Spark properly to maximize executors. The speed has improved to a good level. If there are no additional sugg

[GitHub] [hudi] Rap70r commented on issue #3697: [SUPPORT] Performance Tuning: How to speed up stages?

2021-10-04 Thread GitBox
Rap70r commented on issue #3697: URL: https://github.com/apache/hudi/issues/3697#issuecomment-933893616 Hi @xushiyan, Here is an update for our latest tests. I have switched to d3.xlarge instance type and used the following configs: `spark-submit --deploy-mode cluster --conf spark

[GitHub] [hudi] Rap70r commented on issue #3697: [SUPPORT] Performance Tuning: How to speed up stages?

2021-09-23 Thread GitBox
Rap70r commented on issue #3697: URL: https://github.com/apache/hudi/issues/3697#issuecomment-926186766 Hi @xushiyan, We did some tests using a different instance type (20 machines of type m5.2xlarge) and less partitions. Here's the job flow for an upsert of 130K records (330 MB)

[GitHub] [hudi] Rap70r commented on issue #3697: [SUPPORT] Performance Tuning: How to speed up stages?

2021-09-22 Thread GitBox
Rap70r commented on issue #3697: URL: https://github.com/apache/hudi/issues/3697#issuecomment-924937292 Hello @xushiyan, Thank you for getting back to me. Just a clarification that above data size (1714 Megabytes, 1.4 million records) is the usual incremental data size we expect on each