subject:"\[GitHub\] \[hudi\] xiarixiaoyao commented on pull request #4012\: \[HUDI\-2777\] Data import performance deteriorates because multiple Spark jobs are started when data is written to disks."

[GitHub] [hudi] xiarixiaoyao commented on pull request #4012: [HUDI-2777] Data import performance deteriorates because multiple Spark jobs are started when data is written to disks.

2022-03-29 Thread GitBox

xiarixiaoyao commented on pull request #4012: URL: https://github.com/apache/hudi/pull/4012#issuecomment-1082593749 @vinothchandar @xushiyan in line 121 SparkRDDWriteClient.java, we call collect for rdd, which will trigger compute again. ``` List writeStats = writeStatuses.

[GitHub] [hudi] xiarixiaoyao commented on pull request #4012: [HUDI-2777] Data import performance deteriorates because multiple Spark jobs are started when data is written to disks.

2021-11-26 Thread GitBox

xiarixiaoyao commented on pull request #4012: URL: https://github.com/apache/hudi/pull/4012#issuecomment-980520231 @leesf @vinothchandar @xushiyan If there is a large amount of data, the performance gap is very obvious. Our environment uses 100g of data, and the performance is signif

[GitHub] [hudi] xiarixiaoyao commented on pull request #4012: [HUDI-2777] Data import performance deteriorates because multiple Spark jobs are started when data is written to disks.

2021-11-18 Thread GitBox

xiarixiaoyao commented on pull request #4012: URL: https://github.com/apache/hudi/pull/4012#issuecomment-973649150 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi