[GitHub] [hudi] xiarixiaoyao commented on pull request #4012: [HUDI-2777] Data import performance deteriorates because multiple Spark jobs are started when data is written to disks.

GitBox Tue, 29 Mar 2022 20:55:04 -0700


xiarixiaoyao commented on pull request #4012:
URL: https://github.com/apache/hudi/pull/4012#issuecomment-1082593749



   @vinothchandar @xushiyan 
   in line 121 SparkRDDWriteClient.java,  we call collect for rdd, which will 
trigger compute again. 
   ```
       List<HoodieWriteStat> writeStats = 
writeStatuses.map(WriteStatus::getStat).collect();
   ```
   if we use call isEmpty in 595 HoodieSparkSqlWriter, what will happen
   ```
       if (writeResult.getWriteStatuses.rdd.filter(ws => 
ws.hasErrors).isEmpty()) {
   ```
   1) empty will trigger partial compute
   2) in line 121 SparkRDDWriteClient.java,  compute will trigger again, as 
cache is not availbale
   
   if we can call count in  595 HoodieSparkSqlWriter, what will happen
   1) count will trigger all compute,  and trigger cache to cache the compute 
result.
   2) in line 121 SparkRDDWriteClient.java, compute will not trigger again, as 
we have already cache the compute result.
   
   so i think count is better than isEmpty
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on pull request #4012: [HUDI-2777] Data import performance deteriorates because multiple Spark jobs are started when data is written to disks.

Reply via email to