xiarixiaoyao commented on pull request #4012: URL: https://github.com/apache/hudi/pull/4012#issuecomment-1082593749
@vinothchandar @xushiyan in line 121 SparkRDDWriteClient.java, we call collect for rdd, which will trigger compute again. ``` List<HoodieWriteStat> writeStats = writeStatuses.map(WriteStatus::getStat).collect(); ``` if we use call isEmpty in 595 HoodieSparkSqlWriter, what will happen ``` if (writeResult.getWriteStatuses.rdd.filter(ws => ws.hasErrors).isEmpty()) { ``` 1) empty will trigger partial compute 2) in line 121 SparkRDDWriteClient.java, compute will trigger again, as cache is not availbale if we can call count in 595 HoodieSparkSqlWriter, what will happen 1) count will trigger all compute, and trigger cache to cache the compute result. 2) in line 121 SparkRDDWriteClient.java, compute will not trigger again, as we have already cache the compute result. so i think count is better than isEmpty -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org