zhouyifan279 commented on PR #41628: URL: https://github.com/apache/spark/pull/41628#issuecomment-2100365610
To eliminate data inconsistency issue, we should handle custom partitions in `HadoopMapReduceCommitProtocol.commitJob` instead of writing to the final output path then moving partition dir to custom location: 1. Get all partitionPaths from `TaskCommitMessage.obj._2`(`TaskCommitMessage.obj._1` is empty as we do not have `customPartitionLocations` at this step) 2. Use partitionPaths to get matchingPartitions, then get customPartitionLocations like what we do in this PR. 3. Move partitionPaths to final location according to customPartitionLocations @jeanlyn @bowenliang123 @attilapiros what do you think ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org