nsivabalan commented on issue #17866:
URL: https://github.com/apache/hudi/issues/17866#issuecomment-4145534408
I am categorizing the cases into 3.
- Arguments: list of partitions to stash, stash partition folder
No MDT & HDFS:
1. compare source folder and target folder.
- a. if target folder is not found, ignore. (source is never attempted to
copy over)
- b. if source is not found, ignore. (source is fully copied over already)
Track for how many partitions (to be stashed), we are yet to copy over.
2. If we found any partitions to stash: // Ideally we should have a failed
delete_partition if prev attempt failed. In which case, we should re-attept the
failed delete-partition.
If not, trigger delete partition write operation in hudi w/ custom
PreCommitValidator.
No MDT and GCS:
1. compare source folder and target folder.
- a. if target folder is not found, ignore. (source is never attempted to
copy over)
- b. if source is empty, ignore. (source is fully copied over already)
- c. If files are found in both, copy the files from target back to
source.
Track for how many partitions (to be stashed), we are yet to copy over (a
and c)
2. If we found any partitions to stash: // Ideally we should have a failed
delete_partition if prev attempt failed. In which case, we should re-attept the
failed delete-partition.
If not, trigger delete partition write operation in hudi w/ custom
PreCommitValidator.
MDT use-cases:
Step 1 can be ignored.
2. If we found any partitions to stash: // Ideally we should have a failed
delete_partition if prev attempt failed. In which case, we should re-attempt
the failed delete-partition.
If not, trigger delete partition write operation in hudi w/ custom
PreCommitValidator.
Custom SparkPreCommitValidator:
1. Reads the stash partition parent folder and `list_of_partition_to_delete`
from the extraMetadata of ongoing operation (asserting that its
delete_partition). If the dataset has a MDT, then assert that
`list_of_partition_to_delete` is empty.
2. Creates a spark task for each partition that is not in
`list_of_partition_to_delete`. Within this task, get the source path
(basepath/partition) and the dest path (stash_folder/partition), creating the
latter folders if not already created. Then call `renameHelper` on source ->
dest
3. HDFS: For each partition in `list_of_partition_to_delete` then delete all
files in the dataset partition. In case this deletePartition call fails, then
step (1) in the utility function will anyways add back these files in the next
attempt. // won't the rename helper already do this? why do we need this step?
4. If any DFS operations fail, throw an exception
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]