nsivabalan commented on issue #17866:
URL: https://github.com/apache/hudi/issues/17866#issuecomment-4145534408

   I am categorizing the cases into 3. 
   
   - Arguments: list of partitions to stash, stash partition folder
   
   No MDT & HDFS: 
   
   1. compare source folder and target folder. 
     - a. if target folder is not found, ignore. (source is never attempted to 
copy over)
     - b. if source is not found, ignore. (source is fully copied over already)
   
   Track for how many partitions (to be stashed), we are yet to copy over. 
   
   2. If we found any partitions to stash: // Ideally we should have a failed 
delete_partition if prev attempt failed. In which case, we should re-attept the 
failed delete-partition. 
   If not, trigger delete partition write operation in hudi w/ custom 
PreCommitValidator. 
   
   No MDT and GCS: 
   1. compare source folder and target folder. 
     - a. if target folder is not found, ignore. (source is never attempted to 
copy over)
     - b. if source is empty, ignore. (source is fully copied over already)
     - c. If files are found in both, copy the files from target back to 
source. 
   
   Track for how many partitions (to be stashed), we are yet to copy over (a 
and c)
   
   2. If we found any partitions to stash: // Ideally we should have a failed 
delete_partition if prev attempt failed. In which case, we should re-attept the 
failed delete-partition.
   If not, trigger delete partition write operation in hudi w/ custom 
PreCommitValidator. 
   
   
   MDT use-cases: 
   
   Step 1 can be ignored. 
   2. If we found any partitions to stash: // Ideally we should have a failed 
delete_partition if prev attempt failed. In which case, we should re-attempt 
the failed delete-partition.
   If not, trigger delete partition write operation in hudi w/ custom 
PreCommitValidator. 
   
   Custom SparkPreCommitValidator:
   
   1. Reads the stash partition parent folder and `list_of_partition_to_delete` 
from the extraMetadata of ongoing operation (asserting that its 
delete_partition). If the dataset has a MDT, then assert that 
`list_of_partition_to_delete` is empty.
   2. Creates a spark task for each partition that is not in 
`list_of_partition_to_delete`. Within this task, get the source path 
(basepath/partition)  and the dest path (stash_folder/partition), creating the 
latter folders if not already created. Then call `renameHelper` on source -> 
dest
   3. HDFS: For each partition in `list_of_partition_to_delete` then delete all 
files in the dataset partition. In case this deletePartition call fails, then 
step (1) in the utility function will anyways add back these files in the next 
attempt. // won't the rename helper already do this? why do we need this step?
   4. If any DFS operations fail, throw an exception
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to