[GitHub] [hudi] nsivabalan commented on issue #8584: [SUPPORT] Spark SQL query FileNotFoundException using cleaner policy KEEP_LATEST_BY_HOURS

via GitHub Fri, 28 Apr 2023 16:29:21 -0700


nsivabalan commented on issue #8584:
URL: https://github.com/apache/hudi/issues/8584#issuecomment-1528193678


   hey @tpcross : can you share the entire contents of ".hoodie" for us to 
inspect. since its in S3, when you want to get it locally, can you do rsync and 
not "cp" so that last mod times are intact. 
   
   From what I can glean this is what you are reporting. 
   
   The file group of interest just had only one file slice which was dated 23rd 
nov, 2022. 
   Query started around april 21 ish, 2023. and new commits added two new file 
slices.
   I assume in-between these two time frames, there are no other commits which 
created new file slices for the file group of interest. can you confirm that. 
   
   But in 2.5 hours, the cleaner remove the file slices created on 23rd nov 
which the current query was actually trying to read and it failed. 
   
   I went through the code. 
   From what I see, this is what the code is supposed to do. I need to test it 
out /reproduce to confirm though. 
   
   Whenever clean planning kicks in, we deduce the earliest commit to retain 
based on the number of hours configured. for eg, if you have configured hours 
as 12. we will walk through the timeline and choose the commit just before 12 
hours. 
   
   and then for each file group of interest. 
       among all file slices, we will choose the latest file slice just before 
the earliest commit to retain. So, in above example it should have chosen the 
file slice for 23rd nov. // again. I assume after 23rd nov, up until april 21, 
there are no other file slices created. 
      once obtained, we will ignore that file slice (the latest just before 
earliest commit to retain) and then remove all earlier file slices. 
   
   So, I don't see any issue here. 
   
   If you can confirm the details asked for, would be helpful.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #8584: [SUPPORT] Spark SQL query FileNotFoundException using cleaner policy KEEP_LATEST_BY_HOURS

Reply via email to