bvaradar commented on issue #1860:
URL: https://github.com/apache/hudi/issues/1860#issuecomment-663413173


   I would expect the data to be same across query engines unless there is some 
caching or GS is not giving consistent listing view.
   
   With Hudi's Spark datasource integration, Hudi reuses spark's parquet Data 
Source implementation and merely applies file level path filter to pick and 
choose what files to read. you can do something like 
select(distinct("_hoodie_file_name")) on both the cases to see if any file is 
getting missed. You can also run select(max("_hoodie_commit_time") to determine 
what is the highest committed time and check if they are consistent for 
checking atomicity. Otherwise, I suggest you can also do similar experiments 
with Parquet or other datasets. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to