bvaradar commented on issue #1860: URL: https://github.com/apache/hudi/issues/1860#issuecomment-663413173
I would expect the data to be same across query engines unless there is some caching or GS is not giving consistent listing view. With Hudi's Spark datasource integration, Hudi reuses spark's parquet Data Source implementation and merely applies file level path filter to pick and choose what files to read. you can do something like select(distinct("_hoodie_file_name")) on both the cases to see if any file is getting missed. You can also run select(max("_hoodie_commit_time") to determine what is the highest committed time and check if they are consistent for checking atomicity. Otherwise, I suggest you can also do similar experiments with Parquet or other datasets. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org