[GitHub] [hudi] bvaradar commented on issue #1860: [SUPPORT] Issue when querying from Spark Datasource if COW table is being written to at the same time

GitBox Fri, 24 Jul 2020 01:39:27 -0700


bvaradar commented on issue #1860:
URL: https://github.com/apache/hudi/issues/1860#issuecomment-663413173



   I would expect the data to be same across query engines unless there is some 
caching or GS is not giving consistent listing view.
   
   With Hudi's Spark datasource integration, Hudi reuses spark's parquet Data 
Source implementation and merely applies file level path filter to pick and 
choose what files to read. you can do something like 
select(distinct("_hoodie_file_name")) on both the cases to see if any file is 
getting missed. You can also run select(max("_hoodie_commit_time") to determine 
what is the highest committed time and check if they are consistent for 
checking atomicity. Otherwise, I suggest you can also do similar experiments 
with Parquet or other datasets. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1860: [SUPPORT] Issue when querying from Spark Datasource if COW table is being written to at the same time

Reply via email to