Re: Difference/compatibility between original Parquet files and Hudi modified Parquet files

2021-10-14 Thread Vinoth Chandar
Sorry, dropped the ball on this. If you do the following, your queries will be correct and not see any duplicates/partial data. - For Spark, you need to now do spark.read.format("hudi").load() - For Presto/Trino, when you sync the table metadata out to Hive metastores, Presto/Trino understand

Re: Difference/compatibility between original Parquet files and Hudi modified Parquet files

2021-09-27 Thread Xiong Qiang
Hi Vinoth, Thank you very much for the detailed explanation. That is very helpful. For the downstream applications, we have Spark applications, and Presto/Trino. For Spark, We use spark.read.format('parquet').load() to read the Parquet files for other processing. For Presto/Trino, we use AWS

Re: Difference/compatibility between original Parquet files and Hudi modified Parquet files

2021-09-22 Thread Vinoth Chandar
Hi, There is no format difference whatsoever. Hudi just adds additional footers for min, max key values and bloom filters to parquet and some meta fields for tracking commit times for incremental queries and keys. Any standard parquet reader can read the parquet files in a Hudi table. These

Difference/compatibility between original Parquet files and Hudi modified Parquet files

2021-09-17 Thread Xiong Qiang
Hi, all, I am new to Hudi, so please forgive me for naive questions. I was following the guides at https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html and at https://hudi.incubator.apache.org/docs/quick-start-guide/. My goal is to load original Parquet files