Sorry, dropped the ball on this.
If you do the following, your queries will be correct and not see any
duplicates/partial data.
- For Spark, you need to now do spark.read.format("hudi").load()
- For Presto/Trino, when you sync the table metadata out to Hive
metastores, Presto/Trino understand
Hi Vinoth,
Thank you very much for the detailed explanation. That is very helpful.
For the downstream applications, we have Spark applications, and
Presto/Trino.
For Spark, We use spark.read.format('parquet').load() to read the Parquet
files for other processing.
For Presto/Trino, we use AWS
Hi,
There is no format difference whatsoever. Hudi just adds additional footers
for min, max key values and bloom filters to parquet and some meta fields
for tracking commit times for incremental queries and keys.
Any standard parquet reader can read the parquet files in a Hudi table.
These
Hi, all,
I am new to Hudi, so please forgive me for naive questions.
I was following the guides at
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
and at https://hudi.incubator.apache.org/docs/quick-start-guide/.
My goal is to load original Parquet files