Hi, There is no format difference whatsoever. Hudi just adds additional footers for min, max key values and bloom filters to parquet and some meta fields for tracking commit times for incremental queries and keys. Any standard parquet reader can read the parquet files in a Hudi table. These downstream applications, are these Spark jobs? what do you use to consume the parquet files?
The main thing your downstream reader needs to do is to read a correct snapshot i.e only the latest committed files. Otherwise,you may end up with duplicate values. For example, when you issue the hudi delete, hudi will internally create a new version of parquet files, without the deleted rows. So if you are not careful about filtering for the latest file, you may end up reading both files and have duplicates All of this happens automatically, if you are using a supported engine like spark, flink, hive, presto, trino, ... yes. hudi (copy on write) dataset is a set of parquet files, with some metadata. Hope that helps Thanks Vinoth Thanks Vinoth On Fri, Sep 17, 2021 at 9:09 PM Xiong Qiang <[email protected]> wrote: > Hi, all, > > I am new to Hudi, so please forgive me for naive questions. > > I was following the guides at > > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html > and at https://hudi.incubator.apache.org/docs/quick-start-guide/. > > My goal is to load original Parquet files (written by Spark application > from Kafka to S3) into Hudi, delete some rows, and then save back to (a > different path in) S3 (the modified Parquet file). There are other > downstream applications that consumes the original Parquet files for > further processing. > > My question: *Is there any format difference between the original Parquet > files and the Hudi modified Parquet files?* Is the Hudi modified Parquet > files compatible with the original Parquet files? In other words, will > other downstream applications (previously consuming the original Parquet > files) be able to consume the modified Parquet files (i.e. the Hudi > dataset) without any code change? > > In the docs, I have seen the phrase "Hudi dataset", which, in my > understanding, is simply a Parquet file with accompanied Hudi metadata. I > have also read the migration doc ( > https://hudi.incubator.apache.org/docs/migration_guide/). My understanding > is that we can migrate from original Parquet file to Hudi dataset (Hudi > modified Parquet file). *Can we use (or migrate) Hudi dataset (Hudi > modified Parquet file) back to original Parquet file to be consumed by > other downstream application?* > > Thank you very much! >
