Difference/compatibility between original Parquet files and Hudi modified Parquet files

Xiong Qiang Fri, 17 Sep 2021 21:09:38 -0700

Hi, all,

I am new to Hudi, so please forgive me for naive questions.


I was following the guides at
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
and at https://hudi.incubator.apache.org/docs/quick-start-guide/.

My goal is to load original Parquet files (written by Spark application
from Kafka to S3) into Hudi, delete some rows, and then save back to (a
different path in) S3 (the modified Parquet file). There are other
downstream applications that consumes the original Parquet files for
further processing.

My question: *Is there any format difference between the original Parquet
files and the Hudi modified Parquet files?* Is the Hudi modified Parquet
files compatible with the original Parquet files? In other words, will
other downstream applications (previously consuming the original Parquet
files) be able to consume the modified Parquet files (i.e. the Hudi
dataset) without any code change?

In the docs, I have seen the phrase "Hudi dataset", which, in my
understanding, is simply a Parquet file with accompanied Hudi metadata. I
have also read the migration doc (
https://hudi.incubator.apache.org/docs/migration_guide/). My understanding
is that we can migrate from original Parquet file to Hudi dataset (Hudi
modified Parquet file). *Can we use (or migrate) Hudi dataset (Hudi
modified Parquet file) back to original Parquet file to be consumed by
other downstream application?*

Thank you very much!

Difference/compatibility between original Parquet files and Hudi modified Parquet files

Reply via email to