Re: Difference/compatibility between original Parquet files and Hudi modified Parquet files

Vinoth Chandar Wed, 22 Sep 2021 14:29:06 -0700

Hi,

There is no format difference whatsoever. Hudi just adds additional footers
for min, max key values and bloom filters to parquet and some meta fields
for tracking commit times for incremental queries and keys.
Any standard parquet reader can read the parquet files in a Hudi table.
These downstream applications, are these Spark jobs? what do you use to
consume the parquet files?

The main thing your downstream reader needs to do is to read a correct
snapshot i.e only the latest committed files. Otherwise,you may end up with
duplicate values.
For example, when you issue the hudi delete, hudi will internally create a
new version of parquet files, without the deleted rows. So if you are not
careful about filtering for the latest file, you may end up reading both
files and have duplicates

All of this happens automatically, if you are using a supported engine like
spark, flink, hive, presto, trino, ...

yes. hudi (copy on write) dataset is a set of parquet files, with some
metadata.

Hope that helps

Thanks
Vinoth

Thanks
Vinoth

On Fri, Sep 17, 2021 at 9:09 PM Xiong Qiang <[email protected]>
wrote:

> Hi, all,
>
> I am new to Hudi, so please forgive me for naive questions.
>
> I was following the guides at
>
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
> and at https://hudi.incubator.apache.org/docs/quick-start-guide/.
>
> My goal is to load original Parquet files (written by Spark application
> from Kafka to S3) into Hudi, delete some rows, and then save back to (a
> different path in) S3 (the modified Parquet file). There are other
> downstream applications that consumes the original Parquet files for
> further processing.
>
> My question: *Is there any format difference between the original Parquet
> files and the Hudi modified Parquet files?* Is the Hudi modified Parquet
> files compatible with the original Parquet files? In other words, will
> other downstream applications (previously consuming the original Parquet
> files) be able to consume the modified Parquet files (i.e. the Hudi
> dataset) without any code change?
>
> In the docs, I have seen the phrase "Hudi dataset", which, in my
> understanding, is simply a Parquet file with accompanied Hudi metadata. I
> have also read the migration doc (
> https://hudi.incubator.apache.org/docs/migration_guide/). My understanding
> is that we can migrate from original Parquet file to Hudi dataset (Hudi
> modified Parquet file). *Can we use (or migrate) Hudi dataset (Hudi
> modified Parquet file) back to original Parquet file to be consumed by
> other downstream application?*
>
> Thank you very much!
>

Re: Difference/compatibility between original Parquet files and Hudi modified Parquet files

Reply via email to