Hello all,

One feature that keeps coming up is the ability to use UPDATE, MERGE sql
syntax to support writing into Hudi tables. We have looked into the Spark 3
DataSource V2 APIs as well and found several issues that hinder us in
implementing this via the Spark APIs

- As of this writing, the UPDATE/MERGE syntax is not really opened up to
external datasources like Hudi. only DELETE is.
- DataSource V2 API offers no flexibility to perform any kind of
further transformations to the dataframe. Hudi supports keys, indexes,
preCombining and custom partitioning that ensures file sizes etc. All this
needs shuffling data, looking up/joining against other dataframes so forth.
Today, the DataSource V1 API allows this kind of further
partitions/transformations. But the V2 API is simply offers partition level
iteration once the user calls df.write.format("hudi")

One thought I had is to explore Apache Calcite and write an adapter for
Hudi. This frees us from being very dependent on a particular engine's
syntax support like Spark. Calcite is very popular by itself and supports
most of the key words and (also more streaming friendly syntax). To be
clear, we will still be using Spark/Flink underneath to perform the actual
writing, just that the SQL grammar is provided by Calcite.

To give a taste of how this will look like.

A) If the user wants to mutate a Hudi table using SQL

Instead of writing something like : spark.sql("UPDATE ....")
users will write : hudiSparkSession.sql("UPDATE ....")

B) To save a Spark data frame to a Hudi table
we continue to use Spark DataSource V1

The obvious challenge I see is the disconnect with the Spark DataFrame
ecosystem. Users would write MERGE SQL statements by joining against other
Spark DataFrames.
If we want those expressed in calcite as well, we need to also invest in
the full Query side support, which can increase the scope by a lot.
Some amount of investigation needs to happen, but ideally we should be able
to integrate with the sparkSQL catalog and reuse all the tables there.

I am sure there are some gaps in my thinking. Just starting this thread, so
we can discuss and others can chime in/correct me.

thanks
vinoth

Reply via email to