Hello all,

Just bumping this thread again

thanks
vinoth

On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <[email protected]> wrote:

> Hello all,
>
> One feature that keeps coming up is the ability to use UPDATE, MERGE sql
> syntax to support writing into Hudi tables. We have looked into the Spark 3
> DataSource V2 APIs as well and found several issues that hinder us in
> implementing this via the Spark APIs
>
> - As of this writing, the UPDATE/MERGE syntax is not really opened up to
> external datasources like Hudi. only DELETE is.
> - DataSource V2 API offers no flexibility to perform any kind of
> further transformations to the dataframe. Hudi supports keys, indexes,
> preCombining and custom partitioning that ensures file sizes etc. All this
> needs shuffling data, looking up/joining against other dataframes so forth.
> Today, the DataSource V1 API allows this kind of further
> partitions/transformations. But the V2 API is simply offers partition level
> iteration once the user calls df.write.format("hudi")
>
> One thought I had is to explore Apache Calcite and write an adapter for
> Hudi. This frees us from being very dependent on a particular engine's
> syntax support like Spark. Calcite is very popular by itself and supports
> most of the key words and (also more streaming friendly syntax). To be
> clear, we will still be using Spark/Flink underneath to perform the actual
> writing, just that the SQL grammar is provided by Calcite.
>
> To give a taste of how this will look like.
>
> A) If the user wants to mutate a Hudi table using SQL
>
> Instead of writing something like : spark.sql("UPDATE ....")
> users will write : hudiSparkSession.sql("UPDATE ....")
>
> B) To save a Spark data frame to a Hudi table
> we continue to use Spark DataSource V1
>
> The obvious challenge I see is the disconnect with the Spark DataFrame
> ecosystem. Users would write MERGE SQL statements by joining against other
> Spark DataFrames.
> If we want those expressed in calcite as well, we need to also invest in
> the full Query side support, which can increase the scope by a lot.
> Some amount of investigation needs to happen, but ideally we should be
> able to integrate with the sparkSQL catalog and reuse all the tables there.
>
> I am sure there are some gaps in my thinking. Just starting this thread,
> so we can discuss and others can chime in/correct me.
>
> thanks
> vinoth
>

Reply via email to