Hello all, Just bumping this thread again
thanks vinoth On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <[email protected]> wrote: > Hello all, > > One feature that keeps coming up is the ability to use UPDATE, MERGE sql > syntax to support writing into Hudi tables. We have looked into the Spark 3 > DataSource V2 APIs as well and found several issues that hinder us in > implementing this via the Spark APIs > > - As of this writing, the UPDATE/MERGE syntax is not really opened up to > external datasources like Hudi. only DELETE is. > - DataSource V2 API offers no flexibility to perform any kind of > further transformations to the dataframe. Hudi supports keys, indexes, > preCombining and custom partitioning that ensures file sizes etc. All this > needs shuffling data, looking up/joining against other dataframes so forth. > Today, the DataSource V1 API allows this kind of further > partitions/transformations. But the V2 API is simply offers partition level > iteration once the user calls df.write.format("hudi") > > One thought I had is to explore Apache Calcite and write an adapter for > Hudi. This frees us from being very dependent on a particular engine's > syntax support like Spark. Calcite is very popular by itself and supports > most of the key words and (also more streaming friendly syntax). To be > clear, we will still be using Spark/Flink underneath to perform the actual > writing, just that the SQL grammar is provided by Calcite. > > To give a taste of how this will look like. > > A) If the user wants to mutate a Hudi table using SQL > > Instead of writing something like : spark.sql("UPDATE ....") > users will write : hudiSparkSession.sql("UPDATE ....") > > B) To save a Spark data frame to a Hudi table > we continue to use Spark DataSource V1 > > The obvious challenge I see is the disconnect with the Spark DataFrame > ecosystem. Users would write MERGE SQL statements by joining against other > Spark DataFrames. > If we want those expressed in calcite as well, we need to also invest in > the full Query side support, which can increase the scope by a lot. > Some amount of investigation needs to happen, but ideally we should be > able to integrate with the sparkSQL catalog and reuse all the tables there. > > I am sure there are some gaps in my thinking. Just starting this thread, > so we can discuss and others can chime in/correct me. > > thanks > vinoth >
