Thanks for starting this thread Vinoth.
In general, definitely see the need for SQL style semantics on Hudi tables. 
Apache Calcite is a great option to considering given DatasourceV2 has the 
limitations that you described.

Additionally, even if Spark DatasourceV2 allowed for the flexibility, the same 
SQL semantics needs to be supported in other engines like Flink to provide the 
same experience to users - which in itself could also be considerable amount of 
work. 
So, if we’re able to generalize on the SQL story along Calcite, that would help 
reduce redundant work in some sense. 
Although, I’m worried about a few things

1) Like you pointed out, writing complex user jobs using Spark SQL syntax can 
be harder for users who are moving from “Hudi syntax” to “Spark syntax” for 
cross table joins, merges etc using data frames. One option is to look at the 
if there are ways to plug in custom logical and physical plans in Spark, this 
way, although the merge on sparksql functionality may not be as simple to use, 
but wouldn’t take away performance and feature set for starters, in the future 
we could think of having the entire query space be powered by calcite like you 
mentioned 
2) If we continue to use DatasourceV1, is there any downside to this from a 
performance and optimization perspective when executing plan - I’m guessing not 
but haven’t delved into the code to see if there’s anything different apart 
from the API and spec.

> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <[email protected]> wrote:
> 
> Hello all,
> 
> Just bumping this thread again
> 
> thanks
> vinoth
> 
>> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <[email protected]> wrote:
>> 
>> Hello all,
>> 
>> One feature that keeps coming up is the ability to use UPDATE, MERGE sql
>> syntax to support writing into Hudi tables. We have looked into the Spark 3
>> DataSource V2 APIs as well and found several issues that hinder us in
>> implementing this via the Spark APIs
>> 
>> - As of this writing, the UPDATE/MERGE syntax is not really opened up to
>> external datasources like Hudi. only DELETE is.
>> - DataSource V2 API offers no flexibility to perform any kind of
>> further transformations to the dataframe. Hudi supports keys, indexes,
>> preCombining and custom partitioning that ensures file sizes etc. All this
>> needs shuffling data, looking up/joining against other dataframes so forth.
>> Today, the DataSource V1 API allows this kind of further
>> partitions/transformations. But the V2 API is simply offers partition level
>> iteration once the user calls df.write.format("hudi")
>> 
>> One thought I had is to explore Apache Calcite and write an adapter for
>> Hudi. This frees us from being very dependent on a particular engine's
>> syntax support like Spark. Calcite is very popular by itself and supports
>> most of the key words and (also more streaming friendly syntax). To be
>> clear, we will still be using Spark/Flink underneath to perform the actual
>> writing, just that the SQL grammar is provided by Calcite.
>> 
>> To give a taste of how this will look like.
>> 
>> A) If the user wants to mutate a Hudi table using SQL
>> 
>> Instead of writing something like : spark.sql("UPDATE ....")
>> users will write : hudiSparkSession.sql("UPDATE ....")
>> 
>> B) To save a Spark data frame to a Hudi table
>> we continue to use Spark DataSource V1
>> 
>> The obvious challenge I see is the disconnect with the Spark DataFrame
>> ecosystem. Users would write MERGE SQL statements by joining against other
>> Spark DataFrames.
>> If we want those expressed in calcite as well, we need to also invest in
>> the full Query side support, which can increase the scope by a lot.
>> Some amount of investigation needs to happen, but ideally we should be
>> able to integrate with the sparkSQL catalog and reuse all the tables there.
>> 
>> I am sure there are some gaps in my thinking. Just starting this thread,
>> so we can discuss and others can chime in/correct me.
>> 
>> thanks
>> vinoth
>> 

Reply via email to