Yeah I agree with Nishith that an option way is to look at the ways to plug 
in custom logical and physical plans in Spark. It can simplify the 
implementation and reuse the Spark SQL syntax. And also users familiar with 
Spark SQL will be able to use HUDi's SQL features more quickly.
 In fact, spark have provided the SparkSessionExtensions interface for 
implement custom syntax extensions and SQL rewrite rule.  
https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html.
  We can use the SparkSessionExtensions to extended hoodie sql syntax such as 
MERGE INTO and DDL syntax.


2020年12月15日 下午3:27,Nishith <[email protected]> 写道:


Thanks for starting this thread Vinoth.
In general, definitely see the need for SQL style semantics on Hudi tables. 
Apache Calcite is a great option to considering given DatasourceV2 has the 
limitations that you described.

Additionally, even if Spark DatasourceV2 allowed for the flexibility, the same 
SQL semantics needs to be supported in other engines like Flink to provide the 
same experience to users - which in itself could also be considerable amount of 
work.
So, if we’re able to generalize on the SQL story along Calcite, that would help 
reduce redundant work in some sense.
Although, I’m worried about a few things

1) Like you pointed out, writing complex user jobs using Spark SQL syntax can 
be harder for users who are moving from “Hudi syntax” to “Spark syntax” for 
cross table joins, merges etc using data frames. One option is to look at the 
if there are ways to plug in custom logical and physical plans in Spark, this 
way, although the merge on sparksql functionality may not be as simple to use, 
but wouldn’t take away performance and feature set for starters, in the future 
we could think of having the entire query space be powered by calcite like you 
mentioned
2) If we continue to use DatasourceV1, is there any downside to this from a 
performance and optimization perspective when executing plan - I’m guessing not 
but haven’t delved into the code to see if there’s anything different apart 
from the API and spec.


On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <[email protected]> wrote:


Hello all,


Just bumping this thread again


thanks
vinoth


On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <[email protected]> wrote:


Hello all,


One feature that keeps coming up is the ability to use UPDATE, MERGE sql
syntax to support writing into Hudi tables. We have looked into the Spark 3
DataSource V2 APIs as well and found several issues that hinder us in
implementing this via the Spark APIs


- As of this writing, the UPDATE/MERGE syntax is not really opened up to
external datasources like Hudi. only DELETE is.
- DataSource V2 API offers no flexibility to perform any kind of
further transformations to the dataframe. Hudi supports keys, indexes,
preCombining and custom partitioning that ensures file sizes etc. All this
needs shuffling data, looking up/joining against other dataframes so forth.
Today, the DataSource V1 API allows this kind of further
partitions/transformations. But the V2 API is simply offers partition level
iteration once the user calls df.write.format("hudi")


One thought I had is to explore Apache Calcite and write an adapter for
Hudi. This frees us from being very dependent on a particular engine's
syntax support like Spark. Calcite is very popular by itself and supports
most of the key words and (also more streaming friendly syntax). To be
clear, we will still be using Spark/Flink underneath to perform the actual
writing, just that the SQL grammar is provided by Calcite.


To give a taste of how this will look like.


A) If the user wants to mutate a Hudi table using SQL


Instead of writing something like : spark.sql("UPDATE ....")
users will write : hudiSparkSession.sql("UPDATE ....")


B) To save a Spark data frame to a Hudi table
we continue to use Spark DataSource V1


The obvious challenge I see is the disconnect with the Spark DataFrame
ecosystem. Users would write MERGE SQL statements by joining against other
Spark DataFrames.
If we want those expressed in calcite as well, we need to also invest in
the full Query side support, which can increase the scope by a lot.
Some amount of investigation needs to happen, but ideally we should be
able to integrate with the sparkSQL catalog and reuse all the tables there.


I am sure there are some gaps in my thinking. Just starting this thread,
so we can discuss and others can chime in/correct me.


thanks
vinoth

Reply via email to