I think Dongwook is investigating on the same lines. and it does seem better to pursue this first, before trying other approaches.
On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <[email protected]> wrote: > Yeah I agree with Nishith that an option way is to look at the ways to > plug in custom logical and physical plans in Spark. It can simplify the > implementation and reuse the Spark SQL syntax. And also users familiar with > Spark SQL will be able to use HUDi's SQL features more quickly. > In fact, spark have provided the SparkSessionExtensions interface for > implement custom syntax extensions and SQL rewrite rule. > https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html. > We can use the SparkSessionExtensions to extended hoodie sql syntax such > as MERGE INTO and DDL syntax. > > 2020年12月15日 下午3:27,Nishith <[email protected]> 写道: > > Thanks for starting this thread Vinoth. > In general, definitely see the need for SQL style semantics on Hudi > tables. Apache Calcite is a great option to considering given DatasourceV2 > has the limitations that you described. > > Additionally, even if Spark DatasourceV2 allowed for the flexibility, the > same SQL semantics needs to be supported in other engines like Flink to > provide the same experience to users - which in itself could also be > considerable amount of work. > So, if we’re able to generalize on the SQL story along Calcite, that would > help reduce redundant work in some sense. > Although, I’m worried about a few things > > 1) Like you pointed out, writing complex user jobs using Spark SQL syntax > can be harder for users who are moving from “Hudi syntax” to “Spark syntax” > for cross table joins, merges etc using data frames. One option is to look > at the if there are ways to plug in custom logical and physical plans in > Spark, this way, although the merge on sparksql functionality may not be as > simple to use, but wouldn’t take away performance and feature set for > starters, in the future we could think of having the entire query space be > powered by calcite like you mentioned > 2) If we continue to use DatasourceV1, is there any downside to this from > a performance and optimization perspective when executing plan - I’m > guessing not but haven’t delved into the code to see if there’s anything > different apart from the API and spec. > > On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <[email protected]> wrote: > > > Hello all, > > > Just bumping this thread again > > > thanks > > vinoth > > > On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <[email protected]> wrote: > > > Hello all, > > > One feature that keeps coming up is the ability to use UPDATE, MERGE sql > > syntax to support writing into Hudi tables. We have looked into the Spark 3 > > DataSource V2 APIs as well and found several issues that hinder us in > > implementing this via the Spark APIs > > > - As of this writing, the UPDATE/MERGE syntax is not really opened up to > > external datasources like Hudi. only DELETE is. > > - DataSource V2 API offers no flexibility to perform any kind of > > further transformations to the dataframe. Hudi supports keys, indexes, > > preCombining and custom partitioning that ensures file sizes etc. All this > > needs shuffling data, looking up/joining against other dataframes so forth. > > Today, the DataSource V1 API allows this kind of further > > partitions/transformations. But the V2 API is simply offers partition level > > iteration once the user calls df.write.format("hudi") > > > One thought I had is to explore Apache Calcite and write an adapter for > > Hudi. This frees us from being very dependent on a particular engine's > > syntax support like Spark. Calcite is very popular by itself and supports > > most of the key words and (also more streaming friendly syntax). To be > > clear, we will still be using Spark/Flink underneath to perform the actual > > writing, just that the SQL grammar is provided by Calcite. > > > To give a taste of how this will look like. > > > A) If the user wants to mutate a Hudi table using SQL > > > Instead of writing something like : spark.sql("UPDATE ....") > > users will write : hudiSparkSession.sql("UPDATE ....") > > > B) To save a Spark data frame to a Hudi table > > we continue to use Spark DataSource V1 > > > The obvious challenge I see is the disconnect with the Spark DataFrame > > ecosystem. Users would write MERGE SQL statements by joining against other > > Spark DataFrames. > > If we want those expressed in calcite as well, we need to also invest in > > the full Query side support, which can increase the scope by a lot. > > Some amount of investigation needs to happen, but ideally we should be > > able to integrate with the sparkSQL catalog and reuse all the tables there. > > > I am sure there are some gaps in my thinking. Just starting this thread, > > so we can discuss and others can chime in/correct me. > > > thanks > > vinoth > > >
