That’s awesome. Looks like we have a consensus on Calcite. Look forward to the RFC as well!
-Nishith > On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <[email protected]> wrote: > > Sounds good. Look forward to a RFC/DISCUSS thread. > > Thanks > Vinoth > >> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <[email protected]> wrote: >> >> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would >> add support for SQL connectors of Hoodie Flink soon ~ >> Currently, i'm preparing a refactoring to the current Flink writer code. >> >> Vinoth Chandar <[email protected]> 于2020年12月18日周五 上午6:39写道: >> >>> Thanks Kabeer for the note on gmail. Did not realize that. :) >>> >>>>> My desired use case is user use the Hoodie CLI to execute these SQLs. >>> They can choose what engine to use by a CLI config option. >>> >>> Yes, that is also another attractive aspect of this route. We can build >> out >>> a common SQL layer and have this translate to the underlying engine >> (sounds >>> like Hive huh) >>> Longer term, if we really think we can more easily implement a full DML + >>> DDL + DQL, we can proceed with this. >>> >>> As others pointed out, for Spark SQL, it might be good to try the Spark >>> extensions route, before we take this on more fully. >>> >>> The other part where Calcite is great is, all the support for >>> windowing/streaming in its syntax. >>> Danny, I guess if we should be able to leverage that through a deeper >>> Flink/Hudi integration? >>> >>> >>> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <[email protected]> >> wrote: >>> >>>> I think Dongwook is investigating on the same lines. and it does seem >>>> better to pursue this first, before trying other approaches. >>>> >>>> >>>> >>>> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <[email protected] >>> .invalid> >>>> wrote: >>>> >>>>> Yeah I agree with Nishith that an option way is to look at the >> ways >>> to >>>>> plug in custom logical and physical plans in Spark. It can simplify >> the >>>>> implementation and reuse the Spark SQL syntax. And also users >> familiar >>>> with >>>>> Spark SQL will be able to use HUDi's SQL features more quickly. >>>>> In fact, spark have provided the SparkSessionExtensions interface for >>>>> implement custom syntax extensions and SQL rewrite rule. >>>>> >>>> >>> >> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html >>>> . >>>>> We can use the SparkSessionExtensions to extended hoodie sql syntax >>> such >>>>> as MERGE INTO and DDL syntax. >>>>> >>>>> 2020年12月15日 下午3:27,Nishith <[email protected]> 写道: >>>>> >>>>> Thanks for starting this thread Vinoth. >>>>> In general, definitely see the need for SQL style semantics on Hudi >>>>> tables. Apache Calcite is a great option to considering given >>>> DatasourceV2 >>>>> has the limitations that you described. >>>>> >>>>> Additionally, even if Spark DatasourceV2 allowed for the flexibility, >>> the >>>>> same SQL semantics needs to be supported in other engines like Flink >> to >>>>> provide the same experience to users - which in itself could also be >>>>> considerable amount of work. >>>>> So, if we’re able to generalize on the SQL story along Calcite, that >>>> would >>>>> help reduce redundant work in some sense. >>>>> Although, I’m worried about a few things >>>>> >>>>> 1) Like you pointed out, writing complex user jobs using Spark SQL >>> syntax >>>>> can be harder for users who are moving from “Hudi syntax” to “Spark >>>> syntax” >>>>> for cross table joins, merges etc using data frames. One option is to >>>> look >>>>> at the if there are ways to plug in custom logical and physical plans >>> in >>>>> Spark, this way, although the merge on sparksql functionality may not >>> be >>>> as >>>>> simple to use, but wouldn’t take away performance and feature set for >>>>> starters, in the future we could think of having the entire query >> space >>>> be >>>>> powered by calcite like you mentioned >>>>> 2) If we continue to use DatasourceV1, is there any downside to this >>> from >>>>> a performance and optimization perspective when executing plan - I’m >>>>> guessing not but haven’t delved into the code to see if there’s >>> anything >>>>> different apart from the API and spec. >>>>> >>>>> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <[email protected]> >>> wrote: >>>>> >>>>> >>>>> Hello all, >>>>> >>>>> >>>>> Just bumping this thread again >>>>> >>>>> >>>>> thanks >>>>> >>>>> vinoth >>>>> >>>>> >>>>> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <[email protected]> >>>> wrote: >>>>> >>>>> >>>>> Hello all, >>>>> >>>>> >>>>> One feature that keeps coming up is the ability to use UPDATE, MERGE >>> sql >>>>> >>>>> syntax to support writing into Hudi tables. We have looked into the >>>> Spark 3 >>>>> >>>>> DataSource V2 APIs as well and found several issues that hinder us in >>>>> >>>>> implementing this via the Spark APIs >>>>> >>>>> >>>>> - As of this writing, the UPDATE/MERGE syntax is not really opened up >>> to >>>>> >>>>> external datasources like Hudi. only DELETE is. >>>>> >>>>> - DataSource V2 API offers no flexibility to perform any kind of >>>>> >>>>> further transformations to the dataframe. Hudi supports keys, >> indexes, >>>>> >>>>> preCombining and custom partitioning that ensures file sizes etc. All >>>> this >>>>> >>>>> needs shuffling data, looking up/joining against other dataframes so >>>> forth. >>>>> >>>>> Today, the DataSource V1 API allows this kind of further >>>>> >>>>> partitions/transformations. But the V2 API is simply offers partition >>>> level >>>>> >>>>> iteration once the user calls df.write.format("hudi") >>>>> >>>>> >>>>> One thought I had is to explore Apache Calcite and write an adapter >> for >>>>> >>>>> Hudi. This frees us from being very dependent on a particular >> engine's >>>>> >>>>> syntax support like Spark. Calcite is very popular by itself and >>> supports >>>>> >>>>> most of the key words and (also more streaming friendly syntax). To >> be >>>>> >>>>> clear, we will still be using Spark/Flink underneath to perform the >>>> actual >>>>> >>>>> writing, just that the SQL grammar is provided by Calcite. >>>>> >>>>> >>>>> To give a taste of how this will look like. >>>>> >>>>> >>>>> A) If the user wants to mutate a Hudi table using SQL >>>>> >>>>> >>>>> Instead of writing something like : spark.sql("UPDATE ....") >>>>> >>>>> users will write : hudiSparkSession.sql("UPDATE ....") >>>>> >>>>> >>>>> B) To save a Spark data frame to a Hudi table >>>>> >>>>> we continue to use Spark DataSource V1 >>>>> >>>>> >>>>> The obvious challenge I see is the disconnect with the Spark >> DataFrame >>>>> >>>>> ecosystem. Users would write MERGE SQL statements by joining against >>>> other >>>>> >>>>> Spark DataFrames. >>>>> >>>>> If we want those expressed in calcite as well, we need to also invest >>> in >>>>> >>>>> the full Query side support, which can increase the scope by a lot. >>>>> >>>>> Some amount of investigation needs to happen, but ideally we should >> be >>>>> >>>>> able to integrate with the sparkSQL catalog and reuse all the tables >>>> there. >>>>> >>>>> >>>>> I am sure there are some gaps in my thinking. Just starting this >>> thread, >>>>> >>>>> so we can discuss and others can chime in/correct me. >>>>> >>>>> >>>>> thanks >>>>> >>>>> vinoth >>>>> >>>>> >>>>> >>>> >>> >>
