Re: [DISCUSS] SQL Support using Apache Calcite

Nishith Fri, 18 Dec 2020 10:10:32 -0800

That’s awesome. Looks like we have a consensus on Calcite. Look forward to the 
RFC as well!


-Nishith

> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <[email protected]> wrote:
> 
> Sounds good. Look forward to a RFC/DISCUSS thread.
> 
> Thanks
> Vinoth
> 
>> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <[email protected]> wrote:
>> 
>> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>> add support for SQL connectors of Hoodie Flink soon ~
>> Currently, i'm preparing a refactoring to the current Flink writer code.
>> 
>> Vinoth Chandar <[email protected]> 于2020年12月18日周五 上午6:39写道：
>> 
>>> Thanks Kabeer for the note on gmail. Did not realize that.  :)
>>> 
>>>>> My desired use case is user use the Hoodie CLI to execute these SQLs.
>>> They can choose what engine to use by a CLI config option.
>>> 
>>> Yes, that is also another attractive aspect of this route. We can build
>> out
>>> a common SQL layer and have this translate to the underlying engine
>> (sounds
>>> like Hive huh)
>>> Longer term, if we really think we can more easily implement a full DML +
>>> DDL + DQL, we can proceed with this.
>>> 
>>> As others pointed out, for Spark SQL, it might be good to try the Spark
>>> extensions route, before we take this on more fully.
>>> 
>>> The other part where Calcite is great is, all the support for
>>> windowing/streaming in its syntax.
>>> Danny, I guess if we should be able to leverage that through a deeper
>>> Flink/Hudi integration?
>>> 
>>> 
>>> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <[email protected]>
>> wrote:
>>> 
>>>> I think Dongwook is investigating on the same lines. and it does seem
>>>> better to pursue this first, before trying other approaches.
>>>> 
>>>> 
>>>> 
>>>> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <[email protected]
>>> .invalid>
>>>> wrote:
>>>> 
>>>>>   Yeah I agree with Nishith that an option way is to look at the
>> ways
>>> to
>>>>> plug in custom logical and physical plans in Spark. It can simplify
>> the
>>>>> implementation and reuse the Spark SQL syntax. And also users
>> familiar
>>>> with
>>>>> Spark SQL will be able to use HUDi's SQL features more quickly.
>>>>> In fact, spark have provided the SparkSessionExtensions interface for
>>>>> implement custom syntax extensions and SQL rewrite rule.
>>>>> 
>>>> 
>>> 
>> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>>>> .
>>>>> We can use the SparkSessionExtensions to extended hoodie sql syntax
>>> such
>>>>> as MERGE INTO and DDL syntax.
>>>>> 
>>>>> 2020年12月15日 下午3:27，Nishith <[email protected]> 写道：
>>>>> 
>>>>> Thanks for starting this thread Vinoth.
>>>>> In general, definitely see the need for SQL style semantics on Hudi
>>>>> tables. Apache Calcite is a great option to considering given
>>>> DatasourceV2
>>>>> has the limitations that you described.
>>>>> 
>>>>> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>>> the
>>>>> same SQL semantics needs to be supported in other engines like Flink
>> to
>>>>> provide the same experience to users - which in itself could also be
>>>>> considerable amount of work.
>>>>> So, if we’re able to generalize on the SQL story along Calcite, that
>>>> would
>>>>> help reduce redundant work in some sense.
>>>>> Although, I’m worried about a few things
>>>>> 
>>>>> 1) Like you pointed out, writing complex user jobs using Spark SQL
>>> syntax
>>>>> can be harder for users who are moving from “Hudi syntax” to “Spark
>>>> syntax”
>>>>> for cross table joins, merges etc using data frames. One option is to
>>>> look
>>>>> at the if there are ways to plug in custom logical and physical plans
>>> in
>>>>> Spark, this way, although the merge on sparksql functionality may not
>>> be
>>>> as
>>>>> simple to use, but wouldn’t take away performance and feature set for
>>>>> starters, in the future we could think of having the entire query
>> space
>>>> be
>>>>> powered by calcite like you mentioned
>>>>> 2) If we continue to use DatasourceV1, is there any downside to this
>>> from
>>>>> a performance and optimization perspective when executing plan - I’m
>>>>> guessing not but haven’t delved into the code to see if there’s
>>> anything
>>>>> different apart from the API and spec.
>>>>> 
>>>>> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <[email protected]>
>>> wrote:
>>>>> 
>>>>> 
>>>>> Hello all,
>>>>> 
>>>>> 
>>>>> Just bumping this thread again
>>>>> 
>>>>> 
>>>>> thanks
>>>>> 
>>>>> vinoth
>>>>> 
>>>>> 
>>>>> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <[email protected]>
>>>> wrote:
>>>>> 
>>>>> 
>>>>> Hello all,
>>>>> 
>>>>> 
>>>>> One feature that keeps coming up is the ability to use UPDATE, MERGE
>>> sql
>>>>> 
>>>>> syntax to support writing into Hudi tables. We have looked into the
>>>> Spark 3
>>>>> 
>>>>> DataSource V2 APIs as well and found several issues that hinder us in
>>>>> 
>>>>> implementing this via the Spark APIs
>>>>> 
>>>>> 
>>>>> - As of this writing, the UPDATE/MERGE syntax is not really opened up
>>> to
>>>>> 
>>>>> external datasources like Hudi. only DELETE is.
>>>>> 
>>>>> - DataSource V2 API offers no flexibility to perform any kind of
>>>>> 
>>>>> further transformations to the dataframe. Hudi supports keys,
>> indexes,
>>>>> 
>>>>> preCombining and custom partitioning that ensures file sizes etc. All
>>>> this
>>>>> 
>>>>> needs shuffling data, looking up/joining against other dataframes so
>>>> forth.
>>>>> 
>>>>> Today, the DataSource V1 API allows this kind of further
>>>>> 
>>>>> partitions/transformations. But the V2 API is simply offers partition
>>>> level
>>>>> 
>>>>> iteration once the user calls df.write.format("hudi")
>>>>> 
>>>>> 
>>>>> One thought I had is to explore Apache Calcite and write an adapter
>> for
>>>>> 
>>>>> Hudi. This frees us from being very dependent on a particular
>> engine's
>>>>> 
>>>>> syntax support like Spark. Calcite is very popular by itself and
>>> supports
>>>>> 
>>>>> most of the key words and (also more streaming friendly syntax). To
>> be
>>>>> 
>>>>> clear, we will still be using Spark/Flink underneath to perform the
>>>> actual
>>>>> 
>>>>> writing, just that the SQL grammar is provided by Calcite.
>>>>> 
>>>>> 
>>>>> To give a taste of how this will look like.
>>>>> 
>>>>> 
>>>>> A) If the user wants to mutate a Hudi table using SQL
>>>>> 
>>>>> 
>>>>> Instead of writing something like : spark.sql("UPDATE ....")
>>>>> 
>>>>> users will write : hudiSparkSession.sql("UPDATE ....")
>>>>> 
>>>>> 
>>>>> B) To save a Spark data frame to a Hudi table
>>>>> 
>>>>> we continue to use Spark DataSource V1
>>>>> 
>>>>> 
>>>>> The obvious challenge I see is the disconnect with the Spark
>> DataFrame
>>>>> 
>>>>> ecosystem. Users would write MERGE SQL statements by joining against
>>>> other
>>>>> 
>>>>> Spark DataFrames.
>>>>> 
>>>>> If we want those expressed in calcite as well, we need to also invest
>>> in
>>>>> 
>>>>> the full Query side support, which can increase the scope by a lot.
>>>>> 
>>>>> Some amount of investigation needs to happen, but ideally we should
>> be
>>>>> 
>>>>> able to integrate with the sparkSQL catalog and reuse all the tables
>>>> there.
>>>>> 
>>>>> 
>>>>> I am sure there are some gaps in my thinking. Just starting this
>>> thread,
>>>>> 
>>>>> so we can discuss and others can chime in/correct me.
>>>>> 
>>>>> 
>>>>> thanks
>>>>> 
>>>>> vinoth
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: [DISCUSS] SQL Support using Apache Calcite

Reply via email to