Reply:Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

受春柏 Tue, 22 Dec 2020 01:38:07 -0800

Yes,I think it should be ok



在 2020-12-22 16:30:37，"Vinoth Chandar" <vin...@apache.org> 写道：
>Hi,
>
>I think what we are landing on finally is.
>
>- Keep pushing for SparkSQL support using Spark extensions route
>- Calcite effort will be a separate/orthogonal approach, down the line
>
>Please feel free to correct me, if I got this wrong.
>
>On Mon, Dec 21, 2020 at 3:30 AM pzwpzw <pengzhiwei2...@icloud.com.invalid>
>wrote:
>
>> Hi 受春柏 ，here is my point. We can use Calcite to build a common sql layer
>> to process engine independent SQL,  for example most of the DDL、Hoodie CLI
>> command and also provide parser for the common SQL extensions(e.g. Merge
>> Into). The Engine-related syntax can be taught to the respective engines to
>> process. If the common sql layer can handle the input sql, it handle
>> it.Otherwise it is routed to the engine for processing. In long term, the
>> common layer will more and more rich and perfect.
>> 2020年12月21日 下午4:38，受春柏 <sc...@126.com> 写道：
>>
>> Hi,all
>>
>>
>> That's very good,Hudi SQL syntax can support Flink、hive and other analysis
>> components at the same time,
>> But there are some questions about SparkSQL. SparkSQL syntax is in
>> conflict with Calctite syntax.Is our strategy
>> user migration or syntax compatibility?
>> In addition ，will it also support write SQL?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 在 2020-12-19 02:10:16，"Nishith" <n3.nas...@gmail.com> 写道：
>>
>> That’s awesome. Looks like we have a consensus on Calcite. Look forward to
>> the RFC as well!
>>
>>
>> -Nishith
>>
>>
>> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vin...@apache.org> wrote:
>>
>>
>> Sounds good. Look forward to a RFC/DISCUSS thread.
>>
>>
>> Thanks
>>
>> Vinoth
>>
>>
>> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <danny0...@apache.org> wrote:
>>
>>
>> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>>
>> add support for SQL connectors of Hoodie Flink soon ~
>>
>> Currently, i'm preparing a refactoring to the current Flink writer code.
>>
>>
>> Vinoth Chandar <vin...@apache.org> 于2020年12月18日周五 上午6:39写道：
>>
>>
>> Thanks Kabeer for the note on gmail. Did not realize that. :)
>>
>>
>> My desired use case is user use the Hoodie CLI to execute these SQLs.
>>
>> They can choose what engine to use by a CLI config option.
>>
>>
>> Yes, that is also another attractive aspect of this route. We can build
>>
>> out
>>
>> a common SQL layer and have this translate to the underlying engine
>>
>> (sounds
>>
>> like Hive huh)
>>
>> Longer term, if we really think we can more easily implement a full DML +
>>
>> DDL + DQL, we can proceed with this.
>>
>>
>> As others pointed out, for Spark SQL, it might be good to try the Spark
>>
>> extensions route, before we take this on more fully.
>>
>>
>> The other part where Calcite is great is, all the support for
>>
>> windowing/streaming in its syntax.
>>
>> Danny, I guess if we should be able to leverage that through a deeper
>>
>> Flink/Hudi integration?
>>
>>
>>
>> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vin...@apache.org>
>>
>> wrote:
>>
>>
>> I think Dongwook is investigating on the same lines. and it does seem
>>
>> better to pursue this first, before trying other approaches.
>>
>>
>>
>>
>> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2...@icloud.com
>>
>> .invalid>
>>
>> wrote:
>>
>>
>> Yeah I agree with Nishith that an option way is to look at the
>>
>> ways
>>
>> to
>>
>> plug in custom logical and physical plans in Spark. It can simplify
>>
>> the
>>
>> implementation and reuse the Spark SQL syntax. And also users
>>
>> familiar
>>
>> with
>>
>> Spark SQL will be able to use HUDi's SQL features more quickly.
>>
>> In fact, spark have provided the SparkSessionExtensions interface for
>>
>> implement custom syntax extensions and SQL rewrite rule.
>>
>>
>>
>>
>>
>> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>>
>> .
>>
>> We can use the SparkSessionExtensions to extended hoodie sql syntax
>>
>> such
>>
>> as MERGE INTO and DDL syntax.
>>
>>
>> 2020年12月15日 下午3:27，Nishith <n3.nas...@gmail.com> 写道：
>>
>>
>> Thanks for starting this thread Vinoth.
>>
>> In general, definitely see the need for SQL style semantics on Hudi
>>
>> tables. Apache Calcite is a great option to considering given
>>
>> DatasourceV2
>>
>> has the limitations that you described.
>>
>>
>> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>>
>> the
>>
>> same SQL semantics needs to be supported in other engines like Flink
>>
>> to
>>
>> provide the same experience to users - which in itself could also be
>>
>> considerable amount of work.
>>
>> So, if we’re able to generalize on the SQL story along Calcite, that
>>
>> would
>>
>> help reduce redundant work in some sense.
>>
>> Although, I’m worried about a few things
>>
>>
>> 1) Like you pointed out, writing complex user jobs using Spark SQL
>>
>> syntax
>>
>> can be harder for users who are moving from “Hudi syntax” to “Spark
>>
>> syntax”
>>
>> for cross table joins, merges etc using data frames. One option is to
>>
>> look
>>
>> at the if there are ways to plug in custom logical and physical plans
>>
>> in
>>
>> Spark, this way, although the merge on sparksql functionality may not
>>
>> be
>>
>> as
>>
>> simple to use, but wouldn’t take away performance and feature set for
>>
>> starters, in the future we could think of having the entire query
>>
>> space
>>
>> be
>>
>> powered by calcite like you mentioned
>>
>> 2) If we continue to use DatasourceV1, is there any downside to this
>>
>> from
>>
>> a performance and optimization perspective when executing plan - I’m
>>
>> guessing not but haven’t delved into the code to see if there’s
>>
>> anything
>>
>> different apart from the API and spec.
>>
>>
>> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vin...@apache.org>
>>
>> wrote:
>>
>>
>>
>> Hello all,
>>
>>
>>
>> Just bumping this thread again
>>
>>
>>
>> thanks
>>
>>
>> vinoth
>>
>>
>>
>> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vin...@apache.org>
>>
>> wrote:
>>
>>
>>
>> Hello all,
>>
>>
>>
>> One feature that keeps coming up is the ability to use UPDATE, MERGE
>>
>> sql
>>
>>
>> syntax to support writing into Hudi tables. We have looked into the
>>
>> Spark 3
>>
>>
>> DataSource V2 APIs as well and found several issues that hinder us in
>>
>>
>> implementing this via the Spark APIs
>>
>>
>>
>> - As of this writing, the UPDATE/MERGE syntax is not really opened up
>>
>> to
>>
>>
>> external datasources like Hudi. only DELETE is.
>>
>>
>> - DataSource V2 API offers no flexibility to perform any kind of
>>
>>
>> further transformations to the dataframe. Hudi supports keys,
>>
>> indexes,
>>
>>
>> preCombining and custom partitioning that ensures file sizes etc. All
>>
>> this
>>
>>
>> needs shuffling data, looking up/joining against other dataframes so
>>
>> forth.
>>
>>
>> Today, the DataSource V1 API allows this kind of further
>>
>>
>> partitions/transformations. But the V2 API is simply offers partition
>>
>> level
>>
>>
>> iteration once the user calls df.write.format("hudi")
>>
>>
>>
>> One thought I had is to explore Apache Calcite and write an adapter
>>
>> for
>>
>>
>> Hudi. This frees us from being very dependent on a particular
>>
>> engine's
>>
>>
>> syntax support like Spark. Calcite is very popular by itself and
>>
>> supports
>>
>>
>> most of the key words and (also more streaming friendly syntax). To
>>
>> be
>>
>>
>> clear, we will still be using Spark/Flink underneath to perform the
>>
>> actual
>>
>>
>> writing, just that the SQL grammar is provided by Calcite.
>>
>>
>>
>> To give a taste of how this will look like.
>>
>>
>>
>> A) If the user wants to mutate a Hudi table using SQL
>>
>>
>>
>> Instead of writing something like : spark.sql("UPDATE ....")
>>
>>
>> users will write : hudiSparkSession.sql("UPDATE ....")
>>
>>
>>
>> B) To save a Spark data frame to a Hudi table
>>
>>
>> we continue to use Spark DataSource V1
>>
>>
>>
>> The obvious challenge I see is the disconnect with the Spark
>>
>> DataFrame
>>
>>
>> ecosystem. Users would write MERGE SQL statements by joining against
>>
>> other
>>
>>
>> Spark DataFrames.
>>
>>
>> If we want those expressed in calcite as well, we need to also invest
>>
>> in
>>
>>
>> the full Query side support, which can increase the scope by a lot.
>>
>>
>> Some amount of investigation needs to happen, but ideally we should
>>
>> be
>>
>>
>> able to integrate with the sparkSQL catalog and reuse all the tables
>>
>> there.
>>
>>
>>
>> I am sure there are some gaps in my thinking. Just starting this
>>
>> thread,
>>
>>
>> so we can discuss and others can chime in/correct me.
>>
>>
>>
>> thanks
>>
>>
>> vinoth
>>
>>
>>
>>
>>
>>
>>
>>
Reply:Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Reply via email to