Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Danny Chan Tue, 22 Dec 2020 19:51:02 -0800

That's great, I can help with the Apache Calcite integration.

Vinoth Chandar <vin...@apache.org> 于2020年12月23日周三 上午12:29写道：


> Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
> love to have you involved.
>
> On Tue, Dec 22, 2020 at 3:20 AM pzwpzw <pengzhiwei2...@icloud.com.invalid>
> wrote:
>
> > Yes, it looks good .
> > We are building the spark sql extensions to support for hudi in
> > our internal version.
> > I am interested in participating in the extension of SparkSQL on hudi.
> > 2020年12月22日 下午4:30，Vinoth Chandar <vin...@apache.org> 写道：
> >
> > Hi,
> >
> > I think what we are landing on finally is.
> >
> > - Keep pushing for SparkSQL support using Spark extensions route
> > - Calcite effort will be a separate/orthogonal approach, down the line
> >
> > Please feel free to correct me, if I got this wrong.
> >
> > On Mon, Dec 21, 2020 at 3:30 AM pzwpzw <pengzhiwei2...@icloud.com
> .invalid>
> > wrote:
> >
> > Hi 受春柏 ，here is my point. We can use Calcite to build a common sql layer
> >
> > to process engine independent SQL, for example most of the DDL、Hoodie CLI
> >
> > command and also provide parser for the common SQL extensions(e.g. Merge
> >
> > Into). The Engine-related syntax can be taught to the respective engines
> to
> >
> > process. If the common sql layer can handle the input sql, it handle
> >
> > it.Otherwise it is routed to the engine for processing. In long term, the
> >
> > common layer will more and more rich and perfect.
> >
> > 2020年12月21日 下午4:38，受春柏 <sc...@126.com> 写道：
> >
> >
> > Hi,all
> >
> >
> >
> > That's very good,Hudi SQL syntax can support Flink、hive and other
> analysis
> >
> > components at the same time,
> >
> > But there are some questions about SparkSQL. SparkSQL syntax is in
> >
> > conflict with Calctite syntax.Is our strategy
> >
> > user migration or syntax compatibility?
> >
> > In addition ，will it also support write SQL?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > 在 2020-12-19 02:10:16，"Nishith" <n3.nas...@gmail.com> 写道：
> >
> >
> > That’s awesome. Looks like we have a consensus on Calcite. Look forward
> to
> >
> > the RFC as well!
> >
> >
> >
> > -Nishith
> >
> >
> >
> > On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vin...@apache.org> wrote:
> >
> >
> >
> > Sounds good. Look forward to a RFC/DISCUSS thread.
> >
> >
> >
> > Thanks
> >
> >
> > Vinoth
> >
> >
> >
> > On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <danny0...@apache.org> wrote:
> >
> >
> >
> > Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i
> would
> >
> >
> > add support for SQL connectors of Hoodie Flink soon ~
> >
> >
> > Currently, i'm preparing a refactoring to the current Flink writer code.
> >
> >
> >
> > Vinoth Chandar <vin...@apache.org> 于2020年12月18日周五 上午6:39写道：
> >
> >
> >
> > Thanks Kabeer for the note on gmail. Did not realize that. :)
> >
> >
> >
> > My desired use case is user use the Hoodie CLI to execute these SQLs.
> >
> >
> > They can choose what engine to use by a CLI config option.
> >
> >
> >
> > Yes, that is also another attractive aspect of this route. We can build
> >
> >
> > out
> >
> >
> > a common SQL layer and have this translate to the underlying engine
> >
> >
> > (sounds
> >
> >
> > like Hive huh)
> >
> >
> > Longer term, if we really think we can more easily implement a full DML +
> >
> >
> > DDL + DQL, we can proceed with this.
> >
> >
> >
> > As others pointed out, for Spark SQL, it might be good to try the Spark
> >
> >
> > extensions route, before we take this on more fully.
> >
> >
> >
> > The other part where Calcite is great is, all the support for
> >
> >
> > windowing/streaming in its syntax.
> >
> >
> > Danny, I guess if we should be able to leverage that through a deeper
> >
> >
> > Flink/Hudi integration?
> >
> >
> >
> >
> > On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vin...@apache.org>
> >
> >
> > wrote:
> >
> >
> >
> > I think Dongwook is investigating on the same lines. and it does seem
> >
> >
> > better to pursue this first, before trying other approaches.
> >
> >
> >
> >
> >
> > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2...@icloud.com
> >
> >
> > .invalid>
> >
> >
> > wrote:
> >
> >
> >
> > Yeah I agree with Nishith that an option way is to look at the
> >
> >
> > ways
> >
> >
> > to
> >
> >
> > plug in custom logical and physical plans in Spark. It can simplify
> >
> >
> > the
> >
> >
> > implementation and reuse the Spark SQL syntax. And also users
> >
> >
> > familiar
> >
> >
> > with
> >
> >
> > Spark SQL will be able to use HUDi's SQL features more quickly.
> >
> >
> > In fact, spark have provided the SparkSessionExtensions interface for
> >
> >
> > implement custom syntax extensions and SQL rewrite rule.
> >
> >
> >
> >
> >
> >
> >
> >
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> >
> >
> > .
> >
> >
> > We can use the SparkSessionExtensions to extended hoodie sql syntax
> >
> >
> > such
> >
> >
> > as MERGE INTO and DDL syntax.
> >
> >
> >
> > 2020年12月15日 下午3:27，Nishith <n3.nas...@gmail.com> 写道：
> >
> >
> >
> > Thanks for starting this thread Vinoth.
> >
> >
> > In general, definitely see the need for SQL style semantics on Hudi
> >
> >
> > tables. Apache Calcite is a great option to considering given
> >
> >
> > DatasourceV2
> >
> >
> > has the limitations that you described.
> >
> >
> >
> > Additionally, even if Spark DatasourceV2 allowed for the flexibility,
> >
> >
> > the
> >
> >
> > same SQL semantics needs to be supported in other engines like Flink
> >
> >
> > to
> >
> >
> > provide the same experience to users - which in itself could also be
> >
> >
> > considerable amount of work.
> >
> >
> > So, if we’re able to generalize on the SQL story along Calcite, that
> >
> >
> > would
> >
> >
> > help reduce redundant work in some sense.
> >
> >
> > Although, I’m worried about a few things
> >
> >
> >
> > 1) Like you pointed out, writing complex user jobs using Spark SQL
> >
> >
> > syntax
> >
> >
> > can be harder for users who are moving from “Hudi syntax” to “Spark
> >
> >
> > syntax”
> >
> >
> > for cross table joins, merges etc using data frames. One option is to
> >
> >
> > look
> >
> >
> > at the if there are ways to plug in custom logical and physical plans
> >
> >
> > in
> >
> >
> > Spark, this way, although the merge on sparksql functionality may not
> >
> >
> > be
> >
> >
> > as
> >
> >
> > simple to use, but wouldn’t take away performance and feature set for
> >
> >
> > starters, in the future we could think of having the entire query
> >
> >
> > space
> >
> >
> > be
> >
> >
> > powered by calcite like you mentioned
> >
> >
> > 2) If we continue to use DatasourceV1, is there any downside to this
> >
> >
> > from
> >
> >
> > a performance and optimization perspective when executing plan - I’m
> >
> >
> > guessing not but haven’t delved into the code to see if there’s
> >
> >
> > anything
> >
> >
> > different apart from the API and spec.
> >
> >
> >
> > On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vin...@apache.org>
> >
> >
> > wrote:
> >
> >
> >
> >
> > Hello all,
> >
> >
> >
> >
> > Just bumping this thread again
> >
> >
> >
> >
> > thanks
> >
> >
> >
> > vinoth
> >
> >
> >
> >
> > On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vin...@apache.org>
> >
> >
> > wrote:
> >
> >
> >
> >
> > Hello all,
> >
> >
> >
> >
> > One feature that keeps coming up is the ability to use UPDATE, MERGE
> >
> >
> > sql
> >
> >
> >
> > syntax to support writing into Hudi tables. We have looked into the
> >
> >
> > Spark 3
> >
> >
> >
> > DataSource V2 APIs as well and found several issues that hinder us in
> >
> >
> >
> > implementing this via the Spark APIs
> >
> >
> >
> >
> > - As of this writing, the UPDATE/MERGE syntax is not really opened up
> >
> >
> > to
> >
> >
> >
> > external datasources like Hudi. only DELETE is.
> >
> >
> >
> > - DataSource V2 API offers no flexibility to perform any kind of
> >
> >
> >
> > further transformations to the dataframe. Hudi supports keys,
> >
> >
> > indexes,
> >
> >
> >
> > preCombining and custom partitioning that ensures file sizes etc. All
> >
> >
> > this
> >
> >
> >
> > needs shuffling data, looking up/joining against other dataframes so
> >
> >
> > forth.
> >
> >
> >
> > Today, the DataSource V1 API allows this kind of further
> >
> >
> >
> > partitions/transformations. But the V2 API is simply offers partition
> >
> >
> > level
> >
> >
> >
> > iteration once the user calls df.write.format("hudi")
> >
> >
> >
> >
> > One thought I had is to explore Apache Calcite and write an adapter
> >
> >
> > for
> >
> >
> >
> > Hudi. This frees us from being very dependent on a particular
> >
> >
> > engine's
> >
> >
> >
> > syntax support like Spark. Calcite is very popular by itself and
> >
> >
> > supports
> >
> >
> >
> > most of the key words and (also more streaming friendly syntax). To
> >
> >
> > be
> >
> >
> >
> > clear, we will still be using Spark/Flink underneath to perform the
> >
> >
> > actual
> >
> >
> >
> > writing, just that the SQL grammar is provided by Calcite.
> >
> >
> >
> >
> > To give a taste of how this will look like.
> >
> >
> >
> >
> > A) If the user wants to mutate a Hudi table using SQL
> >
> >
> >
> >
> > Instead of writing something like : spark.sql("UPDATE ....")
> >
> >
> >
> > users will write : hudiSparkSession.sql("UPDATE ....")
> >
> >
> >
> >
> > B) To save a Spark data frame to a Hudi table
> >
> >
> >
> > we continue to use Spark DataSource V1
> >
> >
> >
> >
> > The obvious challenge I see is the disconnect with the Spark
> >
> >
> > DataFrame
> >
> >
> >
> > ecosystem. Users would write MERGE SQL statements by joining against
> >
> >
> > other
> >
> >
> >
> > Spark DataFrames.
> >
> >
> >
> > If we want those expressed in calcite as well, we need to also invest
> >
> >
> > in
> >
> >
> >
> > the full Query side support, which can increase the scope by a lot.
> >
> >
> >
> > Some amount of investigation needs to happen, but ideally we should
> >
> >
> > be
> >
> >
> >
> > able to integrate with the sparkSQL catalog and reuse all the tables
> >
> >
> > there.
> >
> >
> >
> >
> > I am sure there are some gaps in my thinking. Just starting this
> >
> >
> > thread,
> >
> >
> >
> > so we can discuss and others can chime in/correct me.
> >
> >
> >
> >
> > thanks
> >
> >
> >
> > vinoth
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Reply via email to