Re: [DISCUSS] SQL Support using Apache Calcite

Vinoth Chandar Thu, 17 Dec 2020 14:38:29 -0800

Thanks Kabeer for the note on gmail. Did not realize that.  :)

>>  My desired use case is user use the Hoodie CLI to execute these SQLs.
They can choose what engine to use by a CLI config option.


Yes, that is also another attractive aspect of this route. We can build out
a common SQL layer and have this translate to the underlying engine (sounds
like Hive huh)
Longer term, if we really think we can more easily implement a full DML +
DDL + DQL, we can proceed with this.

As others pointed out, for Spark SQL, it might be good to try the Spark
extensions route, before we take this on more fully.

The other part where Calcite is great is, all the support for
windowing/streaming in its syntax.
Danny, I guess if we should be able to leverage that through a deeper
Flink/Hudi integration?


On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <[email protected]> wrote:

> I think Dongwook is investigating on the same lines. and it does seem
> better to pursue this first, before trying other approaches.
>
>
>
> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <[email protected]>
> wrote:
>
> >    Yeah I agree with Nishith that an option way is to look at the ways to
> > plug in custom logical and physical plans in Spark. It can simplify the
> > implementation and reuse the Spark SQL syntax. And also users familiar
> with
> > Spark SQL will be able to use HUDi's SQL features more quickly.
> > In fact, spark have provided the SparkSessionExtensions interface for
> > implement custom syntax extensions and SQL rewrite rule.
> >
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> .
> > We can use the SparkSessionExtensions to extended hoodie sql syntax such
> > as MERGE INTO and DDL syntax.
> >
> > 2020年12月15日 下午3:27，Nishith <[email protected]> 写道：
> >
> > Thanks for starting this thread Vinoth.
> > In general, definitely see the need for SQL style semantics on Hudi
> > tables. Apache Calcite is a great option to considering given
> DatasourceV2
> > has the limitations that you described.
> >
> > Additionally, even if Spark DatasourceV2 allowed for the flexibility, the
> > same SQL semantics needs to be supported in other engines like Flink to
> > provide the same experience to users - which in itself could also be
> > considerable amount of work.
> > So, if we’re able to generalize on the SQL story along Calcite, that
> would
> > help reduce redundant work in some sense.
> > Although, I’m worried about a few things
> >
> > 1) Like you pointed out, writing complex user jobs using Spark SQL syntax
> > can be harder for users who are moving from “Hudi syntax” to “Spark
> syntax”
> > for cross table joins, merges etc using data frames. One option is to
> look
> > at the if there are ways to plug in custom logical and physical plans in
> > Spark, this way, although the merge on sparksql functionality may not be
> as
> > simple to use, but wouldn’t take away performance and feature set for
> > starters, in the future we could think of having the entire query space
> be
> > powered by calcite like you mentioned
> > 2) If we continue to use DatasourceV1, is there any downside to this from
> > a performance and optimization perspective when executing plan - I’m
> > guessing not but haven’t delved into the code to see if there’s anything
> > different apart from the API and spec.
> >
> > On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <[email protected]> wrote:
> >
> >
> > Hello all,
> >
> >
> > Just bumping this thread again
> >
> >
> > thanks
> >
> > vinoth
> >
> >
> > On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <[email protected]>
> wrote:
> >
> >
> > Hello all,
> >
> >
> > One feature that keeps coming up is the ability to use UPDATE, MERGE sql
> >
> > syntax to support writing into Hudi tables. We have looked into the
> Spark 3
> >
> > DataSource V2 APIs as well and found several issues that hinder us in
> >
> > implementing this via the Spark APIs
> >
> >
> > - As of this writing, the UPDATE/MERGE syntax is not really opened up to
> >
> > external datasources like Hudi. only DELETE is.
> >
> > - DataSource V2 API offers no flexibility to perform any kind of
> >
> > further transformations to the dataframe. Hudi supports keys, indexes,
> >
> > preCombining and custom partitioning that ensures file sizes etc. All
> this
> >
> > needs shuffling data, looking up/joining against other dataframes so
> forth.
> >
> > Today, the DataSource V1 API allows this kind of further
> >
> > partitions/transformations. But the V2 API is simply offers partition
> level
> >
> > iteration once the user calls df.write.format("hudi")
> >
> >
> > One thought I had is to explore Apache Calcite and write an adapter for
> >
> > Hudi. This frees us from being very dependent on a particular engine's
> >
> > syntax support like Spark. Calcite is very popular by itself and supports
> >
> > most of the key words and (also more streaming friendly syntax). To be
> >
> > clear, we will still be using Spark/Flink underneath to perform the
> actual
> >
> > writing, just that the SQL grammar is provided by Calcite.
> >
> >
> > To give a taste of how this will look like.
> >
> >
> > A) If the user wants to mutate a Hudi table using SQL
> >
> >
> > Instead of writing something like : spark.sql("UPDATE ....")
> >
> > users will write : hudiSparkSession.sql("UPDATE ....")
> >
> >
> > B) To save a Spark data frame to a Hudi table
> >
> > we continue to use Spark DataSource V1
> >
> >
> > The obvious challenge I see is the disconnect with the Spark DataFrame
> >
> > ecosystem. Users would write MERGE SQL statements by joining against
> other
> >
> > Spark DataFrames.
> >
> > If we want those expressed in calcite as well, we need to also invest in
> >
> > the full Query side support, which can increase the scope by a lot.
> >
> > Some amount of investigation needs to happen, but ideally we should be
> >
> > able to integrate with the sparkSQL catalog and reuse all the tables
> there.
> >
> >
> > I am sure there are some gaps in my thinking. Just starting this thread,
> >
> > so we can discuss and others can chime in/correct me.
> >
> >
> > thanks
> >
> > vinoth
> >
> >
> >
>

Re: [DISCUSS] SQL Support using Apache Calcite

Reply via email to