Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Vinoth Chandar Tue, 22 Dec 2020 08:30:01 -0800

Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
love to have you involved.


On Tue, Dec 22, 2020 at 3:20 AM pzwpzw <pengzhiwei2...@icloud.com.invalid>
wrote:

> Yes, it looks good .
> We are building the spark sql extensions to support for hudi in
> our internal version.
> I am interested in participating in the extension of SparkSQL on hudi.
> 2020年12月22日 下午4:30，Vinoth Chandar <vin...@apache.org> 写道：
>
> Hi,
>
> I think what we are landing on finally is.
>
> - Keep pushing for SparkSQL support using Spark extensions route
> - Calcite effort will be a separate/orthogonal approach, down the line
>
> Please feel free to correct me, if I got this wrong.
>
> On Mon, Dec 21, 2020 at 3:30 AM pzwpzw <pengzhiwei2...@icloud.com.invalid>
> wrote:
>
> Hi 受春柏 ，here is my point. We can use Calcite to build a common sql layer
>
> to process engine independent SQL, for example most of the DDL、Hoodie CLI
>
> command and also provide parser for the common SQL extensions(e.g. Merge
>
> Into). The Engine-related syntax can be taught to the respective engines to
>
> process. If the common sql layer can handle the input sql, it handle
>
> it.Otherwise it is routed to the engine for processing. In long term, the
>
> common layer will more and more rich and perfect.
>
> 2020年12月21日 下午4:38，受春柏 <sc...@126.com> 写道：
>
>
> Hi,all
>
>
>
> That's very good,Hudi SQL syntax can support Flink、hive and other analysis
>
> components at the same time,
>
> But there are some questions about SparkSQL. SparkSQL syntax is in
>
> conflict with Calctite syntax.Is our strategy
>
> user migration or syntax compatibility?
>
> In addition ，will it also support write SQL?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 在 2020-12-19 02:10:16，"Nishith" <n3.nas...@gmail.com> 写道：
>
>
> That’s awesome. Looks like we have a consensus on Calcite. Look forward to
>
> the RFC as well!
>
>
>
> -Nishith
>
>
>
> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vin...@apache.org> wrote:
>
>
>
> Sounds good. Look forward to a RFC/DISCUSS thread.
>
>
>
> Thanks
>
>
> Vinoth
>
>
>
> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <danny0...@apache.org> wrote:
>
>
>
> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>
>
> add support for SQL connectors of Hoodie Flink soon ~
>
>
> Currently, i'm preparing a refactoring to the current Flink writer code.
>
>
>
> Vinoth Chandar <vin...@apache.org> 于2020年12月18日周五 上午6:39写道：
>
>
>
> Thanks Kabeer for the note on gmail. Did not realize that. :)
>
>
>
> My desired use case is user use the Hoodie CLI to execute these SQLs.
>
>
> They can choose what engine to use by a CLI config option.
>
>
>
> Yes, that is also another attractive aspect of this route. We can build
>
>
> out
>
>
> a common SQL layer and have this translate to the underlying engine
>
>
> (sounds
>
>
> like Hive huh)
>
>
> Longer term, if we really think we can more easily implement a full DML +
>
>
> DDL + DQL, we can proceed with this.
>
>
>
> As others pointed out, for Spark SQL, it might be good to try the Spark
>
>
> extensions route, before we take this on more fully.
>
>
>
> The other part where Calcite is great is, all the support for
>
>
> windowing/streaming in its syntax.
>
>
> Danny, I guess if we should be able to leverage that through a deeper
>
>
> Flink/Hudi integration?
>
>
>
>
> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vin...@apache.org>
>
>
> wrote:
>
>
>
> I think Dongwook is investigating on the same lines. and it does seem
>
>
> better to pursue this first, before trying other approaches.
>
>
>
>
>
> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2...@icloud.com
>
>
> .invalid>
>
>
> wrote:
>
>
>
> Yeah I agree with Nishith that an option way is to look at the
>
>
> ways
>
>
> to
>
>
> plug in custom logical and physical plans in Spark. It can simplify
>
>
> the
>
>
> implementation and reuse the Spark SQL syntax. And also users
>
>
> familiar
>
>
> with
>
>
> Spark SQL will be able to use HUDi's SQL features more quickly.
>
>
> In fact, spark have provided the SparkSessionExtensions interface for
>
>
> implement custom syntax extensions and SQL rewrite rule.
>
>
>
>
>
>
>
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>
>
> .
>
>
> We can use the SparkSessionExtensions to extended hoodie sql syntax
>
>
> such
>
>
> as MERGE INTO and DDL syntax.
>
>
>
> 2020年12月15日 下午3:27，Nishith <n3.nas...@gmail.com> 写道：
>
>
>
> Thanks for starting this thread Vinoth.
>
>
> In general, definitely see the need for SQL style semantics on Hudi
>
>
> tables. Apache Calcite is a great option to considering given
>
>
> DatasourceV2
>
>
> has the limitations that you described.
>
>
>
> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>
>
> the
>
>
> same SQL semantics needs to be supported in other engines like Flink
>
>
> to
>
>
> provide the same experience to users - which in itself could also be
>
>
> considerable amount of work.
>
>
> So, if we’re able to generalize on the SQL story along Calcite, that
>
>
> would
>
>
> help reduce redundant work in some sense.
>
>
> Although, I’m worried about a few things
>
>
>
> 1) Like you pointed out, writing complex user jobs using Spark SQL
>
>
> syntax
>
>
> can be harder for users who are moving from “Hudi syntax” to “Spark
>
>
> syntax”
>
>
> for cross table joins, merges etc using data frames. One option is to
>
>
> look
>
>
> at the if there are ways to plug in custom logical and physical plans
>
>
> in
>
>
> Spark, this way, although the merge on sparksql functionality may not
>
>
> be
>
>
> as
>
>
> simple to use, but wouldn’t take away performance and feature set for
>
>
> starters, in the future we could think of having the entire query
>
>
> space
>
>
> be
>
>
> powered by calcite like you mentioned
>
>
> 2) If we continue to use DatasourceV1, is there any downside to this
>
>
> from
>
>
> a performance and optimization perspective when executing plan - I’m
>
>
> guessing not but haven’t delved into the code to see if there’s
>
>
> anything
>
>
> different apart from the API and spec.
>
>
>
> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vin...@apache.org>
>
>
> wrote:
>
>
>
>
> Hello all,
>
>
>
>
> Just bumping this thread again
>
>
>
>
> thanks
>
>
>
> vinoth
>
>
>
>
> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vin...@apache.org>
>
>
> wrote:
>
>
>
>
> Hello all,
>
>
>
>
> One feature that keeps coming up is the ability to use UPDATE, MERGE
>
>
> sql
>
>
>
> syntax to support writing into Hudi tables. We have looked into the
>
>
> Spark 3
>
>
>
> DataSource V2 APIs as well and found several issues that hinder us in
>
>
>
> implementing this via the Spark APIs
>
>
>
>
> - As of this writing, the UPDATE/MERGE syntax is not really opened up
>
>
> to
>
>
>
> external datasources like Hudi. only DELETE is.
>
>
>
> - DataSource V2 API offers no flexibility to perform any kind of
>
>
>
> further transformations to the dataframe. Hudi supports keys,
>
>
> indexes,
>
>
>
> preCombining and custom partitioning that ensures file sizes etc. All
>
>
> this
>
>
>
> needs shuffling data, looking up/joining against other dataframes so
>
>
> forth.
>
>
>
> Today, the DataSource V1 API allows this kind of further
>
>
>
> partitions/transformations. But the V2 API is simply offers partition
>
>
> level
>
>
>
> iteration once the user calls df.write.format("hudi")
>
>
>
>
> One thought I had is to explore Apache Calcite and write an adapter
>
>
> for
>
>
>
> Hudi. This frees us from being very dependent on a particular
>
>
> engine's
>
>
>
> syntax support like Spark. Calcite is very popular by itself and
>
>
> supports
>
>
>
> most of the key words and (also more streaming friendly syntax). To
>
>
> be
>
>
>
> clear, we will still be using Spark/Flink underneath to perform the
>
>
> actual
>
>
>
> writing, just that the SQL grammar is provided by Calcite.
>
>
>
>
> To give a taste of how this will look like.
>
>
>
>
> A) If the user wants to mutate a Hudi table using SQL
>
>
>
>
> Instead of writing something like : spark.sql("UPDATE ....")
>
>
>
> users will write : hudiSparkSession.sql("UPDATE ....")
>
>
>
>
> B) To save a Spark data frame to a Hudi table
>
>
>
> we continue to use Spark DataSource V1
>
>
>
>
> The obvious challenge I see is the disconnect with the Spark
>
>
> DataFrame
>
>
>
> ecosystem. Users would write MERGE SQL statements by joining against
>
>
> other
>
>
>
> Spark DataFrames.
>
>
>
> If we want those expressed in calcite as well, we need to also invest
>
>
> in
>
>
>
> the full Query side support, which can increase the scope by a lot.
>
>
>
> Some amount of investigation needs to happen, but ideally we should
>
>
> be
>
>
>
> able to integrate with the sparkSQL catalog and reuse all the tables
>
>
> there.
>
>
>
>
> I am sure there are some gaps in my thinking. Just starting this
>
>
> thread,
>
>
>
> so we can discuss and others can chime in/correct me.
>
>
>
>
> thanks
>
>
>
> vinoth
>
>
>
>
>
>
>
>
>
>

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Reply via email to