subject:" Re\: Reply\:Re\: \[DISCUSS\] SQL Support using Apache Calcite"

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2021-01-21 Thread pzwpzw


Thank vino yang! I have move the doc to RFC-25. We can continue the discussion 
there.

2021年1月22日 上午9:27，vino yang  写道：


Hi zhiwei,

Done! Now, you should have cwiki permission.

Best,
Vino

pzwpzw  于2021年1月22日周五 上午12:06写道：


That is great! Can you give me the permission to the cwiki? My cwiki id
is: zhiwei .
I will move it to there and continue the disscussion.


2021年1月21日 下午11:19，Gary Li  写道：


Hi pengzhiwei,


Thanks for the proposal. That’s a great feature. Can we move the design
doc to cwiki page as a new RFC? We can continue the discussion from there.


Thanks,


Best Regards,
Gary Li




From: pzwpzw 
Reply-To: "dev@hudi.apache.org" 
Date: Wednesday, January 20, 2021 at 11:52 PM
To: "dev@hudi.apache.org" 
Cc: "dev@hudi.apache.org" 
Subject: Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite


Hi, we have implemented the spark sql extension for hudi in our Internal
version. Here is the main implementation, including the extension sql
syntax and implementation scheme on spark. I am waiting for your feedback.
Any comments are welcome~




https://docs.google.com/document/d/1KC6Rae67CUaCUpKoIAkM6OTAGuOWFPD9qtNfqchAl1o/edit#heading=h.oeoy1y14sifu




2020年12月23日 上午12:30，Vinoth Chandar  写道：
Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
love to have you involved.


On Tue, Dec 22, 2020 at 3:20 AM pzwpzw 
wrote:




Yes, it looks good .
We are building the spark sql extensions to support for hudi in
our internal version.
I am interested in participating in the extension of SparkSQL on hudi.
2020年12月22日 下午4:30，Vinoth Chandar  写道：


Hi,


I think what we are landing on finally is.


- Keep pushing for SparkSQL support using Spark extensions route
- Calcite effort will be a separate/orthogonal approach, down the line


Please feel free to correct me, if I got this wrong.


On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
wrote:


Hi 受春柏 ，here is my point. We can use Calcite to build a common sql layer


to process engine independent SQL, for example most of the DDL、Hoodie CLI


command and also provide parser for the common SQL extensions(e.g. Merge


Into). The Engine-related syntax can be taught to the respective engines to


process. If the common sql layer can handle the input sql, it handle


it.Otherwise it is routed to the engine for processing. In long term, the


common layer will more and more rich and perfect.


2020年12月21日 下午4:38，受春柏  写道：




Hi,all






That's very good,Hudi SQL syntax can support Flink、hive and other analysis


components at the same time,


But there are some questions about SparkSQL. SparkSQL syntax is in


conflict with Calctite syntax.Is our strategy


user migration or syntax compatibility?


In addition ，will it also support write SQL?










































在 2020-12-19 02:10:16，"Nishith"  写道：




That’s awesome. Looks like we have a consensus on Calcite. Look forward to


the RFC as well!






-Nishith






On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:






Sounds good. Look forward to a RFC/DISCUSS thread.






Thanks




Vinoth






On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:






Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would




add support for SQL connectors of Hoodie Flink soon ~




Currently, i'm preparing a refactoring to the current Flink writer code.






Vinoth Chandar  于2020年12月18日周五 上午6:39写道：






Thanks Kabeer for the note on gmail. Did not realize that. :)






My desired use case is user use the Hoodie CLI to execute these SQLs.




They can choose what engine to use by a CLI config option.






Yes, that is also another attractive aspect of this route. We can build




out




a common SQL layer and have this translate to the underlying engine




(sounds




like Hive huh)




Longer term, if we really think we can more easily implement a full DML +




DDL + DQL, we can proceed with this.






As others pointed out, for Spark SQL, it might be good to try the Spark




extensions route, before we take this on more fully.






The other part where Calcite is great is, all the support for




windowing/streaming in its syntax.




Danny, I guess if we should be able to leverage that through a deeper




Flink/Hudi integration?








On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 




wrote:






I think Dongwook is investigating on the same lines. and it does seem




better to pursue this first, before trying other approaches.










On Tue, Dec 15, 2020 at 1:38 AM pzwpzw 




wrote:






Yeah I agree with Nishith that an option way is to look at the




ways




to




plug in custom logical and physical plans in Spark. It can simplify




the




implementation and reuse the Spark SQL syntax. And also users




familiar




with




Spark SQL will be able to use HUDi's SQL features more quickly.




In fact, spark have provided the SparkSessionExtensions interface for

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2021-01-21 Thread vino yang

Hi zhiwei,

Done! Now, you should have cwiki permission.

Best,
Vino

pzwpzw  于2021年1月22日周五 上午12:06写道：

> That is great!  Can you give me the permission to the cwiki? My cwiki id
> is: zhiwei .
> I will move it to there and continue the disscussion.
>
> 2021年1月21日 下午11:19，Gary Li  写道：
>
> Hi pengzhiwei,
>
> Thanks for the proposal. That’s a great feature. Can we move the design
> doc to cwiki page as a new RFC? We can continue the discussion from there.
>
> Thanks,
>
> Best Regards,
> Gary Li
>
>
> From: pzwpzw 
> Reply-To: "dev@hudi.apache.org" 
> Date: Wednesday, January 20, 2021 at 11:52 PM
> To: "dev@hudi.apache.org" 
> Cc: "dev@hudi.apache.org" 
> Subject: Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
>
> Hi, we have implemented the spark sql extension for hudi in our Internal
> version. Here is the main implementation, including the extension sql
> syntax and implementation scheme on spark. I am waiting for your feedback.
> Any comments are welcome~
>
>
> https://docs.google.com/document/d/1KC6Rae67CUaCUpKoIAkM6OTAGuOWFPD9qtNfqchAl1o/edit#heading=h.oeoy1y14sifu
>
>
> 2020年12月23日 上午12:30，Vinoth Chandar  写道：
> Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
> love to have you involved.
>
> On Tue, Dec 22, 2020 at 3:20 AM pzwpzw 
> wrote:
>
>
> Yes, it looks good .
> We are building the spark sql extensions to support for hudi in
> our internal version.
> I am interested in participating in the extension of SparkSQL on hudi.
> 2020年12月22日 下午4:30，Vinoth Chandar  写道：
>
> Hi,
>
> I think what we are landing on finally is.
>
> - Keep pushing for SparkSQL support using Spark extensions route
> - Calcite effort will be a separate/orthogonal approach, down the line
>
> Please feel free to correct me, if I got this wrong.
>
> On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
> wrote:
>
> Hi 受春柏 ，here is my point. We can use Calcite to build a common sql layer
>
> to process engine independent SQL, for example most of the DDL、Hoodie CLI
>
> command and also provide parser for the common SQL extensions(e.g. Merge
>
> Into). The Engine-related syntax can be taught to the respective engines to
>
> process. If the common sql layer can handle the input sql, it handle
>
> it.Otherwise it is routed to the engine for processing. In long term, the
>
> common layer will more and more rich and perfect.
>
> 2020年12月21日 下午4:38，受春柏  写道：
>
>
> Hi,all
>
>
>
> That's very good,Hudi SQL syntax can support Flink、hive and other analysis
>
> components at the same time,
>
> But there are some questions about SparkSQL. SparkSQL syntax is in
>
> conflict with Calctite syntax.Is our strategy
>
> user migration or syntax compatibility?
>
> In addition ，will it also support write SQL?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 在 2020-12-19 02:10:16，"Nishith"  写道：
>
>
> That’s awesome. Looks like we have a consensus on Calcite. Look forward to
>
> the RFC as well!
>
>
>
> -Nishith
>
>
>
> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:
>
>
>
> Sounds good. Look forward to a RFC/DISCUSS thread.
>
>
>
> Thanks
>
>
> Vinoth
>
>
>
> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:
>
>
>
> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>
>
> add support for SQL connectors of Hoodie Flink soon ~
>
>
> Currently, i'm preparing a refactoring to the current Flink writer code.
>
>
>
> Vinoth Chandar  于2020年12月18日周五 上午6:39写道：
>
>
>
> Thanks Kabeer for the note on gmail. Did not realize that. :)
>
>
>
> My desired use case is user use the Hoodie CLI to execute these SQLs.
>
>
> They can choose what engine to use by a CLI config option.
>
>
>
> Yes, that is also another attractive aspect of this route. We can build
>
>
> out
>
>
> a common SQL layer and have this translate to the underlying engine
>
>
> (sounds
>
>
> like Hive huh)
>
>
> Longer term, if we really think we can more easily implement a full DML +
>
>
> DDL + DQL, we can proceed with this.
>
>
>
> As others pointed out, for Spark SQL, it might be good to try the Spark
>
>
> extensions route, before we take this on more fully.
>
>
>
> The other part where Calcite is great is, all the support for
>
>
> windowing/streaming in its syntax.
>
>
> Danny, I guess if we should be able to leverage that through a deeper
>
>
> Flink/Hudi integration?
>
>
>
>
&

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2021-01-21 Thread pzwpzw


That is great!  Can you give me the permission to the cwiki? My cwiki id is: 
zhiwei .  
I will move it to there and continue the disscussion.

2021年1月21日 下午11:19，Gary Li  写道：


Hi pengzhiwei,

Thanks for the proposal. That’s a great feature. Can we move the design doc to 
cwiki page as a new RFC? We can continue the discussion from there.

Thanks,

Best Regards,
Gary Li


From: pzwpzw 
Reply-To: "dev@hudi.apache.org" 
Date: Wednesday, January 20, 2021 at 11:52 PM
To: "dev@hudi.apache.org" 
Cc: "dev@hudi.apache.org" 
Subject: Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Hi, we have implemented the spark sql extension for hudi in our Internal 
version. Here is the main implementation, including the extension sql syntax 
and implementation scheme on spark. I am waiting for your feedback. Any 
comments are welcome~

https://docs.google.com/document/d/1KC6Rae67CUaCUpKoIAkM6OTAGuOWFPD9qtNfqchAl1o/edit#heading=h.oeoy1y14sifu


2020年12月23日 上午12:30，Vinoth Chandar  写道：
Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
love to have you involved.

On Tue, Dec 22, 2020 at 3:20 AM pzwpzw 
wrote:


Yes, it looks good .
We are building the spark sql extensions to support for hudi in
our internal version.
I am interested in participating in the extension of SparkSQL on hudi.
2020年12月22日 下午4:30，Vinoth Chandar  写道：

Hi,

I think what we are landing on finally is.

- Keep pushing for SparkSQL support using Spark extensions route
- Calcite effort will be a separate/orthogonal approach, down the line

Please feel free to correct me, if I got this wrong.

On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
wrote:

Hi 受春柏 ，here is my point. We can use Calcite to build a common sql layer

to process engine independent SQL, for example most of the DDL、Hoodie CLI

command and also provide parser for the common SQL extensions(e.g. Merge

Into). The Engine-related syntax can be taught to the respective engines to

process. If the common sql layer can handle the input sql, it handle

it.Otherwise it is routed to the engine for processing. In long term, the

common layer will more and more rich and perfect.

2020年12月21日 下午4:38，受春柏  写道：


Hi,all



That's very good,Hudi SQL syntax can support Flink、hive and other analysis

components at the same time,

But there are some questions about SparkSQL. SparkSQL syntax is in

conflict with Calctite syntax.Is our strategy

user migration or syntax compatibility?

In addition ，will it also support write SQL?





















在 2020-12-19 02:10:16，"Nishith"  写道：


That’s awesome. Looks like we have a consensus on Calcite. Look forward to

the RFC as well!



-Nishith



On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:



Sounds good. Look forward to a RFC/DISCUSS thread.



Thanks


Vinoth



On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:



Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would


add support for SQL connectors of Hoodie Flink soon ~


Currently, i'm preparing a refactoring to the current Flink writer code.



Vinoth Chandar  于2020年12月18日周五 上午6:39写道：



Thanks Kabeer for the note on gmail. Did not realize that. :)



My desired use case is user use the Hoodie CLI to execute these SQLs.


They can choose what engine to use by a CLI config option.



Yes, that is also another attractive aspect of this route. We can build


out


a common SQL layer and have this translate to the underlying engine


(sounds


like Hive huh)


Longer term, if we really think we can more easily implement a full DML +


DDL + DQL, we can proceed with this.



As others pointed out, for Spark SQL, it might be good to try the Spark


extensions route, before we take this on more fully.



The other part where Calcite is great is, all the support for


windowing/streaming in its syntax.


Danny, I guess if we should be able to leverage that through a deeper


Flink/Hudi integration?




On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 


wrote:



I think Dongwook is investigating on the same lines. and it does seem


better to pursue this first, before trying other approaches.





On Tue, Dec 15, 2020 at 1:38 AM pzwpzw 


wrote:



Yeah I agree with Nishith that an option way is to look at the


ways


to


plug in custom logical and physical plans in Spark. It can simplify


the


implementation and reuse the Spark SQL syntax. And also users


familiar


with


Spark SQL will be able to use HUDi's SQL features more quickly.


In fact, spark have provided the SparkSessionExtensions interface for


implement custom syntax extensions and SQL rewrite rule.







https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2F2.4.5%2Fapi%2Fjava%2Forg%2Fapache%2Fspark%2Fsql%2FSparkSessionExtensions.html&data=04%7C01%7C%7C1c5c63e24f8a455c63df08d8bd5b5300%7C84df9e7fe

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2021-01-21 Thread Gary Li

Hi pengzhiwei,

Thanks for the proposal. That’s a great feature. Can we move the design doc to 
cwiki page as a new RFC? We can continue the discussion from there.

Thanks,

Best Regards,
Gary Li


From: pzwpzw 
Reply-To: "dev@hudi.apache.org" 
Date: Wednesday, January 20, 2021 at 11:52 PM
To: "dev@hudi.apache.org" 
Cc: "dev@hudi.apache.org" 
Subject: Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Hi, we have implemented the spark sql extension for hudi in our Internal 
version. Here is the main implementation, including the extension sql syntax 
and implementation scheme  on spark. I am waiting for your feedback. Any 
comments are welcome~

https://docs.google.com/document/d/1KC6Rae67CUaCUpKoIAkM6OTAGuOWFPD9qtNfqchAl1o/edit#heading=h.oeoy1y14sifu


2020年12月23日 上午12:30，Vinoth Chandar  写道：
Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
love to have you involved.

On Tue, Dec 22, 2020 at 3:20 AM pzwpzw 
wrote:


Yes, it looks good .
We are building the spark sql extensions to support for hudi in
our internal version.
I am interested in participating in the extension of SparkSQL on hudi.
2020年12月22日 下午4:30，Vinoth Chandar  写道：

Hi,

I think what we are landing on finally is.

- Keep pushing for SparkSQL support using Spark extensions route
- Calcite effort will be a separate/orthogonal approach, down the line

Please feel free to correct me, if I got this wrong.

On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
wrote:

Hi 受春柏 ，here is my point. We can use Calcite to build a common sql layer

to process engine independent SQL, for example most of the DDL、Hoodie CLI

command and also provide parser for the common SQL extensions(e.g. Merge

Into). The Engine-related syntax can be taught to the respective engines to

process. If the common sql layer can handle the input sql, it handle

it.Otherwise it is routed to the engine for processing. In long term, the

common layer will more and more rich and perfect.

2020年12月21日 下午4:38，受春柏  写道：


Hi,all



That's very good,Hudi SQL syntax can support Flink、hive and other analysis

components at the same time,

But there are some questions about SparkSQL. SparkSQL syntax is in

conflict with Calctite syntax.Is our strategy

user migration or syntax compatibility?

In addition ，will it also support write SQL?





















在 2020-12-19 02:10:16，"Nishith"  写道：


That’s awesome. Looks like we have a consensus on Calcite. Look forward to

the RFC as well!



-Nishith



On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:



Sounds good. Look forward to a RFC/DISCUSS thread.



Thanks


Vinoth



On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:



Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would


add support for SQL connectors of Hoodie Flink soon ~


Currently, i'm preparing a refactoring to the current Flink writer code.



Vinoth Chandar  于2020年12月18日周五 上午6:39写道：



Thanks Kabeer for the note on gmail. Did not realize that. :)



My desired use case is user use the Hoodie CLI to execute these SQLs.


They can choose what engine to use by a CLI config option.



Yes, that is also another attractive aspect of this route. We can build


out


a common SQL layer and have this translate to the underlying engine


(sounds


like Hive huh)


Longer term, if we really think we can more easily implement a full DML +


DDL + DQL, we can proceed with this.



As others pointed out, for Spark SQL, it might be good to try the Spark


extensions route, before we take this on more fully.



The other part where Calcite is great is, all the support for


windowing/streaming in its syntax.


Danny, I guess if we should be able to leverage that through a deeper


Flink/Hudi integration?




On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 


wrote:



I think Dongwook is investigating on the same lines. and it does seem


better to pursue this first, before trying other approaches.





On Tue, Dec 15, 2020 at 1:38 AM pzwpzw 


wrote:



Yeah I agree with Nishith that an option way is to look at the


ways


to


plug in custom logical and physical plans in Spark. It can simplify


the


implementation and reuse the Spark SQL syntax. And also users


familiar


with


Spark SQL will be able to use HUDi's SQL features more quickly.


In fact, spark have provided the SparkSessionExtensions interface for


implement custom syntax extensions and SQL rewrite rule.







https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2F2.4.5%2Fapi%2Fjava%2Forg%2Fapache%2Fspark%2Fsql%2FSparkSessionExtensions.html&data=04%7C01%7C%7C1c5c63e24f8a455c63df08d8bd5b5300%7C84df9e7fe9f640afb435%7C1%7C0%7C637467547216284787%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=iMp2uNrqOy5C%2B

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2021-01-20 Thread pzwpzw

Hi, we have implemented the spark sql extension for hudi in our Internal
version. Here is the main implementation, including the extension sql syntax
and implementation scheme on spark. I am waiting for your feedback. Any
comments are welcome~

https://docs.google.com/document/d/1KC6Rae67CUaCUpKoIAkM6OTAGuOWFPD9qtNfqchAl1o/edit#heading=h.oeoy1y14sifu

2020年12月23日 上午12:30，Vinoth Chandar 写道：

Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
love to have you involved.

On Tue, Dec 22, 2020 at 3:20 AM pzwpzw
wrote:

Yes, it looks good .
We are building the spark sql extensions to support for hudi in
our internal version.
I am interested in participating in the extension of SparkSQL on hudi.
2020年12月22日 下午4:30，Vinoth Chandar 写道：

Hi,

I think what we are landing on finally is.

- Keep pushing for SparkSQL support using Spark extensions route
- Calcite effort will be a separate/orthogonal approach, down the line

Please feel free to correct me, if I got this wrong.

On Mon, Dec 21, 2020 at 3:30 AM pzwpzw
wrote:

Hi 受春柏 ，here is my point. We can use Calcite to build a common sql layer

to process engine independent SQL, for example most of the DDL、Hoodie CLI

command and also provide parser for the common SQL extensions(e.g. Merge

Into). The Engine-related syntax can be taught to the respective engines to

process. If the common sql layer can handle the input sql, it handle

it.Otherwise it is routed to the engine for processing. In long term, the

common layer will more and more rich and perfect.

2020年12月21日 下午4:38，受春柏 写道：

Hi,all

That's very good,Hudi SQL syntax can support Flink、hive and other analysis

components at the same time,

But there are some questions about SparkSQL. SparkSQL syntax is in

conflict with Calctite syntax.Is our strategy

user migration or syntax compatibility?

In addition ，will it also support write SQL?

在 2020-12-19 02:10:16，"Nishith" 写道：

That’s awesome. Looks like we have a consensus on Calcite. Look forward to

the RFC as well!

-Nishith

On Dec 18, 2020, at 9:03 AM, Vinoth Chandar wrote:

Sounds good. Look forward to a RFC/DISCUSS thread.

Thanks

Vinoth

On Thu, Dec 17, 2020 at 6:04 PM Danny Chan wrote:

Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would

add support for SQL connectors of Hoodie Flink soon ~

Currently, i'm preparing a refactoring to the current Flink writer code.

Vinoth Chandar 于2020年12月18日周五 上午6:39写道：

Thanks Kabeer for the note on gmail. Did not realize that. :)

My desired use case is user use the Hoodie CLI to execute these SQLs.

They can choose what engine to use by a CLI config option.

Yes, that is also another attractive aspect of this route. We can build

out

a common SQL layer and have this translate to the underlying engine

(sounds

like Hive huh)

Longer term, if we really think we can more easily implement a full DML +

DDL + DQL, we can proceed with this.

As others pointed out, for Spark SQL, it might be good to try the Spark

extensions route, before we take this on more fully.

The other part where Calcite is great is, all the support for

windowing/streaming in its syntax.

Danny, I guess if we should be able to leverage that through a deeper

Flink/Hudi integration?

On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar

wrote:

I think Dongwook is investigating on the same lines. and it does seem

better to pursue this first, before trying other approaches.

On Tue, Dec 15, 2020 at 1:38 AM pzwpzw

wrote:

Yeah I agree with Nishith that an option way is to look at the

ways

plug in custom logical and physical plans in Spark. It can simplify

the

implementation and reuse the Spark SQL syntax. And also users

familiar

with

Spark SQL will be able to use HUDi's SQL features more quickly.

In fact, spark have provided the SparkSessionExtensions interface for

implement custom syntax extensions and SQL rewrite rule.

https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html

We can use the SparkSessionExtensions to extended hoodie sql syntax

such

as MERGE INTO and DDL syntax.

2020年12月15日 下午3:27，Nishith 写道：

Thanks for starting this thread Vinoth.

In general, definitely see the need for SQL style semantics on Hudi

tables. Apache Calcite is a great option to considering given

DatasourceV2

has the limitations that you described.

Additionally, even if Spark DatasourceV2 allowed for the flexibility,

the

same SQL semantics needs to be supported in other engines like Flink

provide the same experience to users - which in itself could also be

considerable amount of work.

So, if we’re able to generalize on the SQL story

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2020-12-28 Thread wei li

First, I think it is necessary to improve spark sql, because the main scenario 
of hudi is datalake or warehouse, and spark has strong ecological capabilities 
in this field.

Second, but in the long run, Hudi needs a more general SQL layer, and it is 
very necessary to embrace calcite. Then based on hudi, a powerful data 
management processing service can be built

On 2020/12/22 08:30:37, Vinoth Chandar  wrote: 
> Hi,
> 
> I think what we are landing on finally is.
> 
> - Keep pushing for SparkSQL support using Spark extensions route
> - Calcite effort will be a separate/orthogonal approach, down the line
> 
> Please feel free to correct me, if I got this wrong.
> 
> On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
> wrote:
> 
> > Hi 受春柏 ，here is my point. We can use Calcite to build a common sql layer
> > to process engine independent SQL,  for example most of the DDL、Hoodie CLI
> > command and also provide parser for the common SQL extensions(e.g. Merge
> > Into). The Engine-related syntax can be taught to the respective engines to
> > process. If the common sql layer can handle the input sql, it handle
> > it.Otherwise it is routed to the engine for processing. In long term, the
> > common layer will more and more rich and perfect.
> > 2020年12月21日 下午4:38，受春柏  写道：
> >
> > Hi,all
> >
> >
> > That's very good,Hudi SQL syntax can support Flink、hive and other analysis
> > components at the same time,
> > But there are some questions about SparkSQL. SparkSQL syntax is in
> > conflict with Calctite syntax.Is our strategy
> > user migration or syntax compatibility?
> > In addition ，will it also support write SQL?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > 在 2020-12-19 02:10:16，"Nishith"  写道：
> >
> > That’s awesome. Looks like we have a consensus on Calcite. Look forward to
> > the RFC as well!
> >
> >
> > -Nishith
> >
> >
> > On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:
> >
> >
> > Sounds good. Look forward to a RFC/DISCUSS thread.
> >
> >
> > Thanks
> >
> > Vinoth
> >
> >
> > On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:
> >
> >
> > Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
> >
> > add support for SQL connectors of Hoodie Flink soon ~
> >
> > Currently, i'm preparing a refactoring to the current Flink writer code.
> >
> >
> > Vinoth Chandar  于2020年12月18日周五 上午6:39写道：
> >
> >
> > Thanks Kabeer for the note on gmail. Did not realize that. :)
> >
> >
> > My desired use case is user use the Hoodie CLI to execute these SQLs.
> >
> > They can choose what engine to use by a CLI config option.
> >
> >
> > Yes, that is also another attractive aspect of this route. We can build
> >
> > out
> >
> > a common SQL layer and have this translate to the underlying engine
> >
> > (sounds
> >
> > like Hive huh)
> >
> > Longer term, if we really think we can more easily implement a full DML +
> >
> > DDL + DQL, we can proceed with this.
> >
> >
> > As others pointed out, for Spark SQL, it might be good to try the Spark
> >
> > extensions route, before we take this on more fully.
> >
> >
> > The other part where Calcite is great is, all the support for
> >
> > windowing/streaming in its syntax.
> >
> > Danny, I guess if we should be able to leverage that through a deeper
> >
> > Flink/Hudi integration?
> >
> >
> >
> > On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 
> >
> > wrote:
> >
> >
> > I think Dongwook is investigating on the same lines. and it does seem
> >
> > better to pursue this first, before trying other approaches.
> >
> >
> >
> >
> > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw  >
> > .invalid>
> >
> > wrote:
> >
> >
> > Yeah I agree with Nishith that an option way is to look at the
> >
> > ways
> >
> > to
> >
> > plug in custom logical and physical plans in Spark. It can simplify
> >
> > the
> >
> > implementation and reuse the Spark SQL syntax. And also users
> >
> > familiar
> >
> > with
> >
> > Spark SQL will be able to use HUDi's SQL features more quickly.
> >
> > In fact, spark have provided the SparkSessionExtensions interface for
> >
> > implement custom syntax extensions and SQL rewrite rule.
> >
> >
> >
> >
> >
> > https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> >
> > .
> >
> > We can use the SparkSessionExtensions to extended hoodie sql syntax
> >
> > such
> >
> > as MERGE INTO and DDL syntax.
> >
> >
> > 2020年12月15日 下午3:27，Nishith  写道：
> >
> >
> > Thanks for starting this thread Vinoth.
> >
> > In general, definitely see the need for SQL style semantics on Hudi
> >
> > tables. Apache Calcite is a great option to considering given
> >
> > DatasourceV2
> >
> > has the limitations that you described.
> >
> >
> > Additionally, even if Spark DatasourceV2 allowed for the flexibility,
> >
> > the
> >
> > same SQL semantics needs to be supported in other engines like Flink
> >
> > to
> >
> > provide the same experience to users - which in itself could also be
> >
> > considerable amount of work.

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2020-12-22 Thread Danny Chan

That's great, I can help with the Apache Calcite integration.

Vinoth Chandar  于2020年12月23日周三 上午12:29写道：

> Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
> love to have you involved.
>
> On Tue, Dec 22, 2020 at 3:20 AM pzwpzw 
> wrote:
>
> > Yes, it looks good .
> > We are building the spark sql extensions to support for hudi in
> > our internal version.
> > I am interested in participating in the extension of SparkSQL on hudi.
> > 2020年12月22日 下午4:30，Vinoth Chandar  写道：
> >
> > Hi,
> >
> > I think what we are landing on finally is.
> >
> > - Keep pushing for SparkSQL support using Spark extensions route
> > - Calcite effort will be a separate/orthogonal approach, down the line
> >
> > Please feel free to correct me, if I got this wrong.
> >
> > On Mon, Dec 21, 2020 at 3:30 AM pzwpzw  .invalid>
> > wrote:
> >
> > Hi 受春柏 ，here is my point. We can use Calcite to build a common sql layer
> >
> > to process engine independent SQL, for example most of the DDL、Hoodie CLI
> >
> > command and also provide parser for the common SQL extensions(e.g. Merge
> >
> > Into). The Engine-related syntax can be taught to the respective engines
> to
> >
> > process. If the common sql layer can handle the input sql, it handle
> >
> > it.Otherwise it is routed to the engine for processing. In long term, the
> >
> > common layer will more and more rich and perfect.
> >
> > 2020年12月21日 下午4:38，受春柏  写道：
> >
> >
> > Hi,all
> >
> >
> >
> > That's very good,Hudi SQL syntax can support Flink、hive and other
> analysis
> >
> > components at the same time,
> >
> > But there are some questions about SparkSQL. SparkSQL syntax is in
> >
> > conflict with Calctite syntax.Is our strategy
> >
> > user migration or syntax compatibility?
> >
> > In addition ，will it also support write SQL?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > 在 2020-12-19 02:10:16，"Nishith"  写道：
> >
> >
> > That’s awesome. Looks like we have a consensus on Calcite. Look forward
> to
> >
> > the RFC as well!
> >
> >
> >
> > -Nishith
> >
> >
> >
> > On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:
> >
> >
> >
> > Sounds good. Look forward to a RFC/DISCUSS thread.
> >
> >
> >
> > Thanks
> >
> >
> > Vinoth
> >
> >
> >
> > On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:
> >
> >
> >
> > Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i
> would
> >
> >
> > add support for SQL connectors of Hoodie Flink soon ~
> >
> >
> > Currently, i'm preparing a refactoring to the current Flink writer code.
> >
> >
> >
> > Vinoth Chandar  于2020年12月18日周五 上午6:39写道：
> >
> >
> >
> > Thanks Kabeer for the note on gmail. Did not realize that. :)
> >
> >
> >
> > My desired use case is user use the Hoodie CLI to execute these SQLs.
> >
> >
> > They can choose what engine to use by a CLI config option.
> >
> >
> >
> > Yes, that is also another attractive aspect of this route. We can build
> >
> >
> > out
> >
> >
> > a common SQL layer and have this translate to the underlying engine
> >
> >
> > (sounds
> >
> >
> > like Hive huh)
> >
> >
> > Longer term, if we really think we can more easily implement a full DML +
> >
> >
> > DDL + DQL, we can proceed with this.
> >
> >
> >
> > As others pointed out, for Spark SQL, it might be good to try the Spark
> >
> >
> > extensions route, before we take this on more fully.
> >
> >
> >
> > The other part where Calcite is great is, all the support for
> >
> >
> > windowing/streaming in its syntax.
> >
> >
> > Danny, I guess if we should be able to leverage that through a deeper
> >
> >
> > Flink/Hudi integration?
> >
> >
> >
> >
> > On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 
> >
> >
> > wrote:
> >
> >
> >
> > I think Dongwook is investigating on the same lines. and it does seem
> >
> >
> > better to pursue this first, before trying other approaches.
> >
> >
> >
> >
> >
> > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw  >
> >
> > .invalid>
> >
> >
> > wrote:
> >
> >
> >
> > Yeah I agree with Nishith that an option way is to look at the
> >
> >
> > ways
> >
> >
> > to
> >
> >
> > plug in custom logical and physical plans in Spark. It can simplify
> >
> >
> > the
> >
> >
> > implementation and reuse the Spark SQL syntax. And also users
> >
> >
> > familiar
> >
> >
> > with
> >
> >
> > Spark SQL will be able to use HUDi's SQL features more quickly.
> >
> >
> > In fact, spark have provided the SparkSessionExtensions interface for
> >
> >
> > implement custom syntax extensions and SQL rewrite rule.
> >
> >
> >
> >
> >
> >
> >
> >
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> >
> >
> > .
> >
> >
> > We can use the SparkSessionExtensions to extended hoodie sql syntax
> >
> >
> > such
> >
> >
> > as MERGE INTO and DDL syntax.
> >
> >
> >
> > 2020年12月15日 下午3:27，Nishith  写道：
> >
> >
> >
> > Thanks for starting this thread Vinoth.
> >
> >
> > In general, definitely see the need for SQL style semantics on Hudi
> >
> >
> > tables. Apac

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2020-12-22 Thread Vinoth Chandar

Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
love to have you involved.

On Tue, Dec 22, 2020 at 3:20 AM pzwpzw 
wrote:

> Yes, it looks good .
> We are building the spark sql extensions to support for hudi in
> our internal version.
> I am interested in participating in the extension of SparkSQL on hudi.
> 2020年12月22日 下午4:30，Vinoth Chandar  写道：
>
> Hi,
>
> I think what we are landing on finally is.
>
> - Keep pushing for SparkSQL support using Spark extensions route
> - Calcite effort will be a separate/orthogonal approach, down the line
>
> Please feel free to correct me, if I got this wrong.
>
> On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
> wrote:
>
> Hi 受春柏 ，here is my point. We can use Calcite to build a common sql layer
>
> to process engine independent SQL, for example most of the DDL、Hoodie CLI
>
> command and also provide parser for the common SQL extensions(e.g. Merge
>
> Into). The Engine-related syntax can be taught to the respective engines to
>
> process. If the common sql layer can handle the input sql, it handle
>
> it.Otherwise it is routed to the engine for processing. In long term, the
>
> common layer will more and more rich and perfect.
>
> 2020年12月21日 下午4:38，受春柏  写道：
>
>
> Hi,all
>
>
>
> That's very good,Hudi SQL syntax can support Flink、hive and other analysis
>
> components at the same time,
>
> But there are some questions about SparkSQL. SparkSQL syntax is in
>
> conflict with Calctite syntax.Is our strategy
>
> user migration or syntax compatibility?
>
> In addition ，will it also support write SQL?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 在 2020-12-19 02:10:16，"Nishith"  写道：
>
>
> That’s awesome. Looks like we have a consensus on Calcite. Look forward to
>
> the RFC as well!
>
>
>
> -Nishith
>
>
>
> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:
>
>
>
> Sounds good. Look forward to a RFC/DISCUSS thread.
>
>
>
> Thanks
>
>
> Vinoth
>
>
>
> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:
>
>
>
> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>
>
> add support for SQL connectors of Hoodie Flink soon ~
>
>
> Currently, i'm preparing a refactoring to the current Flink writer code.
>
>
>
> Vinoth Chandar  于2020年12月18日周五 上午6:39写道：
>
>
>
> Thanks Kabeer for the note on gmail. Did not realize that. :)
>
>
>
> My desired use case is user use the Hoodie CLI to execute these SQLs.
>
>
> They can choose what engine to use by a CLI config option.
>
>
>
> Yes, that is also another attractive aspect of this route. We can build
>
>
> out
>
>
> a common SQL layer and have this translate to the underlying engine
>
>
> (sounds
>
>
> like Hive huh)
>
>
> Longer term, if we really think we can more easily implement a full DML +
>
>
> DDL + DQL, we can proceed with this.
>
>
>
> As others pointed out, for Spark SQL, it might be good to try the Spark
>
>
> extensions route, before we take this on more fully.
>
>
>
> The other part where Calcite is great is, all the support for
>
>
> windowing/streaming in its syntax.
>
>
> Danny, I guess if we should be able to leverage that through a deeper
>
>
> Flink/Hudi integration?
>
>
>
>
> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 
>
>
> wrote:
>
>
>
> I think Dongwook is investigating on the same lines. and it does seem
>
>
> better to pursue this first, before trying other approaches.
>
>
>
>
>
> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw 
>
> .invalid>
>
>
> wrote:
>
>
>
> Yeah I agree with Nishith that an option way is to look at the
>
>
> ways
>
>
> to
>
>
> plug in custom logical and physical plans in Spark. It can simplify
>
>
> the
>
>
> implementation and reuse the Spark SQL syntax. And also users
>
>
> familiar
>
>
> with
>
>
> Spark SQL will be able to use HUDi's SQL features more quickly.
>
>
> In fact, spark have provided the SparkSessionExtensions interface for
>
>
> implement custom syntax extensions and SQL rewrite rule.
>
>
>
>
>
>
>
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>
>
> .
>
>
> We can use the SparkSessionExtensions to extended hoodie sql syntax
>
>
> such
>
>
> as MERGE INTO and DDL syntax.
>
>
>
> 2020年12月15日 下午3:27，Nishith  写道：
>
>
>
> Thanks for starting this thread Vinoth.
>
>
> In general, definitely see the need for SQL style semantics on Hudi
>
>
> tables. Apache Calcite is a great option to considering given
>
>
> DatasourceV2
>
>
> has the limitations that you described.
>
>
>
> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>
>
> the
>
>
> same SQL semantics needs to be supported in other engines like Flink
>
>
> to
>
>
> provide the same experience to users - which in itself could also be
>
>
> considerable amount of work.
>
>
> So, if we’re able to generalize on the SQL story along Calcite, that
>
>
> would
>
>
> help reduce redundant work in some sense.
>
>
> Although, I’m worried about a few things
>
>
>
> 1) Like you pointed out, writing complex user jobs using Spark SQL
>
>
>

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2020-12-22 Thread pzwpzw

Yes, it looks good .
We are building the spark sql extensions to support for hudi in our internal
version.
I am interested in participating in the extension of SparkSQL on hudi.
2020年12月22日 下午4:30，Vinoth Chandar 写道：

Hi,

I think what we are landing on finally is.

- Keep pushing for SparkSQL support using Spark extensions route
- Calcite effort will be a separate/orthogonal approach, down the line

Please feel free to correct me, if I got this wrong.

On Mon, Dec 21, 2020 at 3:30 AM pzwpzw
wrote:

Hi 受春柏 ，here is my point. We can use Calcite to build a common sql layer
to process engine independent SQL, for example most of the DDL、Hoodie CLI
command and also provide parser for the common SQL extensions(e.g. Merge
Into). The Engine-related syntax can be taught to the respective engines to
process. If the common sql layer can handle the input sql, it handle
it.Otherwise it is routed to the engine for processing. In long term, the
common layer will more and more rich and perfect.
2020年12月21日 下午4:38，受春柏 写道：

Hi,all

That's very good,Hudi SQL syntax can support Flink、hive and other analysis
components at the same time,
But there are some questions about SparkSQL. SparkSQL syntax is in
conflict with Calctite syntax.Is our strategy
user migration or syntax compatibility?
In addition ，will it also support write SQL?

在 2020-12-19 02:10:16，"Nishith" 写道：

That’s awesome. Looks like we have a consensus on Calcite. Look forward to
the RFC as well!

-Nishith

On Dec 18, 2020, at 9:03 AM, Vinoth Chandar wrote:

Sounds good. Look forward to a RFC/DISCUSS thread.

Thanks

Vinoth

On Thu, Dec 17, 2020 at 6:04 PM Danny Chan wrote:

Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would

add support for SQL connectors of Hoodie Flink soon ~

Currently, i'm preparing a refactoring to the current Flink writer code.

Vinoth Chandar 于2020年12月18日周五 上午6:39写道：

Thanks Kabeer for the note on gmail. Did not realize that. :)

My desired use case is user use the Hoodie CLI to execute these SQLs.

They can choose what engine to use by a CLI config option.

Yes, that is also another attractive aspect of this route. We can build

out

a common SQL layer and have this translate to the underlying engine

(sounds

like Hive huh)

Longer term, if we really think we can more easily implement a full DML +

DDL + DQL, we can proceed with this.

As others pointed out, for Spark SQL, it might be good to try the Spark

extensions route, before we take this on more fully.

The other part where Calcite is great is, all the support for

windowing/streaming in its syntax.

Danny, I guess if we should be able to leverage that through a deeper

Flink/Hudi integration?

On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar

wrote:

I think Dongwook is investigating on the same lines. and it does seem

better to pursue this first, before trying other approaches.

On Tue, Dec 15, 2020 at 1:38 AM pzwpzw

wrote:

Yeah I agree with Nishith that an option way is to look at the

ways

plug in custom logical and physical plans in Spark. It can simplify

the

implementation and reuse the Spark SQL syntax. And also users

familiar

with

Spark SQL will be able to use HUDi's SQL features more quickly.

In fact, spark have provided the SparkSessionExtensions interface for

implement custom syntax extensions and SQL rewrite rule.

https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html

We can use the SparkSessionExtensions to extended hoodie sql syntax

such

as MERGE INTO and DDL syntax.

2020年12月15日 下午3:27，Nishith 写道：

Thanks for starting this thread Vinoth.

In general, definitely see the need for SQL style semantics on Hudi

tables. Apache Calcite is a great option to considering given

DatasourceV2

has the limitations that you described.

Additionally, even if Spark DatasourceV2 allowed for the flexibility,

the

same SQL semantics needs to be supported in other engines like Flink

provide the same experience to users - which in itself could also be

considerable amount of work.

So, if we’re able to generalize on the SQL story along Calcite, that

would

help reduce redundant work in some sense.

Although, I’m worried about a few things

1) Like you pointed out, writing complex user jobs using Spark SQL

syntax

can be harder for users who are moving from “Hudi syntax” to “Spark

syntax”

for cross table joins, merges etc using data frames. One option is to

look

at the if there are ways to plug in custom logical and physical plans

Spark, this way, although the merge on sparksql functionality may not

simple to use, but wouldn’t take away performance and feature set for

starters, in the future we could think of having the entire query

space

2) If we

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2020-12-22 Thread Vinoth Chandar

Hi,

I think what we are landing on finally is.

- Keep pushing for SparkSQL support using Spark extensions route
- Calcite effort will be a separate/orthogonal approach, down the line

Please feel free to correct me, if I got this wrong.

On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
wrote:

> Hi 受春柏 ，here is my point. We can use Calcite to build a common sql layer
> to process engine independent SQL,  for example most of the DDL、Hoodie CLI
> command and also provide parser for the common SQL extensions(e.g. Merge
> Into). The Engine-related syntax can be taught to the respective engines to
> process. If the common sql layer can handle the input sql, it handle
> it.Otherwise it is routed to the engine for processing. In long term, the
> common layer will more and more rich and perfect.
> 2020年12月21日 下午4:38，受春柏  写道：
>
> Hi,all
>
>
> That's very good,Hudi SQL syntax can support Flink、hive and other analysis
> components at the same time,
> But there are some questions about SparkSQL. SparkSQL syntax is in
> conflict with Calctite syntax.Is our strategy
> user migration or syntax compatibility?
> In addition ，will it also support write SQL?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 在 2020-12-19 02:10:16，"Nishith"  写道：
>
> That’s awesome. Looks like we have a consensus on Calcite. Look forward to
> the RFC as well!
>
>
> -Nishith
>
>
> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:
>
>
> Sounds good. Look forward to a RFC/DISCUSS thread.
>
>
> Thanks
>
> Vinoth
>
>
> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:
>
>
> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>
> add support for SQL connectors of Hoodie Flink soon ~
>
> Currently, i'm preparing a refactoring to the current Flink writer code.
>
>
> Vinoth Chandar  于2020年12月18日周五 上午6:39写道：
>
>
> Thanks Kabeer for the note on gmail. Did not realize that. :)
>
>
> My desired use case is user use the Hoodie CLI to execute these SQLs.
>
> They can choose what engine to use by a CLI config option.
>
>
> Yes, that is also another attractive aspect of this route. We can build
>
> out
>
> a common SQL layer and have this translate to the underlying engine
>
> (sounds
>
> like Hive huh)
>
> Longer term, if we really think we can more easily implement a full DML +
>
> DDL + DQL, we can proceed with this.
>
>
> As others pointed out, for Spark SQL, it might be good to try the Spark
>
> extensions route, before we take this on more fully.
>
>
> The other part where Calcite is great is, all the support for
>
> windowing/streaming in its syntax.
>
> Danny, I guess if we should be able to leverage that through a deeper
>
> Flink/Hudi integration?
>
>
>
> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 
>
> wrote:
>
>
> I think Dongwook is investigating on the same lines. and it does seem
>
> better to pursue this first, before trying other approaches.
>
>
>
>
> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw 
> .invalid>
>
> wrote:
>
>
> Yeah I agree with Nishith that an option way is to look at the
>
> ways
>
> to
>
> plug in custom logical and physical plans in Spark. It can simplify
>
> the
>
> implementation and reuse the Spark SQL syntax. And also users
>
> familiar
>
> with
>
> Spark SQL will be able to use HUDi's SQL features more quickly.
>
> In fact, spark have provided the SparkSessionExtensions interface for
>
> implement custom syntax extensions and SQL rewrite rule.
>
>
>
>
>
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>
> .
>
> We can use the SparkSessionExtensions to extended hoodie sql syntax
>
> such
>
> as MERGE INTO and DDL syntax.
>
>
> 2020年12月15日 下午3:27，Nishith  写道：
>
>
> Thanks for starting this thread Vinoth.
>
> In general, definitely see the need for SQL style semantics on Hudi
>
> tables. Apache Calcite is a great option to considering given
>
> DatasourceV2
>
> has the limitations that you described.
>
>
> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>
> the
>
> same SQL semantics needs to be supported in other engines like Flink
>
> to
>
> provide the same experience to users - which in itself could also be
>
> considerable amount of work.
>
> So, if we’re able to generalize on the SQL story along Calcite, that
>
> would
>
> help reduce redundant work in some sense.
>
> Although, I’m worried about a few things
>
>
> 1) Like you pointed out, writing complex user jobs using Spark SQL
>
> syntax
>
> can be harder for users who are moving from “Hudi syntax” to “Spark
>
> syntax”
>
> for cross table joins, merges etc using data frames. One option is to
>
> look
>
> at the if there are ways to plug in custom logical and physical plans
>
> in
>
> Spark, this way, although the merge on sparksql functionality may not
>
> be
>
> as
>
> simple to use, but wouldn’t take away performance and feature set for
>
> starters, in the future we could think of having the entire query
>
> space
>
> be
>
> powered by calcite like you mentioned
>
> 2) If we continue t

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2020-12-21 Thread pzwpzw

Hi 受春柏 ，here is my point. We can use Calcite to build a common sql layer to
process engine independent SQL, for example most of the DDL、Hoodie CLI command
and also provide parser for the common SQL extensions(e.g. Merge Into). The
Engine-related syntax can be taught to the respective engines to process. If
the common sql layer can handle the input sql, it handle it.Otherwise it is
routed to the engine for processing. In long term, the common layer will more
and more rich and perfect.
2020年12月21日 下午4:38，受春柏 写道：

Hi,all

That's very good,Hudi SQL syntax can support Flink、hive and other analysis
components at the same time,
But there are some questions about SparkSQL. SparkSQL syntax is in conflict
with Calctite syntax.Is our strategy
user migration or syntax compatibility?
In addition ，will it also support write SQL?

在 2020-12-19 02:10:16，"Nishith" 写道：

That’s awesome. Looks like we have a consensus on Calcite. Look forward to the
RFC as well!

-Nishith

On Dec 18, 2020, at 9:03 AM, Vinoth Chandar wrote:

Sounds good. Look forward to a RFC/DISCUSS thread.

Thanks
Vinoth

On Thu, Dec 17, 2020 at 6:04 PM Danny Chan wrote:

Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
add support for SQL connectors of Hoodie Flink soon ~
Currently, i'm preparing a refactoring to the current Flink writer code.

Vinoth Chandar 于2020年12月18日周五 上午6:39写道：

Thanks Kabeer for the note on gmail. Did not realize that. :)

My desired use case is user use the Hoodie CLI to execute these SQLs.
They can choose what engine to use by a CLI config option.

Yes, that is also another attractive aspect of this route. We can build
out
a common SQL layer and have this translate to the underlying engine
(sounds
like Hive huh)
Longer term, if we really think we can more easily implement a full DML +
DDL + DQL, we can proceed with this.

As others pointed out, for Spark SQL, it might be good to try the Spark
extensions route, before we take this on more fully.

The other part where Calcite is great is, all the support for
windowing/streaming in its syntax.
Danny, I guess if we should be able to leverage that through a deeper
Flink/Hudi integration?

On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar
wrote:

I think Dongwook is investigating on the same lines. and it does seem
better to pursue this first, before trying other approaches.

On Tue, Dec 15, 2020 at 1:38 AM pzwpzw
wrote:

Yeah I agree with Nishith that an option way is to look at the
ways
to
plug in custom logical and physical plans in Spark. It can simplify
the
implementation and reuse the Spark SQL syntax. And also users
familiar
with
Spark SQL will be able to use HUDi's SQL features more quickly.
In fact, spark have provided the SparkSessionExtensions interface for
implement custom syntax extensions and SQL rewrite rule.

https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
.
We can use the SparkSessionExtensions to extended hoodie sql syntax
such
as MERGE INTO and DDL syntax.

2020年12月15日 下午3:27，Nishith 写道：

Thanks for starting this thread Vinoth.
In general, definitely see the need for SQL style semantics on Hudi
tables. Apache Calcite is a great option to considering given
DatasourceV2
has the limitations that you described.

Additionally, even if Spark DatasourceV2 allowed for the flexibility,
the
same SQL semantics needs to be supported in other engines like Flink
to
provide the same experience to users - which in itself could also be
considerable amount of work.
So, if we’re able to generalize on the SQL story along Calcite, that
would
help reduce redundant work in some sense.
Although, I’m worried about a few things

1) Like you pointed out, writing complex user jobs using Spark SQL
syntax
can be harder for users who are moving from “Hudi syntax” to “Spark
syntax”
for cross table joins, merges etc using data frames. One option is to
look
at the if there are ways to plug in custom logical and physical plans
in
Spark, this way, although the merge on sparksql functionality may not
be
as
simple to use, but wouldn’t take away performance and feature set for
starters, in the future we could think of having the entire query
space
be
powered by calcite like you mentioned
2) If we continue to use DatasourceV1, is there any downside to this
from
a performance and optimization perspective when executing plan - I’m
guessing not but haven’t delved into the code to see if there’s
anything
different apart from the API and spec.

On Dec 14, 2020, at 11:06 PM, Vinoth Chandar
wrote:

Hello all,

Just bumping this thread again

thanks

vinoth

On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar
wrote:

Hello all,

One feature that keeps coming up is the ability to use UPDATE, MERGE
sql

syntax to support writing into Hudi tables. We have looked into the
Spark 3

DataSource V2 APIs as well and found several issues that hinder us in

implement

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

11 matches

Site Navigation

Mail list logo

Footer information