Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2021-01-21 Thread pzwpzw

Thank vino yang! I have move the doc to RFC-25. We can continue the discussion 
there.

2021年1月22日 上午9:27,vino yang  写道:


Hi zhiwei,

Done! Now, you should have cwiki permission.

Best,
Vino

pzwpzw  于2021年1月22日周五 上午12:06写道:


That is great! Can you give me the permission to the cwiki? My cwiki id
is: zhiwei .
I will move it to there and continue the disscussion.


2021年1月21日 下午11:19,Gary Li  写道:


Hi pengzhiwei,


Thanks for the proposal. That’s a great feature. Can we move the design
doc to cwiki page as a new RFC? We can continue the discussion from there.


Thanks,


Best Regards,
Gary Li




From: pzwpzw 
Reply-To: "dev@hudi.apache.org" 
Date: Wednesday, January 20, 2021 at 11:52 PM
To: "dev@hudi.apache.org" 
Cc: "dev@hudi.apache.org" 
Subject: Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite


Hi, we have implemented the spark sql extension for hudi in our Internal
version. Here is the main implementation, including the extension sql
syntax and implementation scheme on spark. I am waiting for your feedback.
Any comments are welcome~




https://docs.google.com/document/d/1KC6Rae67CUaCUpKoIAkM6OTAGuOWFPD9qtNfqchAl1o/edit#heading=h.oeoy1y14sifu




2020年12月23日 上午12:30,Vinoth Chandar  写道:
Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
love to have you involved.


On Tue, Dec 22, 2020 at 3:20 AM pzwpzw 
wrote:




Yes, it looks good .
We are building the spark sql extensions to support for hudi in
our internal version.
I am interested in participating in the extension of SparkSQL on hudi.
2020年12月22日 下午4:30,Vinoth Chandar  写道:


Hi,


I think what we are landing on finally is.


- Keep pushing for SparkSQL support using Spark extensions route
- Calcite effort will be a separate/orthogonal approach, down the line


Please feel free to correct me, if I got this wrong.


On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
wrote:


Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer


to process engine independent SQL, for example most of the DDL、Hoodie CLI


command and also provide parser for the common SQL extensions(e.g. Merge


Into). The Engine-related syntax can be taught to the respective engines to


process. If the common sql layer can handle the input sql, it handle


it.Otherwise it is routed to the engine for processing. In long term, the


common layer will more and more rich and perfect.


2020年12月21日 下午4:38,受春柏  写道:




Hi,all






That's very good,Hudi SQL syntax can support Flink、hive and other analysis


components at the same time,


But there are some questions about SparkSQL. SparkSQL syntax is in


conflict with Calctite syntax.Is our strategy


user migration or syntax compatibility?


In addition ,will it also support write SQL?










































在 2020-12-19 02:10:16,"Nishith"  写道:




That’s awesome. Looks like we have a consensus on Calcite. Look forward to


the RFC as well!






-Nishith






On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:






Sounds good. Look forward to a RFC/DISCUSS thread.






Thanks




Vinoth






On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:






Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would




add support for SQL connectors of Hoodie Flink soon ~




Currently, i'm preparing a refactoring to the current Flink writer code.






Vinoth Chandar  于2020年12月18日周五 上午6:39写道:






Thanks Kabeer for the note on gmail. Did not realize that. :)






My desired use case is user use the Hoodie CLI to execute these SQLs.




They can choose what engine to use by a CLI config option.






Yes, that is also another attractive aspect of this route. We can build




out




a common SQL layer and have this translate to the underlying engine




(sounds




like Hive huh)




Longer term, if we really think we can more easily implement a full DML +




DDL + DQL, we can proceed with this.






As others pointed out, for Spark SQL, it might be good to try the Spark




extensions route, before we take this on more fully.






The other part where Calcite is great is, all the support for




windowing/streaming in its syntax.




Danny, I guess if we should be able to leverage that through a deeper




Flink/Hudi integration?








On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 




wrote:






I think Dongwook is investigating on the same lines. and it does seem




better to pursue this first, before trying other approaches.










On Tue, Dec 15, 2020 at 1:38 AM pzwpzw 




wrote:






Yeah I agree with Nishith that an option way is to look at the




ways




to




plug in custom logical and physical plans in Spark. It can simplify




the




implementation and reuse the Spark SQL syntax. And also users




familiar




with




Spark SQL will be able to use HUDi's SQL features more quickly.




In fact, spark have provided the SparkSessionExtensions interface for





Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2021-01-21 Thread vino yang
Hi zhiwei,

Done! Now, you should have cwiki permission.

Best,
Vino

pzwpzw  于2021年1月22日周五 上午12:06写道:

> That is great!  Can you give me the permission to the cwiki? My cwiki id
> is: zhiwei .
> I will move it to there and continue the disscussion.
>
> 2021年1月21日 下午11:19,Gary Li  写道:
>
> Hi pengzhiwei,
>
> Thanks for the proposal. That’s a great feature. Can we move the design
> doc to cwiki page as a new RFC? We can continue the discussion from there.
>
> Thanks,
>
> Best Regards,
> Gary Li
>
>
> From: pzwpzw 
> Reply-To: "dev@hudi.apache.org" 
> Date: Wednesday, January 20, 2021 at 11:52 PM
> To: "dev@hudi.apache.org" 
> Cc: "dev@hudi.apache.org" 
> Subject: Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
>
> Hi, we have implemented the spark sql extension for hudi in our Internal
> version. Here is the main implementation, including the extension sql
> syntax and implementation scheme on spark. I am waiting for your feedback.
> Any comments are welcome~
>
>
> https://docs.google.com/document/d/1KC6Rae67CUaCUpKoIAkM6OTAGuOWFPD9qtNfqchAl1o/edit#heading=h.oeoy1y14sifu
>
>
> 2020年12月23日 上午12:30,Vinoth Chandar  写道:
> Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
> love to have you involved.
>
> On Tue, Dec 22, 2020 at 3:20 AM pzwpzw 
> wrote:
>
>
> Yes, it looks good .
> We are building the spark sql extensions to support for hudi in
> our internal version.
> I am interested in participating in the extension of SparkSQL on hudi.
> 2020年12月22日 下午4:30,Vinoth Chandar  写道:
>
> Hi,
>
> I think what we are landing on finally is.
>
> - Keep pushing for SparkSQL support using Spark extensions route
> - Calcite effort will be a separate/orthogonal approach, down the line
>
> Please feel free to correct me, if I got this wrong.
>
> On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
> wrote:
>
> Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
>
> to process engine independent SQL, for example most of the DDL、Hoodie CLI
>
> command and also provide parser for the common SQL extensions(e.g. Merge
>
> Into). The Engine-related syntax can be taught to the respective engines to
>
> process. If the common sql layer can handle the input sql, it handle
>
> it.Otherwise it is routed to the engine for processing. In long term, the
>
> common layer will more and more rich and perfect.
>
> 2020年12月21日 下午4:38,受春柏  写道:
>
>
> Hi,all
>
>
>
> That's very good,Hudi SQL syntax can support Flink、hive and other analysis
>
> components at the same time,
>
> But there are some questions about SparkSQL. SparkSQL syntax is in
>
> conflict with Calctite syntax.Is our strategy
>
> user migration or syntax compatibility?
>
> In addition ,will it also support write SQL?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 在 2020-12-19 02:10:16,"Nishith"  写道:
>
>
> That’s awesome. Looks like we have a consensus on Calcite. Look forward to
>
> the RFC as well!
>
>
>
> -Nishith
>
>
>
> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:
>
>
>
> Sounds good. Look forward to a RFC/DISCUSS thread.
>
>
>
> Thanks
>
>
> Vinoth
>
>
>
> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:
>
>
>
> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>
>
> add support for SQL connectors of Hoodie Flink soon ~
>
>
> Currently, i'm preparing a refactoring to the current Flink writer code.
>
>
>
> Vinoth Chandar  于2020年12月18日周五 上午6:39写道:
>
>
>
> Thanks Kabeer for the note on gmail. Did not realize that. :)
>
>
>
> My desired use case is user use the Hoodie CLI to execute these SQLs.
>
>
> They can choose what engine to use by a CLI config option.
>
>
>
> Yes, that is also another attractive aspect of this route. We can build
>
>
> out
>
>
> a common SQL layer and have this translate to the underlying engine
>
>
> (sounds
>
>
> like Hive huh)
>
>
> Longer term, if we really think we can more easily implement a full DML +
>
>
> DDL + DQL, we can proceed with this.
>
>
>
> As others pointed out, for Spark SQL, it might be good to try the Spark
>
>
> extensions route, before we take this on more fully.
>
>
>
> The other part where Calcite is great is, all the support for
>
>
> windowing/streaming in its syntax.
>
>
> Danny, I guess if we should be able to leverage that through a deeper
>
>
> Flink/Hudi integration?
>
>
>
>
&

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2021-01-21 Thread pzwpzw

That is great!  Can you give me the permission to the cwiki? My cwiki id is: 
zhiwei .  
I will move it to there and continue the disscussion.

2021年1月21日 下午11:19,Gary Li  写道:


Hi pengzhiwei,

Thanks for the proposal. That’s a great feature. Can we move the design doc to 
cwiki page as a new RFC? We can continue the discussion from there.

Thanks,

Best Regards,
Gary Li


From: pzwpzw 
Reply-To: "dev@hudi.apache.org" 
Date: Wednesday, January 20, 2021 at 11:52 PM
To: "dev@hudi.apache.org" 
Cc: "dev@hudi.apache.org" 
Subject: Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Hi, we have implemented the spark sql extension for hudi in our Internal 
version. Here is the main implementation, including the extension sql syntax 
and implementation scheme on spark. I am waiting for your feedback. Any 
comments are welcome~

https://docs.google.com/document/d/1KC6Rae67CUaCUpKoIAkM6OTAGuOWFPD9qtNfqchAl1o/edit#heading=h.oeoy1y14sifu


2020年12月23日 上午12:30,Vinoth Chandar  写道:
Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
love to have you involved.

On Tue, Dec 22, 2020 at 3:20 AM pzwpzw 
wrote:


Yes, it looks good .
We are building the spark sql extensions to support for hudi in
our internal version.
I am interested in participating in the extension of SparkSQL on hudi.
2020年12月22日 下午4:30,Vinoth Chandar  写道:

Hi,

I think what we are landing on finally is.

- Keep pushing for SparkSQL support using Spark extensions route
- Calcite effort will be a separate/orthogonal approach, down the line

Please feel free to correct me, if I got this wrong.

On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
wrote:

Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer

to process engine independent SQL, for example most of the DDL、Hoodie CLI

command and also provide parser for the common SQL extensions(e.g. Merge

Into). The Engine-related syntax can be taught to the respective engines to

process. If the common sql layer can handle the input sql, it handle

it.Otherwise it is routed to the engine for processing. In long term, the

common layer will more and more rich and perfect.

2020年12月21日 下午4:38,受春柏  写道:


Hi,all



That's very good,Hudi SQL syntax can support Flink、hive and other analysis

components at the same time,

But there are some questions about SparkSQL. SparkSQL syntax is in

conflict with Calctite syntax.Is our strategy

user migration or syntax compatibility?

In addition ,will it also support write SQL?





















在 2020-12-19 02:10:16,"Nishith"  写道:


That’s awesome. Looks like we have a consensus on Calcite. Look forward to

the RFC as well!



-Nishith



On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:



Sounds good. Look forward to a RFC/DISCUSS thread.



Thanks


Vinoth



On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:



Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would


add support for SQL connectors of Hoodie Flink soon ~


Currently, i'm preparing a refactoring to the current Flink writer code.



Vinoth Chandar  于2020年12月18日周五 上午6:39写道:



Thanks Kabeer for the note on gmail. Did not realize that. :)



My desired use case is user use the Hoodie CLI to execute these SQLs.


They can choose what engine to use by a CLI config option.



Yes, that is also another attractive aspect of this route. We can build


out


a common SQL layer and have this translate to the underlying engine


(sounds


like Hive huh)


Longer term, if we really think we can more easily implement a full DML +


DDL + DQL, we can proceed with this.



As others pointed out, for Spark SQL, it might be good to try the Spark


extensions route, before we take this on more fully.



The other part where Calcite is great is, all the support for


windowing/streaming in its syntax.


Danny, I guess if we should be able to leverage that through a deeper


Flink/Hudi integration?




On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 


wrote:



I think Dongwook is investigating on the same lines. and it does seem


better to pursue this first, before trying other approaches.





On Tue, Dec 15, 2020 at 1:38 AM pzwpzw 


wrote:



Yeah I agree with Nishith that an option way is to look at the


ways


to


plug in custom logical and physical plans in Spark. It can simplify


the


implementation and reuse the Spark SQL syntax. And also users


familiar


with


Spark SQL will be able to use HUDi's SQL features more quickly.


In fact, spark have provided the SparkSessionExtensions interface for


implement custom syntax extensions and SQL rewrite rule.







https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2F2.4.5%2Fapi%2Fjava%2Forg%2Fapache%2Fspark%2Fsql%2FSparkSessionExtensions.html&data=04%7C01%7C%7C1c5c63e24f8a455c63df08d8bd5b5300%7C84df9e7fe

Re:  Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2021-01-21 Thread Gary Li
Hi pengzhiwei,

Thanks for the proposal. That’s a great feature. Can we move the design doc to 
cwiki page as a new RFC? We can continue the discussion from there.

Thanks,

Best Regards,
Gary Li


From: pzwpzw 
Reply-To: "dev@hudi.apache.org" 
Date: Wednesday, January 20, 2021 at 11:52 PM
To: "dev@hudi.apache.org" 
Cc: "dev@hudi.apache.org" 
Subject: Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Hi, we have implemented the spark sql extension for hudi in our Internal 
version. Here is the main implementation, including the extension sql syntax 
and implementation scheme  on spark. I am waiting for your feedback. Any 
comments are welcome~

https://docs.google.com/document/d/1KC6Rae67CUaCUpKoIAkM6OTAGuOWFPD9qtNfqchAl1o/edit#heading=h.oeoy1y14sifu


2020年12月23日 上午12:30,Vinoth Chandar  写道:
Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
love to have you involved.

On Tue, Dec 22, 2020 at 3:20 AM pzwpzw 
wrote:


Yes, it looks good .
We are building the spark sql extensions to support for hudi in
our internal version.
I am interested in participating in the extension of SparkSQL on hudi.
2020年12月22日 下午4:30,Vinoth Chandar  写道:

Hi,

I think what we are landing on finally is.

- Keep pushing for SparkSQL support using Spark extensions route
- Calcite effort will be a separate/orthogonal approach, down the line

Please feel free to correct me, if I got this wrong.

On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
wrote:

Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer

to process engine independent SQL, for example most of the DDL、Hoodie CLI

command and also provide parser for the common SQL extensions(e.g. Merge

Into). The Engine-related syntax can be taught to the respective engines to

process. If the common sql layer can handle the input sql, it handle

it.Otherwise it is routed to the engine for processing. In long term, the

common layer will more and more rich and perfect.

2020年12月21日 下午4:38,受春柏  写道:


Hi,all



That's very good,Hudi SQL syntax can support Flink、hive and other analysis

components at the same time,

But there are some questions about SparkSQL. SparkSQL syntax is in

conflict with Calctite syntax.Is our strategy

user migration or syntax compatibility?

In addition ,will it also support write SQL?





















在 2020-12-19 02:10:16,"Nishith"  写道:


That’s awesome. Looks like we have a consensus on Calcite. Look forward to

the RFC as well!



-Nishith



On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:



Sounds good. Look forward to a RFC/DISCUSS thread.



Thanks


Vinoth



On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:



Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would


add support for SQL connectors of Hoodie Flink soon ~


Currently, i'm preparing a refactoring to the current Flink writer code.



Vinoth Chandar  于2020年12月18日周五 上午6:39写道:



Thanks Kabeer for the note on gmail. Did not realize that. :)



My desired use case is user use the Hoodie CLI to execute these SQLs.


They can choose what engine to use by a CLI config option.



Yes, that is also another attractive aspect of this route. We can build


out


a common SQL layer and have this translate to the underlying engine


(sounds


like Hive huh)


Longer term, if we really think we can more easily implement a full DML +


DDL + DQL, we can proceed with this.



As others pointed out, for Spark SQL, it might be good to try the Spark


extensions route, before we take this on more fully.



The other part where Calcite is great is, all the support for


windowing/streaming in its syntax.


Danny, I guess if we should be able to leverage that through a deeper


Flink/Hudi integration?




On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 


wrote:



I think Dongwook is investigating on the same lines. and it does seem


better to pursue this first, before trying other approaches.





On Tue, Dec 15, 2020 at 1:38 AM pzwpzw 


wrote:



Yeah I agree with Nishith that an option way is to look at the


ways


to


plug in custom logical and physical plans in Spark. It can simplify


the


implementation and reuse the Spark SQL syntax. And also users


familiar


with


Spark SQL will be able to use HUDi's SQL features more quickly.


In fact, spark have provided the SparkSessionExtensions interface for


implement custom syntax extensions and SQL rewrite rule.







https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2F2.4.5%2Fapi%2Fjava%2Forg%2Fapache%2Fspark%2Fsql%2FSparkSessionExtensions.html&data=04%7C01%7C%7C1c5c63e24f8a455c63df08d8bd5b5300%7C84df9e7fe9f640afb435%7C1%7C0%7C637467547216284787%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=iMp2uNrqOy5C%2B

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2021-01-20 Thread pzwpzw

Hi, we have implemented the spark sql extension for hudi in our Internal 
version. Here is the main implementation, including the extension sql syntax 
and implementation scheme  on spark. I am waiting for your feedback. Any 
comments are welcome~


https://docs.google.com/document/d/1KC6Rae67CUaCUpKoIAkM6OTAGuOWFPD9qtNfqchAl1o/edit#heading=h.oeoy1y14sifu



2020年12月23日 上午12:30,Vinoth Chandar  写道:


Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
love to have you involved.

On Tue, Dec 22, 2020 at 3:20 AM pzwpzw 
wrote:


Yes, it looks good .
We are building the spark sql extensions to support for hudi in
our internal version.
I am interested in participating in the extension of SparkSQL on hudi.
2020年12月22日 下午4:30,Vinoth Chandar  写道:


Hi,


I think what we are landing on finally is.


- Keep pushing for SparkSQL support using Spark extensions route
- Calcite effort will be a separate/orthogonal approach, down the line


Please feel free to correct me, if I got this wrong.


On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
wrote:


Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer


to process engine independent SQL, for example most of the DDL、Hoodie CLI


command and also provide parser for the common SQL extensions(e.g. Merge


Into). The Engine-related syntax can be taught to the respective engines to


process. If the common sql layer can handle the input sql, it handle


it.Otherwise it is routed to the engine for processing. In long term, the


common layer will more and more rich and perfect.


2020年12月21日 下午4:38,受春柏  写道:




Hi,all






That's very good,Hudi SQL syntax can support Flink、hive and other analysis


components at the same time,


But there are some questions about SparkSQL. SparkSQL syntax is in


conflict with Calctite syntax.Is our strategy


user migration or syntax compatibility?


In addition ,will it also support write SQL?










































在 2020-12-19 02:10:16,"Nishith"  写道:




That’s awesome. Looks like we have a consensus on Calcite. Look forward to


the RFC as well!






-Nishith






On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:






Sounds good. Look forward to a RFC/DISCUSS thread.






Thanks




Vinoth






On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:






Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would




add support for SQL connectors of Hoodie Flink soon ~




Currently, i'm preparing a refactoring to the current Flink writer code.






Vinoth Chandar  于2020年12月18日周五 上午6:39写道:






Thanks Kabeer for the note on gmail. Did not realize that. :)






My desired use case is user use the Hoodie CLI to execute these SQLs.




They can choose what engine to use by a CLI config option.






Yes, that is also another attractive aspect of this route. We can build




out




a common SQL layer and have this translate to the underlying engine




(sounds




like Hive huh)




Longer term, if we really think we can more easily implement a full DML +




DDL + DQL, we can proceed with this.






As others pointed out, for Spark SQL, it might be good to try the Spark




extensions route, before we take this on more fully.






The other part where Calcite is great is, all the support for




windowing/streaming in its syntax.




Danny, I guess if we should be able to leverage that through a deeper




Flink/Hudi integration?








On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 




wrote:






I think Dongwook is investigating on the same lines. and it does seem




better to pursue this first, before trying other approaches.










On Tue, Dec 15, 2020 at 1:38 AM pzwpzw 




wrote:






Yeah I agree with Nishith that an option way is to look at the




ways




to




plug in custom logical and physical plans in Spark. It can simplify




the




implementation and reuse the Spark SQL syntax. And also users




familiar




with




Spark SQL will be able to use HUDi's SQL features more quickly.




In fact, spark have provided the SparkSessionExtensions interface for




implement custom syntax extensions and SQL rewrite rule.














https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html




.




We can use the SparkSessionExtensions to extended hoodie sql syntax




such




as MERGE INTO and DDL syntax.






2020年12月15日 下午3:27,Nishith  写道:






Thanks for starting this thread Vinoth.




In general, definitely see the need for SQL style semantics on Hudi




tables. Apache Calcite is a great option to considering given




DatasourceV2




has the limitations that you described.






Additionally, even if Spark DatasourceV2 allowed for the flexibility,




the




same SQL semantics needs to be supported in other engines like Flink




to




provide the same experience to users - which in itself could also be




considerable amount of work.




So, if we’re able to generalize on the SQL story

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2020-12-28 Thread wei li
First, I think it is necessary to improve spark sql, because the main scenario 
of hudi is datalake or warehouse, and spark has strong ecological capabilities 
in this field.

Second, but in the long run, Hudi needs a more general SQL layer, and it is 
very necessary to embrace calcite. Then based on hudi, a powerful data 
management processing service can be built

On 2020/12/22 08:30:37, Vinoth Chandar  wrote: 
> Hi,
> 
> I think what we are landing on finally is.
> 
> - Keep pushing for SparkSQL support using Spark extensions route
> - Calcite effort will be a separate/orthogonal approach, down the line
> 
> Please feel free to correct me, if I got this wrong.
> 
> On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
> wrote:
> 
> > Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
> > to process engine independent SQL,  for example most of the DDL、Hoodie CLI
> > command and also provide parser for the common SQL extensions(e.g. Merge
> > Into). The Engine-related syntax can be taught to the respective engines to
> > process. If the common sql layer can handle the input sql, it handle
> > it.Otherwise it is routed to the engine for processing. In long term, the
> > common layer will more and more rich and perfect.
> > 2020年12月21日 下午4:38,受春柏  写道:
> >
> > Hi,all
> >
> >
> > That's very good,Hudi SQL syntax can support Flink、hive and other analysis
> > components at the same time,
> > But there are some questions about SparkSQL. SparkSQL syntax is in
> > conflict with Calctite syntax.Is our strategy
> > user migration or syntax compatibility?
> > In addition ,will it also support write SQL?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > 在 2020-12-19 02:10:16,"Nishith"  写道:
> >
> > That’s awesome. Looks like we have a consensus on Calcite. Look forward to
> > the RFC as well!
> >
> >
> > -Nishith
> >
> >
> > On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:
> >
> >
> > Sounds good. Look forward to a RFC/DISCUSS thread.
> >
> >
> > Thanks
> >
> > Vinoth
> >
> >
> > On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:
> >
> >
> > Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
> >
> > add support for SQL connectors of Hoodie Flink soon ~
> >
> > Currently, i'm preparing a refactoring to the current Flink writer code.
> >
> >
> > Vinoth Chandar  于2020年12月18日周五 上午6:39写道:
> >
> >
> > Thanks Kabeer for the note on gmail. Did not realize that. :)
> >
> >
> > My desired use case is user use the Hoodie CLI to execute these SQLs.
> >
> > They can choose what engine to use by a CLI config option.
> >
> >
> > Yes, that is also another attractive aspect of this route. We can build
> >
> > out
> >
> > a common SQL layer and have this translate to the underlying engine
> >
> > (sounds
> >
> > like Hive huh)
> >
> > Longer term, if we really think we can more easily implement a full DML +
> >
> > DDL + DQL, we can proceed with this.
> >
> >
> > As others pointed out, for Spark SQL, it might be good to try the Spark
> >
> > extensions route, before we take this on more fully.
> >
> >
> > The other part where Calcite is great is, all the support for
> >
> > windowing/streaming in its syntax.
> >
> > Danny, I guess if we should be able to leverage that through a deeper
> >
> > Flink/Hudi integration?
> >
> >
> >
> > On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 
> >
> > wrote:
> >
> >
> > I think Dongwook is investigating on the same lines. and it does seem
> >
> > better to pursue this first, before trying other approaches.
> >
> >
> >
> >
> > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw  >
> > .invalid>
> >
> > wrote:
> >
> >
> > Yeah I agree with Nishith that an option way is to look at the
> >
> > ways
> >
> > to
> >
> > plug in custom logical and physical plans in Spark. It can simplify
> >
> > the
> >
> > implementation and reuse the Spark SQL syntax. And also users
> >
> > familiar
> >
> > with
> >
> > Spark SQL will be able to use HUDi's SQL features more quickly.
> >
> > In fact, spark have provided the SparkSessionExtensions interface for
> >
> > implement custom syntax extensions and SQL rewrite rule.
> >
> >
> >
> >
> >
> > https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> >
> > .
> >
> > We can use the SparkSessionExtensions to extended hoodie sql syntax
> >
> > such
> >
> > as MERGE INTO and DDL syntax.
> >
> >
> > 2020年12月15日 下午3:27,Nishith  写道:
> >
> >
> > Thanks for starting this thread Vinoth.
> >
> > In general, definitely see the need for SQL style semantics on Hudi
> >
> > tables. Apache Calcite is a great option to considering given
> >
> > DatasourceV2
> >
> > has the limitations that you described.
> >
> >
> > Additionally, even if Spark DatasourceV2 allowed for the flexibility,
> >
> > the
> >
> > same SQL semantics needs to be supported in other engines like Flink
> >
> > to
> >
> > provide the same experience to users - which in itself could also be
> >
> > considerable amount of work.

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2020-12-22 Thread Danny Chan
That's great, I can help with the Apache Calcite integration.

Vinoth Chandar  于2020年12月23日周三 上午12:29写道:

> Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
> love to have you involved.
>
> On Tue, Dec 22, 2020 at 3:20 AM pzwpzw 
> wrote:
>
> > Yes, it looks good .
> > We are building the spark sql extensions to support for hudi in
> > our internal version.
> > I am interested in participating in the extension of SparkSQL on hudi.
> > 2020年12月22日 下午4:30,Vinoth Chandar  写道:
> >
> > Hi,
> >
> > I think what we are landing on finally is.
> >
> > - Keep pushing for SparkSQL support using Spark extensions route
> > - Calcite effort will be a separate/orthogonal approach, down the line
> >
> > Please feel free to correct me, if I got this wrong.
> >
> > On Mon, Dec 21, 2020 at 3:30 AM pzwpzw  .invalid>
> > wrote:
> >
> > Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
> >
> > to process engine independent SQL, for example most of the DDL、Hoodie CLI
> >
> > command and also provide parser for the common SQL extensions(e.g. Merge
> >
> > Into). The Engine-related syntax can be taught to the respective engines
> to
> >
> > process. If the common sql layer can handle the input sql, it handle
> >
> > it.Otherwise it is routed to the engine for processing. In long term, the
> >
> > common layer will more and more rich and perfect.
> >
> > 2020年12月21日 下午4:38,受春柏  写道:
> >
> >
> > Hi,all
> >
> >
> >
> > That's very good,Hudi SQL syntax can support Flink、hive and other
> analysis
> >
> > components at the same time,
> >
> > But there are some questions about SparkSQL. SparkSQL syntax is in
> >
> > conflict with Calctite syntax.Is our strategy
> >
> > user migration or syntax compatibility?
> >
> > In addition ,will it also support write SQL?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > 在 2020-12-19 02:10:16,"Nishith"  写道:
> >
> >
> > That’s awesome. Looks like we have a consensus on Calcite. Look forward
> to
> >
> > the RFC as well!
> >
> >
> >
> > -Nishith
> >
> >
> >
> > On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:
> >
> >
> >
> > Sounds good. Look forward to a RFC/DISCUSS thread.
> >
> >
> >
> > Thanks
> >
> >
> > Vinoth
> >
> >
> >
> > On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:
> >
> >
> >
> > Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i
> would
> >
> >
> > add support for SQL connectors of Hoodie Flink soon ~
> >
> >
> > Currently, i'm preparing a refactoring to the current Flink writer code.
> >
> >
> >
> > Vinoth Chandar  于2020年12月18日周五 上午6:39写道:
> >
> >
> >
> > Thanks Kabeer for the note on gmail. Did not realize that. :)
> >
> >
> >
> > My desired use case is user use the Hoodie CLI to execute these SQLs.
> >
> >
> > They can choose what engine to use by a CLI config option.
> >
> >
> >
> > Yes, that is also another attractive aspect of this route. We can build
> >
> >
> > out
> >
> >
> > a common SQL layer and have this translate to the underlying engine
> >
> >
> > (sounds
> >
> >
> > like Hive huh)
> >
> >
> > Longer term, if we really think we can more easily implement a full DML +
> >
> >
> > DDL + DQL, we can proceed with this.
> >
> >
> >
> > As others pointed out, for Spark SQL, it might be good to try the Spark
> >
> >
> > extensions route, before we take this on more fully.
> >
> >
> >
> > The other part where Calcite is great is, all the support for
> >
> >
> > windowing/streaming in its syntax.
> >
> >
> > Danny, I guess if we should be able to leverage that through a deeper
> >
> >
> > Flink/Hudi integration?
> >
> >
> >
> >
> > On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 
> >
> >
> > wrote:
> >
> >
> >
> > I think Dongwook is investigating on the same lines. and it does seem
> >
> >
> > better to pursue this first, before trying other approaches.
> >
> >
> >
> >
> >
> > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw  >
> >
> > .invalid>
> >
> >
> > wrote:
> >
> >
> >
> > Yeah I agree with Nishith that an option way is to look at the
> >
> >
> > ways
> >
> >
> > to
> >
> >
> > plug in custom logical and physical plans in Spark. It can simplify
> >
> >
> > the
> >
> >
> > implementation and reuse the Spark SQL syntax. And also users
> >
> >
> > familiar
> >
> >
> > with
> >
> >
> > Spark SQL will be able to use HUDi's SQL features more quickly.
> >
> >
> > In fact, spark have provided the SparkSessionExtensions interface for
> >
> >
> > implement custom syntax extensions and SQL rewrite rule.
> >
> >
> >
> >
> >
> >
> >
> >
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> >
> >
> > .
> >
> >
> > We can use the SparkSessionExtensions to extended hoodie sql syntax
> >
> >
> > such
> >
> >
> > as MERGE INTO and DDL syntax.
> >
> >
> >
> > 2020年12月15日 下午3:27,Nishith  写道:
> >
> >
> >
> > Thanks for starting this thread Vinoth.
> >
> >
> > In general, definitely see the need for SQL style semantics on Hudi
> >
> >
> > tables. Apac

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2020-12-22 Thread Vinoth Chandar
Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
love to have you involved.

On Tue, Dec 22, 2020 at 3:20 AM pzwpzw 
wrote:

> Yes, it looks good .
> We are building the spark sql extensions to support for hudi in
> our internal version.
> I am interested in participating in the extension of SparkSQL on hudi.
> 2020年12月22日 下午4:30,Vinoth Chandar  写道:
>
> Hi,
>
> I think what we are landing on finally is.
>
> - Keep pushing for SparkSQL support using Spark extensions route
> - Calcite effort will be a separate/orthogonal approach, down the line
>
> Please feel free to correct me, if I got this wrong.
>
> On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
> wrote:
>
> Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
>
> to process engine independent SQL, for example most of the DDL、Hoodie CLI
>
> command and also provide parser for the common SQL extensions(e.g. Merge
>
> Into). The Engine-related syntax can be taught to the respective engines to
>
> process. If the common sql layer can handle the input sql, it handle
>
> it.Otherwise it is routed to the engine for processing. In long term, the
>
> common layer will more and more rich and perfect.
>
> 2020年12月21日 下午4:38,受春柏  写道:
>
>
> Hi,all
>
>
>
> That's very good,Hudi SQL syntax can support Flink、hive and other analysis
>
> components at the same time,
>
> But there are some questions about SparkSQL. SparkSQL syntax is in
>
> conflict with Calctite syntax.Is our strategy
>
> user migration or syntax compatibility?
>
> In addition ,will it also support write SQL?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 在 2020-12-19 02:10:16,"Nishith"  写道:
>
>
> That’s awesome. Looks like we have a consensus on Calcite. Look forward to
>
> the RFC as well!
>
>
>
> -Nishith
>
>
>
> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:
>
>
>
> Sounds good. Look forward to a RFC/DISCUSS thread.
>
>
>
> Thanks
>
>
> Vinoth
>
>
>
> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:
>
>
>
> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>
>
> add support for SQL connectors of Hoodie Flink soon ~
>
>
> Currently, i'm preparing a refactoring to the current Flink writer code.
>
>
>
> Vinoth Chandar  于2020年12月18日周五 上午6:39写道:
>
>
>
> Thanks Kabeer for the note on gmail. Did not realize that. :)
>
>
>
> My desired use case is user use the Hoodie CLI to execute these SQLs.
>
>
> They can choose what engine to use by a CLI config option.
>
>
>
> Yes, that is also another attractive aspect of this route. We can build
>
>
> out
>
>
> a common SQL layer and have this translate to the underlying engine
>
>
> (sounds
>
>
> like Hive huh)
>
>
> Longer term, if we really think we can more easily implement a full DML +
>
>
> DDL + DQL, we can proceed with this.
>
>
>
> As others pointed out, for Spark SQL, it might be good to try the Spark
>
>
> extensions route, before we take this on more fully.
>
>
>
> The other part where Calcite is great is, all the support for
>
>
> windowing/streaming in its syntax.
>
>
> Danny, I guess if we should be able to leverage that through a deeper
>
>
> Flink/Hudi integration?
>
>
>
>
> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 
>
>
> wrote:
>
>
>
> I think Dongwook is investigating on the same lines. and it does seem
>
>
> better to pursue this first, before trying other approaches.
>
>
>
>
>
> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw 
>
> .invalid>
>
>
> wrote:
>
>
>
> Yeah I agree with Nishith that an option way is to look at the
>
>
> ways
>
>
> to
>
>
> plug in custom logical and physical plans in Spark. It can simplify
>
>
> the
>
>
> implementation and reuse the Spark SQL syntax. And also users
>
>
> familiar
>
>
> with
>
>
> Spark SQL will be able to use HUDi's SQL features more quickly.
>
>
> In fact, spark have provided the SparkSessionExtensions interface for
>
>
> implement custom syntax extensions and SQL rewrite rule.
>
>
>
>
>
>
>
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>
>
> .
>
>
> We can use the SparkSessionExtensions to extended hoodie sql syntax
>
>
> such
>
>
> as MERGE INTO and DDL syntax.
>
>
>
> 2020年12月15日 下午3:27,Nishith  写道:
>
>
>
> Thanks for starting this thread Vinoth.
>
>
> In general, definitely see the need for SQL style semantics on Hudi
>
>
> tables. Apache Calcite is a great option to considering given
>
>
> DatasourceV2
>
>
> has the limitations that you described.
>
>
>
> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>
>
> the
>
>
> same SQL semantics needs to be supported in other engines like Flink
>
>
> to
>
>
> provide the same experience to users - which in itself could also be
>
>
> considerable amount of work.
>
>
> So, if we’re able to generalize on the SQL story along Calcite, that
>
>
> would
>
>
> help reduce redundant work in some sense.
>
>
> Although, I’m worried about a few things
>
>
>
> 1) Like you pointed out, writing complex user jobs using Spark SQL
>
>
> 

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2020-12-22 Thread pzwpzw

Yes, it looks good .
We are building the spark sql extensions to support for hudi in our internal 
version.
I am interested in participating in the extension of SparkSQL on hudi.
2020年12月22日 下午4:30,Vinoth Chandar  写道:


Hi,

I think what we are landing on finally is.

- Keep pushing for SparkSQL support using Spark extensions route
- Calcite effort will be a separate/orthogonal approach, down the line

Please feel free to correct me, if I got this wrong.

On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
wrote:


Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
to process engine independent SQL, for example most of the DDL、Hoodie CLI
command and also provide parser for the common SQL extensions(e.g. Merge
Into). The Engine-related syntax can be taught to the respective engines to
process. If the common sql layer can handle the input sql, it handle
it.Otherwise it is routed to the engine for processing. In long term, the
common layer will more and more rich and perfect.
2020年12月21日 下午4:38,受春柏  写道:


Hi,all




That's very good,Hudi SQL syntax can support Flink、hive and other analysis
components at the same time,
But there are some questions about SparkSQL. SparkSQL syntax is in
conflict with Calctite syntax.Is our strategy
user migration or syntax compatibility?
In addition ,will it also support write SQL?








































在 2020-12-19 02:10:16,"Nishith"  写道:


That’s awesome. Looks like we have a consensus on Calcite. Look forward to
the RFC as well!




-Nishith




On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:




Sounds good. Look forward to a RFC/DISCUSS thread.




Thanks


Vinoth




On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:




Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would


add support for SQL connectors of Hoodie Flink soon ~


Currently, i'm preparing a refactoring to the current Flink writer code.




Vinoth Chandar  于2020年12月18日周五 上午6:39写道:




Thanks Kabeer for the note on gmail. Did not realize that. :)




My desired use case is user use the Hoodie CLI to execute these SQLs.


They can choose what engine to use by a CLI config option.




Yes, that is also another attractive aspect of this route. We can build


out


a common SQL layer and have this translate to the underlying engine


(sounds


like Hive huh)


Longer term, if we really think we can more easily implement a full DML +


DDL + DQL, we can proceed with this.




As others pointed out, for Spark SQL, it might be good to try the Spark


extensions route, before we take this on more fully.




The other part where Calcite is great is, all the support for


windowing/streaming in its syntax.


Danny, I guess if we should be able to leverage that through a deeper


Flink/Hudi integration?






On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 


wrote:




I think Dongwook is investigating on the same lines. and it does seem


better to pursue this first, before trying other approaches.








On Tue, Dec 15, 2020 at 1:38 AM pzwpzw 


wrote:




Yeah I agree with Nishith that an option way is to look at the


ways


to


plug in custom logical and physical plans in Spark. It can simplify


the


implementation and reuse the Spark SQL syntax. And also users


familiar


with


Spark SQL will be able to use HUDi's SQL features more quickly.


In fact, spark have provided the SparkSessionExtensions interface for


implement custom syntax extensions and SQL rewrite rule.










https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html


.


We can use the SparkSessionExtensions to extended hoodie sql syntax


such


as MERGE INTO and DDL syntax.




2020年12月15日 下午3:27,Nishith  写道:




Thanks for starting this thread Vinoth.


In general, definitely see the need for SQL style semantics on Hudi


tables. Apache Calcite is a great option to considering given


DatasourceV2


has the limitations that you described.




Additionally, even if Spark DatasourceV2 allowed for the flexibility,


the


same SQL semantics needs to be supported in other engines like Flink


to


provide the same experience to users - which in itself could also be


considerable amount of work.


So, if we’re able to generalize on the SQL story along Calcite, that


would


help reduce redundant work in some sense.


Although, I’m worried about a few things




1) Like you pointed out, writing complex user jobs using Spark SQL


syntax


can be harder for users who are moving from “Hudi syntax” to “Spark


syntax”


for cross table joins, merges etc using data frames. One option is to


look


at the if there are ways to plug in custom logical and physical plans


in


Spark, this way, although the merge on sparksql functionality may not


be


as


simple to use, but wouldn’t take away performance and feature set for


starters, in the future we could think of having the entire query


space


be


powered by calcite like you mentioned


2) If we

Reply:Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2020-12-22 Thread 受春柏



Yes,I think it should be ok



在 2020-12-22 16:30:37,"Vinoth Chandar"  写道:
>Hi,
>
>I think what we are landing on finally is.
>
>- Keep pushing for SparkSQL support using Spark extensions route
>- Calcite effort will be a separate/orthogonal approach, down the line
>
>Please feel free to correct me, if I got this wrong.
>
>On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
>wrote:
>
>> Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
>> to process engine independent SQL,  for example most of the DDL、Hoodie CLI
>> command and also provide parser for the common SQL extensions(e.g. Merge
>> Into). The Engine-related syntax can be taught to the respective engines to
>> process. If the common sql layer can handle the input sql, it handle
>> it.Otherwise it is routed to the engine for processing. In long term, the
>> common layer will more and more rich and perfect.
>> 2020年12月21日 下午4:38,受春柏  写道:
>>
>> Hi,all
>>
>>
>> That's very good,Hudi SQL syntax can support Flink、hive and other analysis
>> components at the same time,
>> But there are some questions about SparkSQL. SparkSQL syntax is in
>> conflict with Calctite syntax.Is our strategy
>> user migration or syntax compatibility?
>> In addition ,will it also support write SQL?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 在 2020-12-19 02:10:16,"Nishith"  写道:
>>
>> That’s awesome. Looks like we have a consensus on Calcite. Look forward to
>> the RFC as well!
>>
>>
>> -Nishith
>>
>>
>> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:
>>
>>
>> Sounds good. Look forward to a RFC/DISCUSS thread.
>>
>>
>> Thanks
>>
>> Vinoth
>>
>>
>> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:
>>
>>
>> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>>
>> add support for SQL connectors of Hoodie Flink soon ~
>>
>> Currently, i'm preparing a refactoring to the current Flink writer code.
>>
>>
>> Vinoth Chandar  于2020年12月18日周五 上午6:39写道:
>>
>>
>> Thanks Kabeer for the note on gmail. Did not realize that. :)
>>
>>
>> My desired use case is user use the Hoodie CLI to execute these SQLs.
>>
>> They can choose what engine to use by a CLI config option.
>>
>>
>> Yes, that is also another attractive aspect of this route. We can build
>>
>> out
>>
>> a common SQL layer and have this translate to the underlying engine
>>
>> (sounds
>>
>> like Hive huh)
>>
>> Longer term, if we really think we can more easily implement a full DML +
>>
>> DDL + DQL, we can proceed with this.
>>
>>
>> As others pointed out, for Spark SQL, it might be good to try the Spark
>>
>> extensions route, before we take this on more fully.
>>
>>
>> The other part where Calcite is great is, all the support for
>>
>> windowing/streaming in its syntax.
>>
>> Danny, I guess if we should be able to leverage that through a deeper
>>
>> Flink/Hudi integration?
>>
>>
>>
>> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 
>>
>> wrote:
>>
>>
>> I think Dongwook is investigating on the same lines. and it does seem
>>
>> better to pursue this first, before trying other approaches.
>>
>>
>>
>>
>> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw >
>> .invalid>
>>
>> wrote:
>>
>>
>> Yeah I agree with Nishith that an option way is to look at the
>>
>> ways
>>
>> to
>>
>> plug in custom logical and physical plans in Spark. It can simplify
>>
>> the
>>
>> implementation and reuse the Spark SQL syntax. And also users
>>
>> familiar
>>
>> with
>>
>> Spark SQL will be able to use HUDi's SQL features more quickly.
>>
>> In fact, spark have provided the SparkSessionExtensions interface for
>>
>> implement custom syntax extensions and SQL rewrite rule.
>>
>>
>>
>>
>>
>> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>>
>> .
>>
>> We can use the SparkSessionExtensions to extended hoodie sql syntax
>>
>> such
>>
>> as MERGE INTO and DDL syntax.
>>
>>
>> 2020年12月15日 下午3:27,Nishith  写道:
>>
>>
>> Thanks for starting this thread Vinoth.
>>
>> In general, definitely see the need for SQL style semantics on Hudi
>>
>> tables. Apache Calcite is a great option to considering given
>>
>> DatasourceV2
>>
>> has the limitations that you described.
>>
>>
>> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>>
>> the
>>
>> same SQL semantics needs to be supported in other engines like Flink
>>
>> to
>>
>> provide the same experience to users - which in itself could also be
>>
>> considerable amount of work.
>>
>> So, if we’re able to generalize on the SQL story along Calcite, that
>>
>> would
>>
>> help reduce redundant work in some sense.
>>
>> Although, I’m worried about a few things
>>
>>
>> 1) Like you pointed out, writing complex user jobs using Spark SQL
>>
>> syntax
>>
>> can be harder for users who are moving from “Hudi syntax” to “Spark
>>
>> syntax”
>>
>> for cross table joins, merges etc using data frames. One option is to
>>
>> look
>>
>> at the if there are ways to plug in custom logical and physical plans
>>
>> in
>>
>> Spa

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2020-12-22 Thread Vinoth Chandar
Hi,

I think what we are landing on finally is.

- Keep pushing for SparkSQL support using Spark extensions route
- Calcite effort will be a separate/orthogonal approach, down the line

Please feel free to correct me, if I got this wrong.

On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
wrote:

> Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
> to process engine independent SQL,  for example most of the DDL、Hoodie CLI
> command and also provide parser for the common SQL extensions(e.g. Merge
> Into). The Engine-related syntax can be taught to the respective engines to
> process. If the common sql layer can handle the input sql, it handle
> it.Otherwise it is routed to the engine for processing. In long term, the
> common layer will more and more rich and perfect.
> 2020年12月21日 下午4:38,受春柏  写道:
>
> Hi,all
>
>
> That's very good,Hudi SQL syntax can support Flink、hive and other analysis
> components at the same time,
> But there are some questions about SparkSQL. SparkSQL syntax is in
> conflict with Calctite syntax.Is our strategy
> user migration or syntax compatibility?
> In addition ,will it also support write SQL?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 在 2020-12-19 02:10:16,"Nishith"  写道:
>
> That’s awesome. Looks like we have a consensus on Calcite. Look forward to
> the RFC as well!
>
>
> -Nishith
>
>
> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:
>
>
> Sounds good. Look forward to a RFC/DISCUSS thread.
>
>
> Thanks
>
> Vinoth
>
>
> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:
>
>
> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>
> add support for SQL connectors of Hoodie Flink soon ~
>
> Currently, i'm preparing a refactoring to the current Flink writer code.
>
>
> Vinoth Chandar  于2020年12月18日周五 上午6:39写道:
>
>
> Thanks Kabeer for the note on gmail. Did not realize that. :)
>
>
> My desired use case is user use the Hoodie CLI to execute these SQLs.
>
> They can choose what engine to use by a CLI config option.
>
>
> Yes, that is also another attractive aspect of this route. We can build
>
> out
>
> a common SQL layer and have this translate to the underlying engine
>
> (sounds
>
> like Hive huh)
>
> Longer term, if we really think we can more easily implement a full DML +
>
> DDL + DQL, we can proceed with this.
>
>
> As others pointed out, for Spark SQL, it might be good to try the Spark
>
> extensions route, before we take this on more fully.
>
>
> The other part where Calcite is great is, all the support for
>
> windowing/streaming in its syntax.
>
> Danny, I guess if we should be able to leverage that through a deeper
>
> Flink/Hudi integration?
>
>
>
> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 
>
> wrote:
>
>
> I think Dongwook is investigating on the same lines. and it does seem
>
> better to pursue this first, before trying other approaches.
>
>
>
>
> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw 
> .invalid>
>
> wrote:
>
>
> Yeah I agree with Nishith that an option way is to look at the
>
> ways
>
> to
>
> plug in custom logical and physical plans in Spark. It can simplify
>
> the
>
> implementation and reuse the Spark SQL syntax. And also users
>
> familiar
>
> with
>
> Spark SQL will be able to use HUDi's SQL features more quickly.
>
> In fact, spark have provided the SparkSessionExtensions interface for
>
> implement custom syntax extensions and SQL rewrite rule.
>
>
>
>
>
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>
> .
>
> We can use the SparkSessionExtensions to extended hoodie sql syntax
>
> such
>
> as MERGE INTO and DDL syntax.
>
>
> 2020年12月15日 下午3:27,Nishith  写道:
>
>
> Thanks for starting this thread Vinoth.
>
> In general, definitely see the need for SQL style semantics on Hudi
>
> tables. Apache Calcite is a great option to considering given
>
> DatasourceV2
>
> has the limitations that you described.
>
>
> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>
> the
>
> same SQL semantics needs to be supported in other engines like Flink
>
> to
>
> provide the same experience to users - which in itself could also be
>
> considerable amount of work.
>
> So, if we’re able to generalize on the SQL story along Calcite, that
>
> would
>
> help reduce redundant work in some sense.
>
> Although, I’m worried about a few things
>
>
> 1) Like you pointed out, writing complex user jobs using Spark SQL
>
> syntax
>
> can be harder for users who are moving from “Hudi syntax” to “Spark
>
> syntax”
>
> for cross table joins, merges etc using data frames. One option is to
>
> look
>
> at the if there are ways to plug in custom logical and physical plans
>
> in
>
> Spark, this way, although the merge on sparksql functionality may not
>
> be
>
> as
>
> simple to use, but wouldn’t take away performance and feature set for
>
> starters, in the future we could think of having the entire query
>
> space
>
> be
>
> powered by calcite like you mentioned
>
> 2) If we continue t

Reply:Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2020-12-21 Thread 受春柏
Hi,pzwpzw
I see what you mean, it is very necessary to implement a common layer for hudi, 
and we are also planning to implement sparkSQL write capabilities for SQL-based 
ETL processing.Common Layer and SparkSQL Write can combine to form HUDI's SQL 
capabilities


















At 2020-12-21 19:30:36, "pzwpzw"  wrote:

Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer to 
process engine independent SQL,  for example most of the DDL、Hoodie CLI command 
and also provide parser for the common SQL extensions(e.g. Merge Into). The 
Engine-related syntax can be taught to the respective engines to process. If 
the common sql layer can handle the input sql, it handle it.Otherwise it is 
routed to the engine for processing. In long term, the common layer will more 
and more rich and perfect.
2020年12月21日 下午4:38,受春柏  写道:


Hi,all


That's very good,Hudi SQL syntax can support Flink、hive and other analysis 
components at the same time,
But there are some questions about SparkSQL. SparkSQL syntax is in conflict 
with Calctite syntax.Is our strategy
user migration or syntax compatibility?
In addition ,will it also support write SQL?




















在 2020-12-19 02:10:16,"Nishith"  写道:
That’s awesome. Looks like we have a consensus on Calcite. Look forward to the 
RFC as well!
-Nishith
On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:
Sounds good. Look forward to a RFC/DISCUSS thread.
ThanksVinoth
On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:
Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i wouldadd 
support for SQL connectors of Hoodie Flink soon ~Currently, i'm preparing a 
refactoring to the current Flink writer code.
Vinoth Chandar  于2020年12月18日周五 上午6:39写道:
Thanks Kabeer for the note on gmail. Did not realize that. :)
My desired use case is user use the Hoodie CLI to execute these SQLs.They can 
choose what engine to use by a CLI config option.
Yes, that is also another attractive aspect of this route. We can buildouta 
common SQL layer and have this translate to the underlying engine(soundslike 
Hive huh)Longer term, if we really think we can more easily implement a full 
DML +DDL + DQL, we can proceed with this.
As others pointed out, for Spark SQL, it might be good to try the 
Sparkextensions route, before we take this on more fully.
The other part where Calcite is great is, all the support 
forwindowing/streaming in its syntax.Danny, I guess if we should be able to 
leverage that through a deeperFlink/Hudi integration?

On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar wrote:
I think Dongwook is investigating on the same lines. and it does seembetter to 
pursue this first, before trying other approaches.


On Tue, Dec 15, 2020 at 1:38 AM pzwpzw wrote:
Yeah I agree with Nishith that an option way is to look at thewaystoplug in 
custom logical and physical plans in Spark. It can simplifytheimplementation 
and reuse the Spark SQL syntax. And also usersfamiliarwithSpark SQL will be 
able to use HUDi's SQL features more quickly.In fact, spark have provided the 
SparkSessionExtensions interface forimplement custom syntax extensions and SQL 
rewrite rule.


https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html.We
 can use the SparkSessionExtensions to extended hoodie sql syntaxsuchas MERGE 
INTO and DDL syntax.
2020年12月15日 下午3:27,Nishith  写道:
Thanks for starting this thread Vinoth.In general, definitely see the need for 
SQL style semantics on Huditables. Apache Calcite is a great option to 
considering givenDatasourceV2has the limitations that you described.
Additionally, even if Spark DatasourceV2 allowed for the flexibility,thesame 
SQL semantics needs to be supported in other engines like Flinktoprovide the 
same experience to users - which in itself could also beconsiderable amount of 
work.So, if we’re able to generalize on the SQL story along Calcite, 
thatwouldhelp reduce redundant work in some sense.Although, I’m worried about a 
few things
1) Like you pointed out, writing complex user jobs using Spark SQLsyntaxcan be 
harder for users who are moving from “Hudi syntax” to “Sparksyntax”for cross 
table joins, merges etc using data frames. One option is tolookat the if there 
are ways to plug in custom logical and physical plansinSpark, this way, 
although the merge on sparksql functionality may notbeassimple to use, but 
wouldn’t take away performance and feature set forstarters, in the future we 
could think of having the entire queryspacebepowered by calcite like you 
mentioned2) If we continue to use DatasourceV1, is there any downside to 
thisfroma performance and optimization perspective when executing plan - 
I’mguessing not but haven’t delved into the code to see if 
there’sanythingdifferent apart from the API and spec.
On Dec 14, 2020, at 11:06 PM, Vinoth Chandar wrote:

Hello all,

Just bumping this thread again

thanks
vinoth

On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar wrote:

Hello all,

One feature tha

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2020-12-21 Thread pzwpzw

Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer to 
process engine independent SQL,  for example most of the DDL、Hoodie CLI command 
and also provide parser for the common SQL extensions(e.g. Merge Into). The 
Engine-related syntax can be taught to the respective engines to process. If 
the common sql layer can handle the input sql, it handle it.Otherwise it is 
routed to the engine for processing. In long term, the common layer will more 
and more rich and perfect.
2020年12月21日 下午4:38,受春柏  写道:


Hi,all


That's very good,Hudi SQL syntax can support Flink、hive and other analysis 
components at the same time,
But there are some questions about SparkSQL. SparkSQL syntax is in conflict 
with Calctite syntax.Is our strategy
user migration or syntax compatibility?
In addition ,will it also support write SQL?




















在 2020-12-19 02:10:16,"Nishith"  写道:

That’s awesome. Looks like we have a consensus on Calcite. Look forward to the 
RFC as well!


-Nishith


On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:


Sounds good. Look forward to a RFC/DISCUSS thread.


Thanks
Vinoth


On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:


Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
add support for SQL connectors of Hoodie Flink soon ~
Currently, i'm preparing a refactoring to the current Flink writer code.


Vinoth Chandar  于2020年12月18日周五 上午6:39写道:


Thanks Kabeer for the note on gmail. Did not realize that. :)


My desired use case is user use the Hoodie CLI to execute these SQLs.
They can choose what engine to use by a CLI config option.


Yes, that is also another attractive aspect of this route. We can build
out
a common SQL layer and have this translate to the underlying engine
(sounds
like Hive huh)
Longer term, if we really think we can more easily implement a full DML +
DDL + DQL, we can proceed with this.


As others pointed out, for Spark SQL, it might be good to try the Spark
extensions route, before we take this on more fully.


The other part where Calcite is great is, all the support for
windowing/streaming in its syntax.
Danny, I guess if we should be able to leverage that through a deeper
Flink/Hudi integration?




On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 
wrote:


I think Dongwook is investigating on the same lines. and it does seem
better to pursue this first, before trying other approaches.






On Tue, Dec 15, 2020 at 1:38 AM pzwpzw 
wrote:


Yeah I agree with Nishith that an option way is to look at the
ways
to
plug in custom logical and physical plans in Spark. It can simplify
the
implementation and reuse the Spark SQL syntax. And also users
familiar
with
Spark SQL will be able to use HUDi's SQL features more quickly.
In fact, spark have provided the SparkSessionExtensions interface for
implement custom syntax extensions and SQL rewrite rule.






https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
.
We can use the SparkSessionExtensions to extended hoodie sql syntax
such
as MERGE INTO and DDL syntax.


2020年12月15日 下午3:27,Nishith  写道:


Thanks for starting this thread Vinoth.
In general, definitely see the need for SQL style semantics on Hudi
tables. Apache Calcite is a great option to considering given
DatasourceV2
has the limitations that you described.


Additionally, even if Spark DatasourceV2 allowed for the flexibility,
the
same SQL semantics needs to be supported in other engines like Flink
to
provide the same experience to users - which in itself could also be
considerable amount of work.
So, if we’re able to generalize on the SQL story along Calcite, that
would
help reduce redundant work in some sense.
Although, I’m worried about a few things


1) Like you pointed out, writing complex user jobs using Spark SQL
syntax
can be harder for users who are moving from “Hudi syntax” to “Spark
syntax”
for cross table joins, merges etc using data frames. One option is to
look
at the if there are ways to plug in custom logical and physical plans
in
Spark, this way, although the merge on sparksql functionality may not
be
as
simple to use, but wouldn’t take away performance and feature set for
starters, in the future we could think of having the entire query
space
be
powered by calcite like you mentioned
2) If we continue to use DatasourceV1, is there any downside to this
from
a performance and optimization perspective when executing plan - I’m
guessing not but haven’t delved into the code to see if there’s
anything
different apart from the API and spec.


On Dec 14, 2020, at 11:06 PM, Vinoth Chandar 
wrote:




Hello all,




Just bumping this thread again




thanks


vinoth




On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar 
wrote:




Hello all,




One feature that keeps coming up is the ability to use UPDATE, MERGE
sql


syntax to support writing into Hudi tables. We have looked into the
Spark 3


DataSource V2 APIs as well and found several issues that hinder us in


implement

Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2020-12-21 Thread 受春柏
Hi,all


That's very good,Hudi SQL syntax can support Flink、hive and other analysis 
components at the same time,
But there are some questions about SparkSQL. SparkSQL syntax is in conflict 
with Calctite syntax.Is our strategy
 user migration or syntax compatibility?
In addition ,will it also support write SQL?




















在 2020-12-19 02:10:16,"Nishith"  写道:
>That’s awesome. Looks like we have a consensus on Calcite. Look forward to the 
>RFC as well!
>
>-Nishith
>
>> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:
>> 
>> Sounds good. Look forward to a RFC/DISCUSS thread.
>> 
>> Thanks
>> Vinoth
>> 
>>> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:
>>> 
>>> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>>> add support for SQL connectors of Hoodie Flink soon ~
>>> Currently, i'm preparing a refactoring to the current Flink writer code.
>>> 
>>> Vinoth Chandar  于2020年12月18日周五 上午6:39写道:
>>> 
 Thanks Kabeer for the note on gmail. Did not realize that.  :)
 
>> My desired use case is user use the Hoodie CLI to execute these SQLs.
 They can choose what engine to use by a CLI config option.
 
 Yes, that is also another attractive aspect of this route. We can build
>>> out
 a common SQL layer and have this translate to the underlying engine
>>> (sounds
 like Hive huh)
 Longer term, if we really think we can more easily implement a full DML +
 DDL + DQL, we can proceed with this.
 
 As others pointed out, for Spark SQL, it might be good to try the Spark
 extensions route, before we take this on more fully.
 
 The other part where Calcite is great is, all the support for
 windowing/streaming in its syntax.
 Danny, I guess if we should be able to leverage that through a deeper
 Flink/Hudi integration?
 
 
 On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 
>>> wrote:
 
> I think Dongwook is investigating on the same lines. and it does seem
> better to pursue this first, before trying other approaches.
> 
> 
> 
> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw >>> .invalid>
> wrote:
> 
>>   Yeah I agree with Nishith that an option way is to look at the
>>> ways
 to
>> plug in custom logical and physical plans in Spark. It can simplify
>>> the
>> implementation and reuse the Spark SQL syntax. And also users
>>> familiar
> with
>> Spark SQL will be able to use HUDi's SQL features more quickly.
>> In fact, spark have provided the SparkSessionExtensions interface for
>> implement custom syntax extensions and SQL rewrite rule.
>> 
> 
 
>>> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> .
>> We can use the SparkSessionExtensions to extended hoodie sql syntax
 such
>> as MERGE INTO and DDL syntax.
>> 
>> 2020年12月15日 下午3:27,Nishith  写道:
>> 
>> Thanks for starting this thread Vinoth.
>> In general, definitely see the need for SQL style semantics on Hudi
>> tables. Apache Calcite is a great option to considering given
> DatasourceV2
>> has the limitations that you described.
>> 
>> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
 the
>> same SQL semantics needs to be supported in other engines like Flink
>>> to
>> provide the same experience to users - which in itself could also be
>> considerable amount of work.
>> So, if we’re able to generalize on the SQL story along Calcite, that
> would
>> help reduce redundant work in some sense.
>> Although, I’m worried about a few things
>> 
>> 1) Like you pointed out, writing complex user jobs using Spark SQL
 syntax
>> can be harder for users who are moving from “Hudi syntax” to “Spark
> syntax”
>> for cross table joins, merges etc using data frames. One option is to
> look
>> at the if there are ways to plug in custom logical and physical plans
 in
>> Spark, this way, although the merge on sparksql functionality may not
 be
> as
>> simple to use, but wouldn’t take away performance and feature set for
>> starters, in the future we could think of having the entire query
>>> space
> be
>> powered by calcite like you mentioned
>> 2) If we continue to use DatasourceV1, is there any downside to this
 from
>> a performance and optimization perspective when executing plan - I’m
>> guessing not but haven’t delved into the code to see if there’s
 anything
>> different apart from the API and spec.
>> 
>> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar 
 wrote:
>> 
>> 
>> Hello all,
>> 
>> 
>> Just bumping this thread again
>> 
>> 
>> thanks
>> 
>> vinoth
>> 
>> 
>> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar 
> wrote:
>> 
>> 
>> Hello all,
>> 
>> 
>> One