Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
Thank vino yang! I have move the doc to RFC-25. We can continue the discussion there. 2021年1月22日 上午9:27,vino yang 写道: Hi zhiwei, Done! Now, you should have cwiki permission. Best, Vino pzwpzw 于2021年1月22日周五 上午12:06写道: That is great! Can you give me the permission to the cwiki? My cwiki id is: zhiwei . I will move it to there and continue the disscussion. 2021年1月21日 下午11:19,Gary Li 写道: Hi pengzhiwei, Thanks for the proposal. That’s a great feature. Can we move the design doc to cwiki page as a new RFC? We can continue the discussion from there. Thanks, Best Regards, Gary Li From: pzwpzw Reply-To: "dev@hudi.apache.org" Date: Wednesday, January 20, 2021 at 11:52 PM To: "dev@hudi.apache.org" Cc: "dev@hudi.apache.org" Subject: Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite Hi, we have implemented the spark sql extension for hudi in our Internal version. Here is the main implementation, including the extension sql syntax and implementation scheme on spark. I am waiting for your feedback. Any comments are welcome~ https://docs.google.com/document/d/1KC6Rae67CUaCUpKoIAkM6OTAGuOWFPD9qtNfqchAl1o/edit#heading=h.oeoy1y14sifu 2020年12月23日 上午12:30,Vinoth Chandar 写道: Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think. love to have you involved. On Tue, Dec 22, 2020 at 3:20 AM pzwpzw wrote: Yes, it looks good . We are building the spark sql extensions to support for hudi in our internal version. I am interested in participating in the extension of SparkSQL on hudi. 2020年12月22日 下午4:30,Vinoth Chandar 写道: Hi, I think what we are landing on finally is. - Keep pushing for SparkSQL support using Spark extensions route - Calcite effort will be a separate/orthogonal approach, down the line Please feel free to correct me, if I got this wrong. On Mon, Dec 21, 2020 at 3:30 AM pzwpzw wrote: Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer to process engine independent SQL, for example most of the DDL、Hoodie CLI command and also provide parser for the common SQL extensions(e.g. Merge Into). The Engine-related syntax can be taught to the respective engines to process. If the common sql layer can handle the input sql, it handle it.Otherwise it is routed to the engine for processing. In long term, the common layer will more and more rich and perfect. 2020年12月21日 下午4:38,受春柏 写道: Hi,all That's very good,Hudi SQL syntax can support Flink、hive and other analysis components at the same time, But there are some questions about SparkSQL. SparkSQL syntax is in conflict with Calctite syntax.Is our strategy user migration or syntax compatibility? In addition ,will it also support write SQL? 在 2020-12-19 02:10:16,"Nishith" 写道: That’s awesome. Looks like we have a consensus on Calcite. Look forward to the RFC as well! -Nishith On Dec 18, 2020, at 9:03 AM, Vinoth Chandar wrote: Sounds good. Look forward to a RFC/DISCUSS thread. Thanks Vinoth On Thu, Dec 17, 2020 at 6:04 PM Danny Chan wrote: Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would add support for SQL connectors of Hoodie Flink soon ~ Currently, i'm preparing a refactoring to the current Flink writer code. Vinoth Chandar 于2020年12月18日周五 上午6:39写道: Thanks Kabeer for the note on gmail. Did not realize that. :) My desired use case is user use the Hoodie CLI to execute these SQLs. They can choose what engine to use by a CLI config option. Yes, that is also another attractive aspect of this route. We can build out a common SQL layer and have this translate to the underlying engine (sounds like Hive huh) Longer term, if we really think we can more easily implement a full DML + DDL + DQL, we can proceed with this. As others pointed out, for Spark SQL, it might be good to try the Spark extensions route, before we take this on more fully. The other part where Calcite is great is, all the support for windowing/streaming in its syntax. Danny, I guess if we should be able to leverage that through a deeper Flink/Hudi integration? On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar wrote: I think Dongwook is investigating on the same lines. and it does seem better to pursue this first, before trying other approaches. On Tue, Dec 15, 2020 at 1:38 AM pzwpzw wrote: Yeah I agree with Nishith that an option way is to look at the ways to plug in custom logical and physical plans in Spark. It can simplify the implementation and reuse the Spark SQL syntax. And also users familiar with Spark SQL will be able to use HUDi's SQL features more quickly. In fact, spark have provided the SparkSessionExtensions interface for
Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
Hi zhiwei, Done! Now, you should have cwiki permission. Best, Vino pzwpzw 于2021年1月22日周五 上午12:06写道: > That is great! Can you give me the permission to the cwiki? My cwiki id > is: zhiwei . > I will move it to there and continue the disscussion. > > 2021年1月21日 下午11:19,Gary Li 写道: > > Hi pengzhiwei, > > Thanks for the proposal. That’s a great feature. Can we move the design > doc to cwiki page as a new RFC? We can continue the discussion from there. > > Thanks, > > Best Regards, > Gary Li > > > From: pzwpzw > Reply-To: "dev@hudi.apache.org" > Date: Wednesday, January 20, 2021 at 11:52 PM > To: "dev@hudi.apache.org" > Cc: "dev@hudi.apache.org" > Subject: Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite > > Hi, we have implemented the spark sql extension for hudi in our Internal > version. Here is the main implementation, including the extension sql > syntax and implementation scheme on spark. I am waiting for your feedback. > Any comments are welcome~ > > > https://docs.google.com/document/d/1KC6Rae67CUaCUpKoIAkM6OTAGuOWFPD9qtNfqchAl1o/edit#heading=h.oeoy1y14sifu > > > 2020年12月23日 上午12:30,Vinoth Chandar 写道: > Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think. > love to have you involved. > > On Tue, Dec 22, 2020 at 3:20 AM pzwpzw > wrote: > > > Yes, it looks good . > We are building the spark sql extensions to support for hudi in > our internal version. > I am interested in participating in the extension of SparkSQL on hudi. > 2020年12月22日 下午4:30,Vinoth Chandar 写道: > > Hi, > > I think what we are landing on finally is. > > - Keep pushing for SparkSQL support using Spark extensions route > - Calcite effort will be a separate/orthogonal approach, down the line > > Please feel free to correct me, if I got this wrong. > > On Mon, Dec 21, 2020 at 3:30 AM pzwpzw > wrote: > > Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer > > to process engine independent SQL, for example most of the DDL、Hoodie CLI > > command and also provide parser for the common SQL extensions(e.g. Merge > > Into). The Engine-related syntax can be taught to the respective engines to > > process. If the common sql layer can handle the input sql, it handle > > it.Otherwise it is routed to the engine for processing. In long term, the > > common layer will more and more rich and perfect. > > 2020年12月21日 下午4:38,受春柏 写道: > > > Hi,all > > > > That's very good,Hudi SQL syntax can support Flink、hive and other analysis > > components at the same time, > > But there are some questions about SparkSQL. SparkSQL syntax is in > > conflict with Calctite syntax.Is our strategy > > user migration or syntax compatibility? > > In addition ,will it also support write SQL? > > > > > > > > > > > > > > > > > > > > > > 在 2020-12-19 02:10:16,"Nishith" 写道: > > > That’s awesome. Looks like we have a consensus on Calcite. Look forward to > > the RFC as well! > > > > -Nishith > > > > On Dec 18, 2020, at 9:03 AM, Vinoth Chandar wrote: > > > > Sounds good. Look forward to a RFC/DISCUSS thread. > > > > Thanks > > > Vinoth > > > > On Thu, Dec 17, 2020 at 6:04 PM Danny Chan wrote: > > > > Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would > > > add support for SQL connectors of Hoodie Flink soon ~ > > > Currently, i'm preparing a refactoring to the current Flink writer code. > > > > Vinoth Chandar 于2020年12月18日周五 上午6:39写道: > > > > Thanks Kabeer for the note on gmail. Did not realize that. :) > > > > My desired use case is user use the Hoodie CLI to execute these SQLs. > > > They can choose what engine to use by a CLI config option. > > > > Yes, that is also another attractive aspect of this route. We can build > > > out > > > a common SQL layer and have this translate to the underlying engine > > > (sounds > > > like Hive huh) > > > Longer term, if we really think we can more easily implement a full DML + > > > DDL + DQL, we can proceed with this. > > > > As others pointed out, for Spark SQL, it might be good to try the Spark > > > extensions route, before we take this on more fully. > > > > The other part where Calcite is great is, all the support for > > > windowing/streaming in its syntax. > > > Danny, I guess if we should be able to leverage that through a deeper > > > Flink/Hudi integration? > > > > &
Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
That is great! Can you give me the permission to the cwiki? My cwiki id is: zhiwei . I will move it to there and continue the disscussion. 2021年1月21日 下午11:19,Gary Li 写道: Hi pengzhiwei, Thanks for the proposal. That’s a great feature. Can we move the design doc to cwiki page as a new RFC? We can continue the discussion from there. Thanks, Best Regards, Gary Li From: pzwpzw Reply-To: "dev@hudi.apache.org" Date: Wednesday, January 20, 2021 at 11:52 PM To: "dev@hudi.apache.org" Cc: "dev@hudi.apache.org" Subject: Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite Hi, we have implemented the spark sql extension for hudi in our Internal version. Here is the main implementation, including the extension sql syntax and implementation scheme on spark. I am waiting for your feedback. Any comments are welcome~ https://docs.google.com/document/d/1KC6Rae67CUaCUpKoIAkM6OTAGuOWFPD9qtNfqchAl1o/edit#heading=h.oeoy1y14sifu 2020年12月23日 上午12:30,Vinoth Chandar 写道: Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think. love to have you involved. On Tue, Dec 22, 2020 at 3:20 AM pzwpzw wrote: Yes, it looks good . We are building the spark sql extensions to support for hudi in our internal version. I am interested in participating in the extension of SparkSQL on hudi. 2020年12月22日 下午4:30,Vinoth Chandar 写道: Hi, I think what we are landing on finally is. - Keep pushing for SparkSQL support using Spark extensions route - Calcite effort will be a separate/orthogonal approach, down the line Please feel free to correct me, if I got this wrong. On Mon, Dec 21, 2020 at 3:30 AM pzwpzw wrote: Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer to process engine independent SQL, for example most of the DDL、Hoodie CLI command and also provide parser for the common SQL extensions(e.g. Merge Into). The Engine-related syntax can be taught to the respective engines to process. If the common sql layer can handle the input sql, it handle it.Otherwise it is routed to the engine for processing. In long term, the common layer will more and more rich and perfect. 2020年12月21日 下午4:38,受春柏 写道: Hi,all That's very good,Hudi SQL syntax can support Flink、hive and other analysis components at the same time, But there are some questions about SparkSQL. SparkSQL syntax is in conflict with Calctite syntax.Is our strategy user migration or syntax compatibility? In addition ,will it also support write SQL? 在 2020-12-19 02:10:16,"Nishith" 写道: That’s awesome. Looks like we have a consensus on Calcite. Look forward to the RFC as well! -Nishith On Dec 18, 2020, at 9:03 AM, Vinoth Chandar wrote: Sounds good. Look forward to a RFC/DISCUSS thread. Thanks Vinoth On Thu, Dec 17, 2020 at 6:04 PM Danny Chan wrote: Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would add support for SQL connectors of Hoodie Flink soon ~ Currently, i'm preparing a refactoring to the current Flink writer code. Vinoth Chandar 于2020年12月18日周五 上午6:39写道: Thanks Kabeer for the note on gmail. Did not realize that. :) My desired use case is user use the Hoodie CLI to execute these SQLs. They can choose what engine to use by a CLI config option. Yes, that is also another attractive aspect of this route. We can build out a common SQL layer and have this translate to the underlying engine (sounds like Hive huh) Longer term, if we really think we can more easily implement a full DML + DDL + DQL, we can proceed with this. As others pointed out, for Spark SQL, it might be good to try the Spark extensions route, before we take this on more fully. The other part where Calcite is great is, all the support for windowing/streaming in its syntax. Danny, I guess if we should be able to leverage that through a deeper Flink/Hudi integration? On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar wrote: I think Dongwook is investigating on the same lines. and it does seem better to pursue this first, before trying other approaches. On Tue, Dec 15, 2020 at 1:38 AM pzwpzw wrote: Yeah I agree with Nishith that an option way is to look at the ways to plug in custom logical and physical plans in Spark. It can simplify the implementation and reuse the Spark SQL syntax. And also users familiar with Spark SQL will be able to use HUDi's SQL features more quickly. In fact, spark have provided the SparkSessionExtensions interface for implement custom syntax extensions and SQL rewrite rule. https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2F2.4.5%2Fapi%2Fjava%2Forg%2Fapache%2Fspark%2Fsql%2FSparkSessionExtensions.html&data=04%7C01%7C%7C1c5c63e24f8a455c63df08d8bd5b5300%7C84df9e7fe
Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
Hi pengzhiwei, Thanks for the proposal. That’s a great feature. Can we move the design doc to cwiki page as a new RFC? We can continue the discussion from there. Thanks, Best Regards, Gary Li From: pzwpzw Reply-To: "dev@hudi.apache.org" Date: Wednesday, January 20, 2021 at 11:52 PM To: "dev@hudi.apache.org" Cc: "dev@hudi.apache.org" Subject: Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite Hi, we have implemented the spark sql extension for hudi in our Internal version. Here is the main implementation, including the extension sql syntax and implementation scheme on spark. I am waiting for your feedback. Any comments are welcome~ https://docs.google.com/document/d/1KC6Rae67CUaCUpKoIAkM6OTAGuOWFPD9qtNfqchAl1o/edit#heading=h.oeoy1y14sifu 2020年12月23日 上午12:30,Vinoth Chandar 写道: Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think. love to have you involved. On Tue, Dec 22, 2020 at 3:20 AM pzwpzw wrote: Yes, it looks good . We are building the spark sql extensions to support for hudi in our internal version. I am interested in participating in the extension of SparkSQL on hudi. 2020年12月22日 下午4:30,Vinoth Chandar 写道: Hi, I think what we are landing on finally is. - Keep pushing for SparkSQL support using Spark extensions route - Calcite effort will be a separate/orthogonal approach, down the line Please feel free to correct me, if I got this wrong. On Mon, Dec 21, 2020 at 3:30 AM pzwpzw wrote: Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer to process engine independent SQL, for example most of the DDL、Hoodie CLI command and also provide parser for the common SQL extensions(e.g. Merge Into). The Engine-related syntax can be taught to the respective engines to process. If the common sql layer can handle the input sql, it handle it.Otherwise it is routed to the engine for processing. In long term, the common layer will more and more rich and perfect. 2020年12月21日 下午4:38,受春柏 写道: Hi,all That's very good,Hudi SQL syntax can support Flink、hive and other analysis components at the same time, But there are some questions about SparkSQL. SparkSQL syntax is in conflict with Calctite syntax.Is our strategy user migration or syntax compatibility? In addition ,will it also support write SQL? 在 2020-12-19 02:10:16,"Nishith" 写道: That’s awesome. Looks like we have a consensus on Calcite. Look forward to the RFC as well! -Nishith On Dec 18, 2020, at 9:03 AM, Vinoth Chandar wrote: Sounds good. Look forward to a RFC/DISCUSS thread. Thanks Vinoth On Thu, Dec 17, 2020 at 6:04 PM Danny Chan wrote: Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would add support for SQL connectors of Hoodie Flink soon ~ Currently, i'm preparing a refactoring to the current Flink writer code. Vinoth Chandar 于2020年12月18日周五 上午6:39写道: Thanks Kabeer for the note on gmail. Did not realize that. :) My desired use case is user use the Hoodie CLI to execute these SQLs. They can choose what engine to use by a CLI config option. Yes, that is also another attractive aspect of this route. We can build out a common SQL layer and have this translate to the underlying engine (sounds like Hive huh) Longer term, if we really think we can more easily implement a full DML + DDL + DQL, we can proceed with this. As others pointed out, for Spark SQL, it might be good to try the Spark extensions route, before we take this on more fully. The other part where Calcite is great is, all the support for windowing/streaming in its syntax. Danny, I guess if we should be able to leverage that through a deeper Flink/Hudi integration? On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar wrote: I think Dongwook is investigating on the same lines. and it does seem better to pursue this first, before trying other approaches. On Tue, Dec 15, 2020 at 1:38 AM pzwpzw wrote: Yeah I agree with Nishith that an option way is to look at the ways to plug in custom logical and physical plans in Spark. It can simplify the implementation and reuse the Spark SQL syntax. And also users familiar with Spark SQL will be able to use HUDi's SQL features more quickly. In fact, spark have provided the SparkSessionExtensions interface for implement custom syntax extensions and SQL rewrite rule. https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2F2.4.5%2Fapi%2Fjava%2Forg%2Fapache%2Fspark%2Fsql%2FSparkSessionExtensions.html&data=04%7C01%7C%7C1c5c63e24f8a455c63df08d8bd5b5300%7C84df9e7fe9f640afb435%7C1%7C0%7C637467547216284787%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=iMp2uNrqOy5C%2B
Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
Hi, we have implemented the spark sql extension for hudi in our Internal version. Here is the main implementation, including the extension sql syntax and implementation scheme on spark. I am waiting for your feedback. Any comments are welcome~ https://docs.google.com/document/d/1KC6Rae67CUaCUpKoIAkM6OTAGuOWFPD9qtNfqchAl1o/edit#heading=h.oeoy1y14sifu 2020年12月23日 上午12:30,Vinoth Chandar 写道: Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think. love to have you involved. On Tue, Dec 22, 2020 at 3:20 AM pzwpzw wrote: Yes, it looks good . We are building the spark sql extensions to support for hudi in our internal version. I am interested in participating in the extension of SparkSQL on hudi. 2020年12月22日 下午4:30,Vinoth Chandar 写道: Hi, I think what we are landing on finally is. - Keep pushing for SparkSQL support using Spark extensions route - Calcite effort will be a separate/orthogonal approach, down the line Please feel free to correct me, if I got this wrong. On Mon, Dec 21, 2020 at 3:30 AM pzwpzw wrote: Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer to process engine independent SQL, for example most of the DDL、Hoodie CLI command and also provide parser for the common SQL extensions(e.g. Merge Into). The Engine-related syntax can be taught to the respective engines to process. If the common sql layer can handle the input sql, it handle it.Otherwise it is routed to the engine for processing. In long term, the common layer will more and more rich and perfect. 2020年12月21日 下午4:38,受春柏 写道: Hi,all That's very good,Hudi SQL syntax can support Flink、hive and other analysis components at the same time, But there are some questions about SparkSQL. SparkSQL syntax is in conflict with Calctite syntax.Is our strategy user migration or syntax compatibility? In addition ,will it also support write SQL? 在 2020-12-19 02:10:16,"Nishith" 写道: That’s awesome. Looks like we have a consensus on Calcite. Look forward to the RFC as well! -Nishith On Dec 18, 2020, at 9:03 AM, Vinoth Chandar wrote: Sounds good. Look forward to a RFC/DISCUSS thread. Thanks Vinoth On Thu, Dec 17, 2020 at 6:04 PM Danny Chan wrote: Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would add support for SQL connectors of Hoodie Flink soon ~ Currently, i'm preparing a refactoring to the current Flink writer code. Vinoth Chandar 于2020年12月18日周五 上午6:39写道: Thanks Kabeer for the note on gmail. Did not realize that. :) My desired use case is user use the Hoodie CLI to execute these SQLs. They can choose what engine to use by a CLI config option. Yes, that is also another attractive aspect of this route. We can build out a common SQL layer and have this translate to the underlying engine (sounds like Hive huh) Longer term, if we really think we can more easily implement a full DML + DDL + DQL, we can proceed with this. As others pointed out, for Spark SQL, it might be good to try the Spark extensions route, before we take this on more fully. The other part where Calcite is great is, all the support for windowing/streaming in its syntax. Danny, I guess if we should be able to leverage that through a deeper Flink/Hudi integration? On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar wrote: I think Dongwook is investigating on the same lines. and it does seem better to pursue this first, before trying other approaches. On Tue, Dec 15, 2020 at 1:38 AM pzwpzw wrote: Yeah I agree with Nishith that an option way is to look at the ways to plug in custom logical and physical plans in Spark. It can simplify the implementation and reuse the Spark SQL syntax. And also users familiar with Spark SQL will be able to use HUDi's SQL features more quickly. In fact, spark have provided the SparkSessionExtensions interface for implement custom syntax extensions and SQL rewrite rule. https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html . We can use the SparkSessionExtensions to extended hoodie sql syntax such as MERGE INTO and DDL syntax. 2020年12月15日 下午3:27,Nishith 写道: Thanks for starting this thread Vinoth. In general, definitely see the need for SQL style semantics on Hudi tables. Apache Calcite is a great option to considering given DatasourceV2 has the limitations that you described. Additionally, even if Spark DatasourceV2 allowed for the flexibility, the same SQL semantics needs to be supported in other engines like Flink to provide the same experience to users - which in itself could also be considerable amount of work. So, if we’re able to generalize on the SQL story
Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
First, I think it is necessary to improve spark sql, because the main scenario of hudi is datalake or warehouse, and spark has strong ecological capabilities in this field. Second, but in the long run, Hudi needs a more general SQL layer, and it is very necessary to embrace calcite. Then based on hudi, a powerful data management processing service can be built On 2020/12/22 08:30:37, Vinoth Chandar wrote: > Hi, > > I think what we are landing on finally is. > > - Keep pushing for SparkSQL support using Spark extensions route > - Calcite effort will be a separate/orthogonal approach, down the line > > Please feel free to correct me, if I got this wrong. > > On Mon, Dec 21, 2020 at 3:30 AM pzwpzw > wrote: > > > Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer > > to process engine independent SQL, for example most of the DDL、Hoodie CLI > > command and also provide parser for the common SQL extensions(e.g. Merge > > Into). The Engine-related syntax can be taught to the respective engines to > > process. If the common sql layer can handle the input sql, it handle > > it.Otherwise it is routed to the engine for processing. In long term, the > > common layer will more and more rich and perfect. > > 2020年12月21日 下午4:38,受春柏 写道: > > > > Hi,all > > > > > > That's very good,Hudi SQL syntax can support Flink、hive and other analysis > > components at the same time, > > But there are some questions about SparkSQL. SparkSQL syntax is in > > conflict with Calctite syntax.Is our strategy > > user migration or syntax compatibility? > > In addition ,will it also support write SQL? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 在 2020-12-19 02:10:16,"Nishith" 写道: > > > > That’s awesome. Looks like we have a consensus on Calcite. Look forward to > > the RFC as well! > > > > > > -Nishith > > > > > > On Dec 18, 2020, at 9:03 AM, Vinoth Chandar wrote: > > > > > > Sounds good. Look forward to a RFC/DISCUSS thread. > > > > > > Thanks > > > > Vinoth > > > > > > On Thu, Dec 17, 2020 at 6:04 PM Danny Chan wrote: > > > > > > Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would > > > > add support for SQL connectors of Hoodie Flink soon ~ > > > > Currently, i'm preparing a refactoring to the current Flink writer code. > > > > > > Vinoth Chandar 于2020年12月18日周五 上午6:39写道: > > > > > > Thanks Kabeer for the note on gmail. Did not realize that. :) > > > > > > My desired use case is user use the Hoodie CLI to execute these SQLs. > > > > They can choose what engine to use by a CLI config option. > > > > > > Yes, that is also another attractive aspect of this route. We can build > > > > out > > > > a common SQL layer and have this translate to the underlying engine > > > > (sounds > > > > like Hive huh) > > > > Longer term, if we really think we can more easily implement a full DML + > > > > DDL + DQL, we can proceed with this. > > > > > > As others pointed out, for Spark SQL, it might be good to try the Spark > > > > extensions route, before we take this on more fully. > > > > > > The other part where Calcite is great is, all the support for > > > > windowing/streaming in its syntax. > > > > Danny, I guess if we should be able to leverage that through a deeper > > > > Flink/Hudi integration? > > > > > > > > On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar > > > > wrote: > > > > > > I think Dongwook is investigating on the same lines. and it does seem > > > > better to pursue this first, before trying other approaches. > > > > > > > > > > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw > > > .invalid> > > > > wrote: > > > > > > Yeah I agree with Nishith that an option way is to look at the > > > > ways > > > > to > > > > plug in custom logical and physical plans in Spark. It can simplify > > > > the > > > > implementation and reuse the Spark SQL syntax. And also users > > > > familiar > > > > with > > > > Spark SQL will be able to use HUDi's SQL features more quickly. > > > > In fact, spark have provided the SparkSessionExtensions interface for > > > > implement custom syntax extensions and SQL rewrite rule. > > > > > > > > > > > > https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html > > > > . > > > > We can use the SparkSessionExtensions to extended hoodie sql syntax > > > > such > > > > as MERGE INTO and DDL syntax. > > > > > > 2020年12月15日 下午3:27,Nishith 写道: > > > > > > Thanks for starting this thread Vinoth. > > > > In general, definitely see the need for SQL style semantics on Hudi > > > > tables. Apache Calcite is a great option to considering given > > > > DatasourceV2 > > > > has the limitations that you described. > > > > > > Additionally, even if Spark DatasourceV2 allowed for the flexibility, > > > > the > > > > same SQL semantics needs to be supported in other engines like Flink > > > > to > > > > provide the same experience to users - which in itself could also be > > > > considerable amount of work.
Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
That's great, I can help with the Apache Calcite integration. Vinoth Chandar 于2020年12月23日周三 上午12:29写道: > Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think. > love to have you involved. > > On Tue, Dec 22, 2020 at 3:20 AM pzwpzw > wrote: > > > Yes, it looks good . > > We are building the spark sql extensions to support for hudi in > > our internal version. > > I am interested in participating in the extension of SparkSQL on hudi. > > 2020年12月22日 下午4:30,Vinoth Chandar 写道: > > > > Hi, > > > > I think what we are landing on finally is. > > > > - Keep pushing for SparkSQL support using Spark extensions route > > - Calcite effort will be a separate/orthogonal approach, down the line > > > > Please feel free to correct me, if I got this wrong. > > > > On Mon, Dec 21, 2020 at 3:30 AM pzwpzw .invalid> > > wrote: > > > > Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer > > > > to process engine independent SQL, for example most of the DDL、Hoodie CLI > > > > command and also provide parser for the common SQL extensions(e.g. Merge > > > > Into). The Engine-related syntax can be taught to the respective engines > to > > > > process. If the common sql layer can handle the input sql, it handle > > > > it.Otherwise it is routed to the engine for processing. In long term, the > > > > common layer will more and more rich and perfect. > > > > 2020年12月21日 下午4:38,受春柏 写道: > > > > > > Hi,all > > > > > > > > That's very good,Hudi SQL syntax can support Flink、hive and other > analysis > > > > components at the same time, > > > > But there are some questions about SparkSQL. SparkSQL syntax is in > > > > conflict with Calctite syntax.Is our strategy > > > > user migration or syntax compatibility? > > > > In addition ,will it also support write SQL? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 在 2020-12-19 02:10:16,"Nishith" 写道: > > > > > > That’s awesome. Looks like we have a consensus on Calcite. Look forward > to > > > > the RFC as well! > > > > > > > > -Nishith > > > > > > > > On Dec 18, 2020, at 9:03 AM, Vinoth Chandar wrote: > > > > > > > > Sounds good. Look forward to a RFC/DISCUSS thread. > > > > > > > > Thanks > > > > > > Vinoth > > > > > > > > On Thu, Dec 17, 2020 at 6:04 PM Danny Chan wrote: > > > > > > > > Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i > would > > > > > > add support for SQL connectors of Hoodie Flink soon ~ > > > > > > Currently, i'm preparing a refactoring to the current Flink writer code. > > > > > > > > Vinoth Chandar 于2020年12月18日周五 上午6:39写道: > > > > > > > > Thanks Kabeer for the note on gmail. Did not realize that. :) > > > > > > > > My desired use case is user use the Hoodie CLI to execute these SQLs. > > > > > > They can choose what engine to use by a CLI config option. > > > > > > > > Yes, that is also another attractive aspect of this route. We can build > > > > > > out > > > > > > a common SQL layer and have this translate to the underlying engine > > > > > > (sounds > > > > > > like Hive huh) > > > > > > Longer term, if we really think we can more easily implement a full DML + > > > > > > DDL + DQL, we can proceed with this. > > > > > > > > As others pointed out, for Spark SQL, it might be good to try the Spark > > > > > > extensions route, before we take this on more fully. > > > > > > > > The other part where Calcite is great is, all the support for > > > > > > windowing/streaming in its syntax. > > > > > > Danny, I guess if we should be able to leverage that through a deeper > > > > > > Flink/Hudi integration? > > > > > > > > > > On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar > > > > > > wrote: > > > > > > > > I think Dongwook is investigating on the same lines. and it does seem > > > > > > better to pursue this first, before trying other approaches. > > > > > > > > > > > > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw > > > > > .invalid> > > > > > > wrote: > > > > > > > > Yeah I agree with Nishith that an option way is to look at the > > > > > > ways > > > > > > to > > > > > > plug in custom logical and physical plans in Spark. It can simplify > > > > > > the > > > > > > implementation and reuse the Spark SQL syntax. And also users > > > > > > familiar > > > > > > with > > > > > > Spark SQL will be able to use HUDi's SQL features more quickly. > > > > > > In fact, spark have provided the SparkSessionExtensions interface for > > > > > > implement custom syntax extensions and SQL rewrite rule. > > > > > > > > > > > > > > > > > https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html > > > > > > . > > > > > > We can use the SparkSessionExtensions to extended hoodie sql syntax > > > > > > such > > > > > > as MERGE INTO and DDL syntax. > > > > > > > > 2020年12月15日 下午3:27,Nishith 写道: > > > > > > > > Thanks for starting this thread Vinoth. > > > > > > In general, definitely see the need for SQL style semantics on Hudi > > > > > > tables. Apac
Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think. love to have you involved. On Tue, Dec 22, 2020 at 3:20 AM pzwpzw wrote: > Yes, it looks good . > We are building the spark sql extensions to support for hudi in > our internal version. > I am interested in participating in the extension of SparkSQL on hudi. > 2020年12月22日 下午4:30,Vinoth Chandar 写道: > > Hi, > > I think what we are landing on finally is. > > - Keep pushing for SparkSQL support using Spark extensions route > - Calcite effort will be a separate/orthogonal approach, down the line > > Please feel free to correct me, if I got this wrong. > > On Mon, Dec 21, 2020 at 3:30 AM pzwpzw > wrote: > > Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer > > to process engine independent SQL, for example most of the DDL、Hoodie CLI > > command and also provide parser for the common SQL extensions(e.g. Merge > > Into). The Engine-related syntax can be taught to the respective engines to > > process. If the common sql layer can handle the input sql, it handle > > it.Otherwise it is routed to the engine for processing. In long term, the > > common layer will more and more rich and perfect. > > 2020年12月21日 下午4:38,受春柏 写道: > > > Hi,all > > > > That's very good,Hudi SQL syntax can support Flink、hive and other analysis > > components at the same time, > > But there are some questions about SparkSQL. SparkSQL syntax is in > > conflict with Calctite syntax.Is our strategy > > user migration or syntax compatibility? > > In addition ,will it also support write SQL? > > > > > > > > > > > > > > > > > > > > > > 在 2020-12-19 02:10:16,"Nishith" 写道: > > > That’s awesome. Looks like we have a consensus on Calcite. Look forward to > > the RFC as well! > > > > -Nishith > > > > On Dec 18, 2020, at 9:03 AM, Vinoth Chandar wrote: > > > > Sounds good. Look forward to a RFC/DISCUSS thread. > > > > Thanks > > > Vinoth > > > > On Thu, Dec 17, 2020 at 6:04 PM Danny Chan wrote: > > > > Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would > > > add support for SQL connectors of Hoodie Flink soon ~ > > > Currently, i'm preparing a refactoring to the current Flink writer code. > > > > Vinoth Chandar 于2020年12月18日周五 上午6:39写道: > > > > Thanks Kabeer for the note on gmail. Did not realize that. :) > > > > My desired use case is user use the Hoodie CLI to execute these SQLs. > > > They can choose what engine to use by a CLI config option. > > > > Yes, that is also another attractive aspect of this route. We can build > > > out > > > a common SQL layer and have this translate to the underlying engine > > > (sounds > > > like Hive huh) > > > Longer term, if we really think we can more easily implement a full DML + > > > DDL + DQL, we can proceed with this. > > > > As others pointed out, for Spark SQL, it might be good to try the Spark > > > extensions route, before we take this on more fully. > > > > The other part where Calcite is great is, all the support for > > > windowing/streaming in its syntax. > > > Danny, I guess if we should be able to leverage that through a deeper > > > Flink/Hudi integration? > > > > > On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar > > > wrote: > > > > I think Dongwook is investigating on the same lines. and it does seem > > > better to pursue this first, before trying other approaches. > > > > > > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw > > .invalid> > > > wrote: > > > > Yeah I agree with Nishith that an option way is to look at the > > > ways > > > to > > > plug in custom logical and physical plans in Spark. It can simplify > > > the > > > implementation and reuse the Spark SQL syntax. And also users > > > familiar > > > with > > > Spark SQL will be able to use HUDi's SQL features more quickly. > > > In fact, spark have provided the SparkSessionExtensions interface for > > > implement custom syntax extensions and SQL rewrite rule. > > > > > > > > https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html > > > . > > > We can use the SparkSessionExtensions to extended hoodie sql syntax > > > such > > > as MERGE INTO and DDL syntax. > > > > 2020年12月15日 下午3:27,Nishith 写道: > > > > Thanks for starting this thread Vinoth. > > > In general, definitely see the need for SQL style semantics on Hudi > > > tables. Apache Calcite is a great option to considering given > > > DatasourceV2 > > > has the limitations that you described. > > > > Additionally, even if Spark DatasourceV2 allowed for the flexibility, > > > the > > > same SQL semantics needs to be supported in other engines like Flink > > > to > > > provide the same experience to users - which in itself could also be > > > considerable amount of work. > > > So, if we’re able to generalize on the SQL story along Calcite, that > > > would > > > help reduce redundant work in some sense. > > > Although, I’m worried about a few things > > > > 1) Like you pointed out, writing complex user jobs using Spark SQL > > >
Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
Yes, it looks good . We are building the spark sql extensions to support for hudi in our internal version. I am interested in participating in the extension of SparkSQL on hudi. 2020年12月22日 下午4:30,Vinoth Chandar 写道: Hi, I think what we are landing on finally is. - Keep pushing for SparkSQL support using Spark extensions route - Calcite effort will be a separate/orthogonal approach, down the line Please feel free to correct me, if I got this wrong. On Mon, Dec 21, 2020 at 3:30 AM pzwpzw wrote: Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer to process engine independent SQL, for example most of the DDL、Hoodie CLI command and also provide parser for the common SQL extensions(e.g. Merge Into). The Engine-related syntax can be taught to the respective engines to process. If the common sql layer can handle the input sql, it handle it.Otherwise it is routed to the engine for processing. In long term, the common layer will more and more rich and perfect. 2020年12月21日 下午4:38,受春柏 写道: Hi,all That's very good,Hudi SQL syntax can support Flink、hive and other analysis components at the same time, But there are some questions about SparkSQL. SparkSQL syntax is in conflict with Calctite syntax.Is our strategy user migration or syntax compatibility? In addition ,will it also support write SQL? 在 2020-12-19 02:10:16,"Nishith" 写道: That’s awesome. Looks like we have a consensus on Calcite. Look forward to the RFC as well! -Nishith On Dec 18, 2020, at 9:03 AM, Vinoth Chandar wrote: Sounds good. Look forward to a RFC/DISCUSS thread. Thanks Vinoth On Thu, Dec 17, 2020 at 6:04 PM Danny Chan wrote: Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would add support for SQL connectors of Hoodie Flink soon ~ Currently, i'm preparing a refactoring to the current Flink writer code. Vinoth Chandar 于2020年12月18日周五 上午6:39写道: Thanks Kabeer for the note on gmail. Did not realize that. :) My desired use case is user use the Hoodie CLI to execute these SQLs. They can choose what engine to use by a CLI config option. Yes, that is also another attractive aspect of this route. We can build out a common SQL layer and have this translate to the underlying engine (sounds like Hive huh) Longer term, if we really think we can more easily implement a full DML + DDL + DQL, we can proceed with this. As others pointed out, for Spark SQL, it might be good to try the Spark extensions route, before we take this on more fully. The other part where Calcite is great is, all the support for windowing/streaming in its syntax. Danny, I guess if we should be able to leverage that through a deeper Flink/Hudi integration? On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar wrote: I think Dongwook is investigating on the same lines. and it does seem better to pursue this first, before trying other approaches. On Tue, Dec 15, 2020 at 1:38 AM pzwpzw wrote: Yeah I agree with Nishith that an option way is to look at the ways to plug in custom logical and physical plans in Spark. It can simplify the implementation and reuse the Spark SQL syntax. And also users familiar with Spark SQL will be able to use HUDi's SQL features more quickly. In fact, spark have provided the SparkSessionExtensions interface for implement custom syntax extensions and SQL rewrite rule. https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html . We can use the SparkSessionExtensions to extended hoodie sql syntax such as MERGE INTO and DDL syntax. 2020年12月15日 下午3:27,Nishith 写道: Thanks for starting this thread Vinoth. In general, definitely see the need for SQL style semantics on Hudi tables. Apache Calcite is a great option to considering given DatasourceV2 has the limitations that you described. Additionally, even if Spark DatasourceV2 allowed for the flexibility, the same SQL semantics needs to be supported in other engines like Flink to provide the same experience to users - which in itself could also be considerable amount of work. So, if we’re able to generalize on the SQL story along Calcite, that would help reduce redundant work in some sense. Although, I’m worried about a few things 1) Like you pointed out, writing complex user jobs using Spark SQL syntax can be harder for users who are moving from “Hudi syntax” to “Spark syntax” for cross table joins, merges etc using data frames. One option is to look at the if there are ways to plug in custom logical and physical plans in Spark, this way, although the merge on sparksql functionality may not be as simple to use, but wouldn’t take away performance and feature set for starters, in the future we could think of having the entire query space be powered by calcite like you mentioned 2) If we
Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
Hi, I think what we are landing on finally is. - Keep pushing for SparkSQL support using Spark extensions route - Calcite effort will be a separate/orthogonal approach, down the line Please feel free to correct me, if I got this wrong. On Mon, Dec 21, 2020 at 3:30 AM pzwpzw wrote: > Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer > to process engine independent SQL, for example most of the DDL、Hoodie CLI > command and also provide parser for the common SQL extensions(e.g. Merge > Into). The Engine-related syntax can be taught to the respective engines to > process. If the common sql layer can handle the input sql, it handle > it.Otherwise it is routed to the engine for processing. In long term, the > common layer will more and more rich and perfect. > 2020年12月21日 下午4:38,受春柏 写道: > > Hi,all > > > That's very good,Hudi SQL syntax can support Flink、hive and other analysis > components at the same time, > But there are some questions about SparkSQL. SparkSQL syntax is in > conflict with Calctite syntax.Is our strategy > user migration or syntax compatibility? > In addition ,will it also support write SQL? > > > > > > > > > > > > > > > > > > > > > 在 2020-12-19 02:10:16,"Nishith" 写道: > > That’s awesome. Looks like we have a consensus on Calcite. Look forward to > the RFC as well! > > > -Nishith > > > On Dec 18, 2020, at 9:03 AM, Vinoth Chandar wrote: > > > Sounds good. Look forward to a RFC/DISCUSS thread. > > > Thanks > > Vinoth > > > On Thu, Dec 17, 2020 at 6:04 PM Danny Chan wrote: > > > Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would > > add support for SQL connectors of Hoodie Flink soon ~ > > Currently, i'm preparing a refactoring to the current Flink writer code. > > > Vinoth Chandar 于2020年12月18日周五 上午6:39写道: > > > Thanks Kabeer for the note on gmail. Did not realize that. :) > > > My desired use case is user use the Hoodie CLI to execute these SQLs. > > They can choose what engine to use by a CLI config option. > > > Yes, that is also another attractive aspect of this route. We can build > > out > > a common SQL layer and have this translate to the underlying engine > > (sounds > > like Hive huh) > > Longer term, if we really think we can more easily implement a full DML + > > DDL + DQL, we can proceed with this. > > > As others pointed out, for Spark SQL, it might be good to try the Spark > > extensions route, before we take this on more fully. > > > The other part where Calcite is great is, all the support for > > windowing/streaming in its syntax. > > Danny, I guess if we should be able to leverage that through a deeper > > Flink/Hudi integration? > > > > On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar > > wrote: > > > I think Dongwook is investigating on the same lines. and it does seem > > better to pursue this first, before trying other approaches. > > > > > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw > .invalid> > > wrote: > > > Yeah I agree with Nishith that an option way is to look at the > > ways > > to > > plug in custom logical and physical plans in Spark. It can simplify > > the > > implementation and reuse the Spark SQL syntax. And also users > > familiar > > with > > Spark SQL will be able to use HUDi's SQL features more quickly. > > In fact, spark have provided the SparkSessionExtensions interface for > > implement custom syntax extensions and SQL rewrite rule. > > > > > > https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html > > . > > We can use the SparkSessionExtensions to extended hoodie sql syntax > > such > > as MERGE INTO and DDL syntax. > > > 2020年12月15日 下午3:27,Nishith 写道: > > > Thanks for starting this thread Vinoth. > > In general, definitely see the need for SQL style semantics on Hudi > > tables. Apache Calcite is a great option to considering given > > DatasourceV2 > > has the limitations that you described. > > > Additionally, even if Spark DatasourceV2 allowed for the flexibility, > > the > > same SQL semantics needs to be supported in other engines like Flink > > to > > provide the same experience to users - which in itself could also be > > considerable amount of work. > > So, if we’re able to generalize on the SQL story along Calcite, that > > would > > help reduce redundant work in some sense. > > Although, I’m worried about a few things > > > 1) Like you pointed out, writing complex user jobs using Spark SQL > > syntax > > can be harder for users who are moving from “Hudi syntax” to “Spark > > syntax” > > for cross table joins, merges etc using data frames. One option is to > > look > > at the if there are ways to plug in custom logical and physical plans > > in > > Spark, this way, although the merge on sparksql functionality may not > > be > > as > > simple to use, but wouldn’t take away performance and feature set for > > starters, in the future we could think of having the entire query > > space > > be > > powered by calcite like you mentioned > > 2) If we continue t
Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer to process engine independent SQL, for example most of the DDL、Hoodie CLI command and also provide parser for the common SQL extensions(e.g. Merge Into). The Engine-related syntax can be taught to the respective engines to process. If the common sql layer can handle the input sql, it handle it.Otherwise it is routed to the engine for processing. In long term, the common layer will more and more rich and perfect. 2020年12月21日 下午4:38,受春柏 写道: Hi,all That's very good,Hudi SQL syntax can support Flink、hive and other analysis components at the same time, But there are some questions about SparkSQL. SparkSQL syntax is in conflict with Calctite syntax.Is our strategy user migration or syntax compatibility? In addition ,will it also support write SQL? 在 2020-12-19 02:10:16,"Nishith" 写道: That’s awesome. Looks like we have a consensus on Calcite. Look forward to the RFC as well! -Nishith On Dec 18, 2020, at 9:03 AM, Vinoth Chandar wrote: Sounds good. Look forward to a RFC/DISCUSS thread. Thanks Vinoth On Thu, Dec 17, 2020 at 6:04 PM Danny Chan wrote: Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would add support for SQL connectors of Hoodie Flink soon ~ Currently, i'm preparing a refactoring to the current Flink writer code. Vinoth Chandar 于2020年12月18日周五 上午6:39写道: Thanks Kabeer for the note on gmail. Did not realize that. :) My desired use case is user use the Hoodie CLI to execute these SQLs. They can choose what engine to use by a CLI config option. Yes, that is also another attractive aspect of this route. We can build out a common SQL layer and have this translate to the underlying engine (sounds like Hive huh) Longer term, if we really think we can more easily implement a full DML + DDL + DQL, we can proceed with this. As others pointed out, for Spark SQL, it might be good to try the Spark extensions route, before we take this on more fully. The other part where Calcite is great is, all the support for windowing/streaming in its syntax. Danny, I guess if we should be able to leverage that through a deeper Flink/Hudi integration? On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar wrote: I think Dongwook is investigating on the same lines. and it does seem better to pursue this first, before trying other approaches. On Tue, Dec 15, 2020 at 1:38 AM pzwpzw wrote: Yeah I agree with Nishith that an option way is to look at the ways to plug in custom logical and physical plans in Spark. It can simplify the implementation and reuse the Spark SQL syntax. And also users familiar with Spark SQL will be able to use HUDi's SQL features more quickly. In fact, spark have provided the SparkSessionExtensions interface for implement custom syntax extensions and SQL rewrite rule. https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html . We can use the SparkSessionExtensions to extended hoodie sql syntax such as MERGE INTO and DDL syntax. 2020年12月15日 下午3:27,Nishith 写道: Thanks for starting this thread Vinoth. In general, definitely see the need for SQL style semantics on Hudi tables. Apache Calcite is a great option to considering given DatasourceV2 has the limitations that you described. Additionally, even if Spark DatasourceV2 allowed for the flexibility, the same SQL semantics needs to be supported in other engines like Flink to provide the same experience to users - which in itself could also be considerable amount of work. So, if we’re able to generalize on the SQL story along Calcite, that would help reduce redundant work in some sense. Although, I’m worried about a few things 1) Like you pointed out, writing complex user jobs using Spark SQL syntax can be harder for users who are moving from “Hudi syntax” to “Spark syntax” for cross table joins, merges etc using data frames. One option is to look at the if there are ways to plug in custom logical and physical plans in Spark, this way, although the merge on sparksql functionality may not be as simple to use, but wouldn’t take away performance and feature set for starters, in the future we could think of having the entire query space be powered by calcite like you mentioned 2) If we continue to use DatasourceV1, is there any downside to this from a performance and optimization perspective when executing plan - I’m guessing not but haven’t delved into the code to see if there’s anything different apart from the API and spec. On Dec 14, 2020, at 11:06 PM, Vinoth Chandar wrote: Hello all, Just bumping this thread again thanks vinoth On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar wrote: Hello all, One feature that keeps coming up is the ability to use UPDATE, MERGE sql syntax to support writing into Hudi tables. We have looked into the Spark 3 DataSource V2 APIs as well and found several issues that hinder us in implement