Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

pzwpzw Thu, 21 Jan 2021 19:55:20 -0800

Thank vino yang! I have move the doc to RFC-25. We can continue the discussion 
there.


2021年1月22日 上午9:27，vino yang <yanghua1...@gmail.com> 写道：


Hi zhiwei,

Done! Now, you should have cwiki permission.

Best,
Vino

pzwpzw <pengzhiwei2...@icloud.com.invalid> 于2021年1月22日周五 上午12:06写道：


That is great! Can you give me the permission to the cwiki? My cwiki id
is: zhiwei .
I will move it to there and continue the disscussion.


2021年1月21日 下午11:19，Gary Li <garyli1...@outlook.com> 写道：


Hi pengzhiwei,


Thanks for the proposal. That’s a great feature. Can we move the design
doc to cwiki page as a new RFC? We can continue the discussion from there.


Thanks,


Best Regards,
Gary Li




From: pzwpzw <pengzhiwei2...@icloud.com.INVALID>
Reply-To: "dev@hudi.apache.org" <dev@hudi.apache.org>
Date: Wednesday, January 20, 2021 at 11:52 PM
To: "dev@hudi.apache.org" <dev@hudi.apache.org>
Cc: "dev@hudi.apache.org" <dev@hudi.apache.org>
Subject: Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite


Hi, we have implemented the spark sql extension for hudi in our Internal
version. Here is the main implementation, including the extension sql
syntax and implementation scheme on spark. I am waiting for your feedback.
Any comments are welcome~




https://docs.google.com/document/d/1KC6Rae67CUaCUpKoIAkM6OTAGuOWFPD9qtNfqchAl1o/edit#heading=h.oeoy1y14sifu




2020年12月23日 上午12:30，Vinoth Chandar <vin...@apache.org> 写道：
Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
love to have you involved.


On Tue, Dec 22, 2020 at 3:20 AM pzwpzw <pengzhiwei2...@icloud.com.invalid>
wrote:




Yes, it looks good .
We are building the spark sql extensions to support for hudi in
our internal version.
I am interested in participating in the extension of SparkSQL on hudi.
2020年12月22日 下午4:30，Vinoth Chandar <vin...@apache.org> 写道：


Hi,


I think what we are landing on finally is.


- Keep pushing for SparkSQL support using Spark extensions route
- Calcite effort will be a separate/orthogonal approach, down the line


Please feel free to correct me, if I got this wrong.


On Mon, Dec 21, 2020 at 3:30 AM pzwpzw <pengzhiwei2...@icloud.com.invalid>
wrote:


Hi 受春柏 ，here is my point. We can use Calcite to build a common sql layer


to process engine independent SQL, for example most of the DDL、Hoodie CLI


command and also provide parser for the common SQL extensions(e.g. Merge


Into). The Engine-related syntax can be taught to the respective engines to


process. If the common sql layer can handle the input sql, it handle


it.Otherwise it is routed to the engine for processing. In long term, the


common layer will more and more rich and perfect.


2020年12月21日 下午4:38，受春柏 <sc...@126.com> 写道：




Hi,all






That's very good,Hudi SQL syntax can support Flink、hive and other analysis


components at the same time,


But there are some questions about SparkSQL. SparkSQL syntax is in


conflict with Calctite syntax.Is our strategy


user migration or syntax compatibility?


In addition ，will it also support write SQL?










































在 2020-12-19 02:10:16，"Nishith" <n3.nas...@gmail.com> 写道：




That’s awesome. Looks like we have a consensus on Calcite. Look forward to


the RFC as well!






-Nishith






On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vin...@apache.org> wrote:






Sounds good. Look forward to a RFC/DISCUSS thread.






Thanks




Vinoth






On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <danny0...@apache.org> wrote:






Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would




add support for SQL connectors of Hoodie Flink soon ~




Currently, i'm preparing a refactoring to the current Flink writer code.






Vinoth Chandar <vin...@apache.org> 于2020年12月18日周五 上午6:39写道：






Thanks Kabeer for the note on gmail. Did not realize that. :)






My desired use case is user use the Hoodie CLI to execute these SQLs.




They can choose what engine to use by a CLI config option.






Yes, that is also another attractive aspect of this route. We can build




out




a common SQL layer and have this translate to the underlying engine




(sounds




like Hive huh)




Longer term, if we really think we can more easily implement a full DML +




DDL + DQL, we can proceed with this.






As others pointed out, for Spark SQL, it might be good to try the Spark




extensions route, before we take this on more fully.






The other part where Calcite is great is, all the support for




windowing/streaming in its syntax.




Danny, I guess if we should be able to leverage that through a deeper




Flink/Hudi integration?








On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vin...@apache.org>




wrote:






I think Dongwook is investigating on the same lines. and it does seem




better to pursue this first, before trying other approaches.










On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2...@icloud.com




.invalid>




wrote:






Yeah I agree with Nishith that an option way is to look at the




ways




to




plug in custom logical and physical plans in Spark. It can simplify




the




implementation and reuse the Spark SQL syntax. And also users




familiar




with




Spark SQL will be able to use HUDi's SQL features more quickly.




In fact, spark have provided the SparkSessionExtensions interface for




implement custom syntax extensions and SQL rewrite rule.
















https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
<
https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2F2.4.5%2Fapi%2Fjava%2Forg%2Fapache%2Fspark%2Fsql%2FSparkSessionExtensions.html&data=04%7C01%7C%7C1c5c63e24f8a455c63df08d8bd5b5300%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637467547216284787%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=iMp2uNrqOy5C%2Bb5wsRxgA4kKMs76nalJ1DJxEOICqNA%3D&reserved=0





.




We can use the SparkSessionExtensions to extended hoodie sql syntax




such




as MERGE INTO and DDL syntax.






2020年12月15日 下午3:27，Nishith <n3.nas...@gmail.com> 写道：






Thanks for starting this thread Vinoth.




In general, definitely see the need for SQL style semantics on Hudi




tables. Apache Calcite is a great option to considering given




DatasourceV2




has the limitations that you described.






Additionally, even if Spark DatasourceV2 allowed for the flexibility,




the




same SQL semantics needs to be supported in other engines like Flink




to




provide the same experience to users - which in itself could also be




considerable amount of work.




So, if we’re able to generalize on the SQL story along Calcite, that




would




help reduce redundant work in some sense.




Although, I’m worried about a few things






1) Like you pointed out, writing complex user jobs using Spark SQL




syntax




can be harder for users who are moving from “Hudi syntax” to “Spark




syntax”




for cross table joins, merges etc using data frames. One option is to




look




at the if there are ways to plug in custom logical and physical plans




in




Spark, this way, although the merge on sparksql functionality may not




be




as




simple to use, but wouldn’t take away performance and feature set for




starters, in the future we could think of having the entire query




space




be




powered by calcite like you mentioned




2) If we continue to use DatasourceV1, is there any downside to this




from




a performance and optimization perspective when executing plan - I’m




guessing not but haven’t delved into the code to see if there’s




anything




different apart from the API and spec.






On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vin...@apache.org>




wrote:








Hello all,








Just bumping this thread again








thanks






vinoth








On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vin...@apache.org>




wrote:








Hello all,








One feature that keeps coming up is the ability to use UPDATE, MERGE




sql






syntax to support writing into Hudi tables. We have looked into the




Spark 3






DataSource V2 APIs as well and found several issues that hinder us in






implementing this via the Spark APIs








- As of this writing, the UPDATE/MERGE syntax is not really opened up




to






external datasources like Hudi. only DELETE is.






- DataSource V2 API offers no flexibility to perform any kind of






further transformations to the dataframe. Hudi supports keys,




indexes,






preCombining and custom partitioning that ensures file sizes etc. All




this






needs shuffling data, looking up/joining against other dataframes so




forth.






Today, the DataSource V1 API allows this kind of further






partitions/transformations. But the V2 API is simply offers partition




level






iteration once the user calls df.write.format("hudi")








One thought I had is to explore Apache Calcite and write an adapter




for






Hudi. This frees us from being very dependent on a particular




engine's






syntax support like Spark. Calcite is very popular by itself and




supports






most of the key words and (also more streaming friendly syntax). To




be






clear, we will still be using Spark/Flink underneath to perform the




actual






writing, just that the SQL grammar is provided by Calcite.








To give a taste of how this will look like.








A) If the user wants to mutate a Hudi table using SQL








Instead of writing something like : spark.sql("UPDATE ....")






users will write : hudiSparkSession.sql("UPDATE ....")








B) To save a Spark data frame to a Hudi table






we continue to use Spark DataSource V1








The obvious challenge I see is the disconnect with the Spark




DataFrame






ecosystem. Users would write MERGE SQL statements by joining against




other






Spark DataFrames.






If we want those expressed in calcite as well, we need to also invest




in






the full Query side support, which can increase the scope by a lot.






Some amount of investigation needs to happen, but ideally we should




be






able to integrate with the sparkSQL catalog and reuse all the tables




there.








I am sure there are some gaps in my thinking. Just starting this




thread,






so we can discuss and others can chime in/correct me.








thanks






vinoth

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Reply via email to