Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think. love to have you involved.
On Tue, Dec 22, 2020 at 3:20 AM pzwpzw <pengzhiwei2...@icloud.com.invalid> wrote: > Yes, it looks good . > We are building the spark sql extensions to support for hudi in > our internal version. > I am interested in participating in the extension of SparkSQL on hudi. > 2020年12月22日 下午4:30,Vinoth Chandar <vin...@apache.org> 写道: > > Hi, > > I think what we are landing on finally is. > > - Keep pushing for SparkSQL support using Spark extensions route > - Calcite effort will be a separate/orthogonal approach, down the line > > Please feel free to correct me, if I got this wrong. > > On Mon, Dec 21, 2020 at 3:30 AM pzwpzw <pengzhiwei2...@icloud.com.invalid> > wrote: > > Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer > > to process engine independent SQL, for example most of the DDL、Hoodie CLI > > command and also provide parser for the common SQL extensions(e.g. Merge > > Into). The Engine-related syntax can be taught to the respective engines to > > process. If the common sql layer can handle the input sql, it handle > > it.Otherwise it is routed to the engine for processing. In long term, the > > common layer will more and more rich and perfect. > > 2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道: > > > Hi,all > > > > That's very good,Hudi SQL syntax can support Flink、hive and other analysis > > components at the same time, > > But there are some questions about SparkSQL. SparkSQL syntax is in > > conflict with Calctite syntax.Is our strategy > > user migration or syntax compatibility? > > In addition ,will it also support write SQL? > > > > > > > > > > > > > > > > > > > > > > 在 2020-12-19 02:10:16,"Nishith" <n3.nas...@gmail.com> 写道: > > > That’s awesome. Looks like we have a consensus on Calcite. Look forward to > > the RFC as well! > > > > -Nishith > > > > On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vin...@apache.org> wrote: > > > > Sounds good. Look forward to a RFC/DISCUSS thread. > > > > Thanks > > > Vinoth > > > > On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <danny0...@apache.org> wrote: > > > > Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would > > > add support for SQL connectors of Hoodie Flink soon ~ > > > Currently, i'm preparing a refactoring to the current Flink writer code. > > > > Vinoth Chandar <vin...@apache.org> 于2020年12月18日周五 上午6:39写道: > > > > Thanks Kabeer for the note on gmail. Did not realize that. :) > > > > My desired use case is user use the Hoodie CLI to execute these SQLs. > > > They can choose what engine to use by a CLI config option. > > > > Yes, that is also another attractive aspect of this route. We can build > > > out > > > a common SQL layer and have this translate to the underlying engine > > > (sounds > > > like Hive huh) > > > Longer term, if we really think we can more easily implement a full DML + > > > DDL + DQL, we can proceed with this. > > > > As others pointed out, for Spark SQL, it might be good to try the Spark > > > extensions route, before we take this on more fully. > > > > The other part where Calcite is great is, all the support for > > > windowing/streaming in its syntax. > > > Danny, I guess if we should be able to leverage that through a deeper > > > Flink/Hudi integration? > > > > > On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vin...@apache.org> > > > wrote: > > > > I think Dongwook is investigating on the same lines. and it does seem > > > better to pursue this first, before trying other approaches. > > > > > > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2...@icloud.com > > > .invalid> > > > wrote: > > > > Yeah I agree with Nishith that an option way is to look at the > > > ways > > > to > > > plug in custom logical and physical plans in Spark. It can simplify > > > the > > > implementation and reuse the Spark SQL syntax. And also users > > > familiar > > > with > > > Spark SQL will be able to use HUDi's SQL features more quickly. > > > In fact, spark have provided the SparkSessionExtensions interface for > > > implement custom syntax extensions and SQL rewrite rule. > > > > > > > > https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html > > > . > > > We can use the SparkSessionExtensions to extended hoodie sql syntax > > > such > > > as MERGE INTO and DDL syntax. > > > > 2020年12月15日 下午3:27,Nishith <n3.nas...@gmail.com> 写道: > > > > Thanks for starting this thread Vinoth. > > > In general, definitely see the need for SQL style semantics on Hudi > > > tables. Apache Calcite is a great option to considering given > > > DatasourceV2 > > > has the limitations that you described. > > > > Additionally, even if Spark DatasourceV2 allowed for the flexibility, > > > the > > > same SQL semantics needs to be supported in other engines like Flink > > > to > > > provide the same experience to users - which in itself could also be > > > considerable amount of work. > > > So, if we’re able to generalize on the SQL story along Calcite, that > > > would > > > help reduce redundant work in some sense. > > > Although, I’m worried about a few things > > > > 1) Like you pointed out, writing complex user jobs using Spark SQL > > > syntax > > > can be harder for users who are moving from “Hudi syntax” to “Spark > > > syntax” > > > for cross table joins, merges etc using data frames. One option is to > > > look > > > at the if there are ways to plug in custom logical and physical plans > > > in > > > Spark, this way, although the merge on sparksql functionality may not > > > be > > > as > > > simple to use, but wouldn’t take away performance and feature set for > > > starters, in the future we could think of having the entire query > > > space > > > be > > > powered by calcite like you mentioned > > > 2) If we continue to use DatasourceV1, is there any downside to this > > > from > > > a performance and optimization perspective when executing plan - I’m > > > guessing not but haven’t delved into the code to see if there’s > > > anything > > > different apart from the API and spec. > > > > On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vin...@apache.org> > > > wrote: > > > > > Hello all, > > > > > Just bumping this thread again > > > > > thanks > > > > vinoth > > > > > On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vin...@apache.org> > > > wrote: > > > > > Hello all, > > > > > One feature that keeps coming up is the ability to use UPDATE, MERGE > > > sql > > > > syntax to support writing into Hudi tables. We have looked into the > > > Spark 3 > > > > DataSource V2 APIs as well and found several issues that hinder us in > > > > implementing this via the Spark APIs > > > > > - As of this writing, the UPDATE/MERGE syntax is not really opened up > > > to > > > > external datasources like Hudi. only DELETE is. > > > > - DataSource V2 API offers no flexibility to perform any kind of > > > > further transformations to the dataframe. Hudi supports keys, > > > indexes, > > > > preCombining and custom partitioning that ensures file sizes etc. All > > > this > > > > needs shuffling data, looking up/joining against other dataframes so > > > forth. > > > > Today, the DataSource V1 API allows this kind of further > > > > partitions/transformations. But the V2 API is simply offers partition > > > level > > > > iteration once the user calls df.write.format("hudi") > > > > > One thought I had is to explore Apache Calcite and write an adapter > > > for > > > > Hudi. This frees us from being very dependent on a particular > > > engine's > > > > syntax support like Spark. Calcite is very popular by itself and > > > supports > > > > most of the key words and (also more streaming friendly syntax). To > > > be > > > > clear, we will still be using Spark/Flink underneath to perform the > > > actual > > > > writing, just that the SQL grammar is provided by Calcite. > > > > > To give a taste of how this will look like. > > > > > A) If the user wants to mutate a Hudi table using SQL > > > > > Instead of writing something like : spark.sql("UPDATE ....") > > > > users will write : hudiSparkSession.sql("UPDATE ....") > > > > > B) To save a Spark data frame to a Hudi table > > > > we continue to use Spark DataSource V1 > > > > > The obvious challenge I see is the disconnect with the Spark > > > DataFrame > > > > ecosystem. Users would write MERGE SQL statements by joining against > > > other > > > > Spark DataFrames. > > > > If we want those expressed in calcite as well, we need to also invest > > > in > > > > the full Query side support, which can increase the scope by a lot. > > > > Some amount of investigation needs to happen, but ideally we should > > > be > > > > able to integrate with the sparkSQL catalog and reuse all the tables > > > there. > > > > > I am sure there are some gaps in my thinking. Just starting this > > > thread, > > > > so we can discuss and others can chime in/correct me. > > > > > thanks > > > > vinoth > > > > > > > > > >