I think Dongwook is investigating on the same lines. and it does seem
better to pursue this first, before trying other approaches.



On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <[email protected]>
wrote:

>    Yeah I agree with Nishith that an option way is to look at the ways to
> plug in custom logical and physical plans in Spark. It can simplify the
> implementation and reuse the Spark SQL syntax. And also users familiar with
> Spark SQL will be able to use HUDi's SQL features more quickly.
> In fact, spark have provided the SparkSessionExtensions interface for
> implement custom syntax extensions and SQL rewrite rule.
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html.
> We can use the SparkSessionExtensions to extended hoodie sql syntax such
> as MERGE INTO and DDL syntax.
>
> 2020年12月15日 下午3:27,Nishith <[email protected]> 写道:
>
> Thanks for starting this thread Vinoth.
> In general, definitely see the need for SQL style semantics on Hudi
> tables. Apache Calcite is a great option to considering given DatasourceV2
> has the limitations that you described.
>
> Additionally, even if Spark DatasourceV2 allowed for the flexibility, the
> same SQL semantics needs to be supported in other engines like Flink to
> provide the same experience to users - which in itself could also be
> considerable amount of work.
> So, if we’re able to generalize on the SQL story along Calcite, that would
> help reduce redundant work in some sense.
> Although, I’m worried about a few things
>
> 1) Like you pointed out, writing complex user jobs using Spark SQL syntax
> can be harder for users who are moving from “Hudi syntax” to “Spark syntax”
> for cross table joins, merges etc using data frames. One option is to look
> at the if there are ways to plug in custom logical and physical plans in
> Spark, this way, although the merge on sparksql functionality may not be as
> simple to use, but wouldn’t take away performance and feature set for
> starters, in the future we could think of having the entire query space be
> powered by calcite like you mentioned
> 2) If we continue to use DatasourceV1, is there any downside to this from
> a performance and optimization perspective when executing plan - I’m
> guessing not but haven’t delved into the code to see if there’s anything
> different apart from the API and spec.
>
> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <[email protected]> wrote:
>
>
> Hello all,
>
>
> Just bumping this thread again
>
>
> thanks
>
> vinoth
>
>
> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <[email protected]> wrote:
>
>
> Hello all,
>
>
> One feature that keeps coming up is the ability to use UPDATE, MERGE sql
>
> syntax to support writing into Hudi tables. We have looked into the Spark 3
>
> DataSource V2 APIs as well and found several issues that hinder us in
>
> implementing this via the Spark APIs
>
>
> - As of this writing, the UPDATE/MERGE syntax is not really opened up to
>
> external datasources like Hudi. only DELETE is.
>
> - DataSource V2 API offers no flexibility to perform any kind of
>
> further transformations to the dataframe. Hudi supports keys, indexes,
>
> preCombining and custom partitioning that ensures file sizes etc. All this
>
> needs shuffling data, looking up/joining against other dataframes so forth.
>
> Today, the DataSource V1 API allows this kind of further
>
> partitions/transformations. But the V2 API is simply offers partition level
>
> iteration once the user calls df.write.format("hudi")
>
>
> One thought I had is to explore Apache Calcite and write an adapter for
>
> Hudi. This frees us from being very dependent on a particular engine's
>
> syntax support like Spark. Calcite is very popular by itself and supports
>
> most of the key words and (also more streaming friendly syntax). To be
>
> clear, we will still be using Spark/Flink underneath to perform the actual
>
> writing, just that the SQL grammar is provided by Calcite.
>
>
> To give a taste of how this will look like.
>
>
> A) If the user wants to mutate a Hudi table using SQL
>
>
> Instead of writing something like : spark.sql("UPDATE ....")
>
> users will write : hudiSparkSession.sql("UPDATE ....")
>
>
> B) To save a Spark data frame to a Hudi table
>
> we continue to use Spark DataSource V1
>
>
> The obvious challenge I see is the disconnect with the Spark DataFrame
>
> ecosystem. Users would write MERGE SQL statements by joining against other
>
> Spark DataFrames.
>
> If we want those expressed in calcite as well, we need to also invest in
>
> the full Query side support, which can increase the scope by a lot.
>
> Some amount of investigation needs to happen, but ideally we should be
>
> able to integrate with the sparkSQL catalog and reuse all the tables there.
>
>
> I am sure there are some gaps in my thinking. Just starting this thread,
>
> so we can discuss and others can chime in/correct me.
>
>
> thanks
>
> vinoth
>
>
>

Reply via email to