I don't have much knowledge wrt catalog, but is there an option of exploring spark catalog based table to create a hudi table? I do know with spark3.2, you can add Distribution(a.k.a partitioning) and Sort order to your table. But still not sure on custom transformation for indexing, etc.
Also, wrt Option2, is there a way to not explicitly ask users to start using hudi_V2 once we have one. For eg, from within our V1's createRelation can't we bypass and start using hudi_v2 somehow? Directly using hudi_v2 should also be an option. I need to explore more on these lines, but just putting it out. Once we make some headway in this(by some spark expertise), I can definitely contribute from my side on this project. On Thu, Jul 15, 2021 at 12:13 AM Vinoth Chandar <vin...@apache.org> wrote: > Folks, > > As you may know, we still use the V1 API, given it the flexibility further > transform the dataframe, after one calls `df.write.format()`, to implement > a fully featured write pipeline with precombining, indexing, custom > partitioning. V2 API takes this away and rather provides a very restrictive > API that simply provides a partition level write interface that hands a > Iterator<InternalRow>. > > That said, v2 has evolved (again) in Spark 3 and we would like to get with > the V2 APIs at some point, for both querying and writing. This thread > summarizes a few approaches we can take. > > *Option 1 : Introduce a pre write hook in Spark Datasource API* > If datasource API provided a simple way to further transform dataframes, > after the call to df.write.format is done, we would be able to move much of > our HoodieSparkSQLWriter logic into that and make the transition. > > Sivabalan engaged with the Spark community around this, without luck. > Anyone who can help revive this or make a more successful attempt, please > chime in. > > *Option 2 : Introduce a new datasource hudi_v2 for Spark Datasource + > HoodieSparkClient API* > > We would limit datasource writes to simply bulk_inserts or > insert_overwrites. All other write operations would be supported via a new > HoodieSparkClient API (similar to all the write clients we have, but works > with DataSet<Row>). Queries will be supported on the v2 APIs. This will be > done only for Spark 3. > > We would still keep the current v1 support until Spark supports it. > Obviously, users have to migrate pipelines to hudi_v2 at some point, if > datasource v1 support is dropped > > My concern is having two datasources, causing greater confusion for the > users. > > Maybe there are others that I did not list out here. Please add > > Thanks > Vinoth > -- Regards, -Sivabalan