Re: [DISCUSS] Move to spark v2 datasource API

Sivabalan Thu, 15 Jul 2021 20:10:58 -0700

I don't have much knowledge wrt catalog, but is there an option of
exploring spark catalog based table to create a hudi table? I do know with
spark3.2, you can add Distribution(a.k.a partitioning) and Sort order to
your table. But still not sure on custom transformation for indexing, etc.


Also, wrt Option2, is there a way to not explicitly ask users to start
using hudi_V2 once we have one. For eg, from within our V1's createRelation
can't we bypass and start using hudi_v2 somehow? Directly using hudi_v2
should also be an option. I need to explore more on these lines, but just
putting it out.

Once we make some headway in this(by some spark expertise), I can
definitely contribute from my side on this project.


On Thu, Jul 15, 2021 at 12:13 AM Vinoth Chandar <vin...@apache.org> wrote:

> Folks,
>
> As you may know, we still use the V1 API, given it the flexibility further
> transform the dataframe, after one calls `df.write.format()`, to implement
> a fully featured write pipeline with precombining, indexing, custom
> partitioning. V2 API takes this away and rather provides a very restrictive
> API that simply provides a partition level write interface that hands a
> Iterator<InternalRow>.
>
> That said, v2 has evolved (again) in Spark 3 and we would like to get with
> the V2 APIs at some point, for both querying and writing. This thread
> summarizes a few approaches we can take.
>
> *Option 1 : Introduce a pre write hook in Spark Datasource API*
> If datasource API provided a simple way to further transform dataframes,
> after the call to df.write.format is done, we would be able to move much of
> our HoodieSparkSQLWriter logic into that and make the transition.
>
> Sivabalan engaged with the Spark community around this, without luck.
> Anyone who can help revive this or make a more successful attempt, please
> chime in.
>
> *Option 2 : Introduce a new datasource hudi_v2 for Spark Datasource +
> HoodieSparkClient API*
>
> We would limit datasource writes to simply bulk_inserts or
> insert_overwrites. All other write operations would be supported via a new
> HoodieSparkClient API (similar to all the write clients we have, but works
> with DataSet<Row>). Queries will be supported on the v2 APIs. This will be
> done only for Spark 3.
>
> We would still keep the current v1 support until Spark supports it.
> Obviously, users have to migrate pipelines to hudi_v2 at some point, if
> datasource v1 support is dropped
>
> My concern is having two datasources, causing greater confusion for the
> users.
>
> Maybe there are others that I did not list out here. Please add
>
> Thanks
> Vinoth
>


-- 
Regards,
-Sivabalan

Re: [DISCUSS] Move to spark v2 datasource API

Reply via email to