Re: [DISCUSS] Move to spark v2 datasource API

2021-07-20 Thread Vinoth Chandar
Hi Siva, Reg the ability to specify distribution, sorting, can they be dynamic? Not just at table creation time. Hudi is really a storage system. i.e has a specific layout of data with multiple tables (ro,rt) exposed. So all of these "file" management APIs, tend to fit poorly at times. To your

Re: [DISCUSS] Move to spark v2 datasource API

2021-07-15 Thread Sivabalan
I don't have much knowledge wrt catalog, but is there an option of exploring spark catalog based table to create a hudi table? I do know with spark3.2, you can add Distribution(a.k.a partitioning) and Sort order to your table. But still not sure on custom transformation for indexing, etc. Also,

[DISCUSS] Move to spark v2 datasource API

2021-07-14 Thread Vinoth Chandar
Folks, As you may know, we still use the V1 API, given it the flexibility further transform the dataframe, after one calls `df.write.format()`, to implement a fully featured write pipeline with precombining, indexing, custom partitioning. V2 API takes this away and rather provides a very