@sivabalan the current plan is to only add this for hoodie_record_key. But I'm hoping to make the implementation general enough to add other columns as well going forward :)
On Fri, Aug 21, 2020 at 11:49 AM Sivabalan <[email protected]> wrote: > +1 for virtual record keys. Do you also propose to generalize this for > partition path as well ? > > > On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma <[email protected]> > wrote: > > > This is a good option to have. :) > > > > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <[email protected]> > wrote: > > > > > IIRC _hoodie_record_key was supposed to this standardized key field. :) > > > Anyways, it's good to provide this option to the user. > > > So +1 for. RFC/further discussion. > > > > > > To level set, I want to also share some of the benefits of having an > > > explicit key column. > > > a) if you build your data lake using a bunch of hudi tables, now you > > have a > > > standardized data model > > > b) Even if your key generator changes, it does not affect the existing > > > data's keys. and updates will be matched correctly. > > > > > > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan > > > <[email protected]> wrote: > > > > > > > +1. This should be good to have as an option. If everybody agrees, > > > please > > > > go ahead with RFC and we can discuss details there. > > > > Balaji.V On Tuesday, August 18, 2020, 04:37:18 PM PDT, Abhishek > Modi > > > > <[email protected]> wrote: > > > > > > > > Hi everyone! > > > > > > > > I was hoping to discuss adding support for making > `_hoodie_record_key` > > a > > > > virtual column :) > > > > > > > > Context: > > > > Currently, _hoodie_record_key is written to DFS, as a column in the > > > Parquet > > > > file. In our production systems at Uber however, _hoodie_record_key > > > > contains data that can be found in a different column (or set of > > > columns). > > > > This means that we are storing duplicated data. > > > > > > > > Proposal: > > > > In the interest of improving storage efficiency, we could add confs / > > > > abstract classes that can construct the _hoodie_record_key given > other > > > > columns. That way we do not have to store duplicated data on DFS. > > > > > > > > Any thoughts on this? > > > > > > > > Best, > > > > Modi > > > > > > > > > > > > -- > Regards, > -Sivabalan >
