Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Pratyaksh Sharma Fri, 21 Aug 2020 01:20:54 -0700

This is a good option to have. :)

On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <[email protected]> wrote:


> IIRC _hoodie_record_key was supposed to this standardized key field. :)
> Anyways, it's good to provide this option to the user.
> So +1 for. RFC/further discussion.
>
> To level set, I want to also share some of the benefits of having an
> explicit key column.
> a) if you build your data lake using a bunch of hudi tables, now you have a
> standardized data model
> b) Even if your key generator changes, it does not affect the existing
> data's keys. and updates will be matched correctly.
>
> On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
> <[email protected]> wrote:
>
> >  +1. This should be good to have as an option. If everybody agrees,
> please
> > go ahead with RFC and we can discuss details there.
> > Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT, Abhishek Modi
> > <[email protected]> wrote:
> >
> >  Hi everyone!
> >
> > I was hoping to discuss adding support for making `_hoodie_record_key` a
> > virtual column :)
> >
> > Context:
> > Currently, _hoodie_record_key is written to DFS, as a column in the
> Parquet
> > file. In our production systems at Uber however, _hoodie_record_key
> > contains data that can be found in a different column (or set of
> columns).
> > This means that we are storing duplicated data.
> >
> > Proposal:
> > In the interest of improving storage efficiency, we could add confs /
> > abstract classes that can construct the _hoodie_record_key given other
> > columns. That way we do not have to store duplicated data on DFS.
> >
> > Any thoughts on this?
> >
> > Best,
> > Modi
> >
>

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Reply via email to