Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Vinoth Chandar Wed, 02 Sep 2020 08:08:20 -0700

Hi Gary,

It should not be that bad, right. We can just create a new implementation
of HoodieRecord where getKey() can dynamically fetch a field's value as the
recordKey.
Then the rest of the code should work as-is?


On Tue, Sep 1, 2020 at 5:22 PM Gary Li <garyli1...@outlook.com> wrote:

> Thanks for the proposal. I am a bit concerning about this feature.
> _hoodie_record_key is the primary key of the Hudi table and we need this
> field for indexing, sorting, merging e.t.c. We are using _hoodie_record_key
> very frequently so I can see the overhead on both write side and read side.
> I am not sure if this is worth it. Make _hoodie_partition_path virtual make
> more sense to me.
>
> On the implementation side, if we support this as an option, we need
> special handles in many places. For example, the LogScanner,
> _hoodie_record_key is the key of the lookup Hashmap and this Map could be
> spilled to the disk as well. There should be significant amount of work
> from my perspective.
>
> -1 for making _hoodie_record_key virtual.
> +1 for _hoodie_partition_path
>
> Best Regards,
> Gary Li
>
>
> On 8/22/20, 9:09 PM, "Sivabalan" <n.siv...@gmail.com> wrote:
>
>     Aah, yes. That’s right.
>
>     On Sat, Aug 22, 2020 at 2:43 AM Vinoth Chandar <vin...@apache.org>
> wrote:
>
>     > All of the remaining meta fields compress very very nicely. They have
>     >
>     > almost no overhead.
>     >
>     >
>     >
>     > On Fri, Aug 21, 2020 at 12:00 PM Abhishek Modi <m...@uber.com.invalid
> >
>     >
>     > wrote:
>     >
>     >
>     >
>     > > @sivabalan the current plan is to only add this for
> hoodie_record_key.
>     > But
>     >
>     > > I'm hoping to make the implementation general enough to add other
> columns
>     >
>     > > as well going forward :)
>     >
>     > >
>     >
>     > > On Fri, Aug 21, 2020 at 11:49 AM Sivabalan <n.siv...@gmail.com>
> wrote:
>     >
>     > >
>     >
>     > > > +1 for virtual record keys. Do you also propose to generalize
> this for
>     >
>     > > > partition path as well ?
>     >
>     > > >
>     >
>     > > >
>     >
>     > > > On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma <
>     > pratyaks...@gmail.com>
>     >
>     > > > wrote:
>     >
>     > > >
>     >
>     > > > > This is a good option to have. :)
>     >
>     > > > >
>     >
>     > > > > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <
> vin...@apache.org>
>     >
>     > > > wrote:
>     >
>     > > > >
>     >
>     > > > > > IIRC _hoodie_record_key was supposed to this standardized key
>     > field.
>     >
>     > > :)
>     >
>     > > > > > Anyways, it's good to provide this option to the user.
>     >
>     > > > > > So +1 for. RFC/further discussion.
>     >
>     > > > > >
>     >
>     > > > > > To level set, I want to also share some of the benefits of
> having
>     > an
>     >
>     > > > > > explicit key column.
>     >
>     > > > > > a) if you build your data lake using a bunch of hudi tables,
> now
>     > you
>     >
>     > > > > have a
>     >
>     > > > > > standardized data model
>     >
>     > > > > > b) Even if your key generator changes, it does not affect the
>     >
>     > > existing
>     >
>     > > > > > data's keys. and updates will be matched correctly.
>     >
>     > > > > >
>     >
>     > > > > > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
>     >
>     > > > > > <v.bal...@ymail.com.invalid> wrote:
>     >
>     > > > > >
>     >
>     > > > > > >  +1. This should be good to have as an option. If everybody
>     > agrees,
>     >
>     > > > > > please
>     >
>     > > > > > > go ahead with RFC and we can discuss details there.
>     >
>     > > > > > > Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT,
>     > Abhishek
>     >
>     > > > Modi
>     >
>     > > > > > > <m...@uber.com.invalid> wrote:
>     >
>     > > > > > >
>     >
>     > > > > > >  Hi everyone!
>     >
>     > > > > > >
>     >
>     > > > > > > I was hoping to discuss adding support for making
>     >
>     > > > `_hoodie_record_key`
>     >
>     > > > > a
>     >
>     > > > > > > virtual column :)
>     >
>     > > > > > >
>     >
>     > > > > > > Context:
>     >
>     > > > > > > Currently, _hoodie_record_key is written to DFS, as a
> column in
>     > the
>     >
>     > > > > > Parquet
>     >
>     > > > > > > file. In our production systems at Uber however,
>     > _hoodie_record_key
>     >
>     > > > > > > contains data that can be found in a different column (or
> set of
>     >
>     > > > > > columns).
>     >
>     > > > > > > This means that we are storing duplicated data.
>     >
>     > > > > > >
>     >
>     > > > > > > Proposal:
>     >
>     > > > > > > In the interest of improving storage efficiency, we could
> add
>     >
>     > > confs /
>     >
>     > > > > > > abstract classes that can construct the _hoodie_record_key
> given
>     >
>     > > > other
>     >
>     > > > > > > columns. That way we do not have to store duplicated data
> on DFS.
>     >
>     > > > > > >
>     >
>     > > > > > > Any thoughts on this?
>     >
>     > > > > > >
>     >
>     > > > > > > Best,
>     >
>     > > > > > > Modi
>     >
>     > > > > > >
>     >
>     > > > > >
>     >
>     > > > >
>     >
>     > > >
>     >
>     > > >
>     >
>     > > > --
>     >
>     > > > Regards,
>     >
>     > > > -Sivabalan
>     >
>     > > >
>     >
>     > >
>     >
>     > --
>     Regards,
>     -Sivabalan
>
>

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Reply via email to