Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Vinoth Chandar Wed, 09 Sep 2020 14:34:04 -0700

+1

Good point btw, gary! we have to have some mechanism to reconstruct the key
for the reader side.


On Fri, Sep 4, 2020 at 10:08 AM [email protected] <[email protected]>
wrote:

>  Regarding queries taking a hit, we need to distinguish queries which do
> not need _hoodie_record_key column vs those that need them. I would think
> the most common case would be the ones that does not required
> _hoodie_record_key. This is similar to how hive query integration for
> bootstrapped table is supported. We should implement on similar lines to
> preserve fast path.
> At the end of the day, this is going to be an optional feature. For cases,
> where the storage overhead of _hoodie_record_key is higher, this feature
> would help.
> Thanks,Balaji.V
>     On Wednesday, September 2, 2020, 05:07:49 PM PDT, Gary Li <
> [email protected]> wrote:
>
>  Yes it works this way, IMO the performance of reading certain fields from
> the row and then to build the key from there is not as efficient as having
> a separate key from the actual record. But if we only make it virtual in
> the parquet then this should be fine.
>
> For parquet, the downside I could see is we won’t be able to use
> ColumnBatch to fetch the key anymore. To get the key, we might need to scan
> the entire row. But since most of the data are not being touched during the
> upsert then I could see the storage benefit over the read performance.
>
> From the reader side, are we trying to make _hoodie_record_key still exist
> in the table view or excluding it from the query side? If we want to still
> include it, then we might need to support the virtual feature for multiple
> query engines to get the consistent view.
>
> No mean to stop this making progress. Just some questions pop into my
> head. I think my questions are all solvable. Happy to discuss more in the
> RFC if we move forward :)
>
> Best,
> Gary
>
> Gary Li
> ________________________________
> From: Vinoth Chandar <[email protected]>
> Sent: Wednesday, September 2, 2020 11:07:23 PM
> To: [email protected] <[email protected]>
> Subject: Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column
>
> Hi Gary,
>
> It should not be that bad, right. We can just create a new implementation
> of HoodieRecord where getKey() can dynamically fetch a field's value as the
> recordKey.
> Then the rest of the code should work as-is?
>
> On Tue, Sep 1, 2020 at 5:22 PM Gary Li <[email protected]> wrote:
>
> > Thanks for the proposal. I am a bit concerning about this feature.
> > _hoodie_record_key is the primary key of the Hudi table and we need this
> > field for indexing, sorting, merging e.t.c. We are using
> _hoodie_record_key
> > very frequently so I can see the overhead on both write side and read
> side.
> > I am not sure if this is worth it. Make _hoodie_partition_path virtual
> make
> > more sense to me.
> >
> > On the implementation side, if we support this as an option, we need
> > special handles in many places. For example, the LogScanner,
> > _hoodie_record_key is the key of the lookup Hashmap and this Map could be
> > spilled to the disk as well. There should be significant amount of work
> > from my perspective.
> >
> > -1 for making _hoodie_record_key virtual.
> > +1 for _hoodie_partition_path
> >
> > Best Regards,
> > Gary Li
> >
> >
> > On 8/22/20, 9:09 PM, "Sivabalan" <[email protected]> wrote:
> >
> >    Aah, yes. That’s right.
> >
> >    On Sat, Aug 22, 2020 at 2:43 AM Vinoth Chandar <[email protected]>
> > wrote:
> >
> >    > All of the remaining meta fields compress very very nicely. They
> have
> >    >
> >    > almost no overhead.
> >    >
> >    >
> >    >
> >    > On Fri, Aug 21, 2020 at 12:00 PM Abhishek Modi
> <[email protected]
> > >
> >    >
> >    > wrote:
> >    >
> >    >
> >    >
> >    > > @sivabalan the current plan is to only add this for
> > hoodie_record_key.
> >    > But
> >    >
> >    > > I'm hoping to make the implementation general enough to add other
> > columns
> >    >
> >    > > as well going forward :)
> >    >
> >    > >
> >    >
> >    > > On Fri, Aug 21, 2020 at 11:49 AM Sivabalan <[email protected]>
> > wrote:
> >    >
> >    > >
> >    >
> >    > > > +1 for virtual record keys. Do you also propose to generalize
> > this for
> >    >
> >    > > > partition path as well ?
> >    >
> >    > > >
> >    >
> >    > > >
> >    >
> >    > > > On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma <
> >    > [email protected]>
> >    >
> >    > > > wrote:
> >    >
> >    > > >
> >    >
> >    > > > > This is a good option to have. :)
> >    >
> >    > > > >
> >    >
> >    > > > > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <
> > [email protected]>
> >    >
> >    > > > wrote:
> >    >
> >    > > > >
> >    >
> >    > > > > > IIRC _hoodie_record_key was supposed to this standardized
> key
> >    > field.
> >    >
> >    > > :)
> >    >
> >    > > > > > Anyways, it's good to provide this option to the user.
> >    >
> >    > > > > > So +1 for. RFC/further discussion.
> >    >
> >    > > > > >
> >    >
> >    > > > > > To level set, I want to also share some of the benefits of
> > having
> >    > an
> >    >
> >    > > > > > explicit key column.
> >    >
> >    > > > > > a) if you build your data lake using a bunch of hudi tables,
> > now
> >    > you
> >    >
> >    > > > > have a
> >    >
> >    > > > > > standardized data model
> >    >
> >    > > > > > b) Even if your key generator changes, it does not affect
> the
> >    >
> >    > > existing
> >    >
> >    > > > > > data's keys. and updates will be matched correctly.
> >    >
> >    > > > > >
> >    >
> >    > > > > > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
> >    >
> >    > > > > > <[email protected]> wrote:
> >    >
> >    > > > > >
> >    >
> >    > > > > > >  +1. This should be good to have as an option. If
> everybody
> >    > agrees,
> >    >
> >    > > > > > please
> >    >
> >    > > > > > > go ahead with RFC and we can discuss details there.
> >    >
> >    > > > > > > Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT,
> >    > Abhishek
> >    >
> >    > > > Modi
> >    >
> >    > > > > > > <[email protected]> wrote:
> >    >
> >    > > > > > >
> >    >
> >    > > > > > >  Hi everyone!
> >    >
> >    > > > > > >
> >    >
> >    > > > > > > I was hoping to discuss adding support for making
> >    >
> >    > > > `_hoodie_record_key`
> >    >
> >    > > > > a
> >    >
> >    > > > > > > virtual column :)
> >    >
> >    > > > > > >
> >    >
> >    > > > > > > Context:
> >    >
> >    > > > > > > Currently, _hoodie_record_key is written to DFS, as a
> > column in
> >    > the
> >    >
> >    > > > > > Parquet
> >    >
> >    > > > > > > file. In our production systems at Uber however,
> >    > _hoodie_record_key
> >    >
> >    > > > > > > contains data that can be found in a different column (or
> > set of
> >    >
> >    > > > > > columns).
> >    >
> >    > > > > > > This means that we are storing duplicated data.
> >    >
> >    > > > > > >
> >    >
> >    > > > > > > Proposal:
> >    >
> >    > > > > > > In the interest of improving storage efficiency, we could
> > add
> >    >
> >    > > confs /
> >    >
> >    > > > > > > abstract classes that can construct the _hoodie_record_key
> > given
> >    >
> >    > > > other
> >    >
> >    > > > > > > columns. That way we do not have to store duplicated data
> > on DFS.
> >    >
> >    > > > > > >
> >    >
> >    > > > > > > Any thoughts on this?
> >    >
> >    > > > > > >
> >    >
> >    > > > > > > Best,
> >    >
> >    > > > > > > Modi
> >    >
> >    > > > > > >
> >    >
> >    > > > > >
> >    >
> >    > > > >
> >    >
> >    > > >
> >    >
> >    > > >
> >    >
> >    > > > --
> >    >
> >    > > > Regards,
> >    >
> >    > > > -Sivabalan
> >    >
> >    > > >
> >    >
> >    > >
> >    >
> >    > --
> >    Regards,
> >    -Sivabalan
> >
> >
>

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Reply via email to