Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

[email protected] Fri, 04 Sep 2020 10:09:12 -0700

 Regarding queries taking a hit, we need to distinguish queries which do not 
need _hoodie_record_key column vs those that need them. I would think the most 
common case would be the ones that does not required _hoodie_record_key. This 
is similar to how hive query integration for bootstrapped table is supported. 
We should implement on similar lines to preserve fast path.
At the end of the day, this is going to be an optional feature. For cases, 
where the storage overhead of _hoodie_record_key is higher, this feature would 
help.
Thanks,Balaji.V
    On Wednesday, September 2, 2020, 05:07:49 PM PDT, Gary Li 
<[email protected]> wrote:  
 
 Yes it works this way, IMO the performance of reading certain fields from the 
row and then to build the key from there is not as efficient as having a 
separate key from the actual record. But if we only make it virtual in the 
parquet then this should be fine.


For parquet, the downside I could see is we won’t be able to use ColumnBatch to 
fetch the key anymore. To get the key, we might need to scan the entire row. 
But since most of the data are not being touched during the upsert then I could 
see the storage benefit over the read performance.

>From the reader side, are we trying to make _hoodie_record_key still exist in 
>the table view or excluding it from the query side? If we want to still 
>include it, then we might need to support the virtual feature for multiple 
>query engines to get the consistent view.

No mean to stop this making progress. Just some questions pop into my head. I 
think my questions are all solvable. Happy to discuss more in the RFC if we 
move forward :)

Best,
Gary

Gary Li
________________________________
From: Vinoth Chandar <[email protected]>
Sent: Wednesday, September 2, 2020 11:07:23 PM
To: [email protected] <[email protected]>
Subject: Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Hi Gary,

It should not be that bad, right. We can just create a new implementation
of HoodieRecord where getKey() can dynamically fetch a field's value as the
recordKey.
Then the rest of the code should work as-is?

On Tue, Sep 1, 2020 at 5:22 PM Gary Li <[email protected]> wrote:

> Thanks for the proposal. I am a bit concerning about this feature.
> _hoodie_record_key is the primary key of the Hudi table and we need this
> field for indexing, sorting, merging e.t.c. We are using _hoodie_record_key
> very frequently so I can see the overhead on both write side and read side.
> I am not sure if this is worth it. Make _hoodie_partition_path virtual make
> more sense to me.
>
> On the implementation side, if we support this as an option, we need
> special handles in many places. For example, the LogScanner,
> _hoodie_record_key is the key of the lookup Hashmap and this Map could be
> spilled to the disk as well. There should be significant amount of work
> from my perspective.
>
> -1 for making _hoodie_record_key virtual.
> +1 for _hoodie_partition_path
>
> Best Regards,
> Gary Li
>
>
> On 8/22/20, 9:09 PM, "Sivabalan" <[email protected]> wrote:
>
>    Aah, yes. That’s right.
>
>    On Sat, Aug 22, 2020 at 2:43 AM Vinoth Chandar <[email protected]>
> wrote:
>
>    > All of the remaining meta fields compress very very nicely. They have
>    >
>    > almost no overhead.
>    >
>    >
>    >
>    > On Fri, Aug 21, 2020 at 12:00 PM Abhishek Modi <[email protected]
> >
>    >
>    > wrote:
>    >
>    >
>    >
>    > > @sivabalan the current plan is to only add this for
> hoodie_record_key.
>    > But
>    >
>    > > I'm hoping to make the implementation general enough to add other
> columns
>    >
>    > > as well going forward :)
>    >
>    > >
>    >
>    > > On Fri, Aug 21, 2020 at 11:49 AM Sivabalan <[email protected]>
> wrote:
>    >
>    > >
>    >
>    > > > +1 for virtual record keys. Do you also propose to generalize
> this for
>    >
>    > > > partition path as well ?
>    >
>    > > >
>    >
>    > > >
>    >
>    > > > On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma <
>    > [email protected]>
>    >
>    > > > wrote:
>    >
>    > > >
>    >
>    > > > > This is a good option to have. :)
>    >
>    > > > >
>    >
>    > > > > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <
> [email protected]>
>    >
>    > > > wrote:
>    >
>    > > > >
>    >
>    > > > > > IIRC _hoodie_record_key was supposed to this standardized key
>    > field.
>    >
>    > > :)
>    >
>    > > > > > Anyways, it's good to provide this option to the user.
>    >
>    > > > > > So +1 for. RFC/further discussion.
>    >
>    > > > > >
>    >
>    > > > > > To level set, I want to also share some of the benefits of
> having
>    > an
>    >
>    > > > > > explicit key column.
>    >
>    > > > > > a) if you build your data lake using a bunch of hudi tables,
> now
>    > you
>    >
>    > > > > have a
>    >
>    > > > > > standardized data model
>    >
>    > > > > > b) Even if your key generator changes, it does not affect the
>    >
>    > > existing
>    >
>    > > > > > data's keys. and updates will be matched correctly.
>    >
>    > > > > >
>    >
>    > > > > > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
>    >
>    > > > > > <[email protected]> wrote:
>    >
>    > > > > >
>    >
>    > > > > > >  +1. This should be good to have as an option. If everybody
>    > agrees,
>    >
>    > > > > > please
>    >
>    > > > > > > go ahead with RFC and we can discuss details there.
>    >
>    > > > > > > Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT,
>    > Abhishek
>    >
>    > > > Modi
>    >
>    > > > > > > <[email protected]> wrote:
>    >
>    > > > > > >
>    >
>    > > > > > >  Hi everyone!
>    >
>    > > > > > >
>    >
>    > > > > > > I was hoping to discuss adding support for making
>    >
>    > > > `_hoodie_record_key`
>    >
>    > > > > a
>    >
>    > > > > > > virtual column :)
>    >
>    > > > > > >
>    >
>    > > > > > > Context:
>    >
>    > > > > > > Currently, _hoodie_record_key is written to DFS, as a
> column in
>    > the
>    >
>    > > > > > Parquet
>    >
>    > > > > > > file. In our production systems at Uber however,
>    > _hoodie_record_key
>    >
>    > > > > > > contains data that can be found in a different column (or
> set of
>    >
>    > > > > > columns).
>    >
>    > > > > > > This means that we are storing duplicated data.
>    >
>    > > > > > >
>    >
>    > > > > > > Proposal:
>    >
>    > > > > > > In the interest of improving storage efficiency, we could
> add
>    >
>    > > confs /
>    >
>    > > > > > > abstract classes that can construct the _hoodie_record_key
> given
>    >
>    > > > other
>    >
>    > > > > > > columns. That way we do not have to store duplicated data
> on DFS.
>    >
>    > > > > > >
>    >
>    > > > > > > Any thoughts on this?
>    >
>    > > > > > >
>    >
>    > > > > > > Best,
>    >
>    > > > > > > Modi
>    >
>    > > > > > >
>    >
>    > > > > >
>    >
>    > > > >
>    >
>    > > >
>    >
>    > > >
>    >
>    > > > --
>    >
>    > > > Regards,
>    >
>    > > > -Sivabalan
>    >
>    > > >
>    >
>    > >
>    >
>    > --
>    Regards,
>    -Sivabalan
>
>

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Reply via email to