+1 Good point btw, gary! we have to have some mechanism to reconstruct the key for the reader side.
On Fri, Sep 4, 2020 at 10:08 AM [email protected] <[email protected]> wrote: > Regarding queries taking a hit, we need to distinguish queries which do > not need _hoodie_record_key column vs those that need them. I would think > the most common case would be the ones that does not required > _hoodie_record_key. This is similar to how hive query integration for > bootstrapped table is supported. We should implement on similar lines to > preserve fast path. > At the end of the day, this is going to be an optional feature. For cases, > where the storage overhead of _hoodie_record_key is higher, this feature > would help. > Thanks,Balaji.V > On Wednesday, September 2, 2020, 05:07:49 PM PDT, Gary Li < > [email protected]> wrote: > > Yes it works this way, IMO the performance of reading certain fields from > the row and then to build the key from there is not as efficient as having > a separate key from the actual record. But if we only make it virtual in > the parquet then this should be fine. > > For parquet, the downside I could see is we won’t be able to use > ColumnBatch to fetch the key anymore. To get the key, we might need to scan > the entire row. But since most of the data are not being touched during the > upsert then I could see the storage benefit over the read performance. > > From the reader side, are we trying to make _hoodie_record_key still exist > in the table view or excluding it from the query side? If we want to still > include it, then we might need to support the virtual feature for multiple > query engines to get the consistent view. > > No mean to stop this making progress. Just some questions pop into my > head. I think my questions are all solvable. Happy to discuss more in the > RFC if we move forward :) > > Best, > Gary > > Gary Li > ________________________________ > From: Vinoth Chandar <[email protected]> > Sent: Wednesday, September 2, 2020 11:07:23 PM > To: [email protected] <[email protected]> > Subject: Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column > > Hi Gary, > > It should not be that bad, right. We can just create a new implementation > of HoodieRecord where getKey() can dynamically fetch a field's value as the > recordKey. > Then the rest of the code should work as-is? > > On Tue, Sep 1, 2020 at 5:22 PM Gary Li <[email protected]> wrote: > > > Thanks for the proposal. I am a bit concerning about this feature. > > _hoodie_record_key is the primary key of the Hudi table and we need this > > field for indexing, sorting, merging e.t.c. We are using > _hoodie_record_key > > very frequently so I can see the overhead on both write side and read > side. > > I am not sure if this is worth it. Make _hoodie_partition_path virtual > make > > more sense to me. > > > > On the implementation side, if we support this as an option, we need > > special handles in many places. For example, the LogScanner, > > _hoodie_record_key is the key of the lookup Hashmap and this Map could be > > spilled to the disk as well. There should be significant amount of work > > from my perspective. > > > > -1 for making _hoodie_record_key virtual. > > +1 for _hoodie_partition_path > > > > Best Regards, > > Gary Li > > > > > > On 8/22/20, 9:09 PM, "Sivabalan" <[email protected]> wrote: > > > > Aah, yes. That’s right. > > > > On Sat, Aug 22, 2020 at 2:43 AM Vinoth Chandar <[email protected]> > > wrote: > > > > > All of the remaining meta fields compress very very nicely. They > have > > > > > > almost no overhead. > > > > > > > > > > > > On Fri, Aug 21, 2020 at 12:00 PM Abhishek Modi > <[email protected] > > > > > > > > > wrote: > > > > > > > > > > > > > @sivabalan the current plan is to only add this for > > hoodie_record_key. > > > But > > > > > > > I'm hoping to make the implementation general enough to add other > > columns > > > > > > > as well going forward :) > > > > > > > > > > > > > > On Fri, Aug 21, 2020 at 11:49 AM Sivabalan <[email protected]> > > wrote: > > > > > > > > > > > > > > > +1 for virtual record keys. Do you also propose to generalize > > this for > > > > > > > > partition path as well ? > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma < > > > [email protected]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > This is a good option to have. :) > > > > > > > > > > > > > > > > > > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar < > > [email protected]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > IIRC _hoodie_record_key was supposed to this standardized > key > > > field. > > > > > > > :) > > > > > > > > > > Anyways, it's good to provide this option to the user. > > > > > > > > > > So +1 for. RFC/further discussion. > > > > > > > > > > > > > > > > > > > > To level set, I want to also share some of the benefits of > > having > > > an > > > > > > > > > > explicit key column. > > > > > > > > > > a) if you build your data lake using a bunch of hudi tables, > > now > > > you > > > > > > > > > have a > > > > > > > > > > standardized data model > > > > > > > > > > b) Even if your key generator changes, it does not affect > the > > > > > > > existing > > > > > > > > > > data's keys. and updates will be matched correctly. > > > > > > > > > > > > > > > > > > > > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan > > > > > > > > > > <[email protected]> wrote: > > > > > > > > > > > > > > > > > > > > > +1. This should be good to have as an option. If > everybody > > > agrees, > > > > > > > > > > please > > > > > > > > > > > go ahead with RFC and we can discuss details there. > > > > > > > > > > > Balaji.V On Tuesday, August 18, 2020, 04:37:18 PM PDT, > > > Abhishek > > > > > > > > Modi > > > > > > > > > > > <[email protected]> wrote: > > > > > > > > > > > > > > > > > > > > > > Hi everyone! > > > > > > > > > > > > > > > > > > > > > > I was hoping to discuss adding support for making > > > > > > > > `_hoodie_record_key` > > > > > > > > > a > > > > > > > > > > > virtual column :) > > > > > > > > > > > > > > > > > > > > > > Context: > > > > > > > > > > > Currently, _hoodie_record_key is written to DFS, as a > > column in > > > the > > > > > > > > > > Parquet > > > > > > > > > > > file. In our production systems at Uber however, > > > _hoodie_record_key > > > > > > > > > > > contains data that can be found in a different column (or > > set of > > > > > > > > > > columns). > > > > > > > > > > > This means that we are storing duplicated data. > > > > > > > > > > > > > > > > > > > > > > Proposal: > > > > > > > > > > > In the interest of improving storage efficiency, we could > > add > > > > > > > confs / > > > > > > > > > > > abstract classes that can construct the _hoodie_record_key > > given > > > > > > > > other > > > > > > > > > > > columns. That way we do not have to store duplicated data > > on DFS. > > > > > > > > > > > > > > > > > > > > > > Any thoughts on this? > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > Modi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Regards, > > > > > > > > -Sivabalan > > > > > > > > > > > > > > > > > > > > > -- > > Regards, > > -Sivabalan > > > > >
