[HUDI-5158][PROPOSE] Add column pruning support to any payload

2022-11-03 Thread Teng Huo
Hi all,

We have a feature propose for improving MOR table read performance to any 
payload.

Background

In HUDI-3217, Alexey added column prune support in HoodieMergeOnReadRDD, which 
is really nice feature. It can speed up MOR _rt table query significantly.

However, this performance improvement is limited by a 
whitelistedPayloadClasses, so column prune is only supported in 
OverwriteWithLatestAvroPayload. If we implemented any other payload class, it 
can't utilise this feature.


Propose

After studying about this feature in HoodieMergeOnReadRDD implemented by 
Alexey, we added 2 new methods in the interface HoodieRecordPayload to tell 
HoodieMergeOnReadRDD if a payload class can be applied column prune, and if 
there is any extra column for doing merge.
We have implemented this feature in Spark side, and also started the dev work 
for supporting it in Trino.

Code reference: https://gist.github.com/TengHuo/48068bf1810ed771b388862271e53266


Related issues

https://issues.apache.org/jira/browse/HUDI-3217
[HUDI-3217] RFC-46: Optimize Record Payload handling - ASF 
JIRA
Apache Hudi; HUDI-3217; RFC-46: Optimize Record Payload handling. Log In. Export
issues.apache.org
https://issues.apache.org/jira/browse/HUDI-5158
[HUDI-5158] Add column pruning support to any payload - ASF 
JIRA
In HoodieMergeOnReadRDD, Alexey added column prune support in PR #4888, which 
is nice, it can speed up MOR _rt table query significantly. However, this 
performance improvement is limited by a whitelistedPayloadClasses , so column 
prune is only supported in OverwriteWithLatestAvroPayload .If we implemented 
any other payload class, it can't utilise this feature.
issues.apache.org


We plan to share this feature to Hudi community once completed. May I ask if 
there is any suggestion about this feature? Really appreciate


Re: [DISCUSS] Hudi data TTL

2022-10-19 Thread Teng Huo
Nice feature!
@stream2000

Just one question, can it work with compaction logs? I mean, if there are some 
log files already marked in a compaction plan, will they be deleted by TTL?

From: sagar sumit 
Sent: Wednesday, October 19, 2022 2:42:36 PM
To: dev@hudi.apache.org 
Subject: Re: [DISCUSS] Hudi data TTL

+1 Very nice idea. Looking forward to the RFC!

On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
wrote:

> great proposal. Partition TTL is a good starting point. we can extend it to
> other TTL strategies like column-based, and make it customizable and
> pluggable. Looking forward to the RFC!
>
> On Wed, Oct 19, 2022 at 11:40 AM Jian Feng 
> wrote:
>
> > Good idea,
> > this is definitely worth an  RFC
> > btw should it only depend on Hudi's partition? I feel it should be a more
> > common feature since sometimes customers' data can not update across
> > partitions
> >
> >
> > On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com> wrote:
> >
> > > Hi all, we have implemented a partition based data ttl management,
> which
> > > we can manage ttl for hudi partition by size, expired time and
> > > sub-partition count. When a partition is detected as outdated, we use
> > > delete partition interface to delete it, which will generate a replace
> > > commit to mark the data as deleted. The real deletion will then done by
> > > clean service.
> > >
> > >
> > > If community is interested in this idea, maybe we can propose a RFC to
> > > discuss it in detail.
> > >
> > >
> > > > On Oct 19, 2022, at 10:06, Vinoth Chandar  wrote:
> > > >
> > > > +1 love to discuss this on a RFC proposal.
> > > >
> > > > On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> > > wrote:
> > > >
> > > >> That's a very interesting idea.
> > > >>
> > > >> Do you want to take a stab at writing a full proposal (in the form
> of
> > > RFC)
> > > >> for it?
> > > >>
> > > >> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang  >
> > > >> wrote:
> > > >>
> > > >>> Hi all,
> > > >>>
> > > >>> Do we have plan to integrate data TTL into HUDI, so we don't have
> to
> > > >>> schedule a offline spark job to delete outdated data, just set a
> TTL
> > > >>> config, then writer or some offline service will delete old data as
> > > >>> expected.
> > > >>>
> > > >>
> > >
> > >
> >
> > --
> > *Jian Feng,冯健*
> > Shopee | Engineer | Data Infrastructure
> >
>
>
> --
> Best,
> Shiyan
>