Hi,

>> In my usecase I don’t want to make any change in consumer logic for
downstream tools so KEEP_LATEST_FILE_VERSIONS and
CLEANER_FILE_VERSIONS_RETAINED_PROP = "1" works.
Noted. Thanks for letting us know.

Thanks
Vinoth


On Fri, Jun 14, 2019 at 8:04 AM Jaimin Shah <shahjaimin0...@gmail.com>
wrote:

> Hi
>
> I am also in favour of restraining KEEP_LATEST_FILE_VERSIONS policy.
>
>  I suspect many people are using hudi as a solution to manage parquet which
> is consumed by downstream tools. In my usecase I don’t want to make any
> change in consumer logic for downstream tools so KEEP_LATEST_FILE_VERSIONS
> and CLEANER_FILE_VERSIONS_RETAINED_PROP = "1" works.
>
> Also I can control when to start consuming data from downstream jobs so I
> don’t face issue with files deleted while running query etc.
>
>
> On Thursday, 13 June 2019, Vinoth Chandar <vin...@apache.org> wrote:
>
> > yes. we always keep atleast one version out, since deleting it could fail
> > the queries..
> > Thanks for the feedback. Will not remove it then.
> >
> > We can work towards Impala support for your use-case, as a long term
> > solution. And revisit later may be
> >
> > On Tue, Jun 11, 2019 at 9:54 PM Gary Li <yanjia.gary...@gmail.com>
> wrote:
> >
> > > Thanks, Vinoth. That's very helpful.
> > >
> > > When I was using data consumers that don't support hoodie format, I
> have
> > to
> > > use KEEP_LATEST_FILE_VERSIONS and CLEANER_FILE_VERSIONS_RETAINED_PROP =
> > "1"
> > > to keep the parquet files clean, as discussed in
> > >https://github.com/apache/incubator-hudi/issues/715  . When I use
>
> > KEEP_LATEST_COMMITS with hoodie.cleaner.commits.retained = "1", I will
> > > still have two versions of parquet files.
> > >
> > > Comparing with running batch jobs, this way actually make my situation
> > much
> > > better. So I'd recommend not to retire KEEP_LATEST_FILE_VERSIONS and
> some
> > > people might find it useful as I do.
> > >
> > > Thanks!
> > > Gary
> > >
> > >
> > > On Tue, Jun 11, 2019 at 9:20 AM Vinoth Chandar <vin...@apache.org>
> > wrote:
> > >
> > > > Cool. So, cleaning policy determines how we clean up older versions
> of
> > > file
> > > > groups (simplistically old parquet and log files), to bound storage
> > > growth,
> > > >
> > > > KEEP_LATEST_COMMITS (default) : Retains (does not delete) any file
> > > (slice)
> > > > that was touched in the last X commits. The idea here is that you are
> > > able
> > > > to pull the incremental changes worth upto X commits.
> > > > KEEP_LATEST_FILE_VERSIONS :  If you are not interested in incremental
> > > pull
> > > > at all, you can choose to just retain X files (slices) per file group
> > > (i.e
> > > > files that share same prefix) instead. This could result in fewer
> files
> > > in
> > > > some cases.
> > > >
> > > > In practice, we always use KEEP_LATEST_COMMITS, I keep thinking about
> > > > starting a discussion to retire LATEST_FILE_VERSIONS actually..
> > > >
> > > > Hope that helps.
> > > >
> > > > On Tue, Jun 11, 2019 at 9:05 AM Gary Li <yanjia.gary...@gmail.com>
> > > wrote:
> > > >
> > > > > Hello Vinoth,
> > > > >
> > > > > Yes, that’s what I mean.
> > > > >
> > > > > Thanks
> > > > > Gary
> > > > >
> > > > > On Tue, Jun 11, 2019 at 9:03 AM Vinoth Chandar <vin...@apache.org>
> > > > wrote:
> > > > >
> > > > > > Hi Gary,
> > > > > >
> > > > > > Do  you mean cleaning policy?  KEEP_LATEST_FILE_VERSIONS vs
> > > > > >  KEEP_LATEST_COMMITS ?
> > > > > >
> > > > > > Thanks
> > > > > > VInoth
> > > > > >
> > > > > > On Mon, Jun 10, 2019 at 9:47 PM Gary Li <
> yanjia.gary...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I am a little confused when I was looking at the compaction
> > policy.
> > > > > What
> > > > > > is
> > > > > > > the difference between KEEP_LATEST_COMMIT vs
> KEEP_LATEST_VERSION?
> > > > What
> > > > > is
> > > > > > > the exact definition of "COMMIT" and "VERSION"?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Gary
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to