Hi, >> In my usecase I don’t want to make any change in consumer logic for downstream tools so KEEP_LATEST_FILE_VERSIONS and CLEANER_FILE_VERSIONS_RETAINED_PROP = "1" works. Noted. Thanks for letting us know.
Thanks Vinoth On Fri, Jun 14, 2019 at 8:04 AM Jaimin Shah <shahjaimin0...@gmail.com> wrote: > Hi > > I am also in favour of restraining KEEP_LATEST_FILE_VERSIONS policy. > > I suspect many people are using hudi as a solution to manage parquet which > is consumed by downstream tools. In my usecase I don’t want to make any > change in consumer logic for downstream tools so KEEP_LATEST_FILE_VERSIONS > and CLEANER_FILE_VERSIONS_RETAINED_PROP = "1" works. > > Also I can control when to start consuming data from downstream jobs so I > don’t face issue with files deleted while running query etc. > > > On Thursday, 13 June 2019, Vinoth Chandar <vin...@apache.org> wrote: > > > yes. we always keep atleast one version out, since deleting it could fail > > the queries.. > > Thanks for the feedback. Will not remove it then. > > > > We can work towards Impala support for your use-case, as a long term > > solution. And revisit later may be > > > > On Tue, Jun 11, 2019 at 9:54 PM Gary Li <yanjia.gary...@gmail.com> > wrote: > > > > > Thanks, Vinoth. That's very helpful. > > > > > > When I was using data consumers that don't support hoodie format, I > have > > to > > > use KEEP_LATEST_FILE_VERSIONS and CLEANER_FILE_VERSIONS_RETAINED_PROP = > > "1" > > > to keep the parquet files clean, as discussed in > > >https://github.com/apache/incubator-hudi/issues/715 . When I use > > > KEEP_LATEST_COMMITS with hoodie.cleaner.commits.retained = "1", I will > > > still have two versions of parquet files. > > > > > > Comparing with running batch jobs, this way actually make my situation > > much > > > better. So I'd recommend not to retire KEEP_LATEST_FILE_VERSIONS and > some > > > people might find it useful as I do. > > > > > > Thanks! > > > Gary > > > > > > > > > On Tue, Jun 11, 2019 at 9:20 AM Vinoth Chandar <vin...@apache.org> > > wrote: > > > > > > > Cool. So, cleaning policy determines how we clean up older versions > of > > > file > > > > groups (simplistically old parquet and log files), to bound storage > > > growth, > > > > > > > > KEEP_LATEST_COMMITS (default) : Retains (does not delete) any file > > > (slice) > > > > that was touched in the last X commits. The idea here is that you are > > > able > > > > to pull the incremental changes worth upto X commits. > > > > KEEP_LATEST_FILE_VERSIONS : If you are not interested in incremental > > > pull > > > > at all, you can choose to just retain X files (slices) per file group > > > (i.e > > > > files that share same prefix) instead. This could result in fewer > files > > > in > > > > some cases. > > > > > > > > In practice, we always use KEEP_LATEST_COMMITS, I keep thinking about > > > > starting a discussion to retire LATEST_FILE_VERSIONS actually.. > > > > > > > > Hope that helps. > > > > > > > > On Tue, Jun 11, 2019 at 9:05 AM Gary Li <yanjia.gary...@gmail.com> > > > wrote: > > > > > > > > > Hello Vinoth, > > > > > > > > > > Yes, that’s what I mean. > > > > > > > > > > Thanks > > > > > Gary > > > > > > > > > > On Tue, Jun 11, 2019 at 9:03 AM Vinoth Chandar <vin...@apache.org> > > > > wrote: > > > > > > > > > > > Hi Gary, > > > > > > > > > > > > Do you mean cleaning policy? KEEP_LATEST_FILE_VERSIONS vs > > > > > > KEEP_LATEST_COMMITS ? > > > > > > > > > > > > Thanks > > > > > > VInoth > > > > > > > > > > > > On Mon, Jun 10, 2019 at 9:47 PM Gary Li < > yanjia.gary...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > I am a little confused when I was looking at the compaction > > policy. > > > > > What > > > > > > is > > > > > > > the difference between KEEP_LATEST_COMMIT vs > KEEP_LATEST_VERSION? > > > > What > > > > > is > > > > > > > the exact definition of "COMMIT" and "VERSION"? > > > > > > > > > > > > > > Thanks, > > > > > > > Gary > > > > > > > > > > > > > > > > > > > > > > > > > > > >