Vinoth, Thanks. We are evaluating hudi at the moment for a very specific use case.
We are also looking at hive 3.0, but, I still don't see a way to do incremental pulls on it. Though, we feel it might be possible to identify the new commits using some the internal apis, and we are checking that. We also came across Databricks Delta, and it seems to be conceptually similar to Hudi, though their storage format is not yet documented and generally internals documentation is lacking. We would be very much interested in Hudi for time travel capabilities as well, such as for building historical ml training data sets. Roshan On Tue, May 7, 2019 at 9:16 PM Vinoth Chandar <vin...@apache.org> wrote: > Hi Roshan, > > Thanks for writing. Yes. the user needs to manage the _commit_time > watermark on the HiveIncrementalPuller path. Also you need to set the table > in incremental mode, providing a start commit_time and max_commits to pull > as documented. The DeltaStreamer tool will manage it for you automatically, > but it supports SparkSQL. > > At Uber, we have built some custom (yet simple) tools to do these steps in > your workflow scheduler. > > For e.g, let's say your commit timeline has c1, c2, c3 commits now and you > at at time t=0 (t corresponding to commit timestamp) > > 1) Use HoodieTableMetaClient and obtain the source table's commit timeline > and determine the range of commits to pull after t=0 > i.e c1, c2, c3 > 2) Ask HiveIncrementalPuller to pull 3 commits from commit time=0 > 3) Save c3 somewhere (mysql table or a folder on dfs) > 4) Before the next run, say there are new commits c4, c5. We make t=3 and > end up pulling 2 commits from c3 as above. > > We'd love to work with you, if you are interested in standardizing this > flow inside Hudi itself. :) > > > > > On Mon, May 6, 2019 at 11:50 PM Roshan Nair (Data Platform) > <roshan.n...@flipkart.com.invalid> wrote: > > > Hi, > > > > We are trying to work out how to use hudi for incremental pulls. In our > > scenario, we would like to read from a hudi table incrementally, so that > > every subsequent read only reads new data. > > > > In the incremental hiveql example in the quickstart ( > > http://hudi.incubator.apache.org/quickstart.html#incremental-hiveql), it > > appears that I can filter on _hoodie_commit_time to select only those > > records that have not been processed yet. Hudi will ensure snapshot > > isolation, so no new partial writes are visible to this reader. > > > > The next time I want an incremental set, how do I set the > > _hoodie_commit_time in the query? > > > > Is the expectation that the user will identify the max > _hoodie_commit_time > > in the result of the query and then use this to set the > _hoodie_commit_time > > filter for the next incremental query? > > > > Roshan > > >