Hi Roshan,

Thanks for writing. Yes. the user needs to manage the _commit_time
watermark on the HiveIncrementalPuller path. Also you need to set the table
in incremental mode, providing a start commit_time and max_commits to pull
as documented. The DeltaStreamer tool will manage it for you automatically,
but it supports SparkSQL.

At Uber, we have built some custom (yet simple) tools to do these steps in
your workflow scheduler.

For e.g, let's say your commit timeline has c1, c2, c3 commits now and you
at at time t=0 (t corresponding to commit timestamp)

1) Use HoodieTableMetaClient and obtain the source table's commit timeline
and determine the range of commits to pull after t=0
     i.e c1, c2, c3
2) Ask HiveIncrementalPuller to pull 3 commits from commit time=0
3) Save c3 somewhere (mysql table or a folder on dfs)
4) Before the next run, say there are new commits c4, c5. We make t=3 and
end up pulling 2 commits from c3 as above.

We'd love to work with you, if you are interested in standardizing this
flow inside Hudi itself. :)




On Mon, May 6, 2019 at 11:50 PM Roshan Nair (Data Platform)
<roshan.n...@flipkart.com.invalid> wrote:

> Hi,
>
> We are trying to work out how to use hudi for incremental pulls. In our
> scenario, we would like to read from a hudi table incrementally, so that
> every subsequent read only reads new data.
>
> In the incremental hiveql example in the quickstart (
> http://hudi.incubator.apache.org/quickstart.html#incremental-hiveql), it
> appears that I can filter on _hoodie_commit_time to select only those
> records that have not been processed yet. Hudi will ensure snapshot
> isolation, so no new partial writes are visible to this reader.
>
> The next time I want an incremental set, how do I set the
> _hoodie_commit_time in the query?
>
> Is the expectation that the user will identify the max _hoodie_commit_time
> in the result of the query and then use this to set the _hoodie_commit_time
> filter for the next incremental query?
>
> Roshan
>

Reply via email to