Re: Last commit id/ts checkpoint for incremental pull

Vinoth Chandar Tue, 14 May 2019 13:32:41 -0700

Hello again roshan,  :)

I assume "reader" here means incremental reader on incremental view. (so
not following your comment on read-optimized vs write-optimized views at
the top)


>>Then my understanding is that the reader would not receive row1 in the
result from hiveincrementalpuller for the commits c1-c3
In step b, it will see row 1 as of c3 (we filter files based on a start and
end commit range. so if c1,c2,c3 is the list as of step b, you will see the
record). In step d, it will see row 1 again as of c4.

This should not change between COW and MOR. Work on incremental view on MOR
backed by logs is underway and it should give you the same semantics above.

Happy to expand more, if I am missing something

Thanks
Vinoth


On Tue, May 14, 2019 at 2:07 AM Roshan Nair (Data Platform)
<roshan.n...@flipkart.com.invalid> wrote:

> Vinoth,
>
> This is related to the difference between read-optimized and
> write-optimized views
>
> > 1) Use HoodieTableMetaClient and obtain the source table's commit
> timeline
> and determine the range of commits to pull after t=0
>      i.e c1, c2, c3
> >2) Ask HiveIncrementalPuller to pull 3 commits from commit time=0
>
> Say we are running a COW table and
> a. row 1 was updated in c3
> b. A reader executes step 1
> c. A writer updates row 1 in commit 4 (c4)
> d. The reader proceeds to step 2.
>
> Then my understanding is that the reader would not receive row1 in the
> result from hiveincrementalpuller for the commits c1-c3. The reader would
> get the value of row1 at c4 on the next read (provided row1 was not updated
> subsequently). This should not be a problem usually, and I would assume
> would happen infrequently as the reader would typically not wait before
> actually executing the read.
>
> However, if we were running a MOR table (and provided the compaction job
> has not run in between step 1 and step 2), we would receive the value of
> row 1 at state c3.
>
> Is this correct?
>
> Roshan
>
>
> On Thu, May 9, 2019 at 3:39 AM Vinoth Chandar <vin...@apache.org> wrote:
>
> > sg. please keep us posted.
> >
> > On Wed, May 8, 2019 at 12:02 AM Roshan Nair (Data Platform)
> > <roshan.n...@flipkart.com.invalid> wrote:
> >
> > > Vinoth,
> > >
> > > Thanks. We are evaluating hudi at the moment for a very specific use
> > case.
> > >
> > > We are also looking at hive 3.0, but, I still don't see a way to do
> > > incremental pulls on it. Though, we feel it might be possible to
> identify
> > > the new commits using some the internal apis, and we are checking that.
> > >
> > > We also came across Databricks Delta, and it seems to be conceptually
> > > similar to Hudi, though their storage format is not yet documented and
> > > generally internals documentation is lacking.
> > >
> > > We would be very much interested in Hudi for time travel capabilities
> as
> > > well, such as for building historical ml training data sets.
> > >
> > > Roshan
> > >
> > >
> > > On Tue, May 7, 2019 at 9:16 PM Vinoth Chandar <vin...@apache.org>
> wrote:
> > >
> > > > Hi Roshan,
> > > >
> > > > Thanks for writing. Yes. the user needs to manage the _commit_time
> > > > watermark on the HiveIncrementalPuller path. Also you need to set the
> > > table
> > > > in incremental mode, providing a start commit_time and max_commits to
> > > pull
> > > > as documented. The DeltaStreamer tool will manage it for you
> > > automatically,
> > > > but it supports SparkSQL.
> > > >
> > > > At Uber, we have built some custom (yet simple) tools to do these
> steps
> > > in
> > > > your workflow scheduler.
> > > >
> > > > For e.g, let's say your commit timeline has c1, c2, c3 commits now
> and
> > > you
> > > > at at time t=0 (t corresponding to commit timestamp)
> > > >
> > > > 1) Use HoodieTableMetaClient and obtain the source table's commit
> > > timeline
> > > > and determine the range of commits to pull after t=0
> > > >      i.e c1, c2, c3
> > > > 2) Ask HiveIncrementalPuller to pull 3 commits from commit time=0
> > > > 3) Save c3 somewhere (mysql table or a folder on dfs)
> > > > 4) Before the next run, say there are new commits c4, c5. We make t=3
> > and
> > > > end up pulling 2 commits from c3 as above.
> > > >
> > > > We'd love to work with you, if you are interested in standardizing
> this
> > > > flow inside Hudi itself. :)
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, May 6, 2019 at 11:50 PM Roshan Nair (Data Platform)
> > > > <roshan.n...@flipkart.com.invalid> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > We are trying to work out how to use hudi for incremental pulls. In
> > our
> > > > > scenario, we would like to read from a hudi table incrementally, so
> > > that
> > > > > every subsequent read only reads new data.
> > > > >
> > > > > In the incremental hiveql example in the quickstart (
> > > > >
> http://hudi.incubator.apache.org/quickstart.html#incremental-hiveql
> > ),
> > > it
> > > > > appears that I can filter on _hoodie_commit_time to select only
> those
> > > > > records that have not been processed yet. Hudi will ensure snapshot
> > > > > isolation, so no new partial writes are visible to this reader.
> > > > >
> > > > > The next time I want an incremental set, how do I set the
> > > > > _hoodie_commit_time in the query?
> > > > >
> > > > > Is the expectation that the user will identify the max
> > > > _hoodie_commit_time
> > > > > in the result of the query and then use this to set the
> > > > _hoodie_commit_time
> > > > > filter for the next incremental query?
> > > > >
> > > > > Roshan
> > > > >
> > > >
> > >
> >
>

Re: Last commit id/ts checkpoint for incremental pull

Reply via email to