Re: [DISCUSS] querying commit metadata from spark DataSource

Shiyan Xu Fri, 12 Jun 2020 08:40:46 -0700

Yes, tickets linked.

On Thu, Jun 11, 2020 at 10:50 AM Vinoth Chandar <vin...@apache.org> wrote:


> Thanks Raymond!
>
> yes.. we can make this a config and leave it to the user to decide if they
> want to use a global table for all their hudi tables (or) keep
> one error table for each hudi table..
>
> For this effort, does it make sense to  take a dependency on the
> multi-writer jira HUDI-944, that liwei filed?
>
> On Wed, Jun 10, 2020 at 7:49 PM Shiyan Xu <xu.shiyan.raym...@gmail.com>
> wrote:
>
> > Yes, Vinoth, it does go a bit too far with first class support on these
> > data.
> > A global error table can do the job easily. As we discussed yesterday,
> > parallel local error tables with `_errors` suffix could also benefit for
> > some scenarios, like different product teams manage their own tables or
> in
> > 2B case where customers manage their own data. These would prefer good
> > segregation on errors or other related data. Let me note down the points
> in
> > RFC-20 for further discussion. Thanks for the feedback!
> >
> > On Wed, Jun 3, 2020 at 9:31 PM Vinoth Chandar <vin...@apache.org> wrote:
> >
> > > Hi Raymond,
> > >
> > > I am not sure generalizing this to all metadata like - errors and
> > metrics -
> > > would be a good idea. We can certainly implement logging errors to a
> > common
> > > errors hudi table, with a certain schema. But these can be just regular
> > > “hudi” format tables.
> > >
> > > Unlike the timeline metadata, these are really external data, not
> related
> > > to a given table’ core functioning.. we don’t necessarily want to keep
> > one
> > > error table per hudi table..
> > >
> > > Thoughts?
> > >
> > > On Tue, Jun 2, 2020 at 5:34 PM Shiyan Xu <xu.shiyan.raym...@gmail.com>
> > > wrote:
> > >
> > > > I also encountered use cases where I'd like to programmatically query
> > > > metadata.
> > > > +1 on the idea of format(“hudi-timeline”)
> > > >
> > > > I also feel that the metadata can be extended further to include more
> > > info
> > > > like, errors, metrics/write statistics, etc. Like the newly proposed
> > > error
> > > > handling, we could also store all metrics or write stats there too,
> and
> > > > relate them to the timeline actions.
> > > >
> > > > A potential use case could be, with all these info encapsulated
> within
> > > > metadata, we may be able to derive some insightful results (by check
> > > > against some benchmarks) and answer questions like: does table A need
> > > more
> > > > tuning? does table B exceed error budget?
> > > >
> > > > Programmatic query to these metadata can help manage many tables in
> > > > diagnosis and inspection. We may need different read formats like
> > > > format("hudi-errors") or format("hudi-metrics")
> > > >
> > > > Sorry this sidetracked from the original question..These are really
> > rough
> > > > high-level thoughts, and may have sign of over-engineering. Would
> like
> > to
> > > > hear some feedbacks. Thanks.
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Jun 1, 2020 at 9:28 PM Satish Kotha
> > <satishko...@uber.com.invalid
> > > >
> > > > wrote:
> > > >
> > > > > Got it. I'll look into implementation choices for creating a new
> data
> > > > > source. Appreciate all the feedback.
> > > > >
> > > > > On Mon, Jun 1, 2020 at 7:53 PM Vinoth Chandar <vin...@apache.org>
> > > wrote:
> > > > >
> > > > > > >Is it to separate data and metadata access?
> > > > > > Correct. We already have modes for querying data using
> > > format("hudi").
> > > > I
> > > > > > feel it will get very confusing to mix data and metadata in the
> > same
> > > > > > source.. for e.g a lot of options we support for data may not
> even
> > > make
> > > > > > sense for the TimelineRelation.
> > > > > >
> > > > > > >This class seems like a list of static methods, I'm not seeing
> > where
> > > > > these
> > > > > > are accessed from
> > > > > > That's the public API for obtaining this information for
> Scala/Java
> > > > > Spark.
> > > > > > If you have a way of calling this from python through some bridge
> > > > without
> > > > > > painful bridges (e.g jython), might be a tactical solution that
> can
> > > > meet
> > > > > > your needs.
> > > > > >
> > > > > > On Mon, Jun 1, 2020 at 5:07 PM Satish Kotha
> > > > <satishko...@uber.com.invalid
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for the feedback.
> > > > > > >
> > > > > > > What is the advantage of doing
> > > > > > > spark.read.format(“hudi-timeline”).load(basepath) as opposed to
> > > doing
> > > > > new
> > > > > > > relation? Is it to separate data and metadata access?
> > > > > > >
> > > > > > > Are you looking for similar functionality as
> > > HoodieDatasourceHelpers?
> > > > > > > >
> > > > > > > This class seems like a list of static methods, I'm not seeing
> > > where
> > > > > > these
> > > > > > > are accessed from. But, I need a way to query metadata details
> > > easily
> > > > > > > in pyspark.
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar <
> vin...@apache.org
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Also please take a look at
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE&s=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0&e=
> > > > > > > > .
> > > > > > > >
> > > > > > > > This was an effort to make the timeline more generalized for
> > > > querying
> > > > > > > (for
> > > > > > > > a different purpose).. but good to revisit now..
> > > > > > > >
> > > > > > > > On Sun, May 31, 2020 at 11:04 PM vbal...@apache.org <
> > > > > > vbal...@apache.org>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > I strongly recommend using a separate datasource relation
> > > (option
> > > > > 1)
> > > > > > to
> > > > > > > > > query timeline. It is elegant and fits well with spark
> APIs.
> > > > > > > > > Thanks.Balaji.V    On Saturday, May 30, 2020, 01:18:45 PM
> > PDT,
> > > > > Vinoth
> > > > > > > > > Chandar <vin...@apache.org> wrote:
> > > > > > > > >
> > > > > > > > >  Hi satish,
> > > > > > > > >
> > > > > > > > > Are you looking for similar functionality as
> > > > > HoodieDatasourceHelpers?
> > > > > > > > >
> > > > > > > > > We have historically relied on cli to inspect the table,
> > which
> > > > does
> > > > > > not
> > > > > > > > > lend it self well to programmatic access.. overall in like
> > > option
> > > > > 1 -
> > > > > > > > > allowing the timeline to be queryable with a standard
> schema
> > > does
> > > > > > seem
> > > > > > > > way
> > > > > > > > > nicer.
> > > > > > > > >
> > > > > > > > > I am wondering though if we should introduce a new view.
> > > Instead
> > > > we
> > > > > > can
> > > > > > > > use
> > > > > > > > > a different data source name -
> > > > > > > > > spark.read.format(“hudi-timeline”).load(basepath). We can
> > start
> > > > by
> > > > > > just
> > > > > > > > > allowing querying of active timeline and expand this to
> > archive
> > > > > > > timeline?
> > > > > > > > >
> > > > > > > > > What do other Think?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, May 29, 2020 at 2:37 PM Satish Kotha
> > > > > > > > <satishko...@uber.com.invalid
> > > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello folks,
> > > > > > > > > >
> > > > > > > > > > We have a use case to incrementally generate data for
> hudi
> > > > table
> > > > > > (say
> > > > > > > > > > 'table2')  by transforming data from other hudi
> table(say,
> > > > > table1).
> > > > > > > We
> > > > > > > > > want
> > > > > > > > > > to atomically store commit timestamps read from table1
> into
> > > > > table2
> > > > > > > > commit
> > > > > > > > > > metadata.
> > > > > > > > > >
> > > > > > > > > > This is similar to how DeltaStreamer operates with kafka
> > > > offsets.
> > > > > > > > > However,
> > > > > > > > > > DeltaStreamer is java code and can easily query kafka
> > offset
> > > > > > > processed
> > > > > > > > by
> > > > > > > > > > creating metaclient for target table. We want to use
> > pyspark
> > > > and
> > > > > I
> > > > > > > > don't
> > > > > > > > > > see a good way to query commit metadata of table1 from
> > > > > DataSource.
> > > > > > > > > >
> > > > > > > > > > I'm considering making one of the below changes to hoodie
> > to
> > > > make
> > > > > > > this
> > > > > > > > > > easier.
> > > > > > > > > >
> > > > > > > > > > Option1: Add new relation in hudi-spark to query commit
> > > > metadata.
> > > > > > > This
> > > > > > > > > > relation would present a 'metadata view' to query and
> > filter
> > > > > > > metadata.
> > > > > > > > > >
> > > > > > > > > > Option2: Add other DataSource options on top of
> incremental
> > > > > > querying
> > > > > > > to
> > > > > > > > > > allow fetching from source table. For example, users can
> > > > specify
> > > > > > > > > > 'hoodie.consume.metadata.table: table2BasePath'  and
> issue
> > > > > > > incremental
> > > > > > > > > > query on table1. Then, IncrementalRelation would go read
> > > table2
> > > > > > > > metadata
> > > > > > > > > > first to identify 'consume.start.timestamp' and start
> > > > incremental
> > > > > > > read
> > > > > > > > on
> > > > > > > > > > table1 with that timestamp.
> > > > > > > > > >
> > > > > > > > > > Option 2 looks simpler to implement. But, seems a bit
> hacky
> > > > > because
> > > > > > > we
> > > > > > > > > are
> > > > > > > > > > reading metadata from table2 when data souce is table1.
> > > > > > > > > >
> > > > > > > > > > Option1 is a bit more complex. But, it is cleaner and not
> > > > tightly
> > > > > > > > coupled
> > > > > > > > > > to incremental reads. For example, use cases other than
> > > > > incremental
> > > > > > > > reads
> > > > > > > > > > can leverage same relation to query metadata
> > > > > > > > > >
> > > > > > > > > > What do you guys think? Let me know if there are other
> > > simpler
> > > > > > > > solutions.
> > > > > > > > > > Appreciate any feedback.
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > > Satish
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] querying commit metadata from spark DataSource

Reply via email to