Re: [DISCUSS] querying commit metadata from spark DataSource

Satish Kotha Mon, 01 Jun 2020 17:07:16 -0700

Thanks for the feedback.

What is the advantage of doing
spark.read.format(“hudi-timeline”).load(basepath) as opposed to doing new
relation? Is it to separate data and metadata access?


Are you looking for similar functionality as HoodieDatasourceHelpers?
>
This class seems like a list of static methods, I'm not seeing where these
are accessed from. But, I need a way to query metadata details easily
in pyspark.


On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar <vin...@apache.org> wrote:

> Also please take a look at
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE&s=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0&e=
> .
>
> This was an effort to make the timeline more generalized for querying (for
> a different purpose).. but good to revisit now..
>
> On Sun, May 31, 2020 at 11:04 PM vbal...@apache.org <vbal...@apache.org>
> wrote:
>
> >
> > I strongly recommend using a separate datasource relation (option 1) to
> > query timeline. It is elegant and fits well with spark APIs.
> > Thanks.Balaji.V    On Saturday, May 30, 2020, 01:18:45 PM PDT, Vinoth
> > Chandar <vin...@apache.org> wrote:
> >
> >  Hi satish,
> >
> > Are you looking for similar functionality as HoodieDatasourceHelpers?
> >
> > We have historically relied on cli to inspect the table, which does not
> > lend it self well to programmatic access.. overall in like option 1 -
> > allowing the timeline to be queryable with a standard schema does seem
> way
> > nicer.
> >
> > I am wondering though if we should introduce a new view. Instead we can
> use
> > a different data source name -
> > spark.read.format(“hudi-timeline”).load(basepath). We can start by just
> > allowing querying of active timeline and expand this to archive timeline?
> >
> > What do other Think?
> >
> >
> >
> >
> > On Fri, May 29, 2020 at 2:37 PM Satish Kotha
> <satishko...@uber.com.invalid
> > >
> > wrote:
> >
> > > Hello folks,
> > >
> > > We have a use case to incrementally generate data for hudi table (say
> > > 'table2')  by transforming data from other hudi table(say, table1). We
> > want
> > > to atomically store commit timestamps read from table1 into table2
> commit
> > > metadata.
> > >
> > > This is similar to how DeltaStreamer operates with kafka offsets.
> > However,
> > > DeltaStreamer is java code and can easily query kafka offset processed
> by
> > > creating metaclient for target table. We want to use pyspark and I
> don't
> > > see a good way to query commit metadata of table1 from DataSource.
> > >
> > > I'm considering making one of the below changes to hoodie to make this
> > > easier.
> > >
> > > Option1: Add new relation in hudi-spark to query commit metadata. This
> > > relation would present a 'metadata view' to query and filter metadata.
> > >
> > > Option2: Add other DataSource options on top of incremental querying to
> > > allow fetching from source table. For example, users can specify
> > > 'hoodie.consume.metadata.table: table2BasePath'  and issue incremental
> > > query on table1. Then, IncrementalRelation would go read table2
> metadata
> > > first to identify 'consume.start.timestamp' and start incremental read
> on
> > > table1 with that timestamp.
> > >
> > > Option 2 looks simpler to implement. But, seems a bit hacky because we
> > are
> > > reading metadata from table2 when data souce is table1.
> > >
> > > Option1 is a bit more complex. But, it is cleaner and not tightly
> coupled
> > > to incremental reads. For example, use cases other than incremental
> reads
> > > can leverage same relation to query metadata
> > >
> > > What do you guys think? Let me know if there are other simpler
> solutions.
> > > Appreciate any feedback.
> > >
> > > Thanks
> > > Satish
> > >
>

Re: [DISCUSS] querying commit metadata from spark DataSource

Reply via email to