Thanks Raymond! yes.. we can make this a config and leave it to the user to decide if they want to use a global table for all their hudi tables (or) keep one error table for each hudi table..
For this effort, does it make sense to take a dependency on the multi-writer jira HUDI-944, that liwei filed? On Wed, Jun 10, 2020 at 7:49 PM Shiyan Xu <xu.shiyan.raym...@gmail.com> wrote: > Yes, Vinoth, it does go a bit too far with first class support on these > data. > A global error table can do the job easily. As we discussed yesterday, > parallel local error tables with `_errors` suffix could also benefit for > some scenarios, like different product teams manage their own tables or in > 2B case where customers manage their own data. These would prefer good > segregation on errors or other related data. Let me note down the points in > RFC-20 for further discussion. Thanks for the feedback! > > On Wed, Jun 3, 2020 at 9:31 PM Vinoth Chandar <vin...@apache.org> wrote: > > > Hi Raymond, > > > > I am not sure generalizing this to all metadata like - errors and > metrics - > > would be a good idea. We can certainly implement logging errors to a > common > > errors hudi table, with a certain schema. But these can be just regular > > “hudi” format tables. > > > > Unlike the timeline metadata, these are really external data, not related > > to a given table’ core functioning.. we don’t necessarily want to keep > one > > error table per hudi table.. > > > > Thoughts? > > > > On Tue, Jun 2, 2020 at 5:34 PM Shiyan Xu <xu.shiyan.raym...@gmail.com> > > wrote: > > > > > I also encountered use cases where I'd like to programmatically query > > > metadata. > > > +1 on the idea of format(“hudi-timeline”) > > > > > > I also feel that the metadata can be extended further to include more > > info > > > like, errors, metrics/write statistics, etc. Like the newly proposed > > error > > > handling, we could also store all metrics or write stats there too, and > > > relate them to the timeline actions. > > > > > > A potential use case could be, with all these info encapsulated within > > > metadata, we may be able to derive some insightful results (by check > > > against some benchmarks) and answer questions like: does table A need > > more > > > tuning? does table B exceed error budget? > > > > > > Programmatic query to these metadata can help manage many tables in > > > diagnosis and inspection. We may need different read formats like > > > format("hudi-errors") or format("hudi-metrics") > > > > > > Sorry this sidetracked from the original question..These are really > rough > > > high-level thoughts, and may have sign of over-engineering. Would like > to > > > hear some feedbacks. Thanks. > > > > > > > > > > > > > > > On Mon, Jun 1, 2020 at 9:28 PM Satish Kotha > <satishko...@uber.com.invalid > > > > > > wrote: > > > > > > > Got it. I'll look into implementation choices for creating a new data > > > > source. Appreciate all the feedback. > > > > > > > > On Mon, Jun 1, 2020 at 7:53 PM Vinoth Chandar <vin...@apache.org> > > wrote: > > > > > > > > > >Is it to separate data and metadata access? > > > > > Correct. We already have modes for querying data using > > format("hudi"). > > > I > > > > > feel it will get very confusing to mix data and metadata in the > same > > > > > source.. for e.g a lot of options we support for data may not even > > make > > > > > sense for the TimelineRelation. > > > > > > > > > > >This class seems like a list of static methods, I'm not seeing > where > > > > these > > > > > are accessed from > > > > > That's the public API for obtaining this information for Scala/Java > > > > Spark. > > > > > If you have a way of calling this from python through some bridge > > > without > > > > > painful bridges (e.g jython), might be a tactical solution that can > > > meet > > > > > your needs. > > > > > > > > > > On Mon, Jun 1, 2020 at 5:07 PM Satish Kotha > > > <satishko...@uber.com.invalid > > > > > > > > > > wrote: > > > > > > > > > > > Thanks for the feedback. > > > > > > > > > > > > What is the advantage of doing > > > > > > spark.read.format(“hudi-timeline”).load(basepath) as opposed to > > doing > > > > new > > > > > > relation? Is it to separate data and metadata access? > > > > > > > > > > > > Are you looking for similar functionality as > > HoodieDatasourceHelpers? > > > > > > > > > > > > > This class seems like a list of static methods, I'm not seeing > > where > > > > > these > > > > > > are accessed from. But, I need a way to query metadata details > > easily > > > > > > in pyspark. > > > > > > > > > > > > > > > > > > On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar <vin...@apache.org > > > > > > wrote: > > > > > > > > > > > > > Also please take a look at > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE&s=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0&e= > > > > > > > . > > > > > > > > > > > > > > This was an effort to make the timeline more generalized for > > > querying > > > > > > (for > > > > > > > a different purpose).. but good to revisit now.. > > > > > > > > > > > > > > On Sun, May 31, 2020 at 11:04 PM vbal...@apache.org < > > > > > vbal...@apache.org> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > I strongly recommend using a separate datasource relation > > (option > > > > 1) > > > > > to > > > > > > > > query timeline. It is elegant and fits well with spark APIs. > > > > > > > > Thanks.Balaji.V On Saturday, May 30, 2020, 01:18:45 PM > PDT, > > > > Vinoth > > > > > > > > Chandar <vin...@apache.org> wrote: > > > > > > > > > > > > > > > > Hi satish, > > > > > > > > > > > > > > > > Are you looking for similar functionality as > > > > HoodieDatasourceHelpers? > > > > > > > > > > > > > > > > We have historically relied on cli to inspect the table, > which > > > does > > > > > not > > > > > > > > lend it self well to programmatic access.. overall in like > > option > > > > 1 - > > > > > > > > allowing the timeline to be queryable with a standard schema > > does > > > > > seem > > > > > > > way > > > > > > > > nicer. > > > > > > > > > > > > > > > > I am wondering though if we should introduce a new view. > > Instead > > > we > > > > > can > > > > > > > use > > > > > > > > a different data source name - > > > > > > > > spark.read.format(“hudi-timeline”).load(basepath). We can > start > > > by > > > > > just > > > > > > > > allowing querying of active timeline and expand this to > archive > > > > > > timeline? > > > > > > > > > > > > > > > > What do other Think? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, May 29, 2020 at 2:37 PM Satish Kotha > > > > > > > <satishko...@uber.com.invalid > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hello folks, > > > > > > > > > > > > > > > > > > We have a use case to incrementally generate data for hudi > > > table > > > > > (say > > > > > > > > > 'table2') by transforming data from other hudi table(say, > > > > table1). > > > > > > We > > > > > > > > want > > > > > > > > > to atomically store commit timestamps read from table1 into > > > > table2 > > > > > > > commit > > > > > > > > > metadata. > > > > > > > > > > > > > > > > > > This is similar to how DeltaStreamer operates with kafka > > > offsets. > > > > > > > > However, > > > > > > > > > DeltaStreamer is java code and can easily query kafka > offset > > > > > > processed > > > > > > > by > > > > > > > > > creating metaclient for target table. We want to use > pyspark > > > and > > > > I > > > > > > > don't > > > > > > > > > see a good way to query commit metadata of table1 from > > > > DataSource. > > > > > > > > > > > > > > > > > > I'm considering making one of the below changes to hoodie > to > > > make > > > > > > this > > > > > > > > > easier. > > > > > > > > > > > > > > > > > > Option1: Add new relation in hudi-spark to query commit > > > metadata. > > > > > > This > > > > > > > > > relation would present a 'metadata view' to query and > filter > > > > > > metadata. > > > > > > > > > > > > > > > > > > Option2: Add other DataSource options on top of incremental > > > > > querying > > > > > > to > > > > > > > > > allow fetching from source table. For example, users can > > > specify > > > > > > > > > 'hoodie.consume.metadata.table: table2BasePath' and issue > > > > > > incremental > > > > > > > > > query on table1. Then, IncrementalRelation would go read > > table2 > > > > > > > metadata > > > > > > > > > first to identify 'consume.start.timestamp' and start > > > incremental > > > > > > read > > > > > > > on > > > > > > > > > table1 with that timestamp. > > > > > > > > > > > > > > > > > > Option 2 looks simpler to implement. But, seems a bit hacky > > > > because > > > > > > we > > > > > > > > are > > > > > > > > > reading metadata from table2 when data souce is table1. > > > > > > > > > > > > > > > > > > Option1 is a bit more complex. But, it is cleaner and not > > > tightly > > > > > > > coupled > > > > > > > > > to incremental reads. For example, use cases other than > > > > incremental > > > > > > > reads > > > > > > > > > can leverage same relation to query metadata > > > > > > > > > > > > > > > > > > What do you guys think? Let me know if there are other > > simpler > > > > > > > solutions. > > > > > > > > > Appreciate any feedback. > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > Satish > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >