Hi Ryan, Thanks for the response. So would it be correct to say that lineage is an aggregate of provenance? I was hoping to find a tool that would help us get into the weeds as you say. I was hoping that I'd be able to find something that took my provenance data and compile something like a snakey diagram <https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd.it%2Fx4xib51xxsv21.png%3Fauto%3Dwebp%26s%3Da92899ca0e36d86959d07dbd778c59ba27c7b962> to show paths and volumes of data and then be able to drill down into individual paths and also search for a given record ID and see its path through the system.
To answer your questions, - NiFi is the orchestrator that does most of the ETL work and routing / decision making, but there are other services involved too. So ideally I'd like something to close that loop so I can get end to end provenance in one system that can be used by a customer support agent too. Thanks, Eric On Sun, Apr 21, 2024, 3:35 PM Ryan Jendoubi <[email protected]> wrote: > Hi Eric, > > Based on what I've seen, I think you'd find challenges using Atlas for the > purpose you described. > > I have only seen Atlas used for tracking lineage relationships between > conceptual data *sets*, not individual provenance of individual data > *records*. I'm not even sure how one could accurately track that using > the usual Entity-based model one normally sees in Atlas. One could > *massively* extend that model to represent each data *record* as a > discrete Entity, but if you did that, you would be effectively duplicating > *all > your data* in all the hundreds of thousands of record Entities at *each > stage* in your pipeline. > > Personally I don't think the term "lineage" which Atlas uses vs the term > "provenance" as used by NiFi uses are that semantically distinct; but > regardless, if we think of them both as some kind of "historical > information", it's natural that NiFi captures that in much higher detail, > because it's a tool for working at scale with actual *data*, whereas > Atlas is designed only for working with *metadata* - and to keep the > volume of that metadata manageable, it's abstracted to the data*set* level. > Consider also that most NiFi installations are configured to aggressively > age out data from the provenance repository, because again, keeping all > that detail indefinitely would massively balloon your storage needs. > > I think the only way to use Atlas to investigate a question like the one > you describe would be trace through the lineage of datasets and look in > detail at the Process Entities which form the link between the Dataset > entities, specifically examining the SQL or other logic recorded there, and > considering how it would affect the particular record(s) you are > investigating. That would be a manual process - I'm not aware of any tools > to "replay" a piece of data through an Atlas lineage. > > And likewise, afraid I haven't ever seen anything so clever as to be able > to follow logic and cross-reference between lineage in Atlas and provenance > in NiFi. > > I guess most folks only need to get into the weeds of tracking individual > records like this fairly rarely - hence no-one has invested the time to > build tooling around the process. > > If this comes up often enough for you that you start to automate various > parts of the process, I'm sure the community would be interested in any > scripts etc. you may feel able to share 🙂 > > Couple of other thoughts: > > - Is your overall pipeline driven by some central orchestrator, e.g. > Airflow? If so, perhaps that would be another useful source of truth for > tracking records from one place to another? > - Do you have any configs you can change in order to stop temporary > tables/datasets from being dropped for a specific run; that way you could > search through each step in the overall pipeline (start in the middle and > work outwards) to see where your target records have been cut out? > > > And of course, hopefully I am wrong and someone else will chime in with a > simple solution to your challenge! > > Kind regards, > > --Ryan > > On Sat, Apr 20, 2024 at 4:54 PM Eric Secules <[email protected]> wrote: > >> Hello, >> >> I'm researching tools to store and visualize data lineage to help debug >> issues in my ETL pipeline, for instance, why a particular piece of data got >> filtered out prior to loading into the database or why it didn't get merged >> into an output CSV file. Does Atlas have any limitations in visualizing or >> searching lineage where a data rows has either been split or merged from/to >> a large batch file containing like 500,000 rows and I want to find only the >> rows that got filtered in a certain direction? >> >> Also would it be possible to join event streams from NiFi provenance with >> higher level business process events that I send to Atlas through Kafka to >> get a full picture of what happened and also to take advantage of the >> split, route and merge events that are already there? >> >> Thanks, >> Eric >> >
