Hi Eric, Based on what I've seen, I think you'd find challenges using Atlas for the purpose you described.
I have only seen Atlas used for tracking lineage relationships between conceptual data *sets*, not individual provenance of individual data *records*. I'm not even sure how one could accurately track that using the usual Entity-based model one normally sees in Atlas. One could *massively* extend that model to represent each data *record* as a discrete Entity, but if you did that, you would be effectively duplicating *all your data* in all the hundreds of thousands of record Entities at *each stage* in your pipeline. Personally I don't think the term "lineage" which Atlas uses vs the term "provenance" as used by NiFi uses are that semantically distinct; but regardless, if we think of them both as some kind of "historical information", it's natural that NiFi captures that in much higher detail, because it's a tool for working at scale with actual *data*, whereas Atlas is designed only for working with *metadata* - and to keep the volume of that metadata manageable, it's abstracted to the data*set* level. Consider also that most NiFi installations are configured to aggressively age out data from the provenance repository, because again, keeping all that detail indefinitely would massively balloon your storage needs. I think the only way to use Atlas to investigate a question like the one you describe would be trace through the lineage of datasets and look in detail at the Process Entities which form the link between the Dataset entities, specifically examining the SQL or other logic recorded there, and considering how it would affect the particular record(s) you are investigating. That would be a manual process - I'm not aware of any tools to "replay" a piece of data through an Atlas lineage. And likewise, afraid I haven't ever seen anything so clever as to be able to follow logic and cross-reference between lineage in Atlas and provenance in NiFi. I guess most folks only need to get into the weeds of tracking individual records like this fairly rarely - hence no-one has invested the time to build tooling around the process. If this comes up often enough for you that you start to automate various parts of the process, I'm sure the community would be interested in any scripts etc. you may feel able to share 🙂 Couple of other thoughts: - Is your overall pipeline driven by some central orchestrator, e.g. Airflow? If so, perhaps that would be another useful source of truth for tracking records from one place to another? - Do you have any configs you can change in order to stop temporary tables/datasets from being dropped for a specific run; that way you could search through each step in the overall pipeline (start in the middle and work outwards) to see where your target records have been cut out? And of course, hopefully I am wrong and someone else will chime in with a simple solution to your challenge! Kind regards, --Ryan On Sat, Apr 20, 2024 at 4:54 PM Eric Secules <[email protected]> wrote: > Hello, > > I'm researching tools to store and visualize data lineage to help debug > issues in my ETL pipeline, for instance, why a particular piece of data got > filtered out prior to loading into the database or why it didn't get merged > into an output CSV file. Does Atlas have any limitations in visualizing or > searching lineage where a data rows has either been split or merged from/to > a large batch file containing like 500,000 rows and I want to find only the > rows that got filtered in a certain direction? > > Also would it be possible to join event streams from NiFi provenance with > higher level business process events that I send to Atlas through Kafka to > get a full picture of what happened and also to take advantage of the > split, route and merge events that are already there? > > Thanks, > Eric >
