Hi Ryan,

Thanks for the response. So would it be correct to say that lineage is an
aggregate of provenance? I was hoping to find a tool that would help us get
into the weeds as you say. I was hoping that I'd be able to find something
that took my provenance data and compile something like a snakey diagram
<https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd.it%2Fx4xib51xxsv21.png%3Fauto%3Dwebp%26s%3Da92899ca0e36d86959d07dbd778c59ba27c7b962>
to show paths and volumes of data and then be able to drill down into
individual paths and also search for a given record ID and see its path
through the system.

To answer your questions,

   - NiFi is the orchestrator that does most of the ETL work and routing /
   decision making, but there are other services involved too. So ideally I'd
   like something to close that loop so I can get end to end provenance in one
   system that can be used by a customer support agent too.


Thanks,
Eric

On Sun, Apr 21, 2024, 3:35 PM Ryan Jendoubi <[email protected]> wrote:

> Hi Eric,
>
> Based on what I've seen, I think you'd find challenges using Atlas for the
> purpose you described.
>
> I have only seen Atlas used for tracking lineage relationships between
> conceptual data *sets*, not individual provenance of individual data
> *records*. I'm not even sure how one could accurately track that using
> the usual Entity-based model one normally sees in Atlas. One could
> *massively* extend that model to represent each data *record* as a
> discrete Entity, but if you did that, you would be effectively duplicating 
> *all
> your data* in all the hundreds of thousands of record Entities at *each
> stage* in your pipeline.
>
> Personally I don't think the term "lineage" which Atlas uses vs the term
> "provenance" as used by NiFi uses are that semantically distinct; but
> regardless, if we think of them both as some kind of "historical
> information", it's natural that NiFi captures that in much higher detail,
> because it's a tool for working at scale with actual *data*, whereas
> Atlas is designed only for working with *metadata* - and to keep the
> volume of that metadata manageable, it's abstracted to the data*set* level.
> Consider also that most NiFi installations are configured to aggressively
> age out data from the provenance repository, because again, keeping all
> that detail indefinitely would massively balloon your storage needs.
>
> I think the only way to use Atlas to investigate a question like the one
> you describe would be trace through the lineage of datasets and look in
> detail at the Process Entities which form the link between the Dataset
> entities, specifically examining the SQL or other logic recorded there, and
> considering how it would affect the particular record(s) you are
> investigating. That would be a manual process - I'm not aware of any tools
> to "replay" a piece of data through an Atlas lineage.
>
> And likewise, afraid I haven't ever seen anything so clever as to be able
> to follow logic and cross-reference between lineage in Atlas and provenance
> in NiFi.
>
> I guess most folks only need to get into the weeds of tracking individual
> records like this fairly rarely - hence no-one has invested the time to
> build tooling around the process.
>
> If this comes up often enough for you that you start to automate various
> parts of the process, I'm sure the community would be interested in any
> scripts etc. you may feel able to share 🙂
>
> Couple of other thoughts:
>
>    - Is your overall pipeline driven by some central orchestrator, e.g.
>    Airflow? If so, perhaps that would be another useful source of truth for
>    tracking records from one place to another?
>    - Do you have any configs you can change in order to stop temporary
>    tables/datasets from being dropped for a specific run; that way you could
>    search through each step in the overall pipeline (start in the middle and
>    work outwards) to see where your target records have been cut out?
>
>
> And of course, hopefully I am wrong and someone else will chime in with a
> simple solution to your challenge!
>
> Kind regards,
>
> --Ryan
>
> On Sat, Apr 20, 2024 at 4:54 PM Eric Secules <[email protected]> wrote:
>
>> Hello,
>>
>> I'm researching tools to store and visualize data lineage to help debug
>> issues in my ETL pipeline, for instance, why a particular piece of data got
>> filtered out prior to loading into the database or why it didn't get merged
>> into an output CSV file. Does Atlas have any limitations in visualizing or
>> searching lineage where a data rows has either been split or merged from/to
>> a large batch file containing like 500,000 rows and I want to find only the
>> rows that got filtered in a certain direction?
>>
>> Also would it be possible to join event streams from NiFi provenance with
>> higher level business process events that I send to Atlas through Kafka to
>> get a full picture of what happened and also to take advantage of the
>> split, route and merge events that are already there?
>>
>> Thanks,
>> Eric
>>
>

Reply via email to