Re: Apache Atlas for large fan-in / fan-out

Ryan Jendoubi Sun, 21 Apr 2024 15:35:47 -0700

Hi Eric,

Based on what I've seen, I think you'd find challenges using Atlas for the
purpose you described.

I have only seen Atlas used for tracking lineage relationships between
conceptual data *sets*, not individual provenance of individual data
*records*. I'm not even sure how one could accurately track that using the
usual Entity-based model one normally sees in Atlas. One could
*massively* extend
that model to represent each data *record* as a discrete Entity, but if you
did that, you would be effectively duplicating *all your data* in all the
hundreds of thousands of record Entities at *each stage* in your pipeline.

Personally I don't think the term "lineage" which Atlas uses vs the term
"provenance" as used by NiFi uses are that semantically distinct; but
regardless, if we think of them both as some kind of "historical
information", it's natural that NiFi captures that in much higher detail,
because it's a tool for working at scale with actual *data*, whereas Atlas
is designed only for working with *metadata* - and to keep the volume of
that metadata manageable, it's abstracted to the data*set* level. Consider
also that most NiFi installations are configured to aggressively age out
data from the provenance repository, because again, keeping all that detail
indefinitely would massively balloon your storage needs.

I think the only way to use Atlas to investigate a question like the one
you describe would be trace through the lineage of datasets and look in
detail at the Process Entities which form the link between the Dataset
entities, specifically examining the SQL or other logic recorded there, and
considering how it would affect the particular record(s) you are
investigating. That would be a manual process - I'm not aware of any tools
to "replay" a piece of data through an Atlas lineage.

And likewise, afraid I haven't ever seen anything so clever as to be able
to follow logic and cross-reference between lineage in Atlas and provenance
in NiFi.

I guess most folks only need to get into the weeds of tracking individual
records like this fairly rarely - hence no-one has invested the time to
build tooling around the process.

If this comes up often enough for you that you start to automate various
parts of the process, I'm sure the community would be interested in any
scripts etc. you may feel able to share 🙂

Couple of other thoughts:

   - Is your overall pipeline driven by some central orchestrator, e.g.
   Airflow? If so, perhaps that would be another useful source of truth for
   tracking records from one place to another?
   - Do you have any configs you can change in order to stop temporary
   tables/datasets from being dropped for a specific run; that way you could
   search through each step in the overall pipeline (start in the middle and
   work outwards) to see where your target records have been cut out?

And of course, hopefully I am wrong and someone else will chime in with a
simple solution to your challenge!

Kind regards,

--Ryan

On Sat, Apr 20, 2024 at 4:54 PM Eric Secules <[email protected]> wrote:

> Hello,
>
> I'm researching tools to store and visualize data lineage to help debug
> issues in my ETL pipeline, for instance, why a particular piece of data got
> filtered out prior to loading into the database or why it didn't get merged
> into an output CSV file. Does Atlas have any limitations in visualizing or
> searching lineage where a data rows has either been split or merged from/to
> a large batch file containing like 500,000 rows and I want to find only the
> rows that got filtered in a certain direction?
>
> Also would it be possible to join event streams from NiFi provenance with
> higher level business process events that I send to Atlas through Kafka to
> get a full picture of what happened and also to take advantage of the
> split, route and merge events that are already there?
>
> Thanks,
> Eric
>

Re: Apache Atlas for large fan-in / fan-out

Reply via email to