Re: Proposal: standard record metadata attributes for data sources

Mike Thomsen Sun, 13 May 2018 11:48:59 -0700

@Joe @Matt

This is kinda related to the point that Joe made in the graph DB thread
about provenance. My thought here was that we need some standards on
enriching the metadata about what was fetched so that no matter how you
store the provenance, you can find some way to query it for questions like
when a data set was loaded into NiFi, how many records went through a
terminating processor, etc. IMO this could help batch-oriented
organizations feel more at ease with something stream-oriented like NiFi.


On Fri, Apr 13, 2018 at 4:01 PM Mike Thomsen <mikerthom...@gmail.com> wrote:

> I'd like to propose that all non-deprecated (or likely to be deprecated)
> Get/Fetch/Query processors get a standard convention for attributes that
> describe things like:
>
> 1. Source system.
> 2. Database/table/index/collection/etc.
> 3. The lookup criteria that was used (similar to the "query attribute"
> some already have).
>
> Using GetMongo as an example, it would add something like this:
>
> source.url=mongodb://localhost:27017
> source.database=testdb
> source.collection=test_collection
> source.query={ "username": "john.smith" }
> source.criteria.username=john.smith //GetMongo would parse the query and
> add this.
>
> We have a use case where a team is coming from an extremely batch-oriented
> view and really wants to know when "dataset X" was run. Our solution was to
> extract that from the result set because the dataset name is one of the
> fields in the JSON body.
>
> I think this would help expand what you can do out of the box with
> provenance tracking because it would provide a lot of useful information
> that could be stored in Solr or ES and then queried against terminating
> processors' DROP events to get a solid window into when jobs were run
> historically.
>
> Thoughts?
>

Re: Proposal: standard record metadata attributes for data sources

Reply via email to