[
https://issues.apache.org/jira/browse/ATLAS-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16000621#comment-16000621
]
ernie ostic edited comment on ATLAS-1765 at 5/8/17 9:30 PM:
------------------------------------------------------------
Initial thoughts on search and query use types, based on various use cases that
are often seen with Infosphere Information Governance Catalog (IGC).
Search/Queries against the repository.
Here are various patterns we see frequently with IGC. The categories below
are loose, but and correspond to the user's objective and also their level of
experience with the tool, and whether they are in the role of "governance team"
vs "regular enterprise user". Breaking them up here just to aid with further
discussion. Each of these comes up in three "access" modes, fairly equally:
(1) online, using the gui (2) batch, via command line, for extraction to a
.csv or other export file structure (3) via REST api. Each also typically
allows a list of "properties" to be simply selected along with said "asset"
(name, description, internal identifier, date created, etc.)
This is more a listing of "syntax examples" than pure business use cases, but
each should be easily backed into a personna or business use case as necessary.
Governance Queries
These are queries that are typically done by the "governance team" or with
management responsibility for the governance project -- and are often the
chassis for a governance dashboard or other reporting system that measures
progress towards governance objectives. A site may have a target that "all
assets in the data lake" be fully governed by "a future target date". These
kinds of queries help measure and move governance teams closer to that goal.
The "list all" kinds of queries are also often the source of a "count" so that
results can be graphed on the dashboard. Searches like (7) below are typically
"validation" queries, sometimes issued in real-time to enforce completeness in
the repository, or perhaps nightly to reject content or force "re-review" for
things that are incomplete. The same "list all" kinds of queries often feed
resulting answer sets into a batch update tool that might perform assignments
or property updates in bulk, perhaps in an offline window.
Ultimately, measuring progress and completeness for asset definitions and their
relationships is about maximizing the value of the repository --- from
providing more-easily-understood names for technical assets, through
establishing responsibility for enterprise assets and their data quality.
1. List out all assets (usually columns) that have not yet been assigned a Term
2. List out all assets (usually columns) that have not yet been assigned a
Steward
3. List out all assets (any kind) that are being managed by Steward <steward>
4. List out all assets that have been modified since <date>
5. List out all assets that are in a particular state (such as "Draft", where
"workflow" in IGC has been implemented)
6. List out all assets based on their time remaining in a particular state
("all terms in draft for more than <n> days")
7. List out all assets (usually Terms) where property <property> is null
[similar to where property <property> is <value> but called out here
specifically because it is a common "management level" governance query
8. List out all assets (usually Terms) where relationship <relationship> is
null
Research Queries
Often by an individual user, data research person....sometimes also performed
by developers, often exploiting a "lineage" relationship --- often with a
particular goal in mind for "that asset" or "that steward" --- such as finding
the lineage for "one particular" report.
9. List out all assets <a specific type> where property <property> is <value>
[string, between, equal_to, etc., etc. etc. ]
10. List out all assets <relationship, such as "owned by"> <steward>
11. Show all assets "written by" <name of process or other data-mover kind of
asset>. For Atlas in its current form, this might the name of a SQOOP process
12. For <asset> (type and name), show immediate upstream asset (and properties
of that asset...last time it ran, status code, etc.)
13. Show a Term and all of its "history" (particularly important for comments
by reviewers over time)
14. Various complex "set" retrievals, qualified by existence of a particular
instance...such as "dump out all database/table/column details for every
database that contains a schema called <schemaName> [at times, the qualifier is
just "if it exists" as a child but still dump all children....possibly
requiring multilple requests or additional filtering against the final returned
list]
15. List all transformations (and their sources/targets/processes) where
nullability was changed for a column from null to "not null". [that is a
specific example, but could exist for datatype changes, column name changes,
specific mappings or functions, etc.
16. Requests that exploit multiple relationships for qualification...such as
"list all tables that have a Steward...but only for Stewards who also
manage/own assets in the Risk Collection"
was (Author: eostic):
Initial thoughts on search and query use types, based on various use cases that
are often seen with Infosphere Information Governance Catalog (IGC).
Search/Queries against the repository.
Here are various patterns we see frequently with IGC. The categories below
are loose, but and correspond to the user's objective and also their level of
experience with the tool, and whether they are in the role of "governance team"
vs "regular enterprise user". Breaking them up here just to aid with further
discussion. Each of these comes up in three "access" modes, fairly equally:
[(1) online, using the gui (2) batch, via command line, for extraction to a
.csv or other export file structure (3) via REST api]. Each also typically
allows a list of "properties" to be simply selected along with said "asset"
(name, description, internal identifier, date created, etc.)
This is more a listing of "syntax examples" than pure business use cases, but
each should be easily backed into a personna or business use case as necessary.
Governance Queries
List out all assets (usually columns) that have not yet been assigned a Term
List out all assets (usually columns) that have not yet been assigned a Steward
List out all assets (any kind) that are being managed by Steward <steward>
List out all assets that have been modified since <date>
List out all assets that are in a particular state (such as "Draft", where
"workflow" in IGC has been implemented)
List out all assets based on their time remaining in a particular state ("all
terms in draft for more than <n> days")
List out all assets (usually Terms) where property <property> is null
[similar to where property <property> is <value> but called out here
specifically because it is a common "management level" governance query
List out all assets (usually Terms) where relationship <relationship> is null
Research Queries
Often by an individual user, data research person....sometimes also performed
by developers, often exploiting a "lineage" relationship
List out all assets <a specific type> where property <property> is <value>
[string, between, equal_to, etc., etc. etc. ]
List out all assets <relationship, such as "owned by"> <steward>
Show all assets "written by" <name of process or other data-mover kind of
asset>. For Atlas in its current form, this might the name of a SQOOP process
For <asset> (type and name), show immediate upstream asset (and properties of
that asset...last time it ran, status code, etc.)
Show a Term and all of its "history" (particular important for comments by
reviewers over time)
Various complex "set" retrievals, qualified by existence of a particular
instance...such as "dump out all database/table/column details for every
database that contains a schema called <schemaName> [at times, the qualifier is
just "if it exists" as a child but still dump all children....possibly
requiring multilple requests or additional filtering against the final returned
list]
List all transformations (and their sources/targets/processes) where
nullability was changed for a column from null to "not null". [that is a
specific example, but could exist for datatype changes, column name changes,
specific mappings or functions, etc.
Requests that exploit multiple relationships for qualification...such as "list
all tables that have a Steward...but only for Stewards who also manage/own
assets in the Risk Collection"
> Self-Service Catalog Search and Data Preview
> --------------------------------------------
>
> Key: ATLAS-1765
> URL: https://issues.apache.org/jira/browse/ATLAS-1765
> Project: Atlas
> Issue Type: New Feature
> Components: atlas-webui
> Affects Versions: 0.9-incubating
> Reporter: Mandy Chessell
> Assignee: Mandy Chessell
> Labels: Self-Service-UIs, VirtualDataConnector
>
> This JIRA covers the development of the catalog search and preview of data
> for data scientists and business users. It supports the search of the Atlas
> metadata repository, display of search results, additional filtering and
> drill down into details of the data sources, including a data preview option
> if the end user has access permission.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)