[
https://issues.apache.org/jira/browse/ATLAS-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16000621#comment-16000621
]
ernie ostic edited comment on ATLAS-1765 at 5/10/17 1:16 PM:
-------------------------------------------------------------
Initial thoughts on search and query use types, based on various use cases that
are often seen with Infosphere Information Governance Catalog (IGC).
Search/Queries against the repository.
Here are various patterns we see frequently with IGC. The categories below
are loose, but and correspond to the user's objective and also their level of
experience with the tool, and whether they are in the role of "governance team"
vs "regular enterprise user". Breaking them up here just to aid with further
discussion. Each of these comes up in three "access" modes, fairly equally:
(1) online, using the gui (2) batch, via command line, for extraction to a
.csv or other export file structure (3) via REST api. Each also typically
allows a list of "properties" to be simply selected along with said "entity"
(name, description, internal identifier, date created, etc.)
This is more a listing of "syntax examples" than pure business use cases, but
each should be easily backed into a personna or business use case as necessary.
Note: It is expected that most of the search use cases below are (nearly
always) restricted by some definition of "scope" --- a "department", an
"owner"....a particular database or region, a specific schema, a "category" of
Terms, etc.
Governance Queries
These are queries that are typically done by the "governance team" or with
management responsibility for the governance project -- and are often the
chassis for a governance dashboard or other reporting system that measures
progress towards governance objectives. A site may have a target that "all
entities in the data lake" be fully governed by "a future target date". These
kinds of queries help measure and move governance teams closer to that goal.
The "list all" kinds of queries are also often the source of a "count" so that
results can be graphed on the dashboard. Searches like (7) below are typically
"validation" queries, sometimes issued in real-time to enforce completeness in
the repository, or perhaps nightly to reject content or force "re-review" for
things that are incomplete. The same "list all" kinds of queries often feed
resulting answer sets into a batch update tool that might perform assignments
or property updates in bulk, perhaps in an offline window.
Ultimately, measuring progress and completeness for entity definitions and
their relationships is about maximizing the value of the repository --- from
providing more-easily-understood names for technical entities, through
establishing responsibility for enterprise entities and their data quality.
1. List out all entities (usually columns) that have not yet been assigned a
Term
2. List out all entities (usually columns) that have not yet been assigned a
Steward [*]
3. List out all entities (any kind) that are being managed by Steward <steward>
4. List out all entities that have been modified since <date>
5. List out all entities that are in a particular state (such as "Draft" [*],
where "workflow" [*] in IGC has been implemented)
6. List out all entities based on their time remaining in a particular state
("all terms in draft for more than <n> days")
7. List out all entities (usually Terms) where property <property> is null
[similar to where property <property> is <value> but called out here
specifically because it is a common "management level" governance query
8. List out all entities (usually Terms) where relationship <relationship> is
null
Research Queries
Often by an individual user, data research person....sometimes also performed
by developers, often exploiting a "lineage" relationship --- often with a
particular goal in mind for "that entity" or "that steward" --- such as finding
the lineage for "one particular" report, or process.
9. List out all entities <a specific type> where property <property> is
<value> [string, between, equal_to, etc., etc. etc. ]
10. List out all entities <relationship, such as "owned by"> <steward>
11. Show all entities "written by" <name of process or other data-mover kind
of asset>. For Atlas in its current form, this might the name of a SQOOP
process
12. For <entity> (type and name), show immediate upstream entity (and
properties of that entity...last time it ran, status code, etc.)
13. Show a Term and all of its "history" (particularly important for comments
by reviewers over time)
14. Various complex "set" retrievals, qualified by existence of a particular
instance...such as "dump out all database/table/column details for every
database that contains a schema called <schemaName> [at times, the qualifier is
just "if it exists" as a child but still dump all children....possibly
requiring multilple requests or additional filtering against the final returned
list]
15. List all transformations (and their sources/targets/processes) where
nullability was changed for a column from null to "not null". [that is a
specific example, but could exist for datatype changes, column name changes,
specific mappings or functions, etc.
16. Requests that exploit multiple relationships for qualification...such as
"list all tables that have a Steward...but only for Stewards who also
manage/own assets in the Risk Collection"
[*] ...the concepts of "Steward" and "Workflow" are not yet defined or
proposed for Atlas but are expected to be the subject of a future JIRA.
== placeholder == Search use cases for "Searching for Relationships"
was (Author: eostic):
Initial thoughts on search and query use types, based on various use cases that
are often seen with Infosphere Information Governance Catalog (IGC).
Search/Queries against the repository.
Here are various patterns we see frequently with IGC. The categories below
are loose, but and correspond to the user's objective and also their level of
experience with the tool, and whether they are in the role of "governance team"
vs "regular enterprise user". Breaking them up here just to aid with further
discussion. Each of these comes up in three "access" modes, fairly equally:
(1) online, using the gui (2) batch, via command line, for extraction to a
.csv or other export file structure (3) via REST api. Each also typically
allows a list of "properties" to be simply selected along with said "entity"
(name, description, internal identifier, date created, etc.)
This is more a listing of "syntax examples" than pure business use cases, but
each should be easily backed into a personna or business use case as necessary.
Note: It is expected that most of the search use cases below are (nearly
always) restricted by some definition of "scope" --- a "department", an
"owner"....a particular database or region, a specific schema, a "category" of
Terms, etc.
Governance Queries
These are queries that are typically done by the "governance team" or with
management responsibility for the governance project -- and are often the
chassis for a governance dashboard or other reporting system that measures
progress towards governance objectives. A site may have a target that "all
entities in the data lake" be fully governed by "a future target date". These
kinds of queries help measure and move governance teams closer to that goal.
The "list all" kinds of queries are also often the source of a "count" so that
results can be graphed on the dashboard. Searches like (7) below are typically
"validation" queries, sometimes issued in real-time to enforce completeness in
the repository, or perhaps nightly to reject content or force "re-review" for
things that are incomplete. The same "list all" kinds of queries often feed
resulting answer sets into a batch update tool that might perform assignments
or property updates in bulk, perhaps in an offline window.
Ultimately, measuring progress and completeness for entity definitions and
their relationships is about maximizing the value of the repository --- from
providing more-easily-understood names for technical entities, through
establishing responsibility for enterprise entities and their data quality.
1. List out all entities (usually columns) that have not yet been assigned a
Term
2. List out all entities (usually columns) that have not yet been assigned a
Steward
3. List out all entities (any kind) that are being managed by Steward
<steward>*
4. List out all entities that have been modified since <date>
5. List out all entities that are in a particular state (such as "Draft",
where "workflow"* in IGC has been implemented)
6. List out all entities based on their time remaining in a particular state
("all terms in draft for more than <n> days")
7. List out all entities (usually Terms) where property <property> is null
[similar to where property <property> is <value> but called out here
specifically because it is a common "management level" governance query
8. List out all entities (usually Terms) where relationship <relationship> is
null
Research Queries
Often by an individual user, data research person....sometimes also performed
by developers, often exploiting a "lineage" relationship --- often with a
particular goal in mind for "that entity" or "that steward" --- such as finding
the lineage for "one particular" report, or process.
9. List out all entities <a specific type> where property <property> is
<value> [string, between, equal_to, etc., etc. etc. ]
10. List out all entities <relationship, such as "owned by"> <steward>
11. Show all entities "written by" <name of process or other data-mover kind
of asset>. For Atlas in its current form, this might the name of a SQOOP
process
12. For <entity> (type and name), show immediate upstream entity (and
properties of that entity...last time it ran, status code, etc.)
13. Show a Term and all of its "history" (particularly important for comments
by reviewers over time)
14. Various complex "set" retrievals, qualified by existence of a particular
instance...such as "dump out all database/table/column details for every
database that contains a schema called <schemaName> [at times, the qualifier is
just "if it exists" as a child but still dump all children....possibly
requiring multilple requests or additional filtering against the final returned
list]
15. List all transformations (and their sources/targets/processes) where
nullability was changed for a column from null to "not null". [that is a
specific example, but could exist for datatype changes, column name changes,
specific mappings or functions, etc.
16. Requests that exploit multiple relationships for qualification...such as
"list all tables that have a Steward...but only for Stewards who also
manage/own assets in the Risk Collection"
[*] ...the concepts of "Steward" and "Workflow" are not yet defined or
proposed for Atlas but are expected to be the subject of a future JIRA.
== placeholder == Search use cases for "Searching for Relationships"
> Self-Service Catalog Search and Data Preview
> --------------------------------------------
>
> Key: ATLAS-1765
> URL: https://issues.apache.org/jira/browse/ATLAS-1765
> Project: Atlas
> Issue Type: New Feature
> Components: atlas-webui
> Affects Versions: 0.9-incubating
> Reporter: Mandy Chessell
> Assignee: Mandy Chessell
> Labels: Self-Service-UIs, VirtualDataConnector
>
> This JIRA covers the development of the catalog search and preview of data
> for data scientists and business users. It supports the search of the Atlas
> metadata repository, display of search results, additional filtering and
> drill down into details of the data sources, including a data preview option
> if the end user has access permission.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)