[jira] [Comment Edited] (ATLAS-1765) Self-Service Catalog Search and Data Preview

ernie ostic (JIRA) Tue, 09 May 2017 11:58:16 -0700

    [ 
https://issues.apache.org/jira/browse/ATLAS-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16000621#comment-16000621
 ]


ernie ostic edited comment on ATLAS-1765 at 5/9/17 6:57 PM:
------------------------------------------------------------

Initial thoughts on search and query use types, based on various use cases that 
are often seen with Infosphere Information Governance Catalog (IGC).

Search/Queries against the repository.   

Here are various patterns we see frequently with IGC.   The categories below 
are loose, but and correspond to the user's objective and also their level of 
experience with the tool, and whether they are in the role of "governance team" 
vs "regular enterprise user".   Breaking them up here just to aid with further 
discussion.     Each of these comes up in three "access" modes, fairly equally: 
(1) online, using the gui  (2) batch, via command line, for extraction to a 
.csv or other export file structure (3) via REST api.    Each also typically 
allows a list of "properties" to be simply selected along with said "entity" 
(name, description, internal identifier, date created, etc.)   

This is more a listing of "syntax examples" than pure business use cases, but 
each should be easily backed into a personna or business use case as necessary. 

Note:  It is expected that most of the search use cases below are (nearly 
always) restricted by some definition of "scope" --- a "department", an 
"owner"....a particular database or region, a specific schema, a "category" of 
Terms, etc.



Governance Queries

These are queries that are typically done by the "governance team" or with 
management responsibility for the governance project -- and are often the 
chassis for a governance dashboard or other reporting system that measures 
progress towards governance objectives. A site may have a target that "all 
entities in the data lake" be fully governed by "a future target date".  These 
kinds of queries help measure and move governance teams closer to that goal.  

The "list all" kinds of queries are also often the source of a "count" so that 
results can be graphed on the dashboard.  Searches like (7) below are typically 
"validation" queries, sometimes issued in real-time to enforce completeness in 
the repository, or perhaps nightly to reject content or force "re-review" for 
things that are incomplete.    The same "list all" kinds of queries often feed 
resulting answer sets into a batch update tool that might perform assignments 
or property updates in bulk, perhaps in an offline window.

Ultimately, measuring progress and completeness for entity definitions and 
their relationships is about maximizing the value of the repository --- from 
providing more-easily-understood names for technical entities, through 
establishing responsibility for enterprise entities and their data quality.

1.  List out all entities (usually columns) that have not yet been assigned a 
Term
2.  List out all entities (usually columns) that have not yet been assigned a 
Steward
3.  List out all entities (any kind) that are being managed by Steward 
<steward>*
4.  List out all entities that have been modified since <date>
5.  List out all entities that are in a particular state (such as "Draft", 
where "workflow"* in IGC has been implemented)
6.  List out all entities based on their time remaining in a particular state 
("all terms in draft for more than <n> days")
7.  List out all entities (usually Terms) where property <property> is null   
[similar to where property <property> is <value> but called out here 
specifically because it is a common "management level" governance query
8.  List out all entities (usually Terms) where relationship <relationship> is 
null  



 Research Queries

Often by an individual user, data research person....sometimes also performed 
by developers, often exploiting a "lineage" relationship --- often with a 
particular goal in mind for "that entity" or "that steward" --- such as finding 
the lineage for "one particular" report, or process.

9.  List out all entities <a specific type> where property <property> is 
<value> [string, between, equal_to, etc., etc. etc. ]
10.  List out all entities <relationship, such as "owned by"> <steward> 
11.  Show all entities "written by" <name of process or other data-mover kind 
of asset>.   For Atlas in its current form, this might the name of a SQOOP 
process
12.  For <entity> (type and name), show immediate upstream entity (and 
properties of that entity...last time it ran, status code, etc.)
13.  Show a Term and all of its "history" (particularly important for comments 
by reviewers over time)
14.  Various complex "set" retrievals, qualified by existence of a particular 
instance...such as "dump out all database/table/column details for every 
database that contains a schema called <schemaName> [at times, the qualifier is 
just "if it exists" as a child but still dump all children....possibly 
requiring multilple requests or additional filtering against the final returned 
list]
15.  List all transformations (and their sources/targets/processes) where 
nullability was changed for a column from null to "not null".   [that is a 
specific example, but could exist for datatype changes, column name changes, 
specific mappings or functions, etc.
16.  Requests that exploit multiple relationships for qualification...such as 
"list all tables that have a Steward...but only for Stewards who also 
manage/own assets in the Risk Collection" 

* ...the concepts of "Steward" and "Workflow" are not yet defined or proposed 
for Atlas but are expected to be the subject of a future JIRA.

 == placeholder ==  Search use cases for "Searching for Relationships"





was (Author: eostic):
Initial thoughts on search and query use types, based on various use cases that 
are often seen with Infosphere Information Governance Catalog (IGC).

Search/Queries against the repository.   

Here are various patterns we see frequently with IGC.   The categories below 
are loose, but and correspond to the user's objective and also their level of 
experience with the tool, and whether they are in the role of "governance team" 
vs "regular enterprise user".   Breaking them up here just to aid with further 
discussion.     Each of these comes up in three "access" modes, fairly equally: 
(1) online, using the gui  (2) batch, via command line, for extraction to a 
.csv or other export file structure (3) via REST api.    Each also typically 
allows a list of "properties" to be simply selected along with said "entity" 
(name, description, internal identifier, date created, etc.)   

This is more a listing of "syntax examples" than pure business use cases, but 
each should be easily backed into a personna or business use case as necessary. 

Note:  It is expected that most of the search use cases below are (nearly 
always) restricted by some definition of "scope" --- a "department", an 
"owner"....a particular database or region, a specific schema, a "category" of 
Terms, etc.



Governance Queries

These are queries that are typically done by the "governance team" or with 
management responsibility for the governance project -- and are often the 
chassis for a governance dashboard or other reporting system that measures 
progress towards governance objectives. A site may have a target that "all 
entities in the data lake" be fully governed by "a future target date".  These 
kinds of queries help measure and move governance teams closer to that goal.  

The "list all" kinds of queries are also often the source of a "count" so that 
results can be graphed on the dashboard.  Searches like (7) below are typically 
"validation" queries, sometimes issued in real-time to enforce completeness in 
the repository, or perhaps nightly to reject content or force "re-review" for 
things that are incomplete.    The same "list all" kinds of queries often feed 
resulting answer sets into a batch update tool that might perform assignments 
or property updates in bulk, perhaps in an offline window.

Ultimately, measuring progress and completeness for entity definitions and 
their relationships is about maximizing the value of the repository --- from 
providing more-easily-understood names for technical entities, through 
establishing responsibility for enterprise entities and their data quality.

1.  List out all entities (usually columns) that have not yet been assigned a 
Term
2.  List out all entities (usually columns) that have not yet been assigned a 
Steward
3.  List out all entities (any kind) that are being managed by Steward <steward>
4.  List out all entities that have been modified since <date>
5.  List out all entities that are in a particular state (such as "Draft", 
where "workflow" in IGC has been implemented)
6.  List out all entities based on their time remaining in a particular state 
("all terms in draft for more than <n> days")
7.  List out all entities (usually Terms) where property <property> is null   
[similar to where property <property> is <value> but called out here 
specifically because it is a common "management level" governance query
8.  List out all entities (usually Terms) where relationship <relationship> is 
null  



 Research Queries

Often by an individual user, data research person....sometimes also performed 
by developers, often exploiting a "lineage" relationship --- often with a 
particular goal in mind for "that entity" or "that steward" --- such as finding 
the lineage for "one particular" report, or process.

9.  List out all entities <a specific type> where property <property> is 
<value> [string, between, equal_to, etc., etc. etc. ]
10.  List out all entities <relationship, such as "owned by"> <steward> 
11.  Show all entities "written by" <name of process or other data-mover kind 
of asset>.   For Atlas in its current form, this might the name of a SQOOP 
process
12.  For <entity> (type and name), show immediate upstream entity (and 
properties of that entity...last time it ran, status code, etc.)
13.  Show a Term and all of its "history" (particularly important for comments 
by reviewers over time)
14.  Various complex "set" retrievals, qualified by existence of a particular 
instance...such as "dump out all database/table/column details for every 
database that contains a schema called <schemaName> [at times, the qualifier is 
just "if it exists" as a child but still dump all children....possibly 
requiring multilple requests or additional filtering against the final returned 
list]
15.  List all transformations (and their sources/targets/processes) where 
nullability was changed for a column from null to "not null".   [that is a 
specific example, but could exist for datatype changes, column name changes, 
specific mappings or functions, etc.
16.  Requests that exploit multiple relationships for qualification...such as 
"list all tables that have a Steward...but only for Stewards who also 
manage/own assets in the Risk Collection" 



> Self-Service Catalog Search and Data Preview
> --------------------------------------------
>
>                 Key: ATLAS-1765
>                 URL: https://issues.apache.org/jira/browse/ATLAS-1765
>             Project: Atlas
>          Issue Type: New Feature
>          Components: atlas-webui
>    Affects Versions: 0.9-incubating
>            Reporter: Mandy Chessell
>            Assignee: Mandy Chessell
>              Labels: Self-Service-UIs, VirtualDataConnector
>
> This JIRA covers the development of the catalog search and preview of data 
> for data scientists and business users.  It supports the search of the Atlas 
> metadata repository, display of search results, additional filtering and 
> drill down into details of the data sources, including a data preview option 
> if the end user has access permission.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (ATLAS-1765) Self-Service Catalog Search and Data Preview

Reply via email to