[jira] [Comment Edited] (ATLAS-1765) Self-Service Catalog Search and Data Preview

ernie ostic (JIRA) Mon, 08 May 2017 14:31:32 -0700

    [ 
https://issues.apache.org/jira/browse/ATLAS-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16000621#comment-16000621
 ]


ernie ostic edited comment on ATLAS-1765 at 5/8/17 9:30 PM:
------------------------------------------------------------

Initial thoughts on search and query use types, based on various use cases that 
are often seen with Infosphere Information Governance Catalog (IGC).

Search/Queries against the repository.   

Here are various patterns we see frequently with IGC.   The categories below 
are loose, but and correspond to the user's objective and also their level of 
experience with the tool, and whether they are in the role of "governance team" 
vs "regular enterprise user".   Breaking them up here just to aid with further 
discussion.     Each of these comes up in three "access" modes, fairly equally: 
(1) online, using the gui  (2) batch, via command line, for extraction to a 
.csv or other export file structure (3) via REST api.    Each also typically 
allows a list of "properties" to be simply selected along with said "asset" 
(name, description, internal identifier, date created, etc.)   

This is more a listing of "syntax examples" than pure business use cases, but 
each should be easily backed into a personna or business use case as necessary. 


Governance Queries

These are queries that are typically done by the "governance team" or with 
management responsibility for the governance project -- and are often the 
chassis for a governance dashboard or other reporting system that measures 
progress towards governance objectives. A site may have a target that "all 
assets in the data lake" be fully governed by "a future target date".  These 
kinds of queries help measure and move governance teams closer to that goal.  

The "list all" kinds of queries are also often the source of a "count" so that 
results can be graphed on the dashboard.  Searches like (7) below are typically 
"validation" queries, sometimes issued in real-time to enforce completeness in 
the repository, or perhaps nightly to reject content or force "re-review" for 
things that are incomplete.    The same "list all" kinds of queries often feed 
resulting answer sets into a batch update tool that might perform assignments 
or property updates in bulk, perhaps in an offline window.

Ultimately, measuring progress and completeness for asset definitions and their 
relationships is about maximizing the value of the repository --- from 
providing more-easily-understood names for technical assets, through 
establishing responsibility for enterprise assets and their data quality.

1.  List out all assets (usually columns) that have not yet been assigned a Term
2.  List out all assets (usually columns) that have not yet been assigned a 
Steward
3.  List out all assets (any kind) that are being managed by Steward <steward>
4.  List out all assets that have been modified since <date>
5.  List out all assets that are in a particular state (such as "Draft", where 
"workflow" in IGC has been implemented)
6.  List out all assets based on their time remaining in a particular state 
("all terms in draft for more than <n> days")
7.  List out all assets (usually Terms) where property <property> is null   
[similar to where property <property> is <value> but called out here 
specifically because it is a common "management level" governance query
8.  List out all assets (usually Terms) where relationship <relationship> is 
null  



 Research Queries

Often by an individual user, data research person....sometimes also performed 
by developers, often exploiting a "lineage" relationship --- often with a 
particular goal in mind for "that asset" or "that steward" --- such as finding 
the lineage for "one particular" report.

9.  List out all assets <a specific type> where property <property> is <value> 
[string, between, equal_to, etc., etc. etc. ]
10.  List out all assets <relationship, such as "owned by"> <steward> 
11.  Show all assets "written by" <name of process or other data-mover kind of 
asset>.   For Atlas in its current form, this might the name of a SQOOP process
12.  For <asset> (type and name), show immediate upstream asset (and properties 
of that asset...last time it ran, status code, etc.)
13.  Show a Term and all of its "history" (particularly important for comments 
by reviewers over time)
14.  Various complex "set" retrievals, qualified by existence of a particular 
instance...such as "dump out all database/table/column details for every 
database that contains a schema called <schemaName> [at times, the qualifier is 
just "if it exists" as a child but still dump all children....possibly 
requiring multilple requests or additional filtering against the final returned 
list]
15.  List all transformations (and their sources/targets/processes) where 
nullability was changed for a column from null to "not null".   [that is a 
specific example, but could exist for datatype changes, column name changes, 
specific mappings or functions, etc.
16.  Requests that exploit multiple relationships for qualification...such as 
"list all tables that have a Steward...but only for Stewards who also 
manage/own assets in the Risk Collection" 




was (Author: eostic):
Initial thoughts on search and query use types, based on various use cases that 
are often seen with Infosphere Information Governance Catalog (IGC).

Search/Queries against the repository.   

Here are various patterns we see frequently with IGC.   The categories below 
are loose, but and correspond to the user's objective and also their level of 
experience with the tool, and whether they are in the role of "governance team" 
vs "regular enterprise user".   Breaking them up here just to aid with further 
discussion.     Each of these comes up in three "access" modes, fairly equally: 
[(1) online, using the gui  (2) batch, via command line, for extraction to a 
.csv or other export file structure (3) via REST api].    Each also typically 
allows a list of "properties" to be simply selected along with said "asset" 
(name, description, internal identifier, date created, etc.)   

This is more a listing of "syntax examples" than pure business use cases, but 
each should be easily backed into a personna or business use case as necessary. 


Governance Queries

List out all assets (usually columns) that have not yet been assigned a Term
List out all assets (usually columns) that have not yet been assigned a Steward
List out all assets (any kind) that are being managed by Steward <steward>
List out all assets that have been modified since <date>
List out all assets that are in a particular state (such as "Draft", where 
"workflow" in IGC has been implemented)
List out all assets based on their time remaining in a particular state ("all 
terms in draft for more than <n> days")
List out all assets (usually Terms) where property <property> is null   
[similar to where property <property> is <value> but called out here 
specifically because it is a common "management level" governance query
List out all assets (usually Terms) where relationship <relationship> is null  



 Research Queries

Often by an individual user, data research person....sometimes also performed 
by developers, often exploiting a "lineage" relationship

List out all assets <a specific type> where property <property> is <value> 
[string, between, equal_to, etc., etc. etc. ]
List out all assets <relationship, such as "owned by"> <steward> 
Show all assets "written by" <name of process or other data-mover kind of 
asset>.   For Atlas in its current form, this might the name of a SQOOP process
For <asset> (type and name), show immediate upstream asset (and properties of 
that asset...last time it ran, status code, etc.)
Show a Term and all of its "history" (particular important for comments by 
reviewers over time)
Various complex "set" retrievals, qualified by existence of a particular 
instance...such as "dump out all database/table/column details for every 
database that contains a schema called <schemaName> [at times, the qualifier is 
just "if it exists" as a child but still dump all children....possibly 
requiring multilple requests or additional filtering against the final returned 
list]
List all transformations (and their sources/targets/processes) where 
nullability was changed for a column from null to "not null".   [that is a 
specific example, but could exist for datatype changes, column name changes, 
specific mappings or functions, etc.
Requests that exploit multiple relationships for qualification...such as "list 
all tables that have a Steward...but only for Stewards who also manage/own 
assets in the Risk Collection" 



> Self-Service Catalog Search and Data Preview
> --------------------------------------------
>
>                 Key: ATLAS-1765
>                 URL: https://issues.apache.org/jira/browse/ATLAS-1765
>             Project: Atlas
>          Issue Type: New Feature
>          Components: atlas-webui
>    Affects Versions: 0.9-incubating
>            Reporter: Mandy Chessell
>            Assignee: Mandy Chessell
>              Labels: Self-Service-UIs, VirtualDataConnector
>
> This JIRA covers the development of the catalog search and preview of data 
> for data scientists and business users.  It supports the search of the Atlas 
> metadata repository, display of search results, additional filtering and 
> drill down into details of the data sources, including a data preview option 
> if the end user has access permission.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (ATLAS-1765) Self-Service Catalog Search and Data Preview

Reply via email to