[ 
https://issues.apache.org/jira/browse/ATLAS-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15493356#comment-15493356
 ] 

Dan Markwat edited comment on ATLAS-1161 at 9/15/16 1:46 PM:
-------------------------------------------------------------

[~davidrad], yes this is involved very much in large ETL pipelines.  Other 
times, it may be that someone simply needs to throw the table away - as part of 
debugging, trying new things, break-fix, etc.  In any event within a short span 
of time (days to months) - for anyone using Hive as a data warehouse or shared, 
structured storage tool in any way - what I've seen so far is that nobody will 
likely be creating *and tagging* a table within a database with the same name 
as a table a few days prior without the two tables being closely related in 
some way (but I could totally be wrong with this assertion).  This is certainly 
true for the use cases I've been exposed to in data warehousing type setups for 
Hive.  This may not be true in all cases, but it lends to the idea that we at 
least need some way of keeping these tags applied to named objects after the 
object has been dropped and recreated, and putting the responsibility of 
keeping these tags up to date in the hands of a user.  Does that sound right to 
you?

Having the ETL jobs be responsible for tagging appears to defeat the perceived 
purpose of Atlas which would be to persist the tags and become a book of record 
for them.  (This could also be a perception problem or misunderstanding of the 
purpose of tagging on our part!).  If a job is responsible for tagging, all 
jobs collectively become the definitive record of what is tagged whereas Atlas 
merely reports that at one time or another something was tagged which is not 
quite what we're looking for...but I'm open to suggestions about alternate ways 
to do this!

Trying to play devil's advocate for myself and maybe see if what we're doing is 
a better fit elsewhere, I had a tangential thought:
- the business catalog/taxonomy: how is that structured?  Do those remain 
tagged to new incarnations of objects after the originally-tagged object is 
dropped?  Do they get pushed to Ranger as "tags" (/ can they be if they do 
persist?)

[~shwethags], yup I know the tags are kept and there are only "soft deletes" 
with new entities and guids being created every time; but myself and the people 
I work with are interested in a notion of more existential tagging that isn't 
tied to a (potentially) volatile object.  We are also trying to find a unified 
the path to auditing anything that shares a name across multiple incarnations 
which Atlas currently doesn't appear to have.  We would query for all the 
objects named a certain thing then audit them individually.  Maybe that's the 
only way there ever will be to do it?  Do you have other suggestions?

And yes, rule-based classification (similar to Ranger) would likely be totally 
acceptable; whatever decouples the tagging from the lifecycle of the object.  I 
do expect there are some people who want to use tagging in a more volatile 
manner, but from what I've seen totally user-driven tagging/untagging might 
work for most use cases.  And could it be that anyone needing a fresh slate for 
every incarnation of an object could simply "reset" the tags applied to it 
(perhaps?) and then proceed as they currently would?


was (Author: dmarkwat):
[~davidrad], yes this is involved very much in large ETL pipelines.  Other 
times, it may be that someone simply needs to throw the table away - as part of 
debugging, trying new things, break-fix, etc.  In any event within a short span 
of time (days to months) - for anyone using Hive as a data warehouse or shared, 
structured storage tool in any way - nobody will likely be creating *and 
tagging* a table within a database with the same name as a table a few days 
prior without the two tables being closely related in some way (as far as I can 
tell).  This is certainly true for the use cases I've been exposed to in data 
warehousing type setups for Hive.  This may not be true in all cases (but maybe 
it mostly is?), but it lends to the idea that we at least need some way of 
keeping these tags applied to named objects after the object has been dropped 
and recreated, and putting the responsibility of keeping these tags up to date 
in the hands of a user.

Having the ETL jobs be responsible for tagging appears to defeat the perceived 
purpose of Atlas which would be to persist the tags and become a book of record 
for them.  (This could also be a perception problem or misunderstanding of the 
purpose of tagging on our part!).  If a job is responsible for tagging, all 
jobs collectively become the definitive record of what is tagged whereas Atlas 
merely reports that at one time or another something was tagged.

Trying to play devil's advocate for myself and maybe see if what we're doing is 
a better fit elsewhere, I had a tangential thought:
- the business catalog/taxonomy: how is that structured?  Do those remain 
tagged to new incarnations of objects after the originally-tagged object is 
dropped?  Do they get pushed to Ranger as "tags" (/ can they be if they do 
persist?)

[~shwethags], yup I know the tags are kept and there are only "soft deletes" 
with new entities and guids being created every time; but myself and the people 
I work with are interested in a notion of more existential tagging that isn't 
tied to a (potentially) volatile object.  We are also trying to find a unified 
the path to auditing anything that shares a name across multiple incarnations 
which Atlas currently doesn't appear to have.  We would query for all the 
objects named a certain thing then audit them individually.  Maybe that's the 
only way there ever will be to do it?  Do you have other suggestions?

And yes, rule-based classification (similar to Ranger) would likely be totally 
acceptable; whatever decouples the tagging from the lifecycle of the object.  I 
do expect there are some people who want to use tagging in a more volatile 
manner, but from what I've seen totally user-driven tagging/untagging might 
work for most use cases.  And could it be that anyone needing a fresh slate for 
every incarnation of an object could simply "reset" the tags applied to it 
(perhaps?) and then proceed as they currently would?

> Tags should be bound to an object's name and remain bound to all incarnations 
> of that name
> ------------------------------------------------------------------------------------------
>
>                 Key: ATLAS-1161
>                 URL: https://issues.apache.org/jira/browse/ATLAS-1161
>             Project: Atlas
>          Issue Type: Improvement
>    Affects Versions: trunk, 0.7-incubating
>            Reporter: Dan Markwat
>
> As a user I would like tags I ascribe to an object in Atlas carry to the next 
> incarnation of that object.  In effect, tags would be ascribed to a 
> fully-qualified object name and all incarnations of that name would have the 
> tags apply to it.  (Not unlike Ranger and the way it applies policies to 
> objects).
> Example: I create a Hive table, TableA.  I tag TableA with tags, Tag1 and 
> Tag2.  I drop TableA.
> In the current Atlas world, if I create TableA again, Tag1 and Tag2 need to 
> be re-applied to TableA.  In the ideal governance/security world, if I 
> re-create TableA I should not have to re-tag it with Tag1 and Tag2; instead, 
> I should be required to *untag* TableA if I desire TableA to be clean and 
> untagged.  This effectively functions like a light switch: user turns on 
> light, just because the bulb is swapped out doesn't mean the switch turned 
> off - the user must explicitly turn the switch off, just as they did to turn 
> it on.  Think also about Ranger: just because I deleted an object doesn't 
> mean that policy goes away.
> By effectively deleting the binding of Tag1 and Tag2 to the name TableA 
> whenever TableA is deleted, Atlas ceases to be a book of record for tags 
> associated with TableA, as those tags would need to be applied again.  This 
> is bad in a world where creating/dropping objects and tagging objects are 
> part of 2 independent and asynchronous processes - one carried out by an 
> engineer, the other carried out by a governance/security administrator.  The 
> issue is compounded by the fact that tags can have security policies 
> associated with them in Ranger; and any object missing its tag at re-creation 
> of that object now is missing security policies previously attached to it.
> This is an especially annoying issue for organizations that have large 
> ingestion pipelines where tables are sometimes deleted or modified in ways 
> not easily accomplished through updating table metadata.  Not to mention, 
> (probably a new feature: ) easily-accessible records of what was tagged with 
> what - even if the object has been dropped or deleted - is especially 
> important for organizations that require auditing or have security controls 
> based on tag-based policies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to