[jira] [Commented] (ATLAS-1161) Tags should be bound to an object's name and remain bound to all incarnations of that name

2016-09-15 Thread Dan Markwat (JIRA)

[ 
https://issues.apache.org/jira/browse/ATLAS-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493356#comment-15493356
 ] 

Dan Markwat commented on ATLAS-1161:


[~davidrad], yes this is involved very much in large ETL pipelines.  Other 
times, it may be that someone simply needs to throw the table away - as part of 
debugging, trying new things, break-fix, etc.  In any event within a short span 
of time (days to months) - for anyone using Hive as a data warehouse or shared, 
structured storage tool in any way - nobody will likely be creating *and 
tagging* a table within a database with the same name as a table a few days 
prior without the two tables being closely related in some way (as far as I can 
tell).  This is certainly true for the use cases I've been exposed to in data 
warehousing type setups for Hive.  This may not be true in all cases (but maybe 
it mostly is?), but it lends to the idea that we at least need some way of 
keeping these tags applied to named objects after the object has been dropped 
and recreated, and putting the responsibility of keeping these tags up to date 
in the hands of a user.

Having the ETL jobs be responsible for tagging appears to defeat the perceived 
purpose of Atlas which would be to persist the tags and become a book of record 
for them.  (This could also be a perception problem or misunderstanding of the 
purpose of tagging on our part!).  If a job is responsible for tagging, all 
jobs collectively become the definitive record of what is tagged whereas Atlas 
merely reports that at one time or another something was tagged.

Trying to play devil's advocate for myself and maybe see if what we're doing is 
a better fit elsewhere, I had a tangential thought:
- the business catalog/taxonomy: how is that structured?  Do those remain 
tagged to new incarnations of objects after the originally-tagged object is 
dropped?  Do they get pushed to Ranger as "tags" (/ can they be if they do 
persist?)

[~shwethags], yup I know the tags are kept and there are only "soft deletes" 
with new entities and guids being created every time; but myself and the people 
I work with are interested in a notion of more existential tagging that isn't 
tied to a (potentially) volatile object.  We are also trying to find a unified 
the path to auditing anything that shares a name across multiple incarnations 
which Atlas currently doesn't appear to have.  We would query for all the 
objects named a certain thing then audit them individually.  Maybe that's the 
only way there ever will be to do it?  Do you have other suggestions?

And yes, rule-based classification (similar to Ranger) would likely be totally 
acceptable; whatever decouples the tagging from the lifecycle of the object.  I 
do expect there are some people who want to use tagging in a more volatile 
manner, but from what I've seen totally user-driven tagging/untagging might 
work for most use cases.  And could it be that anyone needing a fresh slate for 
every incarnation of an object could simply "reset" the tags applied to it 
(perhaps?) and then proceed as they currently would?

> Tags should be bound to an object's name and remain bound to all incarnations 
> of that name
> --
>
> Key: ATLAS-1161
> URL: https://issues.apache.org/jira/browse/ATLAS-1161
> Project: Atlas
>  Issue Type: Improvement
>Affects Versions: trunk, 0.7-incubating
>Reporter: Dan Markwat
>
> As a user I would like tags I ascribe to an object in Atlas carry to the next 
> incarnation of that object.  In effect, tags would be ascribed to a 
> fully-qualified object name and all incarnations of that name would have the 
> tags apply to it.  (Not unlike Ranger and the way it applies policies to 
> objects).
> Example: I create a Hive table, TableA.  I tag TableA with tags, Tag1 and 
> Tag2.  I drop TableA.
> In the current Atlas world, if I create TableA again, Tag1 and Tag2 need to 
> be re-applied to TableA.  In the ideal governance/security world, if I 
> re-create TableA I should not have to re-tag it with Tag1 and Tag2; instead, 
> I should be required to *untag* TableA if I desire TableA to be clean and 
> untagged.  This effectively functions like a light switch: user turns on 
> light, just because the bulb is swapped out doesn't mean the switch turned 
> off - the user must explicitly turn the switch off, just as they did to turn 
> it on.  Think also about Ranger: just because I deleted an object doesn't 
> mean that policy goes away.
> By effectively deleting the binding of Tag1 and Tag2 to the name TableA 
> whenever TableA is deleted, Atlas ceases to be a book of record for tags 
> associated with TableA, as those tags would need to be applied again.  This 
> is bad in a world where creating/dropping objects and tagging 

[jira] [Commented] (ATLAS-1161) Tags should be bound to an object's name and remain bound to all incarnations of that name

2016-09-13 Thread Shwetha G S (JIRA)

[ 
https://issues.apache.org/jira/browse/ATLAS-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487402#comment-15487402
 ] 

Shwetha G S commented on ATLAS-1161:


{quote}
In the current Atlas world, if I create TableA again, Tag1 and Tag2 need to be 
re-applied to TableA
{quote}
I think it makes sense to not carry over the tags from old entity. The new 
entity, even though has the same name as old entity, might have been derived in 
a different way and mean something else. Even I think rule based classification 
would be a better fit here.

{quote}
By effectively deleting the binding of Tag1 and Tag2 to the name TableA 
whenever TableA is deleted, Atlas ceases to be a book of record for tags 
associated with TableA, as those tags would need to be applied again.
{quote}
By default, entity deletes are soft deletes. So, when TableA is deleted, its 
just marked as deleted and all the tag associations still remain and search 
retrieves the deleted entities as well. When TableA is re-created, atlas 
creates another entity(with new guid) and with no tags. So, both the old and 
new entity exists and its possible to look at both of them for audit. 
Additionally, each entity has audit trail that records tag associations, that 
can be viewed for both active and deleted entities

> Tags should be bound to an object's name and remain bound to all incarnations 
> of that name
> --
>
> Key: ATLAS-1161
> URL: https://issues.apache.org/jira/browse/ATLAS-1161
> Project: Atlas
>  Issue Type: Improvement
>Affects Versions: trunk, 0.7-incubating
>Reporter: Dan Markwat
>
> As a user I would like tags I ascribe to an object in Atlas carry to the next 
> incarnation of that object.  In effect, tags would be ascribed to a 
> fully-qualified object name and all incarnations of that name would have the 
> tags apply to it.  (Not unlike Ranger and the way it applies policies to 
> objects).
> Example: I create a Hive table, TableA.  I tag TableA with tags, Tag1 and 
> Tag2.  I drop TableA.
> In the current Atlas world, if I create TableA again, Tag1 and Tag2 need to 
> be re-applied to TableA.  In the ideal governance/security world, if I 
> re-create TableA I should not have to re-tag it with Tag1 and Tag2; instead, 
> I should be required to *untag* TableA if I desire TableA to be clean and 
> untagged.  This effectively functions like a light switch: user turns on 
> light, just because the bulb is swapped out doesn't mean the switch turned 
> off - the user must explicitly turn the switch off, just as they did to turn 
> it on.  Think also about Ranger: just because I deleted an object doesn't 
> mean that policy goes away.
> By effectively deleting the binding of Tag1 and Tag2 to the name TableA 
> whenever TableA is deleted, Atlas ceases to be a book of record for tags 
> associated with TableA, as those tags would need to be applied again.  This 
> is bad in a world where creating/dropping objects and tagging objects are 
> part of 2 independent and asynchronous processes - one carried out by an 
> engineer, the other carried out by a governance/security administrator.  The 
> issue is compounded by the fact that tags can have security policies 
> associated with them in Ranger; and any object missing its tag at re-creation 
> of that object now is missing security policies previously attached to it.
> This is an especially annoying issue for organizations that have large 
> ingestion pipelines where tables are sometimes deleted or modified in ways 
> not easily accomplished through updating table metadata.  Not to mention, 
> (probably a new feature: ) easily-accessible records of what was tagged with 
> what - even if the object has been dropped or deleted - is especially 
> important for organizations that require auditing or have security controls 
> based on tag-based policies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ATLAS-1161) Tags should be bound to an object's name and remain bound to all incarnations of that name

2016-09-13 Thread David Radley (JIRA)

[ 
https://issues.apache.org/jira/browse/ATLAS-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15486749#comment-15486749
 ] 

David Radley commented on ATLAS-1161:
-

Interesting - thanks [~dmarkwat] 
- it seems that the asset in question is like a transient file. I assume that 
the sort of use case you are thinking of is a delta ETL / map reduce jobs, 
where you accumulate daily updates in a table. You may be deleting and 
recreating files as part of this process. A way of doing this would be to use 
the time stamp in the file name or the folder name / namespace. It seems to me 
that a more complete solution would be use rules based classification / tagging 
- e.g. everything in this folder or with this namespace regex could be 
automatically tagged / classified; this could catch renames as well. It might 
that where a file lives brings picks up its classification - so one way of 
changing the classification of an asset would be to move it to a new location. 
- I guess your suggested solution is a useful improvement; though I think the 
company setting this up needs to buy into this sort of behaviour with a config 
option or new API or new atlas type;  so that there i no unintended 
consequences for other use cases. Usual practice would be to not change 
defaults , but I guess in an incubator if we feel this default is more useful 
then this could be the default behavior. 
- Another way of doing this classification would be that the ETL / map reduce 
job does the classification of the target table. It could then also do more 
granular column based tagging as well. The responsibility of the classification 
then lies with the job that creates the file. 
- I suspect we would often not tag a table as PII - more likely a column. 
Though we might tag a table as "customer data" or "for testing"  or the like. 

> Tags should be bound to an object's name and remain bound to all incarnations 
> of that name
> --
>
> Key: ATLAS-1161
> URL: https://issues.apache.org/jira/browse/ATLAS-1161
> Project: Atlas
>  Issue Type: Improvement
>Affects Versions: trunk, 0.7-incubating
>Reporter: Dan Markwat
>
> As a user I would like tags I ascribe to an object in Atlas carry to the next 
> incarnation of that object.  In effect, tags would be ascribed to a 
> fully-qualified object name and all incarnations of that name would have the 
> tags apply to it.  (Not unlike Ranger and the way it applies policies to 
> objects).
> Example: I create a Hive table, TableA.  I tag TableA with tags, Tag1 and 
> Tag2.  I drop TableA.
> In the current Atlas world, if I create TableA again, Tag1 and Tag2 need to 
> be re-applied to TableA.  In the ideal governance/security world, if I 
> re-create TableA I should not have to re-tag it with Tag1 and Tag2; instead, 
> I should be required to *untag* TableA if I desire TableA to be clean and 
> untagged.  This effectively functions like a light switch: user turns on 
> light, just because the bulb is swapped out doesn't mean the switch turned 
> off - the user must explicitly turn the switch off, just as they did to turn 
> it on.  Think also about Ranger: just because I deleted an object doesn't 
> mean that policy goes away.
> By effectively deleting the binding of Tag1 and Tag2 to the name TableA 
> whenever TableA is deleted, Atlas ceases to be a book of record for tags 
> associated with TableA, as those tags would need to be applied again.  This 
> is bad in a world where creating/dropping objects and tagging objects are 
> part of 2 independent and asynchronous processes - one carried out by an 
> engineer, the other carried out by a governance/security administrator.  The 
> issue is compounded by the fact that tags can have security policies 
> associated with them in Ranger; and any object missing its tag at re-creation 
> of that object now is missing security policies previously attached to it.
> This is an especially annoying issue for organizations that have large 
> ingestion pipelines where tables are sometimes deleted or modified in ways 
> not easily accomplished through updating table metadata.  Not to mention, 
> (probably a new feature: ) easily-accessible records of what was tagged with 
> what - even if the object has been dropped or deleted - is especially 
> important for organizations that require auditing or have security controls 
> based on tag-based policies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ATLAS-1161) Tags should be bound to an object's name and remain bound to all incarnations of that name

2016-09-12 Thread Dan Markwat (JIRA)

[ 
https://issues.apache.org/jira/browse/ATLAS-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484294#comment-15484294
 ] 

Dan Markwat commented on ATLAS-1161:


[~davidrad], my thinking is - and it sounds something like as you describe - 
there would be a separate page in the app that takes a tag-sourced approach to 
managing or visualizing information (similar to the business taxonomy).  From 
this page, a user would have a view of all tags applied in the system with 
visual cues provided for which ones are tagged to live objects and which are 
not.  Administrators could then make any updates as they see fit.  
Additionally, this would facilitate easy navigation from a tag to any 
associated objects throughout history.

As for handling renames, I can't say...I had a very similar project I was on 
that dealt with this exact problem but the solution I used makes little sense 
here.  Would it be possible to support this as part of a "rename" operation?  
(Last I checked, it didn't look like Atlas had a [micro]-service style 
architecture, so not sure how easily that fits in?)  Since each hook would know 
what the physical operation in its system was, it stands to reason that the 
hook knows the Atlas-equivalent updates and operations needed to represent it.  
This might involve reverse lookups on the tags and then relinking to the 
newly-named object?  I'm just spit-balling here..  I suppose the tag could 
still bind to a GUID, but then an interesting query involving the name would 
still need to be done to accomplish the same thing.  A concept of "last object 
named X" is required to make this work, as far as I can tell.

> Tags should be bound to an object's name and remain bound to all incarnations 
> of that name
> --
>
> Key: ATLAS-1161
> URL: https://issues.apache.org/jira/browse/ATLAS-1161
> Project: Atlas
>  Issue Type: Improvement
>Affects Versions: trunk, 0.7-incubating
>Reporter: Dan Markwat
>
> As a user I would like tags I ascribe to an object in Atlas carry to the next 
> incarnation of that object.  In effect, tags would be ascribed to a 
> fully-qualified object name and all incarnations of that name would have the 
> tags apply to it.  (Not unlike Ranger and the way it applies policies to 
> objects).
> Example: I create a Hive table, TableA.  I tag TableA with tags, Tag1 and 
> Tag2.  I drop TableA.
> In the current Atlas world, if I create TableA again, Tag1 and Tag2 need to 
> be re-applied to TableA.  In the ideal governance/security world, if I 
> re-create TableA I should not have to re-tag it with Tag1 and Tag2; instead, 
> I should be required to *untag* TableA if I desire TableA to be clean and 
> untagged.  This effectively functions like a light switch: user turns on 
> light, just because the bulb is swapped out doesn't mean the switch turned 
> off - the user must explicitly turn the switch off, just as they did to turn 
> it on.  Think also about Ranger: just because I deleted an object doesn't 
> mean that policy goes away.
> By effectively deleting the binding of Tag1 and Tag2 to the name TableA 
> whenever TableA is deleted, Atlas ceases to be a book of record for tags 
> associated with TableA, as those tags would need to be applied again.  This 
> is bad in a world where creating/dropping objects and tagging objects are 
> part of 2 independent and asynchronous processes - one carried out by an 
> engineer, the other carried out by a governance/security administrator.  The 
> issue is compounded by the fact that tags can have security policies 
> associated with them in Ranger; and any object missing its tag at re-creation 
> of that object now is missing security policies previously attached to it.
> This is an especially annoying issue for organizations that have large 
> ingestion pipelines where tables are sometimes deleted or modified in ways 
> not easily accomplished through updating table metadata.  Not to mention, 
> (probably a new feature: ) easily-accessible records of what was tagged with 
> what - even if the object has been dropped or deleted - is especially 
> important for organizations that require auditing or have security controls 
> based on tag-based policies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ATLAS-1161) Tags should be bound to an object's name and remain bound to all incarnations of that name

2016-09-12 Thread David Radley (JIRA)

[ 
https://issues.apache.org/jira/browse/ATLAS-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15483992#comment-15483992
 ] 

David Radley commented on ATLAS-1161:
-

Another thought. If we use the fully qualified name for this - how do we handle 
renames. Often we like to use a guid as the unique identifier to facilitate 
renames and moves.

> Tags should be bound to an object's name and remain bound to all incarnations 
> of that name
> --
>
> Key: ATLAS-1161
> URL: https://issues.apache.org/jira/browse/ATLAS-1161
> Project: Atlas
>  Issue Type: Improvement
>Affects Versions: trunk, 0.7-incubating
>Reporter: Dan Markwat
>
> As a user I would like tags I ascribe to an object in Atlas carry to the next 
> incarnation of that object.  In effect, tags would be ascribed to a 
> fully-qualified object name and all incarnations of that name would have the 
> tags apply to it.  (Not unlike Ranger and the way it applies policies to 
> objects).
> Example: I create a Hive table, TableA.  I tag TableA with tags, Tag1 and 
> Tag2.  I drop TableA.
> In the current Atlas world, if I create TableA again, Tag1 and Tag2 need to 
> be re-applied to TableA.  In the ideal governance/security world, if I 
> re-create TableA I should not have to re-tag it with Tag1 and Tag2; instead, 
> I should be required to *untag* TableA if I desire TableA to be clean and 
> untagged.  This effectively functions like a light switch: user turns on 
> light, just because the bulb is swapped out doesn't mean the switch turned 
> off - the user must explicitly turn the switch off, just as they did to turn 
> it on.  Think also about Ranger: just because I deleted an object doesn't 
> mean that policy goes away.
> By effectively deleting the binding of Tag1 and Tag2 to the name TableA 
> whenever TableA is deleted, Atlas ceases to be a book of record for tags 
> associated with TableA, as those tags would need to be applied again.  This 
> is bad in a world where creating/dropping objects and tagging objects are 
> part of 2 independent and asynchronous processes - one carried out by an 
> engineer, the other carried out by a governance/security administrator.  The 
> issue is compounded by the fact that tags can have security policies 
> associated with them in Ranger; and any object missing its tag at re-creation 
> of that object now is missing security policies previously attached to it.
> This is an especially annoying issue for organizations that have large 
> ingestion pipelines where tables are sometimes deleted or modified in ways 
> not easily accomplished through updating table metadata.  Not to mention, 
> (probably a new feature: ) easily-accessible records of what was tagged with 
> what - even if the object has been dropped or deleted - is especially 
> important for organizations that require auditing or have security controls 
> based on tag-based policies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ATLAS-1161) Tags should be bound to an object's name and remain bound to all incarnations of that name

2016-09-12 Thread David Radley (JIRA)

[ 
https://issues.apache.org/jira/browse/ATLAS-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15483640#comment-15483640
 ] 

David Radley commented on ATLAS-1161:
-

I can see that the relationship between an asset and a tag needs to be under 
control of govenrance lifecycles and not tied to the asset lifetime. 

The advantage of the current approach is that if a table is dropped then its 
associated tag relationships go as well. So we are not left with ghost tag to 
assets mappings to a non-existance asset. Also if TableA is recreated with a 
different shape we have not got out of date tag to asset mappings. 

If we make a change like this for the reasons you mention, we need to ensure 
that ghost mappings and out of sync mappings are handled in the fix design. If 
this use case is important, I suggest introducing asset type or asset to tag 
type, which has a tag lifetime rather than asset lifetime.  
  

   

> Tags should be bound to an object's name and remain bound to all incarnations 
> of that name
> --
>
> Key: ATLAS-1161
> URL: https://issues.apache.org/jira/browse/ATLAS-1161
> Project: Atlas
>  Issue Type: Improvement
>Affects Versions: trunk, 0.7-incubating
>Reporter: Dan Markwat
>
> As a user I would like tags I ascribe to an object in Atlas carry to the next 
> incarnation of that object.  In effect, tags would be ascribed to a 
> fully-qualified object name and all incarnations of that name would have the 
> tags apply to it.  (Not unlike Ranger and the way it applies policies to 
> objects).
> Example: I create a Hive table, TableA.  I tag TableA with tags, Tag1 and 
> Tag2.  I drop TableA.
> In the current Atlas world, if I create TableA again, Tag1 and Tag2 need to 
> be re-applied to TableA.  In the ideal governance/security world, if I 
> re-create TableA I should not have to re-tag it with Tag1 and Tag2; instead, 
> I should be required to *untag* TableA if I desire TableA to be clean and 
> untagged.  This effectively functions like a light switch: user turns on 
> light, just because the bulb is swapped out doesn't mean the switch turned 
> off - the user must explicitly turn the switch off, just as they did to turn 
> it on.  Think also about Ranger: just because I deleted an object doesn't 
> mean that policy goes away.
> By effectively deleting the binding of Tag1 and Tag2 to the name TableA 
> whenever TableA is deleted, Atlas ceases to be a book of record for tags 
> associated with TableA, as those tags would need to be applied again.  This 
> is bad in a world where creating/dropping objects and tagging objects are 
> part of 2 independent and asynchronous processes - one carried out by an 
> engineer, the other carried out by a governance/security administrator.  The 
> issue is compounded by the fact that tags can have security policies 
> associated with them in Ranger; and any object missing its tag at re-creation 
> of that object now is missing security policies previously attached to it.
> This is an especially annoying issue for organizations that have large 
> ingestion pipelines where tables are sometimes deleted or modified in ways 
> not easily accomplished through updating table metadata.  Not to mention, 
> (probably a new feature: ) easily-accessible records of what was tagged with 
> what - even if the object has been dropped or deleted - is especially 
> important for organizations that require auditing or have security controls 
> based on tag-based policies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)