[jira] [Commented] (ATLAS-1161) Tags should be bound to an object's name and remain bound to all incarnations of that name
[ https://issues.apache.org/jira/browse/ATLAS-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493356#comment-15493356 ] Dan Markwat commented on ATLAS-1161: [~davidrad], yes this is involved very much in large ETL pipelines. Other times, it may be that someone simply needs to throw the table away - as part of debugging, trying new things, break-fix, etc. In any event within a short span of time (days to months) - for anyone using Hive as a data warehouse or shared, structured storage tool in any way - nobody will likely be creating *and tagging* a table within a database with the same name as a table a few days prior without the two tables being closely related in some way (as far as I can tell). This is certainly true for the use cases I've been exposed to in data warehousing type setups for Hive. This may not be true in all cases (but maybe it mostly is?), but it lends to the idea that we at least need some way of keeping these tags applied to named objects after the object has been dropped and recreated, and putting the responsibility of keeping these tags up to date in the hands of a user. Having the ETL jobs be responsible for tagging appears to defeat the perceived purpose of Atlas which would be to persist the tags and become a book of record for them. (This could also be a perception problem or misunderstanding of the purpose of tagging on our part!). If a job is responsible for tagging, all jobs collectively become the definitive record of what is tagged whereas Atlas merely reports that at one time or another something was tagged. Trying to play devil's advocate for myself and maybe see if what we're doing is a better fit elsewhere, I had a tangential thought: - the business catalog/taxonomy: how is that structured? Do those remain tagged to new incarnations of objects after the originally-tagged object is dropped? Do they get pushed to Ranger as "tags" (/ can they be if they do persist?) [~shwethags], yup I know the tags are kept and there are only "soft deletes" with new entities and guids being created every time; but myself and the people I work with are interested in a notion of more existential tagging that isn't tied to a (potentially) volatile object. We are also trying to find a unified the path to auditing anything that shares a name across multiple incarnations which Atlas currently doesn't appear to have. We would query for all the objects named a certain thing then audit them individually. Maybe that's the only way there ever will be to do it? Do you have other suggestions? And yes, rule-based classification (similar to Ranger) would likely be totally acceptable; whatever decouples the tagging from the lifecycle of the object. I do expect there are some people who want to use tagging in a more volatile manner, but from what I've seen totally user-driven tagging/untagging might work for most use cases. And could it be that anyone needing a fresh slate for every incarnation of an object could simply "reset" the tags applied to it (perhaps?) and then proceed as they currently would? > Tags should be bound to an object's name and remain bound to all incarnations > of that name > -- > > Key: ATLAS-1161 > URL: https://issues.apache.org/jira/browse/ATLAS-1161 > Project: Atlas > Issue Type: Improvement >Affects Versions: trunk, 0.7-incubating >Reporter: Dan Markwat > > As a user I would like tags I ascribe to an object in Atlas carry to the next > incarnation of that object. In effect, tags would be ascribed to a > fully-qualified object name and all incarnations of that name would have the > tags apply to it. (Not unlike Ranger and the way it applies policies to > objects). > Example: I create a Hive table, TableA. I tag TableA with tags, Tag1 and > Tag2. I drop TableA. > In the current Atlas world, if I create TableA again, Tag1 and Tag2 need to > be re-applied to TableA. In the ideal governance/security world, if I > re-create TableA I should not have to re-tag it with Tag1 and Tag2; instead, > I should be required to *untag* TableA if I desire TableA to be clean and > untagged. This effectively functions like a light switch: user turns on > light, just because the bulb is swapped out doesn't mean the switch turned > off - the user must explicitly turn the switch off, just as they did to turn > it on. Think also about Ranger: just because I deleted an object doesn't > mean that policy goes away. > By effectively deleting the binding of Tag1 and Tag2 to the name TableA > whenever TableA is deleted, Atlas ceases to be a book of record for tags > associated with TableA, as those tags would need to be applied again. This > is bad in a world where creating/dropping objects and tagging
[jira] [Commented] (ATLAS-1161) Tags should be bound to an object's name and remain bound to all incarnations of that name
[ https://issues.apache.org/jira/browse/ATLAS-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487402#comment-15487402 ] Shwetha G S commented on ATLAS-1161: {quote} In the current Atlas world, if I create TableA again, Tag1 and Tag2 need to be re-applied to TableA {quote} I think it makes sense to not carry over the tags from old entity. The new entity, even though has the same name as old entity, might have been derived in a different way and mean something else. Even I think rule based classification would be a better fit here. {quote} By effectively deleting the binding of Tag1 and Tag2 to the name TableA whenever TableA is deleted, Atlas ceases to be a book of record for tags associated with TableA, as those tags would need to be applied again. {quote} By default, entity deletes are soft deletes. So, when TableA is deleted, its just marked as deleted and all the tag associations still remain and search retrieves the deleted entities as well. When TableA is re-created, atlas creates another entity(with new guid) and with no tags. So, both the old and new entity exists and its possible to look at both of them for audit. Additionally, each entity has audit trail that records tag associations, that can be viewed for both active and deleted entities > Tags should be bound to an object's name and remain bound to all incarnations > of that name > -- > > Key: ATLAS-1161 > URL: https://issues.apache.org/jira/browse/ATLAS-1161 > Project: Atlas > Issue Type: Improvement >Affects Versions: trunk, 0.7-incubating >Reporter: Dan Markwat > > As a user I would like tags I ascribe to an object in Atlas carry to the next > incarnation of that object. In effect, tags would be ascribed to a > fully-qualified object name and all incarnations of that name would have the > tags apply to it. (Not unlike Ranger and the way it applies policies to > objects). > Example: I create a Hive table, TableA. I tag TableA with tags, Tag1 and > Tag2. I drop TableA. > In the current Atlas world, if I create TableA again, Tag1 and Tag2 need to > be re-applied to TableA. In the ideal governance/security world, if I > re-create TableA I should not have to re-tag it with Tag1 and Tag2; instead, > I should be required to *untag* TableA if I desire TableA to be clean and > untagged. This effectively functions like a light switch: user turns on > light, just because the bulb is swapped out doesn't mean the switch turned > off - the user must explicitly turn the switch off, just as they did to turn > it on. Think also about Ranger: just because I deleted an object doesn't > mean that policy goes away. > By effectively deleting the binding of Tag1 and Tag2 to the name TableA > whenever TableA is deleted, Atlas ceases to be a book of record for tags > associated with TableA, as those tags would need to be applied again. This > is bad in a world where creating/dropping objects and tagging objects are > part of 2 independent and asynchronous processes - one carried out by an > engineer, the other carried out by a governance/security administrator. The > issue is compounded by the fact that tags can have security policies > associated with them in Ranger; and any object missing its tag at re-creation > of that object now is missing security policies previously attached to it. > This is an especially annoying issue for organizations that have large > ingestion pipelines where tables are sometimes deleted or modified in ways > not easily accomplished through updating table metadata. Not to mention, > (probably a new feature: ) easily-accessible records of what was tagged with > what - even if the object has been dropped or deleted - is especially > important for organizations that require auditing or have security controls > based on tag-based policies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ATLAS-1161) Tags should be bound to an object's name and remain bound to all incarnations of that name
[ https://issues.apache.org/jira/browse/ATLAS-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15486749#comment-15486749 ] David Radley commented on ATLAS-1161: - Interesting - thanks [~dmarkwat] - it seems that the asset in question is like a transient file. I assume that the sort of use case you are thinking of is a delta ETL / map reduce jobs, where you accumulate daily updates in a table. You may be deleting and recreating files as part of this process. A way of doing this would be to use the time stamp in the file name or the folder name / namespace. It seems to me that a more complete solution would be use rules based classification / tagging - e.g. everything in this folder or with this namespace regex could be automatically tagged / classified; this could catch renames as well. It might that where a file lives brings picks up its classification - so one way of changing the classification of an asset would be to move it to a new location. - I guess your suggested solution is a useful improvement; though I think the company setting this up needs to buy into this sort of behaviour with a config option or new API or new atlas type; so that there i no unintended consequences for other use cases. Usual practice would be to not change defaults , but I guess in an incubator if we feel this default is more useful then this could be the default behavior. - Another way of doing this classification would be that the ETL / map reduce job does the classification of the target table. It could then also do more granular column based tagging as well. The responsibility of the classification then lies with the job that creates the file. - I suspect we would often not tag a table as PII - more likely a column. Though we might tag a table as "customer data" or "for testing" or the like. > Tags should be bound to an object's name and remain bound to all incarnations > of that name > -- > > Key: ATLAS-1161 > URL: https://issues.apache.org/jira/browse/ATLAS-1161 > Project: Atlas > Issue Type: Improvement >Affects Versions: trunk, 0.7-incubating >Reporter: Dan Markwat > > As a user I would like tags I ascribe to an object in Atlas carry to the next > incarnation of that object. In effect, tags would be ascribed to a > fully-qualified object name and all incarnations of that name would have the > tags apply to it. (Not unlike Ranger and the way it applies policies to > objects). > Example: I create a Hive table, TableA. I tag TableA with tags, Tag1 and > Tag2. I drop TableA. > In the current Atlas world, if I create TableA again, Tag1 and Tag2 need to > be re-applied to TableA. In the ideal governance/security world, if I > re-create TableA I should not have to re-tag it with Tag1 and Tag2; instead, > I should be required to *untag* TableA if I desire TableA to be clean and > untagged. This effectively functions like a light switch: user turns on > light, just because the bulb is swapped out doesn't mean the switch turned > off - the user must explicitly turn the switch off, just as they did to turn > it on. Think also about Ranger: just because I deleted an object doesn't > mean that policy goes away. > By effectively deleting the binding of Tag1 and Tag2 to the name TableA > whenever TableA is deleted, Atlas ceases to be a book of record for tags > associated with TableA, as those tags would need to be applied again. This > is bad in a world where creating/dropping objects and tagging objects are > part of 2 independent and asynchronous processes - one carried out by an > engineer, the other carried out by a governance/security administrator. The > issue is compounded by the fact that tags can have security policies > associated with them in Ranger; and any object missing its tag at re-creation > of that object now is missing security policies previously attached to it. > This is an especially annoying issue for organizations that have large > ingestion pipelines where tables are sometimes deleted or modified in ways > not easily accomplished through updating table metadata. Not to mention, > (probably a new feature: ) easily-accessible records of what was tagged with > what - even if the object has been dropped or deleted - is especially > important for organizations that require auditing or have security controls > based on tag-based policies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ATLAS-1161) Tags should be bound to an object's name and remain bound to all incarnations of that name
[ https://issues.apache.org/jira/browse/ATLAS-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484294#comment-15484294 ] Dan Markwat commented on ATLAS-1161: [~davidrad], my thinking is - and it sounds something like as you describe - there would be a separate page in the app that takes a tag-sourced approach to managing or visualizing information (similar to the business taxonomy). From this page, a user would have a view of all tags applied in the system with visual cues provided for which ones are tagged to live objects and which are not. Administrators could then make any updates as they see fit. Additionally, this would facilitate easy navigation from a tag to any associated objects throughout history. As for handling renames, I can't say...I had a very similar project I was on that dealt with this exact problem but the solution I used makes little sense here. Would it be possible to support this as part of a "rename" operation? (Last I checked, it didn't look like Atlas had a [micro]-service style architecture, so not sure how easily that fits in?) Since each hook would know what the physical operation in its system was, it stands to reason that the hook knows the Atlas-equivalent updates and operations needed to represent it. This might involve reverse lookups on the tags and then relinking to the newly-named object? I'm just spit-balling here.. I suppose the tag could still bind to a GUID, but then an interesting query involving the name would still need to be done to accomplish the same thing. A concept of "last object named X" is required to make this work, as far as I can tell. > Tags should be bound to an object's name and remain bound to all incarnations > of that name > -- > > Key: ATLAS-1161 > URL: https://issues.apache.org/jira/browse/ATLAS-1161 > Project: Atlas > Issue Type: Improvement >Affects Versions: trunk, 0.7-incubating >Reporter: Dan Markwat > > As a user I would like tags I ascribe to an object in Atlas carry to the next > incarnation of that object. In effect, tags would be ascribed to a > fully-qualified object name and all incarnations of that name would have the > tags apply to it. (Not unlike Ranger and the way it applies policies to > objects). > Example: I create a Hive table, TableA. I tag TableA with tags, Tag1 and > Tag2. I drop TableA. > In the current Atlas world, if I create TableA again, Tag1 and Tag2 need to > be re-applied to TableA. In the ideal governance/security world, if I > re-create TableA I should not have to re-tag it with Tag1 and Tag2; instead, > I should be required to *untag* TableA if I desire TableA to be clean and > untagged. This effectively functions like a light switch: user turns on > light, just because the bulb is swapped out doesn't mean the switch turned > off - the user must explicitly turn the switch off, just as they did to turn > it on. Think also about Ranger: just because I deleted an object doesn't > mean that policy goes away. > By effectively deleting the binding of Tag1 and Tag2 to the name TableA > whenever TableA is deleted, Atlas ceases to be a book of record for tags > associated with TableA, as those tags would need to be applied again. This > is bad in a world where creating/dropping objects and tagging objects are > part of 2 independent and asynchronous processes - one carried out by an > engineer, the other carried out by a governance/security administrator. The > issue is compounded by the fact that tags can have security policies > associated with them in Ranger; and any object missing its tag at re-creation > of that object now is missing security policies previously attached to it. > This is an especially annoying issue for organizations that have large > ingestion pipelines where tables are sometimes deleted or modified in ways > not easily accomplished through updating table metadata. Not to mention, > (probably a new feature: ) easily-accessible records of what was tagged with > what - even if the object has been dropped or deleted - is especially > important for organizations that require auditing or have security controls > based on tag-based policies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ATLAS-1161) Tags should be bound to an object's name and remain bound to all incarnations of that name
[ https://issues.apache.org/jira/browse/ATLAS-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15483992#comment-15483992 ] David Radley commented on ATLAS-1161: - Another thought. If we use the fully qualified name for this - how do we handle renames. Often we like to use a guid as the unique identifier to facilitate renames and moves. > Tags should be bound to an object's name and remain bound to all incarnations > of that name > -- > > Key: ATLAS-1161 > URL: https://issues.apache.org/jira/browse/ATLAS-1161 > Project: Atlas > Issue Type: Improvement >Affects Versions: trunk, 0.7-incubating >Reporter: Dan Markwat > > As a user I would like tags I ascribe to an object in Atlas carry to the next > incarnation of that object. In effect, tags would be ascribed to a > fully-qualified object name and all incarnations of that name would have the > tags apply to it. (Not unlike Ranger and the way it applies policies to > objects). > Example: I create a Hive table, TableA. I tag TableA with tags, Tag1 and > Tag2. I drop TableA. > In the current Atlas world, if I create TableA again, Tag1 and Tag2 need to > be re-applied to TableA. In the ideal governance/security world, if I > re-create TableA I should not have to re-tag it with Tag1 and Tag2; instead, > I should be required to *untag* TableA if I desire TableA to be clean and > untagged. This effectively functions like a light switch: user turns on > light, just because the bulb is swapped out doesn't mean the switch turned > off - the user must explicitly turn the switch off, just as they did to turn > it on. Think also about Ranger: just because I deleted an object doesn't > mean that policy goes away. > By effectively deleting the binding of Tag1 and Tag2 to the name TableA > whenever TableA is deleted, Atlas ceases to be a book of record for tags > associated with TableA, as those tags would need to be applied again. This > is bad in a world where creating/dropping objects and tagging objects are > part of 2 independent and asynchronous processes - one carried out by an > engineer, the other carried out by a governance/security administrator. The > issue is compounded by the fact that tags can have security policies > associated with them in Ranger; and any object missing its tag at re-creation > of that object now is missing security policies previously attached to it. > This is an especially annoying issue for organizations that have large > ingestion pipelines where tables are sometimes deleted or modified in ways > not easily accomplished through updating table metadata. Not to mention, > (probably a new feature: ) easily-accessible records of what was tagged with > what - even if the object has been dropped or deleted - is especially > important for organizations that require auditing or have security controls > based on tag-based policies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ATLAS-1161) Tags should be bound to an object's name and remain bound to all incarnations of that name
[ https://issues.apache.org/jira/browse/ATLAS-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15483640#comment-15483640 ] David Radley commented on ATLAS-1161: - I can see that the relationship between an asset and a tag needs to be under control of govenrance lifecycles and not tied to the asset lifetime. The advantage of the current approach is that if a table is dropped then its associated tag relationships go as well. So we are not left with ghost tag to assets mappings to a non-existance asset. Also if TableA is recreated with a different shape we have not got out of date tag to asset mappings. If we make a change like this for the reasons you mention, we need to ensure that ghost mappings and out of sync mappings are handled in the fix design. If this use case is important, I suggest introducing asset type or asset to tag type, which has a tag lifetime rather than asset lifetime. > Tags should be bound to an object's name and remain bound to all incarnations > of that name > -- > > Key: ATLAS-1161 > URL: https://issues.apache.org/jira/browse/ATLAS-1161 > Project: Atlas > Issue Type: Improvement >Affects Versions: trunk, 0.7-incubating >Reporter: Dan Markwat > > As a user I would like tags I ascribe to an object in Atlas carry to the next > incarnation of that object. In effect, tags would be ascribed to a > fully-qualified object name and all incarnations of that name would have the > tags apply to it. (Not unlike Ranger and the way it applies policies to > objects). > Example: I create a Hive table, TableA. I tag TableA with tags, Tag1 and > Tag2. I drop TableA. > In the current Atlas world, if I create TableA again, Tag1 and Tag2 need to > be re-applied to TableA. In the ideal governance/security world, if I > re-create TableA I should not have to re-tag it with Tag1 and Tag2; instead, > I should be required to *untag* TableA if I desire TableA to be clean and > untagged. This effectively functions like a light switch: user turns on > light, just because the bulb is swapped out doesn't mean the switch turned > off - the user must explicitly turn the switch off, just as they did to turn > it on. Think also about Ranger: just because I deleted an object doesn't > mean that policy goes away. > By effectively deleting the binding of Tag1 and Tag2 to the name TableA > whenever TableA is deleted, Atlas ceases to be a book of record for tags > associated with TableA, as those tags would need to be applied again. This > is bad in a world where creating/dropping objects and tagging objects are > part of 2 independent and asynchronous processes - one carried out by an > engineer, the other carried out by a governance/security administrator. The > issue is compounded by the fact that tags can have security policies > associated with them in Ranger; and any object missing its tag at re-creation > of that object now is missing security policies previously attached to it. > This is an especially annoying issue for organizations that have large > ingestion pipelines where tables are sometimes deleted or modified in ways > not easily accomplished through updating table metadata. Not to mention, > (probably a new feature: ) easily-accessible records of what was tagged with > what - even if the object has been dropped or deleted - is especially > important for organizations that require auditing or have security controls > based on tag-based policies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)