[ https://issues.apache.org/jira/browse/ATLAS-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15493356#comment-15493356 ]
Dan Markwat edited comment on ATLAS-1161 at 9/15/16 1:46 PM: ------------------------------------------------------------- [~davidrad], yes this is involved very much in large ETL pipelines. Other times, it may be that someone simply needs to throw the table away - as part of debugging, trying new things, break-fix, etc. In any event within a short span of time (days to months) - for anyone using Hive as a data warehouse or shared, structured storage tool in any way - what I've seen so far is that nobody will likely be creating *and tagging* a table within a database with the same name as a table a few days prior without the two tables being closely related in some way (but I could totally be wrong with this assertion). This is certainly true for the use cases I've been exposed to in data warehousing type setups for Hive. This may not be true in all cases, but it lends to the idea that we at least need some way of keeping these tags applied to named objects after the object has been dropped and recreated, and putting the responsibility of keeping these tags up to date in the hands of a user. Does that sound right to you? Having the ETL jobs be responsible for tagging appears to defeat the perceived purpose of Atlas which would be to persist the tags and become a book of record for them. (This could also be a perception problem or misunderstanding of the purpose of tagging on our part!). If a job is responsible for tagging, all jobs collectively become the definitive record of what is tagged whereas Atlas merely reports that at one time or another something was tagged which is not quite what we're looking for...but I'm open to suggestions about alternate ways to do this! Trying to play devil's advocate for myself and maybe see if what we're doing is a better fit elsewhere, I had a tangential thought: - the business catalog/taxonomy: how is that structured? Do those remain tagged to new incarnations of objects after the originally-tagged object is dropped? Do they get pushed to Ranger as "tags" (/ can they be if they do persist?) [~shwethags], yup I know the tags are kept and there are only "soft deletes" with new entities and guids being created every time; but myself and the people I work with are interested in a notion of more existential tagging that isn't tied to a (potentially) volatile object. We are also trying to find a unified the path to auditing anything that shares a name across multiple incarnations which Atlas currently doesn't appear to have. We would query for all the objects named a certain thing then audit them individually. Maybe that's the only way there ever will be to do it? Do you have other suggestions? And yes, rule-based classification (similar to Ranger) would likely be totally acceptable; whatever decouples the tagging from the lifecycle of the object. I do expect there are some people who want to use tagging in a more volatile manner, but from what I've seen totally user-driven tagging/untagging might work for most use cases. And could it be that anyone needing a fresh slate for every incarnation of an object could simply "reset" the tags applied to it (perhaps?) and then proceed as they currently would? was (Author: dmarkwat): [~davidrad], yes this is involved very much in large ETL pipelines. Other times, it may be that someone simply needs to throw the table away - as part of debugging, trying new things, break-fix, etc. In any event within a short span of time (days to months) - for anyone using Hive as a data warehouse or shared, structured storage tool in any way - nobody will likely be creating *and tagging* a table within a database with the same name as a table a few days prior without the two tables being closely related in some way (as far as I can tell). This is certainly true for the use cases I've been exposed to in data warehousing type setups for Hive. This may not be true in all cases (but maybe it mostly is?), but it lends to the idea that we at least need some way of keeping these tags applied to named objects after the object has been dropped and recreated, and putting the responsibility of keeping these tags up to date in the hands of a user. Having the ETL jobs be responsible for tagging appears to defeat the perceived purpose of Atlas which would be to persist the tags and become a book of record for them. (This could also be a perception problem or misunderstanding of the purpose of tagging on our part!). If a job is responsible for tagging, all jobs collectively become the definitive record of what is tagged whereas Atlas merely reports that at one time or another something was tagged. Trying to play devil's advocate for myself and maybe see if what we're doing is a better fit elsewhere, I had a tangential thought: - the business catalog/taxonomy: how is that structured? Do those remain tagged to new incarnations of objects after the originally-tagged object is dropped? Do they get pushed to Ranger as "tags" (/ can they be if they do persist?) [~shwethags], yup I know the tags are kept and there are only "soft deletes" with new entities and guids being created every time; but myself and the people I work with are interested in a notion of more existential tagging that isn't tied to a (potentially) volatile object. We are also trying to find a unified the path to auditing anything that shares a name across multiple incarnations which Atlas currently doesn't appear to have. We would query for all the objects named a certain thing then audit them individually. Maybe that's the only way there ever will be to do it? Do you have other suggestions? And yes, rule-based classification (similar to Ranger) would likely be totally acceptable; whatever decouples the tagging from the lifecycle of the object. I do expect there are some people who want to use tagging in a more volatile manner, but from what I've seen totally user-driven tagging/untagging might work for most use cases. And could it be that anyone needing a fresh slate for every incarnation of an object could simply "reset" the tags applied to it (perhaps?) and then proceed as they currently would? > Tags should be bound to an object's name and remain bound to all incarnations > of that name > ------------------------------------------------------------------------------------------ > > Key: ATLAS-1161 > URL: https://issues.apache.org/jira/browse/ATLAS-1161 > Project: Atlas > Issue Type: Improvement > Affects Versions: trunk, 0.7-incubating > Reporter: Dan Markwat > > As a user I would like tags I ascribe to an object in Atlas carry to the next > incarnation of that object. In effect, tags would be ascribed to a > fully-qualified object name and all incarnations of that name would have the > tags apply to it. (Not unlike Ranger and the way it applies policies to > objects). > Example: I create a Hive table, TableA. I tag TableA with tags, Tag1 and > Tag2. I drop TableA. > In the current Atlas world, if I create TableA again, Tag1 and Tag2 need to > be re-applied to TableA. In the ideal governance/security world, if I > re-create TableA I should not have to re-tag it with Tag1 and Tag2; instead, > I should be required to *untag* TableA if I desire TableA to be clean and > untagged. This effectively functions like a light switch: user turns on > light, just because the bulb is swapped out doesn't mean the switch turned > off - the user must explicitly turn the switch off, just as they did to turn > it on. Think also about Ranger: just because I deleted an object doesn't > mean that policy goes away. > By effectively deleting the binding of Tag1 and Tag2 to the name TableA > whenever TableA is deleted, Atlas ceases to be a book of record for tags > associated with TableA, as those tags would need to be applied again. This > is bad in a world where creating/dropping objects and tagging objects are > part of 2 independent and asynchronous processes - one carried out by an > engineer, the other carried out by a governance/security administrator. The > issue is compounded by the fact that tags can have security policies > associated with them in Ranger; and any object missing its tag at re-creation > of that object now is missing security policies previously attached to it. > This is an especially annoying issue for organizations that have large > ingestion pipelines where tables are sometimes deleted or modified in ways > not easily accomplished through updating table metadata. Not to mention, > (probably a new feature: ) easily-accessible records of what was tagged with > what - even if the object has been dropped or deleted - is especially > important for organizations that require auditing or have security controls > based on tag-based policies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)