[ https://issues.apache.org/jira/browse/ATLAS-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15162871#comment-15162871 ]
Hemanth Yamijala commented on ATLAS-512: ---------------------------------------- There are a couple of ways of doing this. I am trying to list the approaches here with their pros and cons, and get feedback: *Option 1: Move model registration out of the hooks and into an independent tool / script* * Mechanics: ** Every integrating component will provide an implementation that will have the serialized model and a 'signature type'. A utility in Atlas will take these and call the create type API. ** This utility will encode the current logic of checking for a type before registering. The 'signature type' is used for this purpose ** This utility should essentially be called before any entity creation happens from the hooks - so really like a setup step for Atlas. There are a couple of ways of doing this as well, I guess. * Pros: ** The chief advantage is that we can make the model registration a one time activity done in a controlled environment. ** Because we are using an API, we can get feedback on the success or failure of the model registration and have the chance of acting on it. ** Speaking in the shorter term, this is fewer changes to the Atlas system. Mainly, it can be setup to not touch Atlas server side at all. * Cons: ** Depending on implementation, there is a possibility that this registration does not happen (mostly due to human error) and entity creates/updates could fail due that. ** Has a dependency that the Atlas server is running for this to work. (In defense, this is a one-time activity) *Option 2: Write type creations through Kafka from the hooks, instead of the API* * Mechanics: ** A hook will send the type creations as notifications to the ATLAS_HOOK topic of Atlas. ** The hook will not check if the type is already registered. This implies client side hooks like Storm would write this multiple times (unless they maintain state independently) ** The Hook consumer of Atlas should be extended to process this new type of message (like TYPE_CREATE - which is already there). The consumer should check if the type is already registered, and if yes, not act on the type. A log would be helpful to audit such an event. Else, it calls the create type API. * Pros: ** This retains the spirit of hooks auto-registering models. Hence, the chance for errors is minimized ** We remove dependency on the Atlas server for everything * Cons: ** We cannot give feedback on type registration in this mechanism. ** Since client side hooks (which don't maintain state) can write types multiple times, there could be some load on Kafka (for things like Hive CLI, this could be non-trivial impact, but certainly not something that Kafka cannot handle). It just feels wasteful to do so. ** This change is more intrusive as the Atlas server will need some modifications for this to work. Personally, moving to Kafka seems like it will work ok chiefly because it retains the current spirit of auto-registration and does not introduce any setup step that could get error prone for users managing Atlas. The only concern is with client components writing the type definitions multiple times. If this gets to be really an issue, we may need the clients to maintain state in some persistent state store. Thoughts from others? > Decouple currently integrating components from availability of Atlas service > for raising metadata events > --------------------------------------------------------------------------------------------------------- > > Key: ATLAS-512 > URL: https://issues.apache.org/jira/browse/ATLAS-512 > Project: Atlas > Issue Type: Sub-task > Reporter: Hemanth Yamijala > Assignee: Hemanth Yamijala > > The components that currently integrate with Atlas (Hive, Sqoop, Falcon, > Storm) all communicate their metadata events using Kafka as a messaging > layer. This effectively decouples these components from the Atlas server. > However, all of these components have some initialization that checks if > their respective models are registered with Atlas. For components that > integrate on the server, like HiveServer2 and Falcon, this initialization is > a one time check and hence, is manageable. Others like Sqoop, Storm and the > Hive CLI are client side components and hence the initialization happens for > every run or session of these components. Invoking the initialization (and > the one time check) every time like this effectively means that the Atlas > server should be always available. > This JIRA is to try and remove this dependency and thus truly decouple these > components. -- This message was sent by Atlassian JIRA (v6.3.4#6332)