Hello, I’d like to propose that SQL based authorization (or something similar) be applied and enforced also in the metastore service as part of the initiative to extract HMS as an independent project. While any such implementation cannot be ’system complete’ like HiveServer2 (HS2) (HMS has no scope to intercept operations applied to table data, only metadata), it would be a significant step forward for controlling the operations that can be actioned by the many non-HS2 clients in the Hive ecosystem.
I believe this is a good time to consider this option as there is currently much discussion in the Hive community on the future directions of HMS and greater recognition that HMS is now seen as general data platform infrastructure and not simply an internal Hive component. Further details are below. I’d be grateful for any feedback, thoughts, and suggestions on how this could move forward. *Problem* At this time, Hive’s SQL based authorization feature is the recommended approach for controlling which operations may be performed on what by whom. This feature is applied in the HS2 component. However, a large number of platforms that integrate with Hive do not do so via HS2, instead talking to the metastore service directly and so bypassing authorization. They can perform destructive operations such as a table drop even though the permissions declared in the metastore may explicitly forbid it as they are able to circumvent the authorization logic in HS2. In short, there seems to be a lack of encapsulation with authorization in the metastore; HMS owns the metadata, is responsible for performing actions on metadata, for maintaining permissions on what actions are permissible by whom, and yet has no means to use the information it has to protect the data it owns. *Workarounds* Common workarounds to this deficiency include falling back to storage based authorization or running read only metastore instances. However, both of these approaches have significant drawbacks: - File based auth does not function when using object stores such as S3 and so is not usable in cloud deployments of Hive - a pattern that is seeing significant growth. - Read only metastores incur significant infrastructure and operational overheads, requiring a separate set of server instances, while delivering little functionality and blunt authorization capabilities. You cannot for example restrict a particular operation type, by a certain user, on a specific table. You are literally blocking all writes by directing different user groups to different network endpoints. *Anti-patterns* It might be tempting to simply suggest using HS2 for all access to Hive data. However, while this is conceptually appealing, it’s not practical to apply on large, rich, and diverse data platforms where tool interoperability and broad compatibility is required. Additionally, it can be argued that the API exposed by HS2, while useful for analytical tools, is not fit for use by large ETL processes; for example: using a “SELECT *” over JDBC as a source for a large Spark job doesn’t scale. *High level implementation notes* I believe that the HMS requires little (if any) refactoring to support the implementation of SQL based auth in the metastore. It currently maintains all of the necessary metadata that describes the authorization rules that should be applied. It also has access to the principle wishing to perform a certain action via the UGI mechanism. Finally, there is an existing hook mechanism to intercept metadata operations and apply authorization. In deployments that use HS2 exclusively, the proposed metastore resident SQL based auth could either be disabled or used harmlessly in conjunction with the HS2 implementation. Thanks, Elliot. Elliot West Senior Engineer Data Platform Team Hotels.com
