Hello,

I’d like to propose that SQL based authorization (or something similar) be
applied and enforced also in the metastore service as part of the
initiative to extract HMS as an independent project. While any such
implementation cannot be ’system complete’ like HiveServer2 (HS2) (HMS has
no scope to intercept operations applied to table data, only metadata), it
would be a significant step forward for controlling the operations that can
be actioned by the many non-HS2 clients in the Hive ecosystem.

I believe this is a good time to consider this option as there is currently
much discussion in the Hive community on the future directions of HMS and
greater recognition that HMS is now seen as general data platform
infrastructure and not simply an internal Hive component.

Further details are below. I’d be grateful for any feedback, thoughts, and
suggestions on how this could move forward.

*Problem*
At this time, Hive’s SQL based authorization feature is the recommended
approach for controlling which operations may be performed on what by whom.
This feature is applied in the HS2 component. However, a large number of
platforms that integrate with Hive do not do so via HS2, instead talking to
the metastore service directly and so bypassing authorization. They can
perform destructive operations such as a table drop even though the
permissions declared in the metastore may explicitly forbid it as they are
able to circumvent the authorization logic in HS2.

In short, there seems to be a lack of encapsulation with authorization in
the metastore; HMS owns the metadata, is responsible for performing actions
on metadata, for maintaining permissions on what actions are permissible by
whom, and yet has no means to use the information it has to protect the
data it owns.

*Workarounds*
Common workarounds to this deficiency include falling back to storage based
authorization or running read only metastore instances. However, both of
these approaches have significant drawbacks:

   - File based auth does not function when using object stores such as S3
   and so is not usable in cloud deployments of Hive - a pattern that is
   seeing significant growth.
   - Read only metastores incur significant infrastructure and operational
   overheads, requiring a separate set of server instances, while delivering
   little functionality and blunt authorization capabilities. You cannot for
   example restrict a particular operation type, by a certain user, on a
   specific table. You are literally blocking all writes by directing
   different user groups to different network endpoints.

*Anti-patterns*
It might be tempting to simply suggest using HS2 for all access to Hive
data. However, while this is conceptually appealing, it’s not practical to
apply on large, rich, and diverse data platforms where tool
interoperability and broad compatibility is required. Additionally, it can
be argued that the API exposed by HS2, while useful for analytical tools,
is not fit for use by large ETL processes; for example: using a “SELECT *”
over JDBC as a source for a large Spark job doesn’t scale.

*High level implementation notes*
I believe that the HMS requires little (if any) refactoring to support the
implementation of SQL based auth in the metastore. It currently maintains
all of the necessary metadata that describes the authorization rules that
should be applied. It also has access to the principle wishing to perform a
certain action via the UGI mechanism. Finally, there is an existing hook
mechanism to intercept metadata operations and apply authorization.

In deployments that use HS2 exclusively, the proposed metastore resident
SQL based auth could either be disabled or used harmlessly in conjunction
with the HS2 implementation.

Thanks,

Elliot.

Elliot West
Senior Engineer
Data Platform Team
Hotels.com

Reply via email to