haizhou-zhao commented on code in PR #6045:
URL: https://github.com/apache/iceberg/pull/6045#discussion_r1005069022
##########
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java:
##########
@@ -422,11 +425,22 @@ private Table newHmsTable(TableMetadata metadata) {
Preconditions.checkNotNull(metadata, "'metadata' parameter can't be null");
final long currentTimeMillis = System.currentTimeMillis();
+ String defaultUser;
+ // default ownership is determined from Hadoop.UGI.currentUser
+ // and the owner type is default to individual user
+ try {
+ // getShortUserName will remove the realm part of the whole username
+ defaultUser = UserGroupInformation.getCurrentUser().getShortUserName();
Review Comment:
I see there're some similar comments on UGI, so I'll just put my
reply-to-all here.
To the best of my understanding, most production Hive Metastore is protected
with AuthN mechanism. For hive2&3, Kerberos is the only choice; for hive4,
token exchange is supported. I'm not familiar with the new AuthN in hive4, and
it is not used by Iceberg, so I'm focusing on Kerberos in this case,
specifically Hive Metastore server with Kerberos enabled.
Context/Assumption
1. `System.getProperty("user.name")` is OS username. OS security and
Kerberos realm (KDC) recognizes different set of users/principals, so let's
assume your OS username is not going to be recognizable by Kerberos KDC.
2. If you are connecting to Hive Metastore in unsecured mod, then you do not
need to do kerberos init as in `UGI.loginUser()`. In this case,
`UGI.getCurrentUser()` will default to OS username (aka.
`System.getProperty("user.name")`) or whatever your environment variable
`HADOOP_USER_NAME` is set to. For now, let's talk about production Hive
Metastore cluster with Kerberos enabled and set aside the unsecured mode is out
of scope. The unsecured mode is still important, I'll come back to it later.
3. This is my understanding of the life cycle of IcebergHiveCatalog:
a. user code: `UGI.loginUser()` (this is like kinit, that initialize
your initial ticket granting ticket with Kerberos KDC)
b. user code: call `CatalogUtil.loadCatalog(HiveCatalog.class,
'myCatalog', prop, hadooConf)` (this creates a new IcebergHiveCatalog object)
c. under the hood, Iceberg code will init Hive Metastore Client during
init of IcebergHiveCatalog object. Since we are talking about connecting to a
kerberized Hive Metastore server, this will fail if user do not call
`UGI.loginUser()` preemptively
d. users call things like `createNamespace` or `createTable` with the
initialized IcebergHiveMetastore
Now, the original way of registering OS username as the default owner of
iceberg table with Hive won't cause problem even when connecting to a secured
Hive Metastore server, because for Hive, you can set your owner to any string.
**The real problem with using OS username comes with AuthZ**. And here,
since the only option of AuthN with hive2&hive3 is kerberos, let's assume the
AuthZ policies on Hive Metastore ties privileges to Kerberos User Principals
(aka. identities that is only recognizable to Kerberos KDC, not local OS
security).
I would guess the intention of the original author of this code is: "if the
user of the Iceberg API does not want to set any ownership informations on
create, then by default the user running this request should be the owner of
the resource created. And later this user has get admin access to this
particular db/table". However, treating the OS user of the current process as
owner is not well hooked with Kerberos Realm and subsequently, the
AuthZ/Privilege system that depends on Kerberos Principals. In a sense, the
practice of setting OS user as the owner is similar to setting the owner of the
resource to any string that is not recognizable as an identity by the Kerberos
Realm.
**I saw there's comment regarding the performance** of Hadoop
`UGI.getCurrentUser`. My understanding: `UGI.getCurrentUser` conceptually
should not be more time consuming than `System.getProperty("user.name")`,
because before you call `UGI.getCurrentUser`, you've already done with kerberos
init (getting your initial Ticket Granting Ticket), or otherwise you cannot
even connect to Hive Metastore. It is when you invoke `UGI.loginUser`, there's
nasty business of getting TGT and doing things like sha256 decryption and
encryption under the hood. Since kerberos cache only needs to be updated every
time your cache expires (likely your TGT has 24 hours of validity or more),
overall, it's not taking too much toll on your system.
(under the hood, HADOOP UGI still depends on some javax.* or sun.* packages
for Kerberos related things, so the exact encryption/decryption implementation
may differ based on your JDK version)
**Now back to the case you are just connecting to an unsecured Hive
Metastore** in your PoC or testing environment. In that case, you would not
call `UGI.loginUser` beforehand, then when calling `UGI.getCurrentUser`, it's
essentially equivalent of calling `System.getProperty("user.name")`.
Effectively, I understand `UGI.getCurrentUser` as doing this for me:
```
if (connecting to a secured Hive Metastore) {
get the current kerberos cache, find the principal of the user currently
logging in
}
else { // not connecting to a secured Hive Metastore, not messing around
with Kerberos
get the OS user that is running this process, or fetch environment
variable 'HADOOP_USER_NAME'
}
```
(Again, this implementation that Hadoop depends on is deep down inside
javax.security.* packages, may differ as you use different JDKs. But I think
that's a good thing because your get free updates from JDK regarding all these
security things)
In that regard, I think `UGI.getCurrentUser` is more robust and
comprehensive than `System.getProperty("user.name")`.
**Finally, how reliable is `UGI.getCurrentUser`? When does it through
`IOException`s?** I can only say I personally have not encountered issue with
`UGI.getCurrentUser`, and the Hadoop's UGI utility is actually widely used.
However, when I tried to search around, bugs like this do turn up against UGI
utilities: https://issues.apache.org/jira/browse/HADOOP-13805. Though I still
want to say, if you are not dealing with a secured Hive Metastore or Kerberos
(which is likely the status quo of the current Iceberg code), and only seeking
to get the current OS user, I see very low chance that this can fail.
To summarize, I believe it's possible to not change the underlying Iceberg
API code, and ask user of Iceberg API to handle authN and authZ by explicitly
setting the table attributes related to ownership. However, I think we should
maintain the principle that "if the user of the Iceberg API does not want to
set any ownership informations on create, then by default the current user
running this request should be the owner of the resource created". And it would
be a good user experience if we can figure out the correct "current user
running this request".
Now I'm not gonna pretend I'm an expert of Hadoop or Kerberos, the above are
to the best of my knowledge, maybe not entirely correct. So definitely welcome
HADOOP/Kerberos expert from the community to correct any of my words above, and
that also includes you (@gaborkaszab) as well! Let me know what you think.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]