haizhou-zhao commented on code in PR #6045:
URL: https://github.com/apache/iceberg/pull/6045#discussion_r1005069022


##########
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java:
##########
@@ -422,11 +425,22 @@ private Table newHmsTable(TableMetadata metadata) {
     Preconditions.checkNotNull(metadata, "'metadata' parameter can't be null");
     final long currentTimeMillis = System.currentTimeMillis();
 
+    String defaultUser;
+    // default ownership is determined from Hadoop.UGI.currentUser
+    // and the owner type is default to individual user
+    try {
+      // getShortUserName will remove the realm part of the whole username
+      defaultUser = UserGroupInformation.getCurrentUser().getShortUserName();

Review Comment:
   I see there're some similar comments on UGI, so I'll just put my 
reply-to-all here.
   
   To the best of my understanding, most production Hive Metastore is protected 
with AuthN mechanism. For hive2&3, Kerberos is the only choice; for hive4, 
token exchange is supported. I'm not familiar with the new AuthN in hive4, and 
it is not used by Iceberg, so I'm focusing on Kerberos in this case, 
specifically Hive Metastore server with Kerberos enabled.
   
   Context/Assumption
   1. `System.getProperty("user.name")` is OS username. OS security and 
Kerberos realm (KDC) recognizes different set of users/principals, so let's 
assume your OS username is not going to be recognizable by Kerberos KDC.
   2. If you are connecting to Hive Metastore in unsecured mod, then you do not 
need to do kerberos init as in `UGI.loginUser()`. In this case, 
`UGI.getCurrentUser()` will default to OS username (aka. 
`System.getProperty("user.name")`) or whatever your environment variable 
`HADOOP_USER_NAME` is set to. For now, let's talk about production Hive 
Metastore cluster with Kerberos enabled and set aside the unsecured mode is out 
of scope. The unsecured mode is still important, I'll come back to it later.
   3. This is my understanding of the life cycle of IcebergHiveCatalog:
   
       a. user code: `UGI.loginUser()` (this is like kinit, that initialize 
your initial ticket granting ticket with Kerberos KDC)
       b. user code: call `CatalogUtil.loadCatalog(HiveCatalog.class, 
'myCatalog', prop, hadooConf)` (this creates a new IcebergHiveCatalog object)
       c. under the hood, Iceberg code will init Hive Metastore Client during 
init of IcebergHiveCatalog object. Since we are talking about connecting to a 
kerberized Hive Metastore server, this will fail if user do not call 
`UGI.loginUser()` preemptively
       d. users call things like `createNamespace` or `createTable` with the 
initialized IcebergHiveMetastore
   
   Now, the original way of registering OS username as the default owner of 
iceberg table with Hive won't cause problem even when connecting to a secured 
Hive Metastore server, because for Hive, you can set your owner to any string.
   
   **The real problem with using OS username comes with AuthZ**. And here, 
since the only option of AuthN with hive2&hive3 is kerberos, let's assume the 
AuthZ policies on Hive Metastore ties privileges to Kerberos User Principals 
(aka. identities that is only recognizable to Kerberos KDC, not local OS 
security).
   
   I would guess the intention of the original author of this code is: "if the 
user of the Iceberg API does not want to set any ownership informations on 
create, then by default the user running this request should be the owner of 
the resource created. And later this user has get admin access to this 
particular db/table". However, treating the OS user of the current process as 
owner is not well hooked with Kerberos Realm and subsequently, the 
AuthZ/Privilege system that depends on Kerberos Principals. In a sense, the 
practice of setting OS user as the owner is similar to setting the owner of the 
resource to any string that is not recognizable as an identity by the Kerberos 
Realm.
   
   **I saw there's comment regarding the performance** of Hadoop 
`UGI.getCurrentUser`. My understanding: `UGI.getCurrentUser` conceptually 
should not be more time consuming than `System.getProperty("user.name")`, 
because before you call `UGI.getCurrentUser`, you've already done with kerberos 
init (getting your initial Ticket Granting Ticket), or otherwise you cannot 
even connect to Hive Metastore. It is when you invoke `UGI.loginUser`, there's 
nasty business of getting TGT and doing things like sha256 decryption and 
encryption under the hood. Since kerberos cache only needs to be updated every 
time your cache expires (likely your TGT has 24 hours of validity or more), 
overall, it's not taking too much toll on your system.
   
   (under the hood, HADOOP UGI still depends on some javax.* or sun.* packages 
for Kerberos related things, so the exact encryption/decryption implementation 
may differ based on your JDK version)
   
   **Now back to the case you are just connecting to an unsecured Hive 
Metastore** in your PoC or testing environment. In that case, you would not 
call `UGI.loginUser` beforehand, then when calling `UGI.getCurrentUser`, it's 
essentially equivalent of calling `System.getProperty("user.name")`. 
Effectively, I understand `UGI.getCurrentUser` as doing this for me:
   ```
   if (connecting to a secured Hive Metastore) {
       get the current kerberos cache, find the principal of the user currently 
logging in
   }
   else { // not connecting to a secured Hive Metastore, not messing around 
with Kerberos
       get the OS user that is running this process, or fetch environment 
variable 'HADOOP_USER_NAME'
   }
   ```
   (Again, this implementation that Hadoop depends on is deep down inside 
javax.security.* packages, may differ as you use different JDKs. But I think 
that's a good thing because your get free updates from JDK regarding all these 
security things)
   
   In that regard, I think `UGI.getCurrentUser` is more robust and 
comprehensive than `System.getProperty("user.name")`.
   
   **Finally, how reliable is `UGI.getCurrentUser`? When does it through 
`IOException`s?** I can only say I personally have not encountered issue with 
`UGI.getCurrentUser`, and the Hadoop's UGI utility is actually widely used. 
However, when I tried to search around, bugs like this do turn up against UGI 
utilities: https://issues.apache.org/jira/browse/HADOOP-13805. Though I still 
want to say, if you are not dealing with a secured Hive Metastore or Kerberos 
(which is likely the status quo of the current Iceberg code), and only seeking 
to get the current OS user, I see very low chance that this can fail.
   
   To summarize, I believe it's possible to not change the underlying Iceberg 
API code, and ask user of Iceberg API to handle authN and authZ by explicitly 
setting the table attributes related to ownership. However, I think we should 
maintain the principle that "if the user of the Iceberg API does not want to 
set any ownership informations on create, then by default the current user 
running this request should be the owner of the resource created". And it would 
be a good user experience if we can figure out the correct "current user 
running this request".
   
   Now I'm not gonna pretend I'm an expert of Hadoop or Kerberos, the above are 
to the best of my knowledge, maybe not entirely correct. So definitely welcome 
HADOOP/Kerberos expert from the community to correct any of my words above, and 
that also includes you (@gaborkaszab) as well! Let me know what you think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to