[ https://issues.apache.org/jira/browse/SPARK-44976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759201#comment-17759201 ]
YUBI LEE commented on SPARK-44976: ---------------------------------- https://github.com/apache/spark/pull/42690 > Utils.getCurrentUserName should return the full principal name > -------------------------------------------------------------- > > Key: SPARK-44976 > URL: https://issues.apache.org/jira/browse/SPARK-44976 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.2.3, 3.3.3, 3.4.1 > Reporter: YUBI LEE > Priority: Major > > SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use > shortname instead of full principal name. > Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the > side of non-kerberized hdfs namenode. > For example, I use 2 hdfs cluster. One is kerberized, the other one is not > kerberized. > I make a rule to add some prefix to username on the non-kerberized cluster if > some one access it from the kerberized cluster. > {code} > <property> > <name>hadoop.security.auth_to_local</name> > <value xml:space="preserve"> > RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/ > RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/ > DEFAULT</value> > </property> > {code} > However, if I submit spark job with keytab & principal option, hdfs directory > and files ownership is not coherent. > (I change some words for privacy.) > {code} > $ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23 > Found 52 items > -rw-rw-rw- 3 _ex_eub hdfs 0 2023-05-11 00:16 > hdfs:///user/eub/some/path/20230510/23/_SUCCESS > -rw-r--r-- 3 eub hdfs 134418857 2023-05-11 00:15 > hdfs:///user/eub/some/path/20230510/23/part-00000-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz > -rw-r--r-- 3 eub hdfs 153410049 2023-05-11 00:16 > hdfs:///user/eub/some/path/20230510/23/part-00001-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz > -rw-r--r-- 3 eub hdfs 157260989 2023-05-11 00:16 > hdfs:///user/eub/some/path/20230510/23/part-00002-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz > -rw-r--r-- 3 eub hdfs 156222760 2023-05-11 00:16 > hdfs:///user/eub/some/path/20230510/23/part-00003-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz > {code} > Another interesting point is that if I submit spark job without keytab and > principal option but with kerberos authentication with {{kinit}}, it will not > follow {{hadoop.security.auth_to_local}} rule completely. > {code} > $ hdfs dfs -ls hdfs:///user/eub/output/ > Found 3 items > -rw-rw-r--+ 3 eub hdfs 0 2023-08-25 12:31 > hdfs:///user/eub/output/_SUCCESS > -rw-rw-r--+ 3 eub hdfs 512 2023-08-25 12:31 > hdfs:///user/eub/output/part-00000.gz > -rw-rw-r--+ 3 eub hdfs 574 2023-08-25 12:31 > hdfs:///user/eub/output/part-00001.gz > {code} > I finally found that if I submit spark job with {{--principal}} and > {{--keytab}} option, ugi will be different. > (refer to > https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905). > Only file ({{_SUCCESS}}) and output directory created by driver (application > master side) will respect {{hadoop.security.auth_to_local}} on the > non-kerberized namenode only if {{--principal}} and {{--keytab}] options are > provided. > No matter how hdfs files or directory are created by executor or driver, > those should respect {{hadoop.security.auth_to_local}} rule and should be the > same. > Workaround is to pass additional argument to change {{SPARK_USER}} on the > executor side. > e.g. {{--conf spark.executorEnv.SPARK_USER=_ex_eub}} > {{--conf spark.yarn.appMasterEnv.SPARK_USER=_ex_eub}} will make an error. > There are some logics to append environment value with {{:}} (colon) as a > separator. > - > https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L893 > - > https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L52 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org