[jira] [Updated] (SPARK-44976) Utils.getCurrentUserName should return the full principal name

2023-08-25 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-44976:
-
Description: 
SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
shortname instead of full principal name.
Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
side of non-kerberized hdfs namenode.
For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
kerberized.
I make a rule to add some prefix to username on the non-kerberized cluster if 
some one access it from the kerberized cluster.


{code}
  
hadoop.security.auth_to_local

RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
DEFAULT
  
{code}

However, if I submit spark job with keytab & principal option, hdfs directory 
and files ownership is not coherent.

(I change some words for privacy.)

{code}
$ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
Found 52 items
-rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/_SUCCESS
-rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
{code}

Another interesting point is that if I submit spark job without keytab and 
principal option but with kerberos authentication with {{kinit}}, it will not 
follow {{hadoop.security.auth_to_local}} rule completely.

{code}
$ hdfs dfs -ls  hdfs:///user/eub/output/
Found 3 items
-rw-rw-r--+  3 eub hdfs  0 2023-08-25 12:31 
hdfs:///user/eub/output/_SUCCESS
-rw-rw-r--+  3 eub hdfs512 2023-08-25 12:31 
hdfs:///user/eub/output/part-0.gz
-rw-rw-r--+  3 eub hdfs574 2023-08-25 12:31 
hdfs:///user/eub/output/part-1.gz
{code}


I finally found that if I submit spark job with {{--principal}} and 
{{--keytab}} option, ugi will be different.
(refer to 
https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).

Only file ({{_SUCCESS}}) and output directory created by driver (application 
master side) will respect {{hadoop.security.auth_to_local}} on the 
non-kerberized namenode only if {{--principal}} and {{--keytab}] options are 
provided.

No matter how hdfs files or directory are created by executor or driver, those 
should respect {{hadoop.security.auth_to_local}} rule and should be the same.


Workaround is to pass additional argument to change {{SPARK_USER}} on the 
executor side.
e.g. {{--conf spark.executorEnv.SPARK_USER=_ex_eub}}

{{--conf spark.yarn.appMasterEnv.SPARK_USER=_ex_eub}} will make an error. There 
are some logics to append environment value with {{:}} (colon) as a separator.

- 
https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L893
- 
https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L52


  was:
SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
shortname instead of full principal name.
Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
side of non-kerberized hdfs namenode.
For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
kerberized.
I make a rule to add some prefix to username on the non-kerberized cluster if 
some one access it from the kerberized cluster.


{code}
  
hadoop.security.auth_to_local

RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
DEFAULT
  
{code}

However, if I submit spark job with keytab & principal option, hdfs directory 
and files ownership is not coherent.

(I change some words for privacy.)

{code}
$ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
Found 52 items
-rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/_SUCCESS
-rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  15726098

[jira] [Updated] (SPARK-44976) Utils.getCurrentUserName should return the full principal name

2023-08-25 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-44976:
-
Description: 
SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
shortname instead of full principal name.
Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
side of non-kerberized hdfs namenode.
For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
kerberized.
I make a rule to add some prefix to username on the non-kerberized cluster if 
some one access it from the kerberized cluster.


{code}
  
hadoop.security.auth_to_local

RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
DEFAULT
  
{code}

However, if I submit spark job with keytab & principal option, hdfs directory 
and files ownership is not coherent.

(I change some words for privacy.)

{code}
$ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
Found 52 items
-rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/_SUCCESS
-rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
{code}

Another interesting point is that if I submit spark job without keytab and 
principal option but with kerberos authentication with {{kinit}}, it will not 
follow {{hadoop.security.auth_to_local}} rule completely.

{code}
$ hdfs dfs -ls  hdfs:///user/eub/output/
Found 3 items
-rw-rw-r--+  3 eub hdfs  0 2023-08-25 12:31 
hdfs:///user/eub/output/_SUCCESS
-rw-rw-r--+  3 eub hdfs512 2023-08-25 12:31 
hdfs:///user/eub/output/part-0.gz
-rw-rw-r--+  3 eub hdfs574 2023-08-25 12:31 
hdfs:///user/eub/output/part-1.gz
{code}


I finally found that if I submit spark job with {{--principal}} and 
{{--keytab}} option, ugi will be different.
(refer to 
https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).

Only file ({{_SUCCESS}}) and output directory created by driver (application 
master side) will respect {{hadoop.security.auth_to_local}} on the 
non-kerberized namenode only if {{--principal}} and {{--keytab}] options are 
provided.

No matter how hdfs files or directory are created by executor or driver, those 
should respect {{hadoop.security.auth_to_local}} rule and should be the same.


This issue is related to https://issues.apache.org/jira/browse/SPARK-6558.

  was:
SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
shortname instead of full principal name.
Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
side of non-kerberized hdfs namenode.
For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
kerberized.
I make a rule to add some prefix to username on the non-kerberized cluster if 
some one access it from the kerberized cluster.


{code}
  
hadoop.security.auth_to_local

RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
DEFAULT
  
{code}

However, if I submit spark job with keytab & principal option, hdfs directory 
and files ownership is not coherent.

(I change some words for privacy.)

{code}
$ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
Found 52 items
-rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/_SUCCESS
-rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
{code}

Another interesting point is that if I submit spark job without keytab and 
principal option but with kerberos authentication with {{kinit}}, it will not 
follow {{hadoop.security.auth_to_local}} rule completely.

{code}
$ hdfs dfs -ls  hdfs:///user/eub/output/
Found 3 items
-rw-rw-r--+  3 

[jira] [Updated] (SPARK-44976) Utils.getCurrentUserName should return the full principal name

2023-08-25 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-44976:
-
Description: 
SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
shortname instead of full principal name.
Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
side of non-kerberized hdfs namenode.
For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
kerberized.
I make a rule to add some prefix to username on the non-kerberized cluster if 
some one access it from the kerberized cluster.


{code}
  
hadoop.security.auth_to_local

RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
DEFAULT
  
{code}

However, if I submit spark job with keytab & principal option, hdfs directory 
and files ownership is not coherent.

(I change some words for privacy.)

{code}
$ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
Found 52 items
-rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/_SUCCESS
-rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
{code}

Another interesting point is that if I submit spark job without keytab and 
principal option but with kerberos authentication with {{kinit}}, it will not 
follow {{hadoop.security.auth_to_local}} rule completely.

{code}
$ hdfs dfs -ls  hdfs:///user/eub/output/
Found 3 items
-rw-rw-r--+  3 eub hdfs  0 2023-08-25 12:31 
hdfs:///user/eub/output/_SUCCESS
-rw-rw-r--+  3 eub hdfs512 2023-08-25 12:31 
hdfs:///user/eub/output/part-0.gz
-rw-rw-r--+  3 eub hdfs574 2023-08-25 12:31 
hdfs:///user/eub/output/part-1.gz
{code}


I finally found that if I submit spark job with {{--principal}} and 
{{--keytab}} option, ugi will be different.
(refer to 
https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).

Only file ({{_SUCCESS}}) and output directory created by driver (application 
master side) will respect {{hadoop.security.auth_to_local}} on the 
non-kerberized namenode only if {{--principal}} and {{--keytab}] options are 
provided.

No matter how hdfs files or directory are created by executor or driver, those 
should respect {{hadoop.security.auth_to_local}} rule and should be the same.


  was:
SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
shortname instead of full principal name.
Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
side of non-kerberized hdfs namenode.
For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
kerberized.
I make a rule to add some prefix to username on the non-kerberized cluster if 
some one access it from the kerberized cluster.


{code}
  
hadoop.security.auth_to_local

RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
DEFAULT
  
{code}

However, if I submit spark job with keytab & principal option, hdfs directory 
and files ownership is not coherent.

(I change some words for privacy.)

{code}
$ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
Found 52 items
-rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/_SUCCESS
-rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
{code}

Another interesting point is that if I submit spark job without keytab and 
principal option but with kerberos authentication with {{kinit}}, it will not 
follow {{hadoop.security.auth_to_local}} rule completely.

{code}
$ hdfs dfs -ls  hdfs:///user/eub/output/
Found 3 items
-rw-rw-r--+  3 eub hdfs  0 2023-08-25 12:31 
hdfs:///user/eub/output/_SUCCESS
-rw-r