[ https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maziyar PANAHI updated SPARK-26101: ----------------------------------- Description: Hello, I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. Example (running Spark by user `panahi`): ``` 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(*panahi*); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: [http://hadoop-master-1:8088/proxy/application_1542456252041_0006/] user: *panahi* ``` However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This makes it impossible to use a `c/c++` application that needs read/write access to HDFS because the user `yarn` does not have permissions on the user's directory. How to produce this issue: ``` val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() *result:* test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at <console>:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at <console>:37 c: Array[String] = Array(*yarn*) ``` I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. was: Hello, I am using `Spark 2.3.0.cloudera3` on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. Example (running Spark by user `panahi`): ``` 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(*panahi*); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ user: *panahi* ``` However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This makes it impossible to use a `c/c++` application that needs read/write access to HDFS because the user `yarn` does not have permissions on the user's directory. How to produce this issue: ```scala val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() *result:* test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at <console>:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at <console>:37 c: Array[String] = Array(*yarn*) ``` I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. > Spark Pipe() executes the external app by yarn user not the real user > --------------------------------------------------------------------- > > Key: SPARK-26101 > URL: https://issues.apache.org/jira/browse/SPARK-26101 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 2.3.0 > Reporter: Maziyar PANAHI > Priority: Major > > Hello, > > I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark > session (Zeppelin, Shell, or spark-submit) my real username is being > impersonated successfully. That allows YARN to use the right queue based on > the username, also HDFS knows the permissions. > Example (running Spark by user `panahi`): > ``` > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi* > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: > *panahi* > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: > 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups > with view permissions: Set(); > users with modify permissions: Set(*panahi*); groups with modify permissions: > Set() > ... > 18/11/17 13:55:52 INFO yarn.Client: > client token: N/A > diagnostics: N/A > ApplicationMaster host: N/A > ApplicationMaster RPC port: -1 > queue: root.multivac > start time: 1542459353040 > final status: UNDEFINED > tracking URL: > [http://hadoop-master-1:8088/proxy/application_1542456252041_0006/] > user: *panahi* > ``` > However, when I use Spark RDD Pipe() it is being executed as `yarn` user. > This makes it impossible to use a `c/c++` application that needs read/write > access to HDFS because the user `yarn` does not have permissions on the > user's directory. > How to produce this issue: > ``` > val test = sc.parallelize(Seq("test user")).repartition(1) > val piped = test.pipe(Seq("whoami")) > val c = piped.collect() > *result:* > test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition > at <console>:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at > pipe at <console>:37 c: Array[String] = Array(*yarn*) > ``` > I believe since Spark is the key actor to invoke this execution inside YARN > cluster, Spark needs to respect the actual/current username. Or maybe there > is another config for impersonation between Spark and YARN in this situation, > but I haven't found any. > > Many thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org