If are using kerberized HDFS the spark principal (or whoever is running the
cluster) has to be declared as a proxy user.

Once done, you call the

val ugi =  UserGroupInformation.createProxyUser("joe",

that user is then used to create the FS

val proxyFS = ugi.doAs( { FileSystem.newInstance(new
URI("hdfs://nn1/home/user/"), conf)  }}) /* whatever the scala syntax
is here */

The proxyFS will then do all its IO as the given user, even when done
outside a doAs clause, e.g.

proxyFS.mkdirs(new Path("/home/user/alice/"))

FileSystem.get() also works on a UGI basis, so ugi.doAs(
FileSystem.get("hdfs://nn1"))) returns a different FS instance than
FileSystem.get() outside of the clause

Once you are done with the FS, close it. If you know you are completely
done with the user across all threads, you can release them all


This closes all filesystems for that user. This is critical on long-lived
processes as otherwise you'll run out memory/threads.

> I wanted to make unknown users create HDFS files, not the OS user who
> executes the spark application.
> And I thought it would be possible using
> UserGroupInformation.createRemoteUser(“other”).doAS(…)
> However, the files are created by the OS user who launched the spark
> application in Spark Executors.
> Although I’ve tested it on Spark Standalone and Yarn, I got the same
> results.
> Is it impossible to impersonate a Spark job user using the
> UserGroupInformation.doAS?
> PS. In fact, I posted a similar question on the Spark user mailing list,
>But I didn’t get the answer I wanted.

You may use `—proxy-user` to impersonate.

For example,

bin/spark-shell --proxy-user kent
scala> spark.sparkContext.sparkUser
res0: String = kent

res1: String = kent

res2: String = kentyao


