Hello Experts,

I am required to use a specific user id to save files on a remote hdfs
cluster. Remote in the sense, spark jobs run on EMR and write to a CDH
cluster. Hence I cannot change the hdfs-site.xml etc to point to the
destination cluster. As a result I am using webhdfs to save the files into
it.

There are few challenges I have with this approach
1. I cannot use nameservice of the namenode and have to specify the IP
address of the active namenode, which is risky when there is a failover

2. I cannot change the owner/group of the file being written by spark. I
see no option to provide owner for files being written (
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
)

3. Using jdbc such that I can specify the user name and password would mean
I will end up creating managed tables only. This is not acceptable for our
usecase.

Is there a way to change the owner of files written by Spark?

regards
Sunita

Reply via email to