[jira] [Commented] (SPARK-22528) History service and non-HDFS filesystems

paul mackles (JIRA) Tue, 21 Nov 2017 05:36:20 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-22528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260727#comment-16260727
 ]


paul mackles commented on SPARK-22528:
--------------------------------------

In case anyone else bumps into this, I received some feedback from the 
data-lake team at MSFT:

This is expected behavior since Hadoop supports Kerberos based identity whereas 
data lake supports OAuth2 – Azure active directory (AAD). The bridge/mapping 
between Kerberos and AAD OAuth2 is supported only in Azure HDInsight cluster 
today.
 
OAuth2 support in Hadoop is non-trivial task and is in progress - 
https://issues.apache.org/jira/browse/HADOOP-11744
Workaround for the limitation is (Specific to data lake)
core-site.xml
 {code}
<property>
<name>adl.debug.override.localuserasfileowner</name>
<value>true</value>
</property>
 {code}

What does this configuration do ?
FileStatus contains the user/group information which is associated with object 
id from AAD. Hadoop driver would replace object id with local Hadoop user under 
the context of Hadoop process. Actual file information in data lake remains 
unchanged though, only shadowed behind the local Hadoop user.


> History service and non-HDFS filesystems
> ----------------------------------------
>
>                 Key: SPARK-22528
>                 URL: https://issues.apache.org/jira/browse/SPARK-22528
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: paul mackles
>            Priority: Minor
>
> We are using Azure Data Lake (ADL) to store our event logs. This worked fine 
> in 2.1.x but in 2.2.0, the event logs are no longer visible to the history 
> server. I tracked it down to the call to:
> {code}
> SparkHadoopUtil.get.checkAccessPermission()
> {code}
> which was added to "FSHistoryProvider" in 2.2.0.
> I was able to workaround it by:
> * setting the files on ADL to world readable
> * setting HADOOP_PROXY to the Azure objectId of the service principal that 
> owns file
> Neither of those workaround are particularly desirable in our environment. 
> That said, I am not sure how this should be addressed:
> * Is this an issue with the Azure/Hadoop bindings not setting up the user 
> context correctly so that the "checkAccessPermission()" call succeeds w/out 
> having to use the username under which the process is running?
> * Is this an issue with "checkAccessPermission()" not really accounting for 
> all of the possible FileSystem implementations? If so, I would imagine that 
> there are similar issues when using S3.
> In spite of this check, I know the files are accessible through the 
> underlying FileSystem object so it feels like the latter but I don't think 
> that the FileSystem object alone could be used to implement this check.
> Any thoughts [~jerryshao]?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22528) History service and non-HDFS filesystems

Reply via email to