[
https://issues.apache.org/jira/browse/HADOOP-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292357#comment-14292357
]
Chris Nauroth commented on HADOOP-11509:
----------------------------------------
Thank you, Xuan and Jian.
Just to provide a bit more background on this, Xuan found that streaming jobs
using files in Azure Storage were not able to override the setting of
{{fs.azure.block.size}} from the command line. It looks like he found the root
cause is that {{validateFiles}} checks for existence of files against a
{{FileSystem}} instance, but this {{FileSystem}} instance is obtained before
handling -D options. This would mean we then have an instance sitting in the
{{FileSystem}} cache that was created without the -D options set in the
{{Configuration}}. Later, during MapReduce job split calculation, it would use
the cached instance that didn't have the override of {{fs.azure.block.size}}.
I agree with the change here, because the expectation is that the command line
arguments take precedence. However, I don't think we should move the -D
handling all the way to the top of the method. Right now, the handling is such
that -D options would take precedence over -fs and -jt. The current patch
would reverse that. I don't know if anyone depends on that behavior, but we
can avoid changing it by doing the -D handling in between the handling of -conf
and the handling of -libjars. I'd be +1 for the patch with that change if you
test it and it still works for overriding {{fs.azure.block.size}}.
bq. Should the API Path.getFileSystem(Configuration conf) be that the returned
file system object always apply the up-to-date conf ?
This is a long-standing weakness of the {{FileSystem}} cache. It has been
discussed in other jiras, but I can't find those now. The {{FileSystem}} cache
key is composed of scheme, authority, and {{UserGroupInformation}}. However,
the {{FileSystem#get}} API is phrased in terms of a whole {{Configuration}}.
Various other configuration properties can tune the behavior of a
{{FileSystem}}, but if you get a cached instance, then these configuration
properties might not be applied. OTOH, it would be too costly to make the
whole {{Configuration}} part of the cache key.
This is an existing problem, unrelated to the current patch.
> change parsing sequence in GenericOptionsParser to parse -D parameters first
> ----------------------------------------------------------------------------
>
> Key: HADOOP-11509
> URL: https://issues.apache.org/jira/browse/HADOOP-11509
> Project: Hadoop Common
> Issue Type: Bug
> Reporter: Xuan Gong
> Assignee: Xuan Gong
> Attachments: HADOOP-11509.1.patch
>
>
> In GenericOptionsParser, we need to parse -D parameter first. In that case,
> the user input parameter (through -D) can be set into configuration object
> earlier and used to process other parameters.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)