[ 
https://issues.apache.org/jira/browse/HADOOP-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292357#comment-14292357
 ] 

Chris Nauroth commented on HADOOP-11509:
----------------------------------------

Thank you, Xuan and Jian.

Just to provide a bit more background on this, Xuan found that streaming jobs 
using files in Azure Storage were not able to override the setting of 
{{fs.azure.block.size}} from the command line.  It looks like he found the root 
cause is that {{validateFiles}} checks for existence of files against a 
{{FileSystem}} instance, but this {{FileSystem}} instance is obtained before 
handling -D options.  This would mean we then have an instance sitting in the 
{{FileSystem}} cache that was created without the -D options set in the 
{{Configuration}}.  Later, during MapReduce job split calculation, it would use 
the cached instance that didn't have the override of {{fs.azure.block.size}}.

I agree with the change here, because the expectation is that the command line 
arguments take precedence.  However, I don't think we should move the -D 
handling all the way to the top of the method.  Right now, the handling is such 
that -D options would take precedence over -fs and -jt.  The current patch 
would reverse that.  I don't know if anyone depends on that behavior, but we 
can avoid changing it by doing the -D handling in between the handling of -conf 
and the handling of -libjars.  I'd be +1 for the patch with that change if you 
test it and it still works for overriding {{fs.azure.block.size}}.

bq. Should the API Path.getFileSystem(Configuration conf) be that the returned 
file system object always apply the up-to-date conf ?

This is a long-standing weakness of the {{FileSystem}} cache.  It has been 
discussed in other jiras, but I can't find those now.  The {{FileSystem}} cache 
key is composed of scheme, authority, and {{UserGroupInformation}}.  However, 
the {{FileSystem#get}} API is phrased in terms of a whole {{Configuration}}.  
Various other configuration properties can tune the behavior of a 
{{FileSystem}}, but if you get a cached instance, then these configuration 
properties might not be applied.  OTOH, it would be too costly to make the 
whole {{Configuration}} part of the cache key.

This is an existing problem, unrelated to the current patch.

> change parsing sequence in GenericOptionsParser to parse -D parameters first
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-11509
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11509
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Xuan Gong
>            Assignee: Xuan Gong
>         Attachments: HADOOP-11509.1.patch
>
>
> In GenericOptionsParser, we need to parse -D parameter first. In that case, 
> the user input parameter (through -D) can be set into configuration object 
> earlier and used to process other parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to