[ 
https://issues.apache.org/jira/browse/HADOOP-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661647#action_12661647
 ] 

Doug Cutting commented on HADOOP-4952:
--------------------------------------

> I guess I am missing this library argument [ ... ]

What I mean is that there isn't a clear line between application and library 
code, that folks might wish to be able to, e.g., start running jobs on two 
clusters at once, without having to refactor everything to keep track of the 
config.  We should not encourage a style that makes this kind of thing more 
difficult, but rather one that makes it easy.

> Most file system interfaces are dead simple. I think Hadoop's API can be made 
> equally simple [ ... ]

I agree.  But I also think we need to pass the configuration explicitly, and 
not depend on a static configuration.  Our current convention is that only 
command-line tools create all-new configurations and I think there are good 
reasons to stick with that.

> In most file systems, the config state is provided by the underlying OS;

The configuration is the equivalent of the unix environment.  Using per-process 
environment variables makes sense when you start new processes for new 
operations, the unix convention, but that's not the way folks generally work in 
Java.  We could use a static configuration, in fact we used to long ago, but 
that caused lots of problems.  In other projects I've frequently seen folks 
start out using Java system properties, then switch to something that's passed 
dynamically.  That sort of switch is painful and something we should avoid and 
encourage others to avoid.

> In my proposal the libray writer will have to add a line 
> "Files.importConfig(configArg)". 

That does not work for multi-threaded applications, where different threads are 
configured differently.  Consider, e.g., a Nutch-based system that has 
different search zones (like, e.g. http://www.archive-it.org/).  This is a real 
and common use case.  Some developers use Hadoop's APIs directly, but many if 
not more layer things on top of it (Hive, HBase, Pig, Jaql, Nutch).  Its 
reasonable for any of these, or just about any Hadoop application, to be 
repackaged as a multi-user service.  Folks should not have to refactor to do 
that.


> Improved files system interface for the application writer.
> -----------------------------------------------------------
>
>                 Key: HADOOP-4952
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4952
>             Project: Hadoop Core
>          Issue Type: Improvement
>    Affects Versions: 0.21.0
>            Reporter: Sanjay Radia
>            Assignee: Sanjay Radia
>         Attachments: Files.java
>
>
> Currently the FIleSystem interface serves two purposes:
> - an application writer's interface for using the Hadoop file system
> - a file system implementer's interface (e.g. hdfs, local file system, kfs, 
> etc)
> This Jira proposes that we provide a simpler interfaces for the application 
> writer and leave the FilsSystem  interface for the implementer of a 
> filesystem.
> - Filesystem interface  has a  confusing set of methods for the application 
> writer
> - We could make it easier to take advantage of the URI file naming
> ** Current approach is to get FileSystem instance by supplying the URI and 
> then access that name space. It is consistent for the FileSystem instance to 
> not accept URIs for other schemes, but we can do better.
> ** The special copyFromLocalFIle can be generalized as a  copyFile where the 
> src or target can be generalized to any URI, including the local one.
> ** The proposed scheme (below) simplifies this.
> -     The client side config can be simplified. 
> ** New config() by default uses the default config. Since this is the common 
> usage pattern, one should not need to always pass the config as a parameter 
> when accessing the file system.  
> -     
> ** It does not handle multiple file systems too well. Today a site.xml is 
> derived from a single Hadoop cluster. This does not make sense for multiple 
> Hadoop clusters which may have different defaults.
> ** Further one should need very little to configure the client side:
> *** Default files system.
> *** Block size 
> *** Replication factor
> *** Scheme to class mapping
> ** It should be possible to take Blocksize and replication factors defaults 
> from the target file system, rather then the client size config.  I am not 
> suggesting we don't allow setting client side defaults, but most clients do 
> not care and would find it simpler to take the defaults for their systems  
> from the target file system. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to