[
https://issues.apache.org/jira/browse/HADOOP-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750779#action_12750779
]
Sanjay Radia commented on HADOOP-4952:
--------------------------------------
No one has commented on my proposal on the config issue in this jira. As a
result, over the last 2 days, I have had a set of discussions with a number of
folks at Yahoo, including Doug and with Dhruba. Here is roughly the set of
opinions:
- Most felt that our config management is a mess and confusing.
- Everyone likes the notion of Server-side defaults esp when you consider
federated clusters and a URI based file namespace as explained in this Jira.
- Some folks were confused about the URI filesystem and how the FileContext
lets us deal with URIs in a first class way. But in the end most felt that it
was a good idea. The unix and scp analogy helped get this across.
- All agreed that most folks will use the SS defaults most of the time. But
there are apps that will specify, for example, the blockSize to override the SS
default. They liked that the create() call had a parameter to do that.
- There were a couple folks who felt strongly that one needs to be able to
specify the bytesPerChecksum on the client side (see the related HDFS-578);
strongly enough to -1 a proposal that did not allow it. Some felt that we
should add an additional parameter to the create call while others felt that we
should add an options parameter to the create call.
- There needs to be an undocumented way to override the SS defaults so that one
could test new parameters for SS defaults without reconfiguring the clusters.
(Dhruba's suggestion)
Based on the feedback, a proposal is described below. Note for some folks parts
of this proposal represents a compromise, but they could live with it. The 21
deadline is very very close and we need to get this in or we will miss the
deadline.
FileContext contains the following items derived from the config:
* Default fs - /
* Working dir (derived indirectly via the default file system - details are
below)
* Umask.
One creates FileContext as described in the patch (the patch is not uptodate
with the proposal in this comment).
* fc = FileContext.getFC()
* fc = FileContext.getFC(defaultFsUri), etc.
*NO other config parameters are read from the config*: The fs client side
config contains only two things: your / and your umask; all defaults will come
from SS. However, users will be able to override these defaults through the
options parameter in the create() call when creating a file. So in this
proposal there is not way to set application defaults in the config file.
(Note We may end up having some undocumented config variables to handle the SS
override for testing purpose (Dhruba's request); exact mechanism to be
determined - will file a separate jira for discussing this one.).
So the basic calls are:
- fc.mkdirs(path, perms)
- fc.create(path, perms, createOpt ...) // note the use of varArgs
- fc.open(path, bufSize)
Examples of create using varargs
Fc.create(path, perms) // all SS
Fc.create(path, perms, CreateOpt.blocksize(4096), CreateOpt.repFac(4));
Roughly: CreateOpt is a class with several subclasses, one per option
(Blocksize, RepFactor etc) and a static factory method for each of them such as
CreateOpt.blocksize(long).
Here is the list of options that one will be able to set through the
createOptions:
- progressable - default is null => progress not reported
** (ie a spec default, not a SS default.
** Shall we remove progressable?
- iobufferSize // The rest of the createOptions use SS default if not set
- replicationFactor
- blockSize - must be a multiple of bytesPerChecksum and writePacketsize
- bytesPerChecksum
The following SS variable is *not* settable via the createOption.
- writePacketSize - the SS default is always used.
If the application desires a particular property it will set it in the
createOpt paramaters. There is *no automatic support* to read these app
defaults from a config file; *this was deliberate choice*.
The actual mechanisms for createOpts is still to be determined but I am
strongly leaning towards varargs rather then a options-Object with setters and
getters.
So please comment on this proposal ASAP. The above proposal was derived after
looking at several alternative and lots of discussions; thanks to all those who
participated.
------
Some details on how wd and home dirs are derived.
The wd is derived from the default fs; e.g if the defaultFS is localFS the wd
of the process is used to initialize the wd. So HDFS could have SS default for
its wd which would be set to the users home directory in that cluster.
Similarly the homedir is derived from the defaultFS using server side config.
(Note we could have the homedir set on the client side by config vars but I
like the way we currently do this for the local filesystem and it would
consistent to derive it from the SS; hence the home dir in a cluster becomes a
property of the cluster's deployment. This also means less client side config
variables.)
> Improved files system interface for the application writer.
> -----------------------------------------------------------
>
> Key: HADOOP-4952
> URL: https://issues.apache.org/jira/browse/HADOOP-4952
> Project: Hadoop Common
> Issue Type: Improvement
> Affects Versions: 0.21.0
> Reporter: Sanjay Radia
> Assignee: Sanjay Radia
> Attachments: FileContext3.patch, FileContext5.patch,
> FileContext6.patch, FileContext7.patch, Files.java, Files.java,
> FilesContext1.patch, FilesContext2.patch
>
>
> Currently the FIleSystem interface serves two purposes:
> - an application writer's interface for using the Hadoop file system
> - a file system implementer's interface (e.g. hdfs, local file system, kfs,
> etc)
> This Jira proposes that we provide a simpler interfaces for the application
> writer and leave the FilsSystem interface for the implementer of a
> filesystem.
> - Filesystem interface has a confusing set of methods for the application
> writer
> - We could make it easier to take advantage of the URI file naming
> ** Current approach is to get FileSystem instance by supplying the URI and
> then access that name space. It is consistent for the FileSystem instance to
> not accept URIs for other schemes, but we can do better.
> ** The special copyFromLocalFIle can be generalized as a copyFile where the
> src or target can be generalized to any URI, including the local one.
> ** The proposed scheme (below) simplifies this.
> - The client side config can be simplified.
> ** New config() by default uses the default config. Since this is the common
> usage pattern, one should not need to always pass the config as a parameter
> when accessing the file system.
> -
> ** It does not handle multiple file systems too well. Today a site.xml is
> derived from a single Hadoop cluster. This does not make sense for multiple
> Hadoop clusters which may have different defaults.
> ** Further one should need very little to configure the client side:
> *** Default files system.
> *** Block size
> *** Replication factor
> *** Scheme to class mapping
> ** It should be possible to take Blocksize and replication factors defaults
> from the target file system, rather then the client size config. I am not
> suggesting we don't allow setting client side defaults, but most clients do
> not care and would find it simpler to take the defaults for their systems
> from the target file system.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.