[
https://issues.apache.org/jira/browse/HADOOP-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751763#action_12751763
]
Todd Lipcon commented on HADOOP-4952:
-------------------------------------
I looked through FileContext9.patch in detail last night. My notes are below.
Obviously this is still a preliminary patch, but I wasn't sure I'd have time to
look again for a few weeks, so figured I'd give a detailed review now - feel
free to ignore notes about things that are preliminary bits and won't be in the
final patch.
FileContext.java;
nits:
- apache license headers
- typos:
FileContext.java:22 operationS
"imples" -> implies
"imples a default" (extra space)
- check for more than 80 columns (eg FileContext.java:33, 39)
- "file related" -> "file-related" (line 56)
- "all file systems instance" -> "all file system instances"
- overall "filesystem" seems better to me than "file system"
- inconsistent use of "you" vs "one" becomes confusing to read
- Extraneous bit of javadoc left right before FileContext class definition
- looks like notes to self.
- indentation on initFromConfig method is too deep
- javadoc for getFSofPath is not filled in
- extra space before "theConfig" on line 195
- extra ';' on line 209
- new member variables defined in the middle of the class (LOCAL_FS_URI and
defaultPerm)
- typo: "relavent" line 267
- javadoc for setWorkingDirectory refers to "newWdir" but the param is
called "p". The description should be under the @param p section.
- indentation on line 513
- Javadoc for setTimes: I'd rather see "since the epoch" instead of "since
Jan 1, 1970".
code:
- "conf" is the common name rather than "theConfig"
- rather than FileSystem.getInitialWorkingDirectory() returning null by
default and having the check in FileContext, have it default to just passing
through to getHomeDirectory() in FileSystem.java
- is defaultFsURI.equals(defaultFS.getUri()) an invariant? If so, why are
there two separate member variables? If not, when is that not the case?
- could makeAbsolute be made public? It seems generally useful
- style: "isNotSchemeWithRelative" should be called
"checkNotSchemeWithRelative" - otherwise it looks like it should return a
boolean
- also, that function claims to throw IOException if it's of the wrong
type, but in fact it throws IllegalArgumentException.
- are these kinds of paths *ever* legal in Hadoop? If not, can this check
go into the Path constructor such that we can never end up with an invalid
object?
- getFSofPath seems like a hackish implementation, but unfortunately that's
the API that FileSystem gives. Could that function in FileSystem be refactored
a bit so there is a more sensical way of checking ownership of a path without
having to throw/catch an exception like this?
- JavaDoc for FileContext() constructor - again this is the normal behavior
for hadoop configuration. I think you can leave it at just "using a default
Configuration". In fact, I am sort of in favor of removing this constructor
entirely and forcing the user to explicitly choose to construct a new
Configuration().
- the static factory methods (getFileContext, getLocalFSFileContext) seem
confusing in that they are not singleton instances. I can see this tripping
people up - if you think they should be new instances every time it might be
worth renaming the methods to "createFileContext" or "newFileContext".
- I'm not sure I see the point of the factory methods that call
setDefaultFileSystem for you - again the "get" name makes it seem like it would
be the same instance every time if you call it multiple times on the same FS.
- Why allow the user to pass either URI or FileSystem instances? There's
less code if you just provide one, and the user can always go from one to the
other. I'm in favor of fewer code paths where possible.
- setWorkingDirectory - why not call makeAbsolute here?
- CreatOpts should be CreateOptions. I think Ken Thompson once remarked
that if he had to do anything differently in his past he would have spelled
O_CREAT right :)
- ReflactionFac -> ReplicationFactor
- I'm not sold on the varargs/CreatOpts solution. What about having a class
using boxed Integers, where "null" means "not set by the user". Then use the
method chaining pattern so you can do new
CreateOptions().setDefaultReplication(3).setBlockSize(64*1024*1024).etc.etc.etc.
Given this structure we could refactor the general FileSystem interface to
take that single parameter, rather than the hard-to-migrate
lots-of-arguments-to-open route we've got now.
- If left that way, at least change the sequence of "ifs" to "else if's,
and add an else clause at the end with "throw new IllegalArgumentException(...)"
- mkdirs: this javadoc claims that it returns false if the directory
already exists, but in my experience that isn't true, at least with HDFS. I
vaguely recall you opened another JIRA about that elsewhere, so maybe
out-of-scope for this one.
- mkdirs: either don't accept null as an argument, or document that null
means defaults in the javadoc
- style: abs_dir -> absDir
- delete: another maybe out-of-scope thing: I believe delete with
recursive=false will still fail even on an empty directory. Would be nice to
clear that up before 0.21.
- rename: would be nice to clarify these semantics as well. Again I think I
saw another JIRA about this.
- rename: rather than comparing URIs for equality, would it be possible to
implement FileSystem.equals to default to using URIs, but allow other classes
to override?
- setOwner: some kind of assertion that (username != null || groupname !=
null)?
- setVerifyChecksum(bool) - since this isn't Path-specific, it seems odd
that it's present here. The pattern imho should be anything that takes a Path
should be present in FileContext and forward through to the appropriate FS, but
non-Path-specific stuff shouldn't be delegated.
- the "Not implemented yet" method should just be removed, imo
- getFSStatus might be better named getFsStatus to match the type. Also I
think better to call the no-param getStatus() rather than getStatus(null),
stylewise
- getFSStatus() doesn't makeAbsolute the path
- Why does the inner class have a redundant reference thisFC? Since it's
not a static inner class this reference already exists implicitly. The purpose
of this inner class also isn't clear to me.
- Agree with the "TBD not very clear" on line 1263 - this code path is
really hard to follow.
- Line 1263: should use srcFS.equals(dstFS) here
overall:
- I'm not 100% sold that your working directory can be on a different
filesystem than the default filesystem. In particular, I think it's unexpected
that "foo" and "/foo" might be on different filesystems. Could this be changed
such that there's a concept of a "home directory" and "working directory", and
you can always "cd home". Then absolute paths and relative paths would always
be on the same filesystem as the working directory.
- I'm unclear of the definition of "Server Side Defaults" from the JavaDoc -
are they in fact server side? They're just configurations, right? In general
the discussion of Configuration in FileContext seems somewhat extraneous - the
behavior here is the same as it is everywhere else in Hadoop, until HDFS-578 is
done. If HDFS-578 is a true pre-requisite to this, can you link the tickets?
- the isNotSchemeWithRelative function is only called in a few places - it's
unclear to me whether this is on purpose or not. Maybe this should just be part
of makeAbsolute?
- There's already some confusion/inclarity in the Hadoop docs about what
"qualified" vs "absolute" means in terms of paths. If the JavaDocs here could
help clarify that, I think that would be a big plus. (what's the difference
between makeQualified and makeAbsolute?)
- The "TBD"s should be changed to "TODOs" so that they flag on the Hudson
test-patch. And, of course, they should be fixed before this gets committed.
- A lot of the stuff in util is copied straight from FileSystem.java. This
code duplication should be avoided. This also includes DEFAULT_FILTER,
GlobFilter below
FsConfig.java:
nits:
- typo: line 10: static method*s*
- missing apache license
- "see the note on the javdoc" -> jav*a*doc
code:
- the "NOTE" section of the class javadoc doesn't make sense to me. What
Files class?
- What is SERVER_DEFAULT(-1) referring to? I see neither the constant nor a
literal -1 anywhere. [edit: looks like this is part of HDFS-578, see above]
- FS_HOME_DIR default isn't quite correct.. it would be /user/<foo> but I
don't think you can do a ${...} style substitution from this context. So the
calculation of the default would have to be done in code, or else just not
provide a default at all and throw an exception if for some reason it can't be
grabbed from the Configuration object.
- Would be nice to have javadocs for each of the static getters
- getImplClass: check that the URI has a scheme. If it has no scheme, throw
an IllegalArgumentException? Add javadoc to that effect.
- The class javadoc indicates that there can be instances of this class, but
it is entirely made of static methods. Remove that section of the javadoc and
make the class abstract and final.
FileSystem.java:
- line 544 typo agains -> against
- typo: absolutPermission -> absolutePermission (perhaps this is a new brand
of Hadoop sponsored vodka! +1!)
FileContextBaseTest.java:
- line 102: should duplicate the assertion from line 103 before setting the
working directory here too - make sure we ended up at the right absolute path
by means of the relative "cd"s
- should test setWorkingDirectory on a nonexistent dir
- also, all of the setWorkingDirectory calls are called with
already-qualified paths. Should try some with non-qualified paths too.
- test on line 117 confirms my earlier comment that mkdirs returns true if
you call it on an already-existing directory
> Improved files system interface for the application writer.
> -----------------------------------------------------------
>
> Key: HADOOP-4952
> URL: https://issues.apache.org/jira/browse/HADOOP-4952
> Project: Hadoop Common
> Issue Type: Improvement
> Affects Versions: 0.21.0
> Reporter: Sanjay Radia
> Assignee: Sanjay Radia
> Attachments: FileContext3.patch, FileContext5.patch,
> FileContext6.patch, FileContext7.patch, FileContext9.patch, Files.java,
> Files.java, FilesContext1.patch, FilesContext2.patch
>
>
> Currently the FIleSystem interface serves two purposes:
> - an application writer's interface for using the Hadoop file system
> - a file system implementer's interface (e.g. hdfs, local file system, kfs,
> etc)
> This Jira proposes that we provide a simpler interfaces for the application
> writer and leave the FilsSystem interface for the implementer of a
> filesystem.
> - Filesystem interface has a confusing set of methods for the application
> writer
> - We could make it easier to take advantage of the URI file naming
> ** Current approach is to get FileSystem instance by supplying the URI and
> then access that name space. It is consistent for the FileSystem instance to
> not accept URIs for other schemes, but we can do better.
> ** The special copyFromLocalFIle can be generalized as a copyFile where the
> src or target can be generalized to any URI, including the local one.
> ** The proposed scheme (below) simplifies this.
> - The client side config can be simplified.
> ** New config() by default uses the default config. Since this is the common
> usage pattern, one should not need to always pass the config as a parameter
> when accessing the file system.
> -
> ** It does not handle multiple file systems too well. Today a site.xml is
> derived from a single Hadoop cluster. This does not make sense for multiple
> Hadoop clusters which may have different defaults.
> ** Further one should need very little to configure the client side:
> *** Default files system.
> *** Block size
> *** Replication factor
> *** Scheme to class mapping
> ** It should be possible to take Blocksize and replication factors defaults
> from the target file system, rather then the client size config. I am not
> suggesting we don't allow setting client side defaults, but most clients do
> not care and would find it simpler to take the defaults for their systems
> from the target file system.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.