[
https://issues.apache.org/jira/browse/MAPREDUCE-7101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503287#comment-16503287
]
Steve Loughran commented on MAPREDUCE-7101:
-------------------------------------------
* as noted by Wangda, S3 doesn' t have a modtime, so probing for changes that
way is out
* and as noted by Vinod, big HDFS clusters need this
I don't want FS-specific algorithms (yet...), because it complicates testing &
integration. I'd prefer some control here. Making the check something you can
optionally disable is good for testing, as you can test it locally.
w.r.t FS-specific algorithms, theres not much they can do differently there
other than
* poll for some explicitly set file (e.g .UPDATE), which kicks it off, adding a
new problem: how to create?
* integrate with cloud event sources which are set up to add a new event when a
file is added to a container. Lots of integration, testing & scale fun there:
not something I'd be in a rush to do.
One thing to consider though: if the scanning includes subdirectories then
listFiles(path, recursive=true) is orders of magnitude more efficient on S3A
(and any other connector which can do bulk listings): we want to use that for
any recursive polling.
Proposed: initially, just let us turn off the directory timestamp check, maybe
switch to listFiles() for the probe, with a recurse option to do the recursive
scan
> Revisit behavior of JHS scan file behavior
> ------------------------------------------
>
> Key: MAPREDUCE-7101
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7101
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Reporter: Wangda Tan
> Priority: Critical
>
> Currently, the JHS scan directory if the modification of *directory* changed:
> {code}
> public synchronized void scanIfNeeded(FileStatus fs) {
> long newModTime = fs.getModificationTime();
> if (modTime != newModTime) {
> <... omitted some logics ...>
> // reset scanTime before scanning happens
> scanTime = System.currentTimeMillis();
> Path p = fs.getPath();
> try {
> scanIntermediateDirectory(p);
> {code}
> This logic relies on an assumption that, the directory's modification time
> will be updated if a file got placed under the directory.
> However, the semantic of directory's modification time is not consistent in
> different FS implementations. For example, MAPREDUCE-6680 fixed some issues
> of truncated modification time. And HADOOP-12837 mentioned on S3, the
> directory's modification time is always 0.
> I think we need to revisit behavior of this logic to make it to more robustly
> work on different file systems.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]