[ https://issues.apache.org/jira/browse/MAPREDUCE-7101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503287#comment-16503287 ]
Steve Loughran commented on MAPREDUCE-7101: ------------------------------------------- * as noted by Wangda, S3 doesn' t have a modtime, so probing for changes that way is out * and as noted by Vinod, big HDFS clusters need this I don't want FS-specific algorithms (yet...), because it complicates testing & integration. I'd prefer some control here. Making the check something you can optionally disable is good for testing, as you can test it locally. w.r.t FS-specific algorithms, theres not much they can do differently there other than * poll for some explicitly set file (e.g .UPDATE), which kicks it off, adding a new problem: how to create? * integrate with cloud event sources which are set up to add a new event when a file is added to a container. Lots of integration, testing & scale fun there: not something I'd be in a rush to do. One thing to consider though: if the scanning includes subdirectories then listFiles(path, recursive=true) is orders of magnitude more efficient on S3A (and any other connector which can do bulk listings): we want to use that for any recursive polling. Proposed: initially, just let us turn off the directory timestamp check, maybe switch to listFiles() for the probe, with a recurse option to do the recursive scan > Revisit behavior of JHS scan file behavior > ------------------------------------------ > > Key: MAPREDUCE-7101 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7101 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Wangda Tan > Priority: Critical > > Currently, the JHS scan directory if the modification of *directory* changed: > {code} > public synchronized void scanIfNeeded(FileStatus fs) { > long newModTime = fs.getModificationTime(); > if (modTime != newModTime) { > <... omitted some logics ...> > // reset scanTime before scanning happens > scanTime = System.currentTimeMillis(); > Path p = fs.getPath(); > try { > scanIntermediateDirectory(p); > {code} > This logic relies on an assumption that, the directory's modification time > will be updated if a file got placed under the directory. > However, the semantic of directory's modification time is not consistent in > different FS implementations. For example, MAPREDUCE-6680 fixed some issues > of truncated modification time. And HADOOP-12837 mentioned on S3, the > directory's modification time is always 0. > I think we need to revisit behavior of this logic to make it to more robustly > work on different file systems. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org