[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16507097#comment-16507097
 ] 

Wangda Tan commented on MAPREDUCE-7101:
---------------------------------------

[~ste...@apache.org] / [~ehiggs] / [~rohithsharma]

Thanks for your suggestions.  To me there're two major options.

1. As mentioned by [~rohithsharma], Make the behavior (skip dir timestamp 
check) to be configurable to avoid surprising users of HDFS-backed clusters. 
This config can be marked as private/unstable so we change this in the future. 

2. Considering cloud storages are not identical as mentioned, another approach 
is to add a option to make the whole {{scanIntermediateDirectory}} becomes 
pluggable folder-scan policy, and {{UserLogDir}} is private to the policy. We 
can implement list by recursive=true for some FS and =false for others, and can 
poll special files, etc.

I would not prefer to {{turn off the directory timestamp check}}, and I prefer 
#1 over #2. 

Thoughts? Can we can a conclusion for this so we can start fixing the problem 
sooner if possible?

 

> Revisit behavior of JHS scan file behavior
> ------------------------------------------
>
>                 Key: MAPREDUCE-7101
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7101
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Wangda Tan
>            Priority: Critical
>
> Currently, the JHS scan directory if the modification of *directory* changed: 
> {code} 
>     public synchronized void scanIfNeeded(FileStatus fs) {
>       long newModTime = fs.getModificationTime();
>       if (modTime != newModTime) {
>         <... omitted some logics ...>
>         // reset scanTime before scanning happens
>         scanTime = System.currentTimeMillis();
>         Path p = fs.getPath();
>         try {
>           scanIntermediateDirectory(p);
> {code}
> This logic relies on an assumption that, the directory's modification time 
> will be updated if a file got placed under the directory.
> However, the semantic of directory's modification time is not consistent in 
> different FS implementations. For example, MAPREDUCE-6680 fixed some issues 
> of truncated modification time. And HADOOP-12837 mentioned on S3, the 
> directory's modification time is always 0.
> I think we need to revisit behavior of this logic to make it to more robustly 
> work on different file systems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to