> On June 14, 2016, 7:24 p.m., Mike Percy wrote:
> > flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirMatcher.java,
> >  line 179
> > <https://reviews.apache.org/r/48161/diff/2-3/?file=1417012#file1417012line179>
> >
> >     Why negate? Why not write this part as:
> >     
> >       currentParentDirMTime >= lastCheckedTime?
> >     
> >     Also I am confused about why you are rounding this down to the nearest 
> > second. Would you mind explaining the reasoning behind your logic?
> >     
> >     I don't understand what the intention is, but I don't think it's 
> > working as expected. Below is an example of why I think it's not working.
> >     
> >     Example:
> >     
> >     Iteration 1)
> >     
> >     System.currentTimeMillis() returns 1465931328009 so we round this down 
> > to 1465931328000 and assign that to 'now'.
> >     
> >     currentParentDirMTime = 1465931328005 which is before the after 'now' 
> > so we process the directory. We also set lastSeenParentDirMTime = 
> > 1465931328005
> >     
> >     We assign lastCheckedTime = 1465931328000
> >     
> >     Iteration 2)
> >     
> >     System.currentTimeMillis() returns 1465931328500 so we round this down 
> > to 1465931328000 and assign that to now (turns out this is the same as 
> > before).
> >     
> >     currentParentDirMTime is still 1465931328005 because it hasn't changed 
> > since iteration 1.
> >     
> >     lastCheckedTime is 1465931328000 which is less than 
> > currentParentDirMTime so we process the directory again.
> >     
> >     Expected result: We don't process the same directory again 
> > unnecessarily.
> 
> Attila Simon wrote:
>     Negate: my intention was to express that "(system) clock hasn't passed 
> that second yet" 
>     
>     Why checking more: I guess this was stated in the comment as well. 
>     Your point is correct. The logic might double check even if nothing 
> changed (or changed but you cannot tell the difference). This was the price 
> which had to be paid to make sure that every change has been captured for 
> sure. Also please note that on many systems mtime reported by 
> File.lastModified() is truncated to seconds (but would cause similar problems 
> if it would report mtime in millis only the opportunity window would be 
> smaller: 1 millisec).
>     Basically you don't want to cache a state which might be invalid. You 
> want to cache a state which is valid for sure.
>     
>     Change 1 happens at  1465931328009 which FS rounds down to 1465931328000
>     Change 2 may or may not happen at 1465931328509 which FS rounds down to 
> 1465931328000 as well
>     
>     If iteration happened after 1465931329000 then all good. You can be sure 
> that now checking the content of the dir can be cached as it is all settled 
> (mtime is from prev second). But if iteration happened anytime between 
> 8000-8999 then you cannot be sure whether it cached the state after change 1 
> or change 2. If it cached the file list as 1465931328000 then it may not 
> include change 2 at all. And this is critical: further invocations would only 
> look at 1465931328000 ?= mtime (which would return true in case of no other 
> changes) and happily return cached content which still might or might not 
> include change 2. Which would be sometimes wrong (badly broken). Logic has to 
> be sure that clock passed the second of mtime and if that is not the case 
> then it has to check the dir for updates.
>     
>     Without this some of the junit tests fail. (Checks and changes happen in 
> the very same sec there)
>     
>     Changed javadoc to clarify a bit more.

Alright, thanks for clarifying. I wouldn't personally prefer to write the 
negation like but the logic makes sense. I am not sure which filesystems only 
offer second granularity but it's reasonable to cater to the lowest common 
denominator and support a wide range of systems.


> On June 14, 2016, 7:24 p.m., Mike Percy wrote:
> > flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirMatcher.java,
> >  line 225
> > <https://reviews.apache.org/r/48161/diff/2-3/?file=1417012#file1417012line225>
> >
> >     Make this return void and specify that 'files' is sorted in-place
> 
> Attila Simon wrote:
>     Could you please explain it why?
>     
>     I guess this is a very similar debate to the Collection.sort() signature. 
> I understand the points of that debate still I found that every time when I 
> see Collection.sort in code it cause me mental overhead whether it returns 
> something which wasn't assigned. Yes I usually I check the documentation and 
> it is clear, but I have to check the docs.
>     On the other hand if I see a sort call which was assigned to a varible I 
> know for sure that I can read further as nothing problematic is here. 
> (Compiler checks that a List is returned so it must have been the sorted 
> version)
>     
>     I think knowing whether the sorting was in place or not should happen at 
> the time when selecting a sort implementation amongst other available 
> implementations thus looking at the docs/code is required there.

The reason I think it's clearer is because if the function returns a collection 
then the reader may expect it to act like Python's sorted() function, which 
doesn't mutate its argument but returns a sorted copy. On the other hand, if 
it's got a void return like Collections.sort() then it's obvious to the reader 
that it's sorting the array in place, so you don't have to worry about 
potentially keeping mental track of the unsorted and sorted variables being 
used after the call. That said, if you feel strongly about it then we can leave 
it. It's a private API anyway.


- Mike


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48161/#review137566
-----------------------------------------------------------


On June 19, 2016, 11:45 p.m., Attila Simon wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48161/
> -----------------------------------------------------------
> 
> (Updated June 19, 2016, 11:45 p.m.)
> 
> 
> Review request for Flume.
> 
> 
> Bugs: FLUME-2918
>     https://issues.apache.org/jira/browse/FLUME-2918
> 
> 
> Repository: flume-git
> 
> 
> Description
> -------
> 
> The way TailDir source checks which files should be tracked was improved. 
> Existing implementation caused unneccessary high CPU usage for huge (+50K 
> files) directories. This fix allows users to eliminate continous listing of 
> parent directory (on each Source.process invocation) and introduce a more 
> performant method for listing&matching files.
> 
> used java.nio.file.DirectoryStream to filter files
> made pattern match calculation optionally cached
> added junit tests
> added javadoc
> added license
> 
> 
> Diffs
> -----
> 
>   
> flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/ReliableTaildirEventReader.java
>  5b6d465 
>   
> flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TailFile.java
>  eabd357 
>   
> flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirMatcher.java
>  PRE-CREATION 
>   
> flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirSource.java
>  8816327 
>   
> flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirSourceConfigurationConstants.java
>  6165276 
>   
> flume-ng-sources/flume-taildir-source/src/test/java/org/apache/flume/source/taildir/TestTaildirMatcher.java
>  PRE-CREATION 
>   
> flume-ng-sources/flume-taildir-source/src/test/java/org/apache/flume/source/taildir/TestTaildirSource.java
>  f9e614c 
> 
> Diff: https://reviews.apache.org/r/48161/diff/
> 
> 
> Testing
> -------
> 
> mvn clean install -DskipTests -> built
> junit tests for flume-taildir-source module -> passed
> 
> 
> Thanks,
> 
> Attila Simon
> 
>

Reply via email to