> On June 14, 2016, 7:24 p.m., Mike Percy wrote: > > flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirMatcher.java, > > line 98 > > <https://reviews.apache.org/r/48161/diff/2-3/?file=1417012#file1417012line98> > > > > actually this should be "consists of" but I'll just fix it on commit if > > I remember to do it > > Mike Percy wrote: > actually if you don't mind making it "consists of" please go ahead
donedone ;) > On June 14, 2016, 7:24 p.m., Mike Percy wrote: > > flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirMatcher.java, > > line 225 > > <https://reviews.apache.org/r/48161/diff/2-3/?file=1417012#file1417012line225> > > > > Make this return void and specify that 'files' is sorted in-place Could you please explain it why? I guess this is a very similar debate to the Collection.sort() signature. I understand the points of that debate still I found that every time when I see Collection.sort in code it cause me mental overhead whether it returns something which wasn't assigned. Yes I usually I check the documentation and it is clear, but I have to check the docs. On the other hand if I see a sort call which was assigned to a varible I know for sure that I can read further as nothing problematic is here. (Compiler checks that a List is returned so it must have been the sorted version) I think knowing whether the sorting was in place or not should happen at the time when selecting a sort implementation amongst other available implementations thus looking at the docs/code is required there. > On June 14, 2016, 7:24 p.m., Mike Percy wrote: > > flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirMatcher.java, > > line 179 > > <https://reviews.apache.org/r/48161/diff/2-3/?file=1417012#file1417012line179> > > > > Why negate? Why not write this part as: > > > > currentParentDirMTime >= lastCheckedTime? > > > > Also I am confused about why you are rounding this down to the nearest > > second. Would you mind explaining the reasoning behind your logic? > > > > I don't understand what the intention is, but I don't think it's > > working as expected. Below is an example of why I think it's not working. > > > > Example: > > > > Iteration 1) > > > > System.currentTimeMillis() returns 1465931328009 so we round this down > > to 1465931328000 and assign that to 'now'. > > > > currentParentDirMTime = 1465931328005 which is before the after 'now' > > so we process the directory. We also set lastSeenParentDirMTime = > > 1465931328005 > > > > We assign lastCheckedTime = 1465931328000 > > > > Iteration 2) > > > > System.currentTimeMillis() returns 1465931328500 so we round this down > > to 1465931328000 and assign that to now (turns out this is the same as > > before). > > > > currentParentDirMTime is still 1465931328005 because it hasn't changed > > since iteration 1. > > > > lastCheckedTime is 1465931328000 which is less than > > currentParentDirMTime so we process the directory again. > > > > Expected result: We don't process the same directory again > > unnecessarily. Negate: my intention was to express that "(system) clock hasn't passed that second yet" Why checking more: I guess this was stated in the comment as well. Your point is correct. The logic might double check even if nothing changed (or changed but you cannot tell the difference). This was the price which had to be paid to make sure that every change has been captured for sure. Also please note that on many systems mtime reported by File.lastModified() is truncated to seconds (but would cause similar problems if it would report mtime in millis only the opportunity window would be smaller: 1 millisec). Basically you don't want to cache a state which might be invalid. You want to cache a state which is valid for sure. Change 1 happens at 1465931328009 which FS rounds down to 1465931328000 Change 2 may or may not happen at 1465931328509 which FS rounds down to 1465931328000 as well If iteration happened after 1465931329000 then all good. You can be sure that now checking the content of the dir can be cached as it is all settled (mtime is from prev second). But if iteration happened anytime between 8000-8999 then you cannot be sure whether it cached the state after change 1 or change 2. If it cached the file list as 1465931328000 then it may not include change 2 at all. And this is critical: further invocations would only look at 1465931328000 ?= mtime (which would return true in case of no other changes) and happily return cached content which still might or might not include change 2. Which would be sometimes wrong (badly broken). Logic has to be sure that clock passed the second of mtime and if that is not the case then it has to check the dir for updates. Without this some of the junit tests fail. (Checks and changes happen in the very same sec there) Changed javadoc to clarify a bit more. > On June 14, 2016, 7:24 p.m., Mike Percy wrote: > > flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirMatcher.java, > > line 227 > > <https://reviews.apache.org/r/48161/diff/2-3/?file=1417012#file1417012line227> > > > > style nit: Maintain spacing according to the Java code conventions: > > http://www.oracle.com/technetwork/java/javase/documentation/codeconventions-142311.html#454 > > > > This should read like: > > > > for (File f : files) { done - Attila ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/48161/#review137566 ----------------------------------------------------------- On June 14, 2016, 4:03 p.m., Attila Simon wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/48161/ > ----------------------------------------------------------- > > (Updated June 14, 2016, 4:03 p.m.) > > > Review request for Flume. > > > Bugs: FLUME-2918 > https://issues.apache.org/jira/browse/FLUME-2918 > > > Repository: flume-git > > > Description > ------- > > The way TailDir source checks which files should be tracked was improved. > Existing implementation caused unneccessary high CPU usage for huge (+50K > files) directories. This fix allows users to eliminate continous listing of > parent directory (on each Source.process invocation) and introduce a more > performant method for listing&matching files. > > used java.nio.file.DirectoryStream to filter files > made pattern match calculation optionally cached > added junit tests > added javadoc > added license > > > Diffs > ----- > > > flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/ReliableTaildirEventReader.java > 5b6d465 > > flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TailFile.java > eabd357 > > flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirMatcher.java > PRE-CREATION > > flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirSource.java > 8816327 > > flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirSourceConfigurationConstants.java > 6165276 > > flume-ng-sources/flume-taildir-source/src/test/java/org/apache/flume/source/taildir/TestTaildirMatcher.java > PRE-CREATION > > flume-ng-sources/flume-taildir-source/src/test/java/org/apache/flume/source/taildir/TestTaildirSource.java > f9e614c > > Diff: https://reviews.apache.org/r/48161/diff/ > > > Testing > ------- > > mvn clean install -DskipTests -> built > junit tests for flume-taildir-source module -> passed > > > Thanks, > > Attila Simon > >
