[ https://issues.apache.org/jira/browse/NIFI-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093997#comment-16093997 ]
ASF GitHub Bot commented on NIFI-4205: -------------------------------------- Github user ijokarumawak commented on the issue: https://github.com/apache/nifi/pull/2021 Thanks @pvillard31 for your comment. I agree with that, the situation having the same content after rollover occurred should be a very strange edge case, as files being tailed usually timestamps in it. > TailFile can produce duplicated data when it wrongly assumes a file is rotated > ------------------------------------------------------------------------------ > > Key: NIFI-4205 > URL: https://issues.apache.org/jira/browse/NIFI-4205 > Project: Apache NiFi > Issue Type: Bug > Components: Extensions > Affects Versions: 1.1.0 > Reporter: Koji Kawamura > Assignee: Koji Kawamura > > TailFile checks whether a file being tailed is rotated by following lines of > code, and if it determines so, it resets the reader and state for the file: > https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/TailFile.java#L693 > {code} > // Check if file has rotated > if (rolloverOccurred > || (timestamp <= file.lastModified() && length > > file.length()) > || (timestamp < file.lastModified() && length >= > file.length())) { > // Since file has rotated, we close the reader, create a new one, > and then reset our state. > try { > reader.close(); > getLogger().debug("Closed FileChannel {}", new > Object[]{reader, reader}); > } catch (final IOException ioe) { > getLogger().warn("Failed to close reader for {} due to {}", > new Object[]{file, ioe}); > } > reader = createReader(file, 0L); > position = 0L; > checksum.reset(); > } > {code} > The third condition, newer timestamp but the same file size can work > negatively in some situations. For example: > # If an already fully tailed file is 'touched' and last modified timestamp is > updated. This is the easiest way to produce duplicated content. > # On Windows, if a file is being tailed and updated by an app that writes > logs or some data to it consistently and frequently, then the last modified > timestamp can be delayed to be updated compared to file size. I couldn't find > canonical docs for this behavior, but testing on Windows consistently > produces duplicated data. And the 3rd condition becomes true when such > duplication occurs. > TailFile updates the file timestamp and length when it reads some data from > the file, specifically at these lines: > https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/TailFile.java#L765 > {code} > timestamp = Math.max(state.getTimestamp(), file.lastModified()); > length = file.length(); > {code} > As mentioned at the 2nd case above, file.lastModified can return a stale > timestamp (or maybe just not being flushed yet) while length is replaced by > the latest value. After this happens, at the next onTrigger cycle, the 3rd > condition becomes true because it detects a newer timestamp. > These conditions are added by NIFI-1170 and NIFI-1959. > A simple flow, TailFile -> SplitText -> (FlowFiles are queued) -> > UpdateAttribute(Stopped) can reproduce this, with a command-line to simulate > frequently updated log file: > {code} > $ for i in `seq 1 10000`; do echo $i >> test.log; done > {code} > The expected result is having 10000 generated FlowFiles queued at the > relationship between SplitText and UpdateAttribute. But on Windows, more > FlowFiles are generated. > By enabling debug level log for TailFile, following log messages can be > confirmed: > {code} > Add this to conf/logback.xml > <logger name="org.apache.nifi.processors.standard.TailFile" level="DEBUG"/> > 2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] > o.a.nifi.processors.standard.TailFile TailFile[id=59ef6ea7-0 > 15d-1000-d6c2-c57a61e58a80] Recovering Rolled Off Files; total number of > files rolled off = 0 > 2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] > o.a.nifi.processors.standard.TailFile > TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Closed FileChannel > sun.nio.ch.FileChannelImpl@6d2a1eaf > 2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] > o.a.nifi.processors.standard.TailFile > TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Created FileChannel > sun.nio.ch.FileChannelImpl@4aefddb3 for C:\logs\test.log > 2017-07-19 10:22:07,150 DEBUG [Timer-Driven Process Thread-3] > o.a.nifi.processors.standard.TailFile > TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Reading lines starting at > position 0 > {code} > The 3rd condition should be removed to avoid having these duplicated data > ingested. Or if there's any specific need, we should discuss about it and > implement additional solution. -- This message was sent by Atlassian JIRA (v6.4.14#64029)