[ 
https://issues.apache.org/jira/browse/NIFI-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Kawamura updated NIFI-4205:
--------------------------------
    Status: Patch Available  (was: Open)

> TailFile can produce duplicated data when it wrongly assumes a file is rotated
> ------------------------------------------------------------------------------
>
>                 Key: NIFI-4205
>                 URL: https://issues.apache.org/jira/browse/NIFI-4205
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.1.0
>            Reporter: Koji Kawamura
>            Assignee: Koji Kawamura
>
> TailFile checks whether a file being tailed is rotated by following lines of 
> code, and if it determines so, it resets the reader and state for the file:
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/TailFile.java#L693
> {code}
>         // Check if file has rotated
>         if (rolloverOccurred
>                 || (timestamp <= file.lastModified() && length > 
> file.length())
>                 || (timestamp < file.lastModified() && length >= 
> file.length())) {
>             // Since file has rotated, we close the reader, create a new one, 
> and then reset our state.
>             try {
>                 reader.close();
>                 getLogger().debug("Closed FileChannel {}", new 
> Object[]{reader, reader});
>             } catch (final IOException ioe) {
>                 getLogger().warn("Failed to close reader for {} due to {}", 
> new Object[]{file, ioe});
>             }
>             reader = createReader(file, 0L);
>             position = 0L;
>             checksum.reset();
>         }
> {code}
> The third condition, newer timestamp but the same file size can work 
> negatively in some situations. For example:
> # If an already fully tailed file is 'touched' and last modified timestamp is 
> updated. This is the easiest way to produce duplicated content.
> # On Windows, if a file is being tailed and updated by an app that writes 
> logs or some data to it consistently and frequently, then the last modified 
> timestamp can be delayed to be updated compared to file size. I couldn't find 
> canonical docs for this behavior, but testing on Windows consistently 
> produces duplicated data. And the 3rd condition becomes true when such 
> duplication occurs.
> TailFile updates the file timestamp and length when it reads some data from 
> the file, specifically at these lines:
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/TailFile.java#L765
> {code}
>             timestamp = Math.max(state.getTimestamp(), file.lastModified());
>             length = file.length();
> {code}
> As mentioned at the 2nd case above, file.lastModified can return a stale 
> timestamp (or maybe just not being flushed yet) while length is replaced by 
> the latest value. After this happens, at the next onTrigger cycle, the 3rd 
> condition becomes true because it detects a newer timestamp.
> These conditions are added by NIFI-1170 and NIFI-1959.
> A simple flow, TailFile -> SplitText -> (FlowFiles are queued) -> 
> UpdateAttribute(Stopped) can reproduce this, with a command-line to simulate 
> frequently updated log file:
> {code}
> $ for i in `seq 1 10000`; do echo $i >> test.log; done
> {code}
> The expected result is having 10000 generated FlowFiles queued at the 
> relationship between SplitText and UpdateAttribute. But on Windows, more 
> FlowFiles are generated.
> By enabling debug level log for TailFile, following log messages can be 
> confirmed:
> {code}
> Add this to conf/logback.xml
> <logger name="org.apache.nifi.processors.standard.TailFile" level="DEBUG"/>
> 2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] 
> o.a.nifi.processors.standard.TailFile TailFile[id=59ef6ea7-0
> 15d-1000-d6c2-c57a61e58a80] Recovering Rolled Off Files; total number of 
> files rolled off = 0
> 2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] 
> o.a.nifi.processors.standard.TailFile 
> TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Closed FileChannel 
> sun.nio.ch.FileChannelImpl@6d2a1eaf
> 2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] 
> o.a.nifi.processors.standard.TailFile 
> TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Created FileChannel 
> sun.nio.ch.FileChannelImpl@4aefddb3 for C:\logs\test.log
> 2017-07-19 10:22:07,150 DEBUG [Timer-Driven Process Thread-3] 
> o.a.nifi.processors.standard.TailFile 
> TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Reading lines starting at 
> position 0
> {code}
> The 3rd condition should be removed to avoid having these duplicated data 
> ingested. Or if there's any specific need, we should discuss about it and 
> implement additional solution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to