[ 
https://issues.apache.org/jira/browse/NIFI-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Kawamura updated NIFI-4205:
--------------------------------
    Description: 
TailFile checks whether a file being tailed is rotated by following lines of 
code, and if it determines so, it resets the reader and state for the file:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/TailFile.java#L693
{code}
        // Check if file has rotated
        if (rolloverOccurred
                || (timestamp <= file.lastModified() && length > file.length())
                || (timestamp < file.lastModified() && length >= 
file.length())) {

            // Since file has rotated, we close the reader, create a new one, 
and then reset our state.
            try {
                reader.close();
                getLogger().debug("Closed FileChannel {}", new Object[]{reader, 
reader});
            } catch (final IOException ioe) {
                getLogger().warn("Failed to close reader for {} due to {}", new 
Object[]{file, ioe});
            }

            reader = createReader(file, 0L);
            position = 0L;
            checksum.reset();
        }
{code}

The third condition, newer timestamp but the same file size can work negatively 
in some situations. For example:

# If an already fully tailed file is 'touched' and last modified timestamp is 
updated. This is the easiest way to produce duplicated content.
# On Windows, if a file is being tailed and updated by an app that writes logs 
or some data to it consistently and frequently, then the last modified 
timestamp can be delayed to be updated compared to file size. I couldn't find 
canonical docs for this behavior, but testing on Windows consistently produces 
duplicated data. And the 3rd condition becomes true when such duplication 
occurs.

TailFile updates the file timestamp and length when it reads some data from the 
file, specifically at these lines:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/TailFile.java#L765
{code}
            timestamp = Math.max(state.getTimestamp(), file.lastModified());
            length = file.length();
{code}

As mentioned at the 2nd case above, file.lastModified can return a stale 
timestamp (or maybe just not being flushed yet) while length is replaced by the 
latest value. After this happens, at the next onTrigger cycle, the 3rd 
condition becomes true because it detects a newer timestamp.

These conditions are added by NIFI-1170 and NIFI-1959.

A simple flow, TailFile -> SplitText -> (FlowFiles are queued) -> 
UpdateAttribute(Stopped) can reproduce this, with a command-line to simulate 
frequently updated log file:
{code}
$ for i in `seq 1 10000`; do echo $i >> test.log; done
{code}
The expected result is having 10000 generated FlowFiles queued at the 
relationship between SplitText and UpdateAttribute. But on Windows, more 
FlowFiles are generated.

By enabling debug level log for TailFile, following log messages can be 
confirmed:
{code}
Add this to conf/logback.xml
<logger name="org.apache.nifi.processors.standard.TailFile" level="DEBUG"/>

2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] 
o.a.nifi.processors.standard.TailFile TailFile[id=59ef6ea7-0
15d-1000-d6c2-c57a61e58a80] Recovering Rolled Off Files; total number of files 
rolled off = 0
2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] 
o.a.nifi.processors.standard.TailFile 
TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Closed FileChannel 
sun.nio.ch.FileChannelImpl@6d2a1eaf
2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] 
o.a.nifi.processors.standard.TailFile 
TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Created FileChannel 
sun.nio.ch.FileChannelImpl@4aefddb3 for C:\logs\test.log
2017-07-19 10:22:07,150 DEBUG [Timer-Driven Process Thread-3] 
o.a.nifi.processors.standard.TailFile 
TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Reading lines starting at 
position 0
{code}

The 3rd condition should be removed to avoid having these duplicated data 
ingested. Or if there's any specific need, we should discuss about it and 
implement additional solution.

Optionally, the 2nd condition should be included in the 'rolloverOccurred' flag 
to make it more readable.

  was:
TailFile checks whether a file being tailed is rotated by following lines of 
code, and if it determines so, it resets the reader and state for the file:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/TailFile.java#L693
{code}
        // Check if file has rotated
        if (rolloverOccurred
                || (timestamp <= file.lastModified() && length > file.length())
                || (timestamp < file.lastModified() && length >= 
file.length())) {

            // Since file has rotated, we close the reader, create a new one, 
and then reset our state.
            try {
                reader.close();
                getLogger().debug("Closed FileChannel {}", new Object[]{reader, 
reader});
            } catch (final IOException ioe) {
                getLogger().warn("Failed to close reader for {} due to {}", new 
Object[]{file, ioe});
            }

            reader = createReader(file, 0L);
            position = 0L;
            checksum.reset();
        }
{code}

The third condition, newer timestamp but the same file size can work negatively 
in some situations. For example:

# If an already fully tailed file is 'touched' and last modified timestamp is 
updated. This is the easiest way to produce duplicated content.
# On Windows, if a file is being tailed and updated by an app that writes logs 
or some data to it consistently and frequently, then the last modified 
timestamp can be delayed to be updated compared to file size. I couldn't find 
canonical docs for this behavior, but testing on Windows consistently produces 
duplicated data. And the 3rd condition becomes true when such duplication 
occurs.

TailFile updates the file timestamp and length when it reads some data from the 
file, specifically at these lines:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/TailFile.java#L765
{code}
            timestamp = Math.max(state.getTimestamp(), file.lastModified());
            length = file.length();
{code}

As mentioned at the 2nd case above, file.lastModified returns stale timestamp 
(or maybe just not being flushed yet) while length is replaced by the latest 
value. Once this happens, at the next onTrigger cycle, the 3rd condition 
becomes true because it detects newer timestamp.

These conditions are added by NIFI-1170 and NIFI-1959.

A simple flow, TailFile -> SplitText -> (FlowFiles are queued) -> 
UpdateAttribute(Stopped) can reproduce this, with a command-line to simulate 
frequently updated log file:
{code}
$ for i in `seq 1 10000`; do echo $i >> test.log; done
{code}
The expected result is having generated FlowFiles queued at the relationship 
between SplitText and UpdateAttribute. But on Windows, more FlowFiles are 
generated.

By enabling debug level log for TailFile, following log messages can be 
confirmed:
{code}
Add this to conf/logback.xml
<logger name="org.apache.nifi.processors.standard.TailFile" level="DEBUG"/>

2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] 
o.a.nifi.processors.standard.TailFile TailFile[id=59ef6ea7-0
15d-1000-d6c2-c57a61e58a80] Recovering Rolled Off Files; total number of files 
rolled off = 0
2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] 
o.a.nifi.processors.standard.TailFile 
TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Closed FileChannel 
sun.nio.ch.FileChannelImpl@6d2a1eaf
2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] 
o.a.nifi.processors.standard.TailFile 
TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Created FileChannel 
sun.nio.ch.FileChannelImpl@4aefddb3 for C:\logs\test.log
2017-07-19 10:22:07,150 DEBUG [Timer-Driven Process Thread-3] 
o.a.nifi.processors.standard.TailFile 
TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Reading lines starting at 
position 0
{code}

The 3rd condition should be removed to avoid having these duplicated data 
ingested. Or if there's any specific need, we should discuss about it and 
implement additional solution.

Optionally, the 2nd condition should be included in the 'rolloverOccurred' flag 
to make it more readable.


> TailFile can produce duplicated data when it wrongly assumes a file is rotated
> ------------------------------------------------------------------------------
>
>                 Key: NIFI-4205
>                 URL: https://issues.apache.org/jira/browse/NIFI-4205
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.1.0
>            Reporter: Koji Kawamura
>
> TailFile checks whether a file being tailed is rotated by following lines of 
> code, and if it determines so, it resets the reader and state for the file:
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/TailFile.java#L693
> {code}
>         // Check if file has rotated
>         if (rolloverOccurred
>                 || (timestamp <= file.lastModified() && length > 
> file.length())
>                 || (timestamp < file.lastModified() && length >= 
> file.length())) {
>             // Since file has rotated, we close the reader, create a new one, 
> and then reset our state.
>             try {
>                 reader.close();
>                 getLogger().debug("Closed FileChannel {}", new 
> Object[]{reader, reader});
>             } catch (final IOException ioe) {
>                 getLogger().warn("Failed to close reader for {} due to {}", 
> new Object[]{file, ioe});
>             }
>             reader = createReader(file, 0L);
>             position = 0L;
>             checksum.reset();
>         }
> {code}
> The third condition, newer timestamp but the same file size can work 
> negatively in some situations. For example:
> # If an already fully tailed file is 'touched' and last modified timestamp is 
> updated. This is the easiest way to produce duplicated content.
> # On Windows, if a file is being tailed and updated by an app that writes 
> logs or some data to it consistently and frequently, then the last modified 
> timestamp can be delayed to be updated compared to file size. I couldn't find 
> canonical docs for this behavior, but testing on Windows consistently 
> produces duplicated data. And the 3rd condition becomes true when such 
> duplication occurs.
> TailFile updates the file timestamp and length when it reads some data from 
> the file, specifically at these lines:
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/TailFile.java#L765
> {code}
>             timestamp = Math.max(state.getTimestamp(), file.lastModified());
>             length = file.length();
> {code}
> As mentioned at the 2nd case above, file.lastModified can return a stale 
> timestamp (or maybe just not being flushed yet) while length is replaced by 
> the latest value. After this happens, at the next onTrigger cycle, the 3rd 
> condition becomes true because it detects a newer timestamp.
> These conditions are added by NIFI-1170 and NIFI-1959.
> A simple flow, TailFile -> SplitText -> (FlowFiles are queued) -> 
> UpdateAttribute(Stopped) can reproduce this, with a command-line to simulate 
> frequently updated log file:
> {code}
> $ for i in `seq 1 10000`; do echo $i >> test.log; done
> {code}
> The expected result is having 10000 generated FlowFiles queued at the 
> relationship between SplitText and UpdateAttribute. But on Windows, more 
> FlowFiles are generated.
> By enabling debug level log for TailFile, following log messages can be 
> confirmed:
> {code}
> Add this to conf/logback.xml
> <logger name="org.apache.nifi.processors.standard.TailFile" level="DEBUG"/>
> 2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] 
> o.a.nifi.processors.standard.TailFile TailFile[id=59ef6ea7-0
> 15d-1000-d6c2-c57a61e58a80] Recovering Rolled Off Files; total number of 
> files rolled off = 0
> 2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] 
> o.a.nifi.processors.standard.TailFile 
> TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Closed FileChannel 
> sun.nio.ch.FileChannelImpl@6d2a1eaf
> 2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] 
> o.a.nifi.processors.standard.TailFile 
> TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Created FileChannel 
> sun.nio.ch.FileChannelImpl@4aefddb3 for C:\logs\test.log
> 2017-07-19 10:22:07,150 DEBUG [Timer-Driven Process Thread-3] 
> o.a.nifi.processors.standard.TailFile 
> TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Reading lines starting at 
> position 0
> {code}
> The 3rd condition should be removed to avoid having these duplicated data 
> ingested. Or if there's any specific need, we should discuss about it and 
> implement additional solution.
> Optionally, the 2nd condition should be included in the 'rolloverOccurred' 
> flag to make it more readable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to