[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981169#comment-14981169 ]
Joe Skora commented on NIFI-994: -------------------------------- That seems reasonable in general, and I really am trying to help. :-D I'm not trying to be argumentative, but I don't want you to put a big effort in trying to reach 100% if it is impossible. I'd rather have a simpler processor that makes a best effort, and make sure users know about the potential problems. Of the many possible scenarios, I picked the following 4. Scenario #2 results in lost content and cannot be fixed even with checksumming. Scenario #4 is not distinguishable from #2 without checksumming the whole file and it could have additional lost data if there was a log write between #4/T1 and #3/T2. * Scenario #1 - file grows but no rotation occurs - no data loss *# T0 - logger writes 2K to file => len=2K, timestamp=T0 *# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0 *# T2 - logger writes 2K to file => len=4K, timestamp=T2 *# T3 - tail processor processes 2K-4K, stores checksum(T3) and timestamp=T2 * Scenario #2 - rotation truncates file - data written after last processing but before truncation is lost *# T0 - logger writes 2K to file => len=2K, timestamp=T0 *# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0 *# T2 - logger writes 2K to file => len=4K, timestamp=T2 (**LOST WRITE, UNFIXABLE**) *# T3 - logger truncates file => len=0, timestamp=T3 *# T4 - logger writes 1K to file => len=1K, timestamp=T4 *# T5 - tail processor processes 0-1K, stores checksum(T5) and timestamp=T4 * Scenario #3 - file grows but no rotation occurs *# T0 - logger writes 2K to file => len=2K, timestamp=T0 *# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0 *# T2 - logger writes 2K to file => len=4K, timestamp=T2 *# T3 - tail processor processes 2K-4K, stores checksum(T3) and timestamp=T2 * Scenario #4 - rotation occurs but file size exceeds size at last processing *# T0 - logger writes 2K to file => len=2K, timestamp=T0 *# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0 *# T2 - (**log write here would be lost**) *# T3 - logger rotates file => len=0, timestamp=T3 *# T4 - logger writes 4K to file => len=4K, timestamp=T4 (**PARTIALLY LOST WRITE**) (**LOOKS LIKE #3/T2**) *# T5 - tail processor processes 2K-4K, stores checksum(T5) and timestamp=T4 As long as the file can change outside NiFi's control of NiFi (and could change quickly in some cases), I think it is impossible to design a lossless approach without copying the data, and even that could be impossible depending on volume and load. Thoughts. > Processor to tail files > ----------------------- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature > Affects Versions: 0.4.0 > Reporter: Joseph Percivall > Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, > 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)