[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001086#comment-15001086 ] ASF subversion and git services commented on NIFI-994: -- Commit 7b9c8df6c593059d063770095ab9efcf3c82467e in nifi's branch refs/heads/master from [~markap14] [ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=7b9c8df ] NIFI-994: Fixed issue that could result in data duplication if more than 1 rollover of tailed file has occurred on restart of Processor > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, > 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch, > 0003-NIFI-994-Fixed-issue-that-could-result-in-data-dupli.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001089#comment-15001089 ] Mark Payne commented on NIFI-994: - Thanks, Bryan! Merged into master! > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, > 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch, > 0003-NIFI-994-Fixed-issue-that-could-result-in-data-dupli.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001087#comment-15001087 ] ASF subversion and git services commented on NIFI-994: -- Commit 7a165b62cc4c46f92b4b4ed7c233f464cf63b3ef in nifi's branch refs/heads/master from [~markap14] [ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=7a165b6 ] Merge branch 'NIFI-994' > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, > 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch, > 0003-NIFI-994-Fixed-issue-that-could-result-in-data-dupli.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001085#comment-15001085 ] ASF subversion and git services commented on NIFI-994: -- Commit bfa9e450798591db11a7b520cd01388f8819d865 in nifi's branch refs/heads/master from [~markap14] [ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=bfa9e45 ] NIFI-994: Ensure that processor is not valid due to the tail file not yet existing > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, > 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch, > 0003-NIFI-994-Fixed-issue-that-could-result-in-data-dupli.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001084#comment-15001084 ] ASF subversion and git services commented on NIFI-994: -- Commit 31f0909bd315af43936b844327454ba2c48611e4 in nifi's branch refs/heads/master from [~markap14] [ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=31f0909 ] NIFI-994: Initial import of TailFile > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, > 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch, > 0003-NIFI-994-Fixed-issue-that-could-result-in-data-dupli.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000981#comment-15000981 ] Bryan Bende commented on NIFI-994: -- +1 Latest changes look good, I banged on this for a while with the same test I did before and can no longer reproduce the scenario, always getting consistent results now. > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, > 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch, > 0003-NIFI-994-Fixed-issue-that-could-result-in-data-dupli.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999408#comment-14999408 ] Mark Payne commented on NIFI-994: - [~bbende] - great catch! I was able to create a unit test that replicated the issue. There were a couple of places where the checksum could have been messed up: - Needed to ensure that if we changed position of RandomAccessFile that we did not count the bytes that we "unread" toward the checksum. - Was a bug where we did not keep the correct checksum after a processor was stopped and restarted. The unit test 'testMultipleRolloversAfterHavingReadAllData' was added and failed because it pulled in duplicate data just like you were seeing. With the new patch, this has been resolved. I attached a new patch separately that should be applied on top of the others in order to make it easier for you to understand what changed, vs. squashing all the commits. Please review whenever you get a chance and ensure that all looks good. Thanks! -Mark > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, > 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch, > 0003-NIFI-994-Fixed-issue-that-could-result-in-data-dupli.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994316#comment-14994316 ] Bryan Bende commented on NIFI-994: -- I've been testing this processor for the past two days and overall it is awesome! I created one scenario that I have reproduced a couple of times where it seems like the processor re-reads some lines from the last rolled file that it has already read. I added some logging to the processor to see what was going on in recoverRolledFiles() and here is what prints out when I see the problem: {code} 2015-11-06 14:08:56,882 INFO [Timer-Driven Process Thread-10] o.a.nifi.processors.standard.TailFile TailFile[id=6b24b195-9fc6-4783-957f-13f891236de0] RECOVERED ROLLED FILES WITH STATE TIMESTAMP OF 1446836931000 2015-11-06 14:08:56,882 INFO [Timer-Driven Process Thread-10] o.a.nifi.processors.standard.TailFile TailFile[id=6b24b195-9fc6-4783-957f-13f891236de0] RECOVERED ROLLED FILE solr.log.1 WITH LAST MODIFIED TIME OF 1446836931000 2015-11-06 14:08:56,882 INFO [Timer-Driven Process Thread-10] o.a.nifi.processors.standard.TailFile TailFile[id=6b24b195-9fc6-4783-957f-13f891236de0] RECOVERED - firstFile LENGTH IS 262621 AND state.getPosition() IS 260201 2015-11-06 14:08:56,883 INFO [Timer-Driven Process Thread-10] o.a.nifi.processors.standard.TailFile TailFile[id=6b24b195-9fc6-4783-957f-13f891236de0] RECOVERED - EXPECTED RECOVERY CHECKSUM IS 3912972977 AND CHECKSUM RESULT IS 1100203812 {code} I had TailFile stopped when solr.log rolled, started it shortly after so it picks up solr.log.1 correctly, determines that new data was written to it since the last time since the file length is > state.getPosition(), then it calculates the checksum which ends up not matching the expected checksum. I can't figure out why the checksum doesn't match, but since they don't match then it leaves that file in the list to be processed in full. > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, > 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982784#comment-14982784 ] Mark Payne commented on NIFI-994: - just replaced the 0002 patch as i found a typo in the documentation and a minor 1-line fix that needed to happen. > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, > 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982764#comment-14982764 ] Joe Skora commented on NIFI-994: [~markap14] I agree Scenario #4 is unlikely, but I fully expect Scenario #2 will happen. Since it will only occur at the frequency of the log rotation it will probably go unnoticed and it might be impossible to detect. However, I can envision OS aware versions that use file system features like iNotify or understand logrotate so that they can keep on processing the data even as the "logs roll". ;-) Regardless, this is going to be a useful processor and probably pretty popular. It will be good to have it in the toolkit! > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, > 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982563#comment-14982563 ] Mark Payne commented on NIFI-994: - [~jskora] - I totally understand you're not being argumentative - the problem with online communication is that talking through scenarios often does feel argumentative. But I think I know you well enough to know you're more interested in making NiFi awesome than in arguing your ideas :) I agree with the logic that you've laid out here. It won't be guaranteed against every possible corner case. However for the 99.9% use case, it should get all of the data. 99.9% of the time, Scenario 2 I don't think is going to happen - if the producer is just trashing its own data, well... not much we can do :) And I think Scenario #4 is possible but *extremely* rare, especially for a logging case, that you would replace an entire log file in a tiny amount of time with more content than was in the previous log file. Possible but rare. The checksum really serves only one purpose, as it is implemented now. If the Processor (or NiFi) is stopped for a while, we need to know where we left off. Since the filename will have changed if the log rolled over, we need to figure out which file it was that is already half-consumed so that we don't re-consume the first half. I expect that this Processor will undergo some iteration in the future as it is field-tested, and we'll make it much better over time. As simple as the description of this processor sounds, it's really complicated with all the weird edge cases that you run into when consuming data that keeps changing with no unique identifier :( > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, > 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981169#comment-14981169 ] Joe Skora commented on NIFI-994: That seems reasonable in general, and I really am trying to help. :-D I'm not trying to be argumentative, but I don't want you to put a big effort in trying to reach 100% if it is impossible. I'd rather have a simpler processor that makes a best effort, and make sure users know about the potential problems. Of the many possible scenarios, I picked the following 4. Scenario #2 results in lost content and cannot be fixed even with checksumming. Scenario #4 is not distinguishable from #2 without checksumming the whole file and it could have additional lost data if there was a log write between #4/T1 and #3/T2. * Scenario #1 - file grows but no rotation occurs - no data loss *# T0 - logger writes 2K to file => len=2K, timestamp=T0 *# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0 *# T2 - logger writes 2K to file => len=4K, timestamp=T2 *# T3 - tail processor processes 2K-4K, stores checksum(T3) and timestamp=T2 * Scenario #2 - rotation truncates file - data written after last processing but before truncation is lost *# T0 - logger writes 2K to file => len=2K, timestamp=T0 *# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0 *# T2 - logger writes 2K to file => len=4K, timestamp=T2 (**LOST WRITE, UNFIXABLE**) *# T3 - logger truncates file => len=0, timestamp=T3 *# T4 - logger writes 1K to file => len=1K, timestamp=T4 *# T5 - tail processor processes 0-1K, stores checksum(T5) and timestamp=T4 * Scenario #3 - file grows but no rotation occurs *# T0 - logger writes 2K to file => len=2K, timestamp=T0 *# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0 *# T2 - logger writes 2K to file => len=4K, timestamp=T2 *# T3 - tail processor processes 2K-4K, stores checksum(T3) and timestamp=T2 * Scenario #4 - rotation occurs but file size exceeds size at last processing *# T0 - logger writes 2K to file => len=2K, timestamp=T0 *# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0 *# T2 - (**log write here would be lost**) *# T3 - logger rotates file => len=0, timestamp=T3 *# T4 - logger writes 4K to file => len=4K, timestamp=T4 (**PARTIALLY LOST WRITE**) (**LOOKS LIKE #3/T2**) *# T5 - tail processor processes 2K-4K, stores checksum(T5) and timestamp=T4 As long as the file can change outside NiFi's control of NiFi (and could change quickly in some cases), I think it is impossible to design a lossless approach without copying the data, and even that could be impossible depending on volume and load. Thoughts. > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, > 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980981#comment-14980981 ] Mark Payne commented on NIFI-994: - [~jskora] - the algorithm implemented uses the lastModifiedTime and the length of the file in addition to the checksum, so there is no need to re-read the whole file. It would re-read one file only when the processor is stopped and restarted (or nifi is restarted). I'm OK with the cost of re-reading a file when the user stops/starts the processor or when NiFi is restarted. Certainly if we had to continually read from the start, I would not use that type of approach. > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, > 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980928#comment-14980928 ] Joe Skora commented on NIFI-994: [~markap14] Logging is a very likely use case for this processor, creating the possibility of the log rolling over before the processor reaches the end, losing the unprocessed portion if it isn't duplicated before processing. That being the case, I'm inclined to favor performance and simplicity over accuracy. Calculating the checksum while reading a file won't be bad, but re-reading the whole file on subsequent triggers could get expensive. For example, processing a 6MB log file in 6 parts could mean processing 21MB of data (1+2+3+4+5+6) and it grows geometrically (IIRC) from there. It will be important to make sure people know they may not be getting all the data if only the open log file can be processed. The only sure way I see to get 100% coverage of a log file is to only process files that have rotated out and are no longer active. My 2 cents. YMMV. > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, > 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980744#comment-14980744 ] Mark Payne commented on NIFI-994: - [~jskora] - I agree that the Linux "tail" application may not guarantee every bit of content will be sign. However, for our purposes, we should certainly strive to ensure that we always obtain every bit of data, if possible. I definitely like the idea of the checksums. But I think it's prudent to perform the checksum across the entire file, not just a few bytes. If performance were to become a concern, then we can certainly look at other options, but for at least the initial pass I think this is the correct approach. > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, > 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980734#comment-14980734 ] Mark Payne commented on NIFI-994: - I attached a second patch that addresses these. The 0002-* patch should be applied after the 0001-* patch. > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, > 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978745#comment-14978745 ] Joe Skora commented on NIFI-994: In general, I don't think the contract of "tail" guarantees every bit of content will be seen. The GNU Tail source mentions in [this comment|http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/tail.c;h=f916d7460395f0cee52c592bc3d160ac94697e73;hb=HEAD#l1199] that if the file size shrinks tail will restart from the beginning, but if the file is truncated and regrows past the last size check it appears that tail will not detect the change and only return content beyond the last size check. I share the concerns about using checksums, even though I brought them up. Logs and such are highly repetitive, which could be a problem for the "last N bytes" approach unless the checksum window size is large enough to cover a typical line or record length. It would be great to be able to set the windows size and have an option for a 0 byte windows size that altogether eliminates the checksum processing. Regards, Joe > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978546#comment-14978546 ] Mark Payne commented on NIFI-994: - Andre, Yes, we are currently using a Checksum in order to ensure that if the file changes that we have all of the data necessary. If the checksum is different, then we know that the file has not been ingested and pull in all of the data. If the checksums are the same then we know that our position within the file is good. > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978534#comment-14978534 ] Mark Payne commented on NIFI-994: - We should make sure that we also address the following scenarios: - If the file to tail does not exist, we should generate a WARN message but not make the Processor invalid. It will be common to tail the same files on several systems and just because a log file, for example, may not yet exist doesn't mean the processor should be invalid. Should just warn, yield, and try again after the yield period. - Should have a property that indicates where to start tailing. Valid options would be: Beginning of Time (pull in all the rolled over files that you can!), Beginning of Tailed File, Now (Do not pull "historical" data). Need to ensure that this takes effect only when we begin tailing a new file. I.e., after we start tailing a file, we should continue to tail from wherever we left off. > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977680#comment-14977680 ] Andre commented on NIFI-994: [~markap14] Flume has since version 1.7 (snapshot) a [taildir source|https://issues.apache.org/jira/browse/FLUME-2498]. The way they currently keep track of the files is using a position JSON sidecar file with content describing the log, inode and position of the tail against a file: {code} [{"inode":13209775,"pos":13771668368,"file":"/mnt/logs/logfilename.log"}] {code} It is not fault proof as the process tends to fail to detect changes to a file that result in the exact same size, e.g.: So supposing the tail last queried a file with the following state: {code} $ cat log.log {code} Updating it with similar content {code} $ echo > log.log {code} Would not trigger a new tail. A more robust alternative would be to use checksums as suggested by [~jskora] but instead of checksumming the processed content, one would checksum a fixed number of bytes preceding the saved seek position. More or less like (apologies for my weird pseudo-code): {code} IF SEEK_POSITION AND FILESIZE >= 8 BYTES if = OPEN logfile SEEK lf AT SEEK_POSITION - 8 BYTES SHA256(READ 8 BYTES FROM if) {code} What do you think? > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14972213#comment-14972213 ] Mark Payne commented on NIFI-994: - All, I uploaded an initial implementation of TailFile. Please feel free to try out/test/review and provide any feedback. We can iterate as necessary. Thanks -Mark > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Mark Payne > Fix For: 0.4.0 > > Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch > > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14938244#comment-14938244 ] Joe Skora commented on NIFI-994: I think we are on the same page, but I left out some details. The key is that the processor always starts at the beginning when it finds a file but discards content it thinks was previously committed downstream. One approach could be storing a checksum of processed content with the other state when content is committed downstream. Files are always handled from the start, but those that exist when the processor starts are checked against the stored state. If the file has the same checksum at the same offset as the state, the content up to the offset is discarded and the file is processed from there on. If the checksum at the offset is different, all the content is processed. Any content that ages off while the Processor is stopped will be lost, but I don't see a way around that. That said, it might be possible to recognize some log rolling scenarios and finish processing rolled out files that were previously in process while the regular behaviors pickup the new file. > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Joseph Percivall > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14937018#comment-14937018 ] Joseph Percivall commented on NIFI-994: --- Adding an email chain that relates to this processor to the comments: For a NiFi processor, I think the "tail -F" makes more sense. As opposed to the normal behavior that follows existing file descriptors, "tail -F" follows on filename (or pattern) so it tracks the current instance of a file, letting it handle new files during the run, log rotations, etc.. I definitely agree that it should take a regex or a fixed filename. I think the biggest question is granularity. Though tail is normally a line oriented operation, in NiFi it should probably be "chunk" oriented with each pass creating a new flow file with whatever new full lines are available. Joe Skora - Joe, The problem with "tail -F" is that if NiFi is restarted and then we do essentially "tail -F" we may have missed a lot of data that was written to the log file while NiFi was down. The idea behind this Processor is to be able to recover that data, even if it was written to a log file (or any other sort of file) while NiFi was not running or while the Processor was not running. I agree that it should be "chunk oriented" - likely would need a property that indicates how long to tail for a single chunk. E.g., tail for 1 second and create a FlowFile with the content received. -Mark > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Joseph Percivall > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936888#comment-14936888 ] Mark Payne commented on NIFI-994: - Agreed. I'd recommend we allow the filename to tail to contain a * so that as things roll over we can still process the data. We could sort on last modified time to know the ordering of the files, and if we keep an offset into a file plus the timestamp when we pulled that file, that should help us to know which file it came from (the one with the smallest Last Modified timestamp >= our timestamp) and then we know which offset we left off at. If the data rolls off then you're right - there's nothing we can do about that. Would recommend we mention in the @CapabilityDescription that we expect logs to be kept around long enough to recover from outages. > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall >Assignee: Joseph Percivall > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907269#comment-14907269 ] Aldrin Piri commented on NIFI-994: -- I think we can make a best effort at this, but I don't think there are any guarantees that we have all the data. A lot of this comes down to logging provider configuration. As an example, perhaps there are constraints on size or time that cause records to be rotated off. It could take long outages for these environments to develop, but when the data has been rolled off, it is gone. Depending on how markers and such work, this brings up some interesting cases to consider when implementing. As another point of consideration, it would be nice to have a property/properties that provide handling for rolling log formats. Consider logback and log4j with their date formatted log names. > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905759#comment-14905759 ] Randy Gelhausen commented on NIFI-994: -- A good test case is: Start tailing processor pointing at /var/log/app.log Stop tailing processor Allow logrotate to rotate /var/log/app.log Start tailing processor Expected Result: Emit line 1 Emit line 2 Emit line 3 etc. Essentially the processor needs to resume emitting from where it left off in the sequence of log-lines. It should handle recognizing where it left off, read any available rotation archives (app.log.1, app.log.2, etc.) in order, and then catch back up emitting from the live app.log file. > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905741#comment-14905741 ] Adis Cesir commented on NIFI-994: - We could possibly look at implementing http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/Tailer.html and extending it further for additional features like tracking state. > Processor to tail files > --- > > Key: NIFI-994 > URL: https://issues.apache.org/jira/browse/NIFI-994 > Project: Apache NiFi > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Joseph Percivall > > It's a very common data ingest situation to want to input text into the > system by "tailing" a file, most commonly log files. Currently we don't have > an easy way to do this. > A simple processor to tail a file would benefit many users. There would need > to be an option to not just tail a file but pick up where the processor left > off if it is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)