[jira] [Commented] (NIFI-994) Processor to tail files

2015-11-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001086#comment-15001086
 ] 

ASF subversion and git services commented on NIFI-994:
--

Commit 7b9c8df6c593059d063770095ab9efcf3c82467e in nifi's branch 
refs/heads/master from [~markap14]
[ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=7b9c8df ]

NIFI-994: Fixed issue that could result in data duplication if more than 1 
rollover of tailed file has occurred on restart of Processor


> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, 
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch, 
> 0003-NIFI-994-Fixed-issue-that-could-result-in-data-dupli.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-11-11 Thread Mark Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001089#comment-15001089
 ] 

Mark Payne commented on NIFI-994:
-

Thanks, Bryan! Merged into master!

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, 
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch, 
> 0003-NIFI-994-Fixed-issue-that-could-result-in-data-dupli.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-11-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001087#comment-15001087
 ] 

ASF subversion and git services commented on NIFI-994:
--

Commit 7a165b62cc4c46f92b4b4ed7c233f464cf63b3ef in nifi's branch 
refs/heads/master from [~markap14]
[ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=7a165b6 ]

Merge branch 'NIFI-994'


> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, 
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch, 
> 0003-NIFI-994-Fixed-issue-that-could-result-in-data-dupli.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-11-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001085#comment-15001085
 ] 

ASF subversion and git services commented on NIFI-994:
--

Commit bfa9e450798591db11a7b520cd01388f8819d865 in nifi's branch 
refs/heads/master from [~markap14]
[ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=bfa9e45 ]

NIFI-994: Ensure that processor is not valid due to the tail file not yet 
existing


> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, 
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch, 
> 0003-NIFI-994-Fixed-issue-that-could-result-in-data-dupli.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-11-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001084#comment-15001084
 ] 

ASF subversion and git services commented on NIFI-994:
--

Commit 31f0909bd315af43936b844327454ba2c48611e4 in nifi's branch 
refs/heads/master from [~markap14]
[ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=31f0909 ]

NIFI-994: Initial import of TailFile


> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, 
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch, 
> 0003-NIFI-994-Fixed-issue-that-could-result-in-data-dupli.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-11-11 Thread Bryan Bende (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000981#comment-15000981
 ] 

Bryan Bende commented on NIFI-994:
--

+1 Latest changes look good, I banged on this for a while with the same test I 
did before and can no longer reproduce the scenario, always getting consistent 
results now.

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, 
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch, 
> 0003-NIFI-994-Fixed-issue-that-could-result-in-data-dupli.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-11-10 Thread Mark Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999408#comment-14999408
 ] 

Mark Payne commented on NIFI-994:
-

[~bbende] - great catch! I was able to create a unit test that replicated the 
issue. There were a couple of places where the checksum could have been messed 
up:

- Needed to ensure that if we changed position of RandomAccessFile that we did 
not count the bytes that we "unread" toward the checksum.
- Was a bug where we did not keep the correct checksum after a processor was 
stopped and restarted.

The unit test 'testMultipleRolloversAfterHavingReadAllData' was added and 
failed because it pulled in duplicate data just like you were seeing. With the 
new patch,
this has been resolved.

I attached a new patch separately that should be applied on top of the others 
in order to make it easier for you to understand what changed, vs. squashing 
all the commits.

Please review whenever you get a chance and ensure that all looks good.

Thanks!
-Mark

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, 
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch, 
> 0003-NIFI-994-Fixed-issue-that-could-result-in-data-dupli.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-11-06 Thread Bryan Bende (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994316#comment-14994316
 ] 

Bryan Bende commented on NIFI-994:
--

I've been testing this processor for the past two days and overall it is 
awesome! 

I created one scenario that I have reproduced a couple of times where it seems 
like the processor re-reads some lines from the last rolled file that it has 
already read. I added some logging to the processor to see what was going on in 
recoverRolledFiles() and here is what prints out when I see the problem:

{code}
2015-11-06 14:08:56,882 INFO [Timer-Driven Process Thread-10] 
o.a.nifi.processors.standard.TailFile 
TailFile[id=6b24b195-9fc6-4783-957f-13f891236de0] RECOVERED ROLLED FILES WITH 
STATE TIMESTAMP OF 1446836931000
2015-11-06 14:08:56,882 INFO [Timer-Driven Process Thread-10] 
o.a.nifi.processors.standard.TailFile 
TailFile[id=6b24b195-9fc6-4783-957f-13f891236de0] RECOVERED ROLLED FILE 
solr.log.1 WITH LAST MODIFIED TIME OF 1446836931000
2015-11-06 14:08:56,882 INFO [Timer-Driven Process Thread-10] 
o.a.nifi.processors.standard.TailFile 
TailFile[id=6b24b195-9fc6-4783-957f-13f891236de0] RECOVERED - firstFile LENGTH 
IS 262621 AND state.getPosition() IS 260201
2015-11-06 14:08:56,883 INFO [Timer-Driven Process Thread-10] 
o.a.nifi.processors.standard.TailFile 
TailFile[id=6b24b195-9fc6-4783-957f-13f891236de0] RECOVERED - EXPECTED RECOVERY 
CHECKSUM IS 3912972977 AND CHECKSUM RESULT IS 1100203812
{code}

I had TailFile stopped when solr.log rolled, started it shortly after so it 
picks up solr.log.1 correctly, determines that new data was written to it since 
the last time since the file length is > state.getPosition(), then it 
calculates the checksum which ends up not matching the expected checksum. I 
can't figure out why the checksum doesn't match, but since they don't match 
then it leaves that file in the list to be processed in full. 

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, 
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-10-30 Thread Mark Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982784#comment-14982784
 ] 

Mark Payne commented on NIFI-994:
-

just replaced the 0002 patch as i found a typo in the documentation and a minor 
1-line fix that needed to happen.

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, 
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-10-30 Thread Joe Skora (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982764#comment-14982764
 ] 

Joe Skora commented on NIFI-994:


[~markap14] I agree Scenario #4 is unlikely, but I fully expect Scenario #2 
will happen.  Since it will only occur at the frequency of the log rotation it 
will probably go unnoticed and it might be impossible to detect.

However, I can envision OS aware versions that use file system features like 
iNotify or understand logrotate so that they can keep on processing the data 
even as the "logs roll".  ;-)

Regardless, this is going to be a useful processor and probably pretty popular. 
 It will be good to have it in the toolkit!

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, 
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-10-30 Thread Mark Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982563#comment-14982563
 ] 

Mark Payne commented on NIFI-994:
-

[~jskora] - I totally understand you're not being argumentative - the problem 
with online communication is that talking through scenarios often does feel 
argumentative. But I think I know you well enough to know you're more 
interested in making NiFi awesome than in arguing your ideas :)

I agree with the logic that you've laid out here. It won't be guaranteed 
against every possible corner case. However for the 99.9% use case, it should 
get all of the data. 99.9% of the time, Scenario 2 I don't think is going to 
happen - if the producer is just trashing its own data, well... not much we can 
do :) And I think Scenario #4 is possible but *extremely* rare, especially for 
a logging case, that you would replace an entire log file in a tiny amount of 
time with more content than was in the previous log file. Possible but rare.

The checksum really serves only one purpose, as it is implemented now. If the 
Processor (or NiFi) is stopped for a while, we need to know where we left off. 
Since the filename will have changed if the log rolled over, we need to figure 
out which file it was that is already half-consumed so that we don't re-consume 
the first half.

I expect that this Processor will undergo some iteration in the future as it is 
field-tested, and we'll make it much better over time. As simple as the 
description of this processor sounds, it's really complicated with all the 
weird edge cases that you run into when consuming data that keeps changing with 
no unique identifier :(

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, 
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-10-29 Thread Joe Skora (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981169#comment-14981169
 ] 

Joe Skora commented on NIFI-994:


That seems reasonable in general, and I really am trying to help.  :-D

I'm not trying to be argumentative, but I don't want you to put a big effort in 
trying to reach 100% if it is impossible.  I'd rather have a simpler processor 
that makes a best effort, and make sure users know about the potential problems.

Of the many possible scenarios, I picked the following 4.  Scenario #2 results 
in lost content and cannot be fixed even with checksumming.  Scenario #4 is not 
distinguishable from #2 without checksumming the whole file and it could have 
additional lost data if there was a log write between #4/T1 and #3/T2.
* Scenario #1 - file grows but no rotation occurs - no data loss
*# T0 - logger writes 2K to file => len=2K, timestamp=T0
*# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0
*# T2 - logger writes 2K to file => len=4K, timestamp=T2
*# T3 - tail processor processes 2K-4K, stores checksum(T3) and timestamp=T2
* Scenario #2 - rotation truncates file - data written after last processing 
but before truncation is lost
*# T0 - logger writes 2K to file => len=2K, timestamp=T0
*# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0
*# T2 - logger writes 2K to file => len=4K, timestamp=T2 (**LOST WRITE, 
UNFIXABLE**)
*# T3 - logger truncates file => len=0, timestamp=T3
*# T4 - logger writes 1K to file => len=1K, timestamp=T4
*# T5 - tail processor processes 0-1K, stores checksum(T5) and timestamp=T4
* Scenario #3 - file grows but no rotation occurs
*# T0 - logger writes 2K to file => len=2K, timestamp=T0
*# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0
*# T2 - logger writes 2K to file => len=4K, timestamp=T2
*# T3 - tail processor processes 2K-4K, stores checksum(T3) and timestamp=T2
* Scenario #4 - rotation occurs but file size exceeds size at last processing
*# T0 - logger writes 2K to file => len=2K, timestamp=T0
*# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0
*# T2 - (**log write here would be lost**)
*# T3 - logger rotates file => len=0, timestamp=T3
*# T4 - logger writes 4K to file => len=4K, timestamp=T4  (**PARTIALLY LOST 
WRITE**)  (**LOOKS LIKE #3/T2**)
*# T5 - tail processor processes 2K-4K, stores checksum(T5) and timestamp=T4

As long as the file can change outside NiFi's control of NiFi (and could change 
quickly in some cases), I think it is impossible to design a lossless approach 
without copying the data, and even that could be impossible depending on volume 
and load.

Thoughts.

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, 
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-10-29 Thread Mark Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980981#comment-14980981
 ] 

Mark Payne commented on NIFI-994:
-

[~jskora] - the algorithm implemented uses the lastModifiedTime and the length 
of the file in addition to the checksum, so there is no need to re-read the 
whole file. It would re-read one file only when the processor is stopped and 
restarted (or nifi is restarted). I'm OK with the cost of re-reading a file 
when the user stops/starts the processor or when NiFi is restarted. Certainly 
if we had to continually read from the start, I would not use that type of 
approach.

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, 
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-10-29 Thread Joe Skora (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980928#comment-14980928
 ] 

Joe Skora commented on NIFI-994:


[~markap14]  Logging is a very likely use case for this processor, creating the 
possibility of the log rolling over before the processor reaches the end, 
losing the unprocessed portion if it isn't duplicated before processing.  That 
being the case, I'm inclined to favor performance and simplicity over accuracy. 
 

Calculating the checksum while reading a file won't be bad, but re-reading the 
whole file on subsequent triggers could get expensive.  For example, processing 
a 6MB log file in 6 parts could mean processing 21MB of data (1+2+3+4+5+6) and 
it grows geometrically (IIRC) from there.

It will be important to make sure people know they may not be getting all the 
data if only the open log file can be processed.  The only sure way I see to 
get 100% coverage of a log file is to only process files that have rotated out 
and are no longer active.

My 2 cents.  YMMV.

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, 
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-10-29 Thread Mark Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980744#comment-14980744
 ] 

Mark Payne commented on NIFI-994:
-

[~jskora] - I agree that the Linux "tail" application may not guarantee every 
bit of content will be sign. However, for our purposes, we should certainly 
strive to ensure that we always obtain every bit of data, if possible.

I definitely like the idea of the checksums. But I think it's prudent to 
perform the checksum across the entire file, not just a few bytes. If 
performance were to become a concern, then we can certainly look at other 
options, but for at least the initial pass I think this is the correct approach.

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, 
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-10-29 Thread Mark Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980734#comment-14980734
 ] 

Mark Payne commented on NIFI-994:
-

I attached a second patch that addresses these. The 0002-* patch should be 
applied after the 0001-* patch.

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, 
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-10-28 Thread Joe Skora (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978745#comment-14978745
 ] 

Joe Skora commented on NIFI-994:


In general, I don't think the contract of "tail" guarantees every bit of 
content will be seen.  The GNU Tail source mentions in [this 
comment|http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/tail.c;h=f916d7460395f0cee52c592bc3d160ac94697e73;hb=HEAD#l1199]
 that if the file size shrinks tail will restart from the beginning, but if the 
file is truncated and regrows past the last size check it appears that tail 
will not detect the change and only return content beyond the last size check.

I share the concerns about using checksums, even though I brought them up.  
Logs and such are highly repetitive, which could be a problem for the "last N 
bytes" approach unless the checksum window size is large enough to cover a 
typical line or record length.  It would be great to be able to set the windows 
size and have an option for a 0 byte windows size that altogether eliminates 
the checksum processing.

Regards,
Joe

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-10-28 Thread Mark Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978546#comment-14978546
 ] 

Mark Payne commented on NIFI-994:
-

Andre,

Yes, we are currently using a Checksum in order to ensure that if the file 
changes that we have all of the data necessary. If the checksum is different, 
then we know that the file has not been ingested and pull in all of the data. 
If the checksums are the same then we know that our position within the file is 
good.

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-10-28 Thread Mark Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978534#comment-14978534
 ] 

Mark Payne commented on NIFI-994:
-

We should make sure that we also address the following scenarios:

- If the file to tail does not exist, we should generate a WARN message but not 
make the Processor invalid. It will be common to tail the same files on several 
systems and just because a log file, for example, may not yet exist doesn't 
mean the processor should be invalid. Should just warn, yield, and try again 
after the yield period.
- Should have a property that indicates where to start tailing. Valid options 
would be: Beginning of Time (pull in all the rolled over files that you can!), 
Beginning of Tailed File, Now (Do not pull "historical" data). Need to ensure 
that this takes effect only when we begin tailing a new file. I.e., after we 
start tailing a file, we should continue to tail from wherever we left off.

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-10-27 Thread Andre (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977680#comment-14977680
 ] 

Andre commented on NIFI-994:


[~markap14]

Flume has since version 1.7 (snapshot) a [taildir 
source|https://issues.apache.org/jira/browse/FLUME-2498].

The way they currently keep track of the files is using a position JSON sidecar 
file with content describing the log, inode and position of the tail against a 
file:

{code}
[{"inode":13209775,"pos":13771668368,"file":"/mnt/logs/logfilename.log"}]
{code}

It is not fault proof as the process tends to fail to detect changes to a file 
that result in the exact same size, e.g.:

So supposing the tail last queried a file with the following state:
{code}
$ cat log.log

{code}

Updating it with similar content 
{code}
$ echo  > log.log 
{code}

Would not trigger a new tail.

A more robust alternative would be to use checksums as suggested by [~jskora] 
but instead of checksumming the processed content, one would checksum a fixed 
number of bytes preceding the saved seek position.

More or less like (apologies for my weird pseudo-code):
{code}
IF SEEK_POSITION AND FILESIZE >= 8 BYTES
   if = OPEN logfile
   SEEK lf AT SEEK_POSITION - 8 BYTES
   SHA256(READ 8 BYTES FROM if)
{code}

What do you think?

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-10-23 Thread Mark Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14972213#comment-14972213
 ] 

Mark Payne commented on NIFI-994:
-

All,

I uploaded an initial implementation of TailFile. Please feel free to try 
out/test/review and provide any feedback. We can iterate as necessary.

Thanks
-Mark

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-09-30 Thread Joe Skora (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14938244#comment-14938244
 ] 

Joe Skora commented on NIFI-994:


I think we are on the same page, but I left out some details.  The key is that 
the processor always starts at the beginning when it finds a file but discards 
content it thinks was previously committed downstream.

One approach could be storing a checksum of processed content with the other 
state when content is committed downstream.  Files are always handled from the 
start, but those that exist when the processor starts are checked against the 
stored state.  If the file has the same checksum at the same offset as the 
state, the content up to the offset is discarded and the file is processed from 
there on.  If the checksum at the offset is different, all the content is 
processed.

Any content that ages off while the Processor is stopped will be lost, but I 
don't see a way around that.  That said, it might be possible to recognize some 
log rolling scenarios and finish processing rolled out files that were 
previously in process while the regular behaviors pickup the new file.

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Joseph Percivall
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-09-30 Thread Joseph Percivall (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14937018#comment-14937018
 ] 

Joseph Percivall commented on NIFI-994:
---

Adding an email chain that relates to this processor to the comments:

For a NiFi processor, I think the "tail -F" makes more sense.  As opposed
to the normal behavior that follows existing file descriptors, "tail -F"
follows on filename (or pattern) so it tracks the current instance of a
file, letting it handle new files during the run, log rotations, etc..

I definitely agree that it should take a regex or a fixed filename.

I think the biggest question is granularity.  Though tail is normally a
line oriented operation, in NiFi it should probably be "chunk" oriented
with each pass creating a new flow file with whatever new full lines are
available.

Joe Skora

-

Joe,

The problem with "tail -F" is that if NiFi is restarted and then we do 
essentially "tail -F"
we may have missed a lot of data that was written to the log file while NiFi 
was down.
The idea behind this Processor is to be able to recover that data, even if it 
was written
to a log file (or any other sort of file) while NiFi was not running or while 
the Processor
was not running.

I agree that it should be "chunk oriented" - likely would need a property that 
indicates how
long to tail for a single chunk. E.g., tail for 1 second and create a FlowFile 
with the content
received.

-Mark

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Joseph Percivall
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-09-30 Thread Mark Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936888#comment-14936888
 ] 

Mark Payne commented on NIFI-994:
-

Agreed. I'd recommend we allow the filename to tail to contain a * so that as 
things roll over we can still process the data. We could sort on last modified 
time to know the ordering of the files, and if we keep an offset into a file 
plus the timestamp when we pulled that file, that should help us to know which 
file it came from (the one with the smallest Last Modified timestamp >= our 
timestamp) and then we know which offset we left off at.

If the data rolls off then you're right - there's nothing we can do about that. 
Would recommend we mention in the @CapabilityDescription that we expect logs to 
be kept around long enough to recover from outages.


> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>Assignee: Joseph Percivall
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-09-24 Thread Aldrin Piri (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907269#comment-14907269
 ] 

Aldrin Piri commented on NIFI-994:
--

I think we can make a best effort at this, but I don't think there are any 
guarantees that we have all the data.  A lot of this comes down to logging 
provider configuration.  As an example, perhaps there are constraints on size 
or time that cause records to be rotated off.  It could take long outages for 
these environments to develop, but when the data has been rolled off, it is 
gone.  Depending on how markers and such work, this brings up some interesting 
cases to consider when implementing.

As another point of consideration, it would be nice to have a 
property/properties that provide handling for rolling log formats.  Consider 
logback and log4j with their date formatted log names.



> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-09-23 Thread Randy Gelhausen (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905759#comment-14905759
 ] 

Randy Gelhausen commented on NIFI-994:
--

A good test case is:
Start tailing processor pointing at /var/log/app.log
Stop tailing processor
Allow logrotate to rotate /var/log/app.log
Start tailing processor

Expected Result:
Emit line 1
Emit line 2
Emit line 3
etc.

Essentially the processor needs to resume emitting from where it left off in 
the sequence of log-lines. It should handle recognizing where it left off, read 
any available rotation archives (app.log.1, app.log.2, etc.) in order, and then 
catch back up emitting from the live app.log file.

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-994) Processor to tail files

2015-09-23 Thread Adis Cesir (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905741#comment-14905741
 ] 

Adis Cesir commented on NIFI-994:
-

We could possibly look at implementing 
http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/Tailer.html
 and extending it further for additional features like tracking state.

> Processor to tail files
> ---
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Joseph Percivall
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)