[ 
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981169#comment-14981169
 ] 

Joe Skora commented on NIFI-994:
--------------------------------

That seems reasonable in general, and I really am trying to help.  :-D

I'm not trying to be argumentative, but I don't want you to put a big effort in 
trying to reach 100% if it is impossible.  I'd rather have a simpler processor 
that makes a best effort, and make sure users know about the potential problems.

Of the many possible scenarios, I picked the following 4.  Scenario #2 results 
in lost content and cannot be fixed even with checksumming.  Scenario #4 is not 
distinguishable from #2 without checksumming the whole file and it could have 
additional lost data if there was a log write between #4/T1 and #3/T2.
* Scenario #1 - file grows but no rotation occurs - no data loss
*# T0 - logger writes 2K to file => len=2K, timestamp=T0
*# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0
*# T2 - logger writes 2K to file => len=4K, timestamp=T2
*# T3 - tail processor processes 2K-4K, stores checksum(T3) and timestamp=T2
* Scenario #2 - rotation truncates file - data written after last processing 
but before truncation is lost
*# T0 - logger writes 2K to file => len=2K, timestamp=T0
*# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0
*# T2 - logger writes 2K to file => len=4K, timestamp=T2 (**LOST WRITE, 
UNFIXABLE**)
*# T3 - logger truncates file => len=0, timestamp=T3
*# T4 - logger writes 1K to file => len=1K, timestamp=T4
*# T5 - tail processor processes 0-1K, stores checksum(T5) and timestamp=T4
* Scenario #3 - file grows but no rotation occurs
*# T0 - logger writes 2K to file => len=2K, timestamp=T0
*# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0
*# T2 - logger writes 2K to file => len=4K, timestamp=T2
*# T3 - tail processor processes 2K-4K, stores checksum(T3) and timestamp=T2
* Scenario #4 - rotation occurs but file size exceeds size at last processing
*# T0 - logger writes 2K to file => len=2K, timestamp=T0
*# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0
*# T2 - (**log write here would be lost**)
*# T3 - logger rotates file => len=0, timestamp=T3
*# T4 - logger writes 4K to file => len=4K, timestamp=T4  (**PARTIALLY LOST 
WRITE**)  (**LOOKS LIKE #3/T2**)
*# T5 - tail processor processes 2K-4K, stores checksum(T5) and timestamp=T4

As long as the file can change outside NiFi's control of NiFi (and could change 
quickly in some cases), I think it is impossible to design a lossless approach 
without copying the data, and even that could be impossible depending on volume 
and load.

Thoughts.

> Processor to tail files
> -----------------------
>
>                 Key: NIFI-994
>                 URL: https://issues.apache.org/jira/browse/NIFI-994
>             Project: Apache NiFi
>          Issue Type: New Feature
>    Affects Versions: 0.4.0
>            Reporter: Joseph Percivall
>            Assignee: Mark Payne
>             Fix For: 0.4.0
>
>         Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch, 
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch
>
>
> It's a very common data ingest situation to want to input text into the 
> system by "tailing" a file, most commonly log files. Currently we don't have 
> an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need 
> to be an option to not just tail a file but pick up where the processor left 
> off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to