[jira] [Commented] (NIFI-3686) EOFException on swap in causes tight loop in polling for flowfiles

2017-04-13 Thread Michael Moser (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968124#comment-15968124
 ] 

Michael Moser commented on NIFI-3686:
-

I can change this ticket from Bug to Improvement for changing the schema so 
that each Record contains one FlowFile, and handling partial recovery.  And I 
will lower its priority because it's in the "we don't think this could happen 
and we accept the risk that we are wrong" category.

> EOFException on swap in causes tight loop in polling for flowfiles
> --
>
> Key: NIFI-3686
> URL: https://issues.apache.org/jira/browse/NIFI-3686
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.1.1
>Reporter: Michael Moser
>
> If flowfile_repository partition fills 100% while swapping files out to a new 
> swap file, then this swap file becomes corrupt (partially written).  When 
> NiFi tries to swap this file in, EOFException happens and we get following 
> ERROR, which is nice.
> 2017-04-10 18:02:58,855 ERROR [Timer-Driven Process Thread-3] 
> o.a.n.controller.StandardFlowFileQueue Failed to swap in FlowFiles from Swap 
> File 
> /local/mwmoser/nifi-1.2.0-SNAPSHOT/./flowfile_repository/swap/1491574631605-2840b630-57fc-4f49-615b-0b37d77bec66-5dbc0ad0-921c-483e-a05d-5c65d014fa48.swap;
>  Swap File appears to be corrupt!
> However, once all other dataflow stops, the queue now shows 1 flowfiles 
> in it.  The processor reading from this queue constantly has its onTrigger() 
> called, and session.get() polls the queue and gets 0 files returned.  This 
> happens in a tight loop, with no other errors.
> To a user it appears that the processor is doing lots of work but just not 
> processing those 1 files.  The error message above only appears once in 
> the nifi-app.log, so you don't see anything wrong if you tail the log. 
>  When you restart NiFi, the error message above appears again, but the user 
> experience of 1 files not processing remains.
> The new SchemaSwapDeserializer does not (and perhaps cannot) implement the 
> IncompleteSwapFileException that the old SimpleSwapDeserializer does.  So, 
> reading a swap file is currently all-or-nothing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NIFI-3686) EOFException on swap in causes tight loop in polling for flowfiles

2017-04-13 Thread Mark Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967779#comment-15967779
 ] 

Mark Payne commented on NIFI-3686:
--

[~mosermw] The FileSystemSwapManager writes the contents of a swap file to a 
temp file, then performs an fsync, and finally renames the file. So there 
should be no way to get an EOFException unless the file in actually corrupt - 
it should not be due to the contents not being completely written out. I tried 
to replicate the behavior locally by creating a 100 MB partition and putting 
the FlowFile repo there, but I wasn't able to replicate. So just saying that, 
to say that there may be more to this story than simply running out of disk 
space and not being able to finish writing the file.

In any case, though, I think when we are swapping it in, we should not assume 
that an EOFException would dictate that we can lose all FlowFiles. We need to 
ensure that we are able to recover those FlowFiles that we can. Unfortunately, 
looking at it now, it looks like the schema that we are using has a single 
element named "FlowFiles" and the Swap File is expected to consist of a single 
"Record." We'd need to update the schema so that it allows each FlowFile to be 
written as a separate Record. The downside is that the schema would be 
incompatible. So we could still remain backward compatible but would lost 
"forward compatibility" -- meaning that if a Swap File gets written in the new 
format we won't be able to recover that swap file if we rolled back to an old 
version of NiFi...

> EOFException on swap in causes tight loop in polling for flowfiles
> --
>
> Key: NIFI-3686
> URL: https://issues.apache.org/jira/browse/NIFI-3686
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.1.1
>Reporter: Michael Moser
>
> If flowfile_repository partition fills 100% while swapping files out to a new 
> swap file, then this swap file becomes corrupt (partially written).  When 
> NiFi tries to swap this file in, EOFException happens and we get following 
> ERROR, which is nice.
> 2017-04-10 18:02:58,855 ERROR [Timer-Driven Process Thread-3] 
> o.a.n.controller.StandardFlowFileQueue Failed to swap in FlowFiles from Swap 
> File 
> /local/mwmoser/nifi-1.2.0-SNAPSHOT/./flowfile_repository/swap/1491574631605-2840b630-57fc-4f49-615b-0b37d77bec66-5dbc0ad0-921c-483e-a05d-5c65d014fa48.swap;
>  Swap File appears to be corrupt!
> However, once all other dataflow stops, the queue now shows 1 flowfiles 
> in it.  The processor reading from this queue constantly has its onTrigger() 
> called, and session.get() polls the queue and gets 0 files returned.  This 
> happens in a tight loop, with no other errors.
> To a user it appears that the processor is doing lots of work but just not 
> processing those 1 files.  The error message above only appears once in 
> the nifi-app.log, so you don't see anything wrong if you tail the log. 
>  When you restart NiFi, the error message above appears again, but the user 
> experience of 1 files not processing remains.
> The new SchemaSwapDeserializer does not (and perhaps cannot) implement the 
> IncompleteSwapFileException that the old SimpleSwapDeserializer does.  So, 
> reading a swap file is currently all-or-nothing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NIFI-3686) EOFException on swap in causes tight loop in polling for flowfiles

2017-04-13 Thread Michael Moser (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967640#comment-15967640
 ] 

Michael Moser commented on NIFI-3686:
-

After manually removing the corrupt swap file, NiFi seems to never forget that 
it was there.  Once I let all flowfiles drain from the system, and restart 
NiFi, WriteAheadFlowFileRepository continues to believe it has 1 swap file out 
there.

INFO [pool-9-thread-1] org.wali.MinimalLockingWriteAheadLog 
org.wali.MinimalLockingWriteAheadLog@775429c8 checkpointed with 0 Records and 1 
Swap Files in 48 milliseconds (Stop-the-world time = 23 milliseconds, Clear 
Edit Logs time = 15 millis), max Transaction ID 50011


> EOFException on swap in causes tight loop in polling for flowfiles
> --
>
> Key: NIFI-3686
> URL: https://issues.apache.org/jira/browse/NIFI-3686
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.1.1
>Reporter: Michael Moser
>
> If flowfile_repository partition fills 100% while swapping files out to a new 
> swap file, then this swap file becomes corrupt (partially written).  When 
> NiFi tries to swap this file in, EOFException happens and we get following 
> ERROR, which is nice.
> 2017-04-10 18:02:58,855 ERROR [Timer-Driven Process Thread-3] 
> o.a.n.controller.StandardFlowFileQueue Failed to swap in FlowFiles from Swap 
> File 
> /local/mwmoser/nifi-1.2.0-SNAPSHOT/./flowfile_repository/swap/1491574631605-2840b630-57fc-4f49-615b-0b37d77bec66-5dbc0ad0-921c-483e-a05d-5c65d014fa48.swap;
>  Swap File appears to be corrupt!
> However, once all other dataflow stops, the queue now shows 1 flowfiles 
> in it.  The processor reading from this queue constantly has its onTrigger() 
> called, and session.get() polls the queue and gets 0 files returned.  This 
> happens in a tight loop, with no other errors.
> To a user it appears that the processor is doing lots of work but just not 
> processing those 1 files.  The error message above only appears once in 
> the nifi-app.log, so you don't see anything wrong if you tail the log. 
>  When you restart NiFi, the error message above appears again, but the user 
> experience of 1 files not processing remains.
> The new SchemaSwapDeserializer does not (and perhaps cannot) implement the 
> IncompleteSwapFileException that the old SimpleSwapDeserializer does.  So, 
> reading a swap file is currently all-or-nothing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NIFI-3686) EOFException on swap in causes tight loop in polling for flowfiles

2017-04-10 Thread Michael Moser (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963355#comment-15963355
 ] 

Michael Moser commented on NIFI-3686:
-

Note: I didn't encounter this on a production system, I simulated this 
happening by truncating a swap file while NiFi was not running.

I have a simple code patch to StandardFlowFileQueue that will remove the swap 
contents from the swapQueue if the swap summary is valid.  This fixes the user 
experience by logging the EOFException ERROR to the nifi-app.log, then the 
queue size goes to 0 and the processor reading from this queue is not 
triggered.  On the next NiFi restart, if the corrupt swap file is still there, 
the EOFException ERROR happens again.  I'm not sure this is the desired 
approach, though.

[~markap14] if you can ponder this, please let me know if I should submit this 
as a PR or if it should be resolved in another way.  Thanks!

> EOFException on swap in causes tight loop in polling for flowfiles
> --
>
> Key: NIFI-3686
> URL: https://issues.apache.org/jira/browse/NIFI-3686
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.1.1
>Reporter: Michael Moser
>
> If flowfile_repository partition fills 100% while swapping files out to a new 
> swap file, then this swap file becomes corrupt (partially written).  When 
> NiFi tries to swap this file in, EOFException happens and we get following 
> ERROR, which is nice.
> 2017-04-10 18:02:58,855 ERROR [Timer-Driven Process Thread-3] 
> o.a.n.controller.StandardFlowFileQueue Failed to swap in FlowFiles from Swap 
> File 
> /local/mwmoser/nifi-1.2.0-SNAPSHOT/./flowfile_repository/swap/1491574631605-2840b630-57fc-4f49-615b-0b37d77bec66-5dbc0ad0-921c-483e-a05d-5c65d014fa48.swap;
>  Swap File appears to be corrupt!
> However, once all other dataflow stops, the queue now shows 1 flowfiles 
> in it.  The processor reading from this queue constantly has its onTrigger() 
> called, and session.get() polls the queue and gets 0 files returned.  This 
> happens in a tight loop, with no other errors.
> To a user it appears that the processor is doing lots of work but just not 
> processing those 1 files.  The error message above only appears once in 
> the nifi-app.log, so you don't see anything wrong if you tail the log. 
>  When you restart NiFi, the error message above appears again, but the user 
> experience of 1 files not processing remains.
> The new SchemaSwapDeserializer does not (and perhaps cannot) implement the 
> IncompleteSwapFileException that the old SimpleSwapDeserializer does.  So, 
> reading a swap file is currently all-or-nothing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)