[jira] [Commented] (NIFI-3686) EOFException on swap in causes tight loop in polling for flowfiles
[ https://issues.apache.org/jira/browse/NIFI-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968124#comment-15968124 ] Michael Moser commented on NIFI-3686: - I can change this ticket from Bug to Improvement for changing the schema so that each Record contains one FlowFile, and handling partial recovery. And I will lower its priority because it's in the "we don't think this could happen and we accept the risk that we are wrong" category. > EOFException on swap in causes tight loop in polling for flowfiles > -- > > Key: NIFI-3686 > URL: https://issues.apache.org/jira/browse/NIFI-3686 > Project: Apache NiFi > Issue Type: Bug > Components: Core Framework >Affects Versions: 1.1.1 >Reporter: Michael Moser > > If flowfile_repository partition fills 100% while swapping files out to a new > swap file, then this swap file becomes corrupt (partially written). When > NiFi tries to swap this file in, EOFException happens and we get following > ERROR, which is nice. > 2017-04-10 18:02:58,855 ERROR [Timer-Driven Process Thread-3] > o.a.n.controller.StandardFlowFileQueue Failed to swap in FlowFiles from Swap > File > /local/mwmoser/nifi-1.2.0-SNAPSHOT/./flowfile_repository/swap/1491574631605-2840b630-57fc-4f49-615b-0b37d77bec66-5dbc0ad0-921c-483e-a05d-5c65d014fa48.swap; > Swap File appears to be corrupt! > However, once all other dataflow stops, the queue now shows 1 flowfiles > in it. The processor reading from this queue constantly has its onTrigger() > called, and session.get() polls the queue and gets 0 files returned. This > happens in a tight loop, with no other errors. > To a user it appears that the processor is doing lots of work but just not > processing those 1 files. The error message above only appears once in > the nifi-app.log, so you don't see anything wrong if you tail the log. > When you restart NiFi, the error message above appears again, but the user > experience of 1 files not processing remains. > The new SchemaSwapDeserializer does not (and perhaps cannot) implement the > IncompleteSwapFileException that the old SimpleSwapDeserializer does. So, > reading a swap file is currently all-or-nothing. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NIFI-3686) EOFException on swap in causes tight loop in polling for flowfiles
[ https://issues.apache.org/jira/browse/NIFI-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967779#comment-15967779 ] Mark Payne commented on NIFI-3686: -- [~mosermw] The FileSystemSwapManager writes the contents of a swap file to a temp file, then performs an fsync, and finally renames the file. So there should be no way to get an EOFException unless the file in actually corrupt - it should not be due to the contents not being completely written out. I tried to replicate the behavior locally by creating a 100 MB partition and putting the FlowFile repo there, but I wasn't able to replicate. So just saying that, to say that there may be more to this story than simply running out of disk space and not being able to finish writing the file. In any case, though, I think when we are swapping it in, we should not assume that an EOFException would dictate that we can lose all FlowFiles. We need to ensure that we are able to recover those FlowFiles that we can. Unfortunately, looking at it now, it looks like the schema that we are using has a single element named "FlowFiles" and the Swap File is expected to consist of a single "Record." We'd need to update the schema so that it allows each FlowFile to be written as a separate Record. The downside is that the schema would be incompatible. So we could still remain backward compatible but would lost "forward compatibility" -- meaning that if a Swap File gets written in the new format we won't be able to recover that swap file if we rolled back to an old version of NiFi... > EOFException on swap in causes tight loop in polling for flowfiles > -- > > Key: NIFI-3686 > URL: https://issues.apache.org/jira/browse/NIFI-3686 > Project: Apache NiFi > Issue Type: Bug > Components: Core Framework >Affects Versions: 1.1.1 >Reporter: Michael Moser > > If flowfile_repository partition fills 100% while swapping files out to a new > swap file, then this swap file becomes corrupt (partially written). When > NiFi tries to swap this file in, EOFException happens and we get following > ERROR, which is nice. > 2017-04-10 18:02:58,855 ERROR [Timer-Driven Process Thread-3] > o.a.n.controller.StandardFlowFileQueue Failed to swap in FlowFiles from Swap > File > /local/mwmoser/nifi-1.2.0-SNAPSHOT/./flowfile_repository/swap/1491574631605-2840b630-57fc-4f49-615b-0b37d77bec66-5dbc0ad0-921c-483e-a05d-5c65d014fa48.swap; > Swap File appears to be corrupt! > However, once all other dataflow stops, the queue now shows 1 flowfiles > in it. The processor reading from this queue constantly has its onTrigger() > called, and session.get() polls the queue and gets 0 files returned. This > happens in a tight loop, with no other errors. > To a user it appears that the processor is doing lots of work but just not > processing those 1 files. The error message above only appears once in > the nifi-app.log, so you don't see anything wrong if you tail the log. > When you restart NiFi, the error message above appears again, but the user > experience of 1 files not processing remains. > The new SchemaSwapDeserializer does not (and perhaps cannot) implement the > IncompleteSwapFileException that the old SimpleSwapDeserializer does. So, > reading a swap file is currently all-or-nothing. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NIFI-3686) EOFException on swap in causes tight loop in polling for flowfiles
[ https://issues.apache.org/jira/browse/NIFI-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967640#comment-15967640 ] Michael Moser commented on NIFI-3686: - After manually removing the corrupt swap file, NiFi seems to never forget that it was there. Once I let all flowfiles drain from the system, and restart NiFi, WriteAheadFlowFileRepository continues to believe it has 1 swap file out there. INFO [pool-9-thread-1] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog@775429c8 checkpointed with 0 Records and 1 Swap Files in 48 milliseconds (Stop-the-world time = 23 milliseconds, Clear Edit Logs time = 15 millis), max Transaction ID 50011 > EOFException on swap in causes tight loop in polling for flowfiles > -- > > Key: NIFI-3686 > URL: https://issues.apache.org/jira/browse/NIFI-3686 > Project: Apache NiFi > Issue Type: Bug > Components: Core Framework >Affects Versions: 1.1.1 >Reporter: Michael Moser > > If flowfile_repository partition fills 100% while swapping files out to a new > swap file, then this swap file becomes corrupt (partially written). When > NiFi tries to swap this file in, EOFException happens and we get following > ERROR, which is nice. > 2017-04-10 18:02:58,855 ERROR [Timer-Driven Process Thread-3] > o.a.n.controller.StandardFlowFileQueue Failed to swap in FlowFiles from Swap > File > /local/mwmoser/nifi-1.2.0-SNAPSHOT/./flowfile_repository/swap/1491574631605-2840b630-57fc-4f49-615b-0b37d77bec66-5dbc0ad0-921c-483e-a05d-5c65d014fa48.swap; > Swap File appears to be corrupt! > However, once all other dataflow stops, the queue now shows 1 flowfiles > in it. The processor reading from this queue constantly has its onTrigger() > called, and session.get() polls the queue and gets 0 files returned. This > happens in a tight loop, with no other errors. > To a user it appears that the processor is doing lots of work but just not > processing those 1 files. The error message above only appears once in > the nifi-app.log, so you don't see anything wrong if you tail the log. > When you restart NiFi, the error message above appears again, but the user > experience of 1 files not processing remains. > The new SchemaSwapDeserializer does not (and perhaps cannot) implement the > IncompleteSwapFileException that the old SimpleSwapDeserializer does. So, > reading a swap file is currently all-or-nothing. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NIFI-3686) EOFException on swap in causes tight loop in polling for flowfiles
[ https://issues.apache.org/jira/browse/NIFI-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963355#comment-15963355 ] Michael Moser commented on NIFI-3686: - Note: I didn't encounter this on a production system, I simulated this happening by truncating a swap file while NiFi was not running. I have a simple code patch to StandardFlowFileQueue that will remove the swap contents from the swapQueue if the swap summary is valid. This fixes the user experience by logging the EOFException ERROR to the nifi-app.log, then the queue size goes to 0 and the processor reading from this queue is not triggered. On the next NiFi restart, if the corrupt swap file is still there, the EOFException ERROR happens again. I'm not sure this is the desired approach, though. [~markap14] if you can ponder this, please let me know if I should submit this as a PR or if it should be resolved in another way. Thanks! > EOFException on swap in causes tight loop in polling for flowfiles > -- > > Key: NIFI-3686 > URL: https://issues.apache.org/jira/browse/NIFI-3686 > Project: Apache NiFi > Issue Type: Bug > Components: Core Framework >Affects Versions: 1.1.1 >Reporter: Michael Moser > > If flowfile_repository partition fills 100% while swapping files out to a new > swap file, then this swap file becomes corrupt (partially written). When > NiFi tries to swap this file in, EOFException happens and we get following > ERROR, which is nice. > 2017-04-10 18:02:58,855 ERROR [Timer-Driven Process Thread-3] > o.a.n.controller.StandardFlowFileQueue Failed to swap in FlowFiles from Swap > File > /local/mwmoser/nifi-1.2.0-SNAPSHOT/./flowfile_repository/swap/1491574631605-2840b630-57fc-4f49-615b-0b37d77bec66-5dbc0ad0-921c-483e-a05d-5c65d014fa48.swap; > Swap File appears to be corrupt! > However, once all other dataflow stops, the queue now shows 1 flowfiles > in it. The processor reading from this queue constantly has its onTrigger() > called, and session.get() polls the queue and gets 0 files returned. This > happens in a tight loop, with no other errors. > To a user it appears that the processor is doing lots of work but just not > processing those 1 files. The error message above only appears once in > the nifi-app.log, so you don't see anything wrong if you tail the log. > When you restart NiFi, the error message above appears again, but the user > experience of 1 files not processing remains. > The new SchemaSwapDeserializer does not (and perhaps cannot) implement the > IncompleteSwapFileException that the old SimpleSwapDeserializer does. So, > reading a swap file is currently all-or-nothing. -- This message was sent by Atlassian JIRA (v6.3.15#6346)