[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...
Github user devriesb commented on the issue: https://github.com/apache/nifi/pull/2416 @markap14 yes... under this somewhat unusual circumstance, my proposed solution would sacrifice data for consistency. However, if alwaysSync is set false, there isn't a guarantee of no loss anyway. So, we'd really be no worse off than we are, except that the plug getting kicked out of the wall wouldn't kill your repo. Also, I fully agree that an implementation that DOES guarantee no loss of CREATEs would be for the best. My proposal above doesn't address that in any way. It lives inside the "all or nothing" guarantee of the current (/previous?) WriteAheadFFR. ---
[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...
Github user devriesb commented on the issue: https://github.com/apache/nifi/pull/2416 Again, I would submit that it isn't violating the data loss guarantees of the original implementation. Yes, some data could potentially be lost, but none that it guaranteed to keep, and in doing so it prevents corruption... in a handful of easily understandable lines, that don't fundamentally alter the implementation. I want to be clear... I'm in no way opposed to improvements. However, for operational data flows there is going to be some trepidation about large changes to such a key component... especially one that we've had previous issues with. The smaller / more incremental the changes, and the more time allowed for evaluation under load, the more comfortable I would imagine people would be. ---
[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...
Github user devriesb commented on the issue: https://github.com/apache/nifi/pull/2416 I'm glad there's support for making this opt in. One point on @joewitt 's comment : "The claim of a simple fix being available to close the previous gaps doesn't appear to be backed with a suggested implementation though it does look like you received a good response to why that wasn't feasible." ... the reasons given were reasons why @markap14 's original proposed solution wouldn't work, not why another solution might not. For example (as suggested previously in email): > the immediate fix that I see needing is that when building the transaction map[1], we need to keep track of the *lowest* encountered corrupt transaction. Then, when iterating over the transaction map to rebuild the repo, we need to stop when that transaction is reached... after that we can't trust that there are no missing transactions which could lead to repo corruption or incorrect state. This is the relatively simple fix that I believe would be more appropriate to the WhriteAheadFlowFileRepository, as it would only prevent a known bug from causing corruption, as opposed to a major rewrite. I never got any response suggesting this wouldn't work, rather simply that another solution was preferred... which turned out not to be feasible. At that point, instead of circling back and trying to fix the bug, it appears as though the rewrite began. If there's a reason the author of the original implementation doesn't believe my suggestion would work to prevent the observed corruption, I'd be happy to take another look. [1] https://github.com/apache/nifi/blob/0f2ac39f69c1a744f151f0d924c9978f6790b7f7/nifi-commons/nifi-write-ahead-log/src/main/java/org/wali/MinimalLockingWriteAheadLog.java#L444 ---
[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...
Github user markap14 commented on the issue: https://github.com/apache/nifi/pull/2416 @devriesb I believe the issue with this proposed solution is that if we have 256 partitions (the default), for instance, then if one partition is not flushed to disk during a power failure, we would end up losing all data written to the other 255 partitions. Even if we were to use an fsync() for each CREATE event, that data would be lost with the proposed solution. So this could lead to quite a bit of data loss even with fsync enabled. ---
[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...
Github user markap14 commented on the issue: https://github.com/apache/nifi/pull/2416 I would not consider that circumstance to be unusual, but rather a common scenario if power is lost, after NIFI-4775 has been implemented. Given that NIFI-4775 was created and that there were no objections, I considered that verification that it is intended to be implemented in the future. Once this is done, it will guarantee no loss of data (though it would allow loss of processing). The proposed solution, however, still results in data loss if power is lost, but also prevents us from implementing NIFI-4775 effectively because once it is implemented it would provide us no real benefit with such a solution, as it would still throw out those fsync'ed CREATE events if another partition was not also fsync'ed. ---
[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...
Github user markap14 commented on the issue: https://github.com/apache/nifi/pull/2416 The proposed solution does address the issue that was raised in NIFI-4774, but in doing so introduces a new issue of data loss. This is why I provided the solution that I did in the PR, as I believe that it addresses both of these issues. Again, I do not have a problem with making this new solution one that the user is able to opt out of, though. If you want to submit a PR that also incorporates your proposed solution into the MinimalLockingWirteAheadLog that is okay too and we can also review and incorporate that change as well. ---
[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...
Github user devriesb commented on the issue: https://github.com/apache/nifi/pull/2416 I'll grant NIFI-4775 may raise issues with my proposed solution. However, there is a problem right now. My proposed solution addresses the problem right now. Future modification may require adjustments to previous assumptions. That, however, is a problem for the future. In any case, after doing some experimentation, I'm not sure the current version of NIFI-4775 is the correct approach. And whatever the eventual approach is, it may more appropriately be a new implementation (as discussed above). I don't think we should put off correcting current bugs because they may complicate potential future features. ---
[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...
Github user markap14 commented on the issue: https://github.com/apache/nifi/pull/2416 I am not opposed to providing a mechanism for opting out. I can look into creating a new PR that will provide that. ---
[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...
Github user mattyb149 commented on the issue: https://github.com/apache/nifi/pull/2416 +1 to @joewitt 's comments on allowing opt-out or selection of the implementation. I'm good with this being the default going forward, but I think the other impl(s) should be available to the users. ---
[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...
Github user joewitt commented on the issue: https://github.com/apache/nifi/pull/2416 It definitely requires considerable testing. Burgess clearly did testing during his code review and he had favorable findings. Mark obviously did. I personally spent multiple weeks in long running protracted tests on a couple of systems one of which ran uninterrupted at a rate of 120,000 events per second (so WALI update rate was far higher) for 11 days with zero full gcs, perfectly stable performance, and in testing before after with a series of hard stops/restarts all behaviors were favorable. Do you have tests underway with problematic findings? I'll state that based on my experience with this project I am very comfortable with this change and my own personally verified observations and evaluation. It is critical code so it does help that the author that wrote the previous implementation also authored this one and that the problems of the previous one were well stated and the path to write it was clear. That the changes were significant in nature in terms of total lines/files impacted I dont think reflects the improved simplicity this brings and as noted it was reusing much of the core logic of the previous more complicated approach. The claim of a simple fix being available to close the previous gaps doesn't appear to be backed with a suggested implementation though it does look like you received a good response to why that wasn't feasible. You raised some good points and got a really detailed reply from mark a month ago after which others such as myself and Burgess stayed active on this important item. So, in that sense this all seems fine. Now, having said all of this I do share your view that I would have preferred to see this be something users could opt-out since the old implementation even with the known potential issue has been extremely stable for a very long time and is relied upon heavily. I think using this new implementation as the default which closes the gap and which handles automatically reading old repository formatted partitions is correct and letting users that are on the old one stay on it as they wish is perfectly valid. I dont think it is right to revert this change. But I do think it is fair to reopen the JIRA and move toward a model allowing opt-out and leveraging this as a default. @markap14 do you agree? @mattyb149 do you? Thanks ---
[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...
Github user devriesb commented on the issue: https://github.com/apache/nifi/pull/2416 So, I was under the impression this was still a WIP. I am a HUGE -1 on this change. As @markap14 stated above, this is a critical section of code. And while the previous version has serious flaws, they are at least somewhat know based on a long period of use. In other words, we know there are problems, but they only bite us every so often. This major rewrite of a critical piece being essentially forced on users, likely without their knowledge unless they are paying close attention, seems less than ideal. This new implementation will need SIGNIFICANT testing before it can be trusted to the same degree as the previous, even with its issues. I would have greatly preferred, and will still advocate for, making this a new implementation vs. a change to the WriteAheadFlowFileRepository (e.g. SequentialWriteAheadFlowFileRepository). Again, there are other issues, but avoiding repo corruption in the previous WriteAheadFlowFileRepository would have been a reasonably simple fix, not requiring this rewrite. While the rewrite may have other benefits, making it a new implementation (even if you were to make it the default...) would give users the opportunity to evaluate and decide for themselves when they are ready to move to the new repo, without forcing them to postpone an upgrade to 1.6.0 which has other worthwhile changes. I know critical sections are change all the time, and that users won't always be aware of the changes. However, the criticality of this section combined with it's history means I think we should tread a little more lightly. ---
[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...
Github user mattyb149 commented on the issue: https://github.com/apache/nifi/pull/2416 +1 LGTM, ran full suite of tests and numerous flows in a running NiFi instance. I did not do performance testing (Joe W did quite a few) but did verify the stability. Thanks for the improvement! Merging to master ---