[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...

2018-02-21 Thread devriesb
Github user devriesb commented on the issue:

https://github.com/apache/nifi/pull/2416
  
@markap14 yes... under this somewhat unusual circumstance, my proposed 
solution would sacrifice data for consistency.  However, if alwaysSync is set 
false, there isn't a guarantee of no loss anyway.  So, we'd really be no worse 
off than we are, except that the plug getting kicked out of the wall wouldn't 
kill your repo.

Also, I fully agree that an implementation that DOES guarantee no loss of 
CREATEs would be for the best.  My proposal above doesn't address that in any 
way.  It lives inside the "all or nothing" guarantee of the current 
(/previous?) WriteAheadFFR.


---


[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...

2018-02-21 Thread devriesb
Github user devriesb commented on the issue:

https://github.com/apache/nifi/pull/2416
  
Again, I would submit that it isn't violating the data loss guarantees of 
the original implementation.  Yes, some data could potentially be lost, but 
none that it guaranteed to keep, and in doing so it prevents corruption... in a 
handful of easily understandable lines, that don't fundamentally alter the 
implementation.  

I want to be clear... I'm in no way opposed to improvements.  However, for 
operational data flows there is going to be some trepidation about large 
changes to such a key component... especially one that we've had previous 
issues with.  The smaller / more incremental the changes, and the more time 
allowed for evaluation under load, the more comfortable I would imagine people 
would be.


---


[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...

2018-02-21 Thread devriesb
Github user devriesb commented on the issue:

https://github.com/apache/nifi/pull/2416
  
I'm glad there's support for making this opt in.  One point on @joewitt 's 
comment : "The claim of a simple fix being available to close the previous gaps 
doesn't appear to be backed with a suggested implementation though it does look 
like you received a good response to why that wasn't feasible." ... the reasons 
given were reasons why @markap14 's original proposed solution wouldn't work, 
not why another solution might not.  For example (as suggested previously in 
email):

> the immediate fix that I see needing is that when building the 
transaction map[1], we need to keep track of the *lowest* encountered corrupt 
transaction. Then, when iterating over the transaction map to rebuild the repo, 
we need to stop when that transaction is reached...  after that we can't trust 
that there are no missing transactions which could lead to repo corruption or 
incorrect state.

This is the relatively simple fix that I believe would be more appropriate 
to the WhriteAheadFlowFileRepository, as it would only prevent a known bug from 
causing corruption, as opposed to a major rewrite.  I never got any response 
suggesting this wouldn't work, rather simply that another solution was 
preferred... which turned out not to be feasible.  At that point, instead of 
circling back and trying to fix the bug, it appears as though the rewrite began.

If there's a reason the author of the original implementation doesn't 
believe my suggestion would work to prevent the observed corruption, I'd be 
happy to take another look.

[1] 
https://github.com/apache/nifi/blob/0f2ac39f69c1a744f151f0d924c9978f6790b7f7/nifi-commons/nifi-write-ahead-log/src/main/java/org/wali/MinimalLockingWriteAheadLog.java#L444



---


[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...

2018-02-21 Thread markap14
Github user markap14 commented on the issue:

https://github.com/apache/nifi/pull/2416
  
@devriesb I believe the issue with this proposed solution is that if we 
have 256 partitions (the default), for instance, then if one partition is not 
flushed to disk during a power failure, we would end up losing all data written 
to the other 255 partitions. Even if we were to use an fsync() for each CREATE 
event, that data would be lost with the proposed solution. So this could lead 
to quite a bit of data loss even with fsync enabled.


---


[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...

2018-02-21 Thread markap14
Github user markap14 commented on the issue:

https://github.com/apache/nifi/pull/2416
  
I would not consider that circumstance to be unusual, but rather a common 
scenario if power is lost, after NIFI-4775 has been implemented. Given that 
NIFI-4775 was created and that there were no objections, I considered that 
verification that it is intended to be implemented in the future. Once this is 
done, it will guarantee no loss of data (though it would allow loss of 
processing). The proposed solution, however, still results in data loss if 
power is lost, but also prevents us from implementing NIFI-4775 effectively 
because once it is implemented it would provide us no real benefit with such a 
solution, as it would still throw out those fsync'ed CREATE events if another 
partition was not also fsync'ed.


---


[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...

2018-02-21 Thread markap14
Github user markap14 commented on the issue:

https://github.com/apache/nifi/pull/2416
  
The proposed solution does address the issue that was raised in NIFI-4774, 
but in doing so introduces a new issue of data loss. This is why I provided the 
solution that I did in the PR, as I believe that it addresses both of these 
issues.

Again, I do not have a problem with making this new solution one that the 
user is able to opt out of, though. If you want to submit a PR that also 
incorporates your proposed solution into the MinimalLockingWirteAheadLog that 
is okay too and we can also review and incorporate that change as well.


---


[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...

2018-02-21 Thread devriesb
Github user devriesb commented on the issue:

https://github.com/apache/nifi/pull/2416
  
I'll grant NIFI-4775 may raise issues with my proposed solution. However, 
there is a problem right now.  My proposed solution addresses the problem right 
now.  Future modification may require adjustments to previous assumptions.  
That, however, is a problem for the future.  

In any case, after doing some experimentation, I'm not sure the current 
version of NIFI-4775 is the correct approach.  And whatever the eventual 
approach is, it may more appropriately be a new implementation (as discussed 
above).  I don't think we should put off correcting current bugs because they 
may complicate potential future features.


---


[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...

2018-02-21 Thread markap14
Github user markap14 commented on the issue:

https://github.com/apache/nifi/pull/2416
  
I am not opposed to providing a mechanism for opting out. I can look into 
creating a new PR that will provide that.


---


[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...

2018-02-21 Thread mattyb149
Github user mattyb149 commented on the issue:

https://github.com/apache/nifi/pull/2416
  
+1 to @joewitt 's comments on allowing opt-out or selection of the 
implementation. I'm good with this being the default going forward, but I think 
the other impl(s) should be available to the users.


---


[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...

2018-02-21 Thread joewitt
Github user joewitt commented on the issue:

https://github.com/apache/nifi/pull/2416
  
It definitely requires considerable testing.  Burgess clearly did testing 
during his code review and he had favorable findings.  Mark obviously did.  I 
personally spent multiple weeks in long running protracted tests on a couple of 
systems one of which ran uninterrupted at a rate of 120,000 events per second 
(so WALI update rate was far higher) for 11 days with zero full gcs, perfectly 
stable performance, and in testing before after with a series of hard 
stops/restarts all behaviors were favorable.  Do you have tests underway with 
problematic findings? I'll state that based on my experience with this project 
I am very comfortable with this change and my own personally verified 
observations and evaluation.

It is critical code so it does help that the author that wrote the previous 
implementation also authored this one and that the problems of the previous one 
were well stated and the path to write it was clear.  That the changes were 
significant in nature in terms of total lines/files impacted I dont think 
reflects the improved simplicity this brings and as noted it was reusing much 
of the core logic of the previous more complicated approach.

The claim of a simple fix being available to close the previous gaps 
doesn't appear to be backed with a suggested implementation though it does look 
like you received a good response to why that wasn't feasible.

You raised some good points and got a really detailed reply from mark a 
month ago after which others such as myself and Burgess stayed active on this 
important item.  So, in that sense this all seems fine.

Now, having said all of this I do share your view that I would have 
preferred to see this be something users could opt-out since the old 
implementation even with the known potential issue has been extremely stable 
for a very long time and is relied upon heavily. I think using this new 
implementation as the default which closes the gap and which handles 
automatically reading old repository formatted partitions is correct and 
letting users that are on the old one stay on it as they wish is perfectly 
valid.

I dont think it is right to revert this change.  But I do think it is fair 
to reopen the JIRA and move toward a model allowing opt-out and leveraging this 
as a default.  @markap14 do you agree?  @mattyb149  do you? 

Thanks


---


[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...

2018-02-21 Thread devriesb
Github user devriesb commented on the issue:

https://github.com/apache/nifi/pull/2416
  
So, I was under the impression this was still a WIP.  I am a HUGE -1 on 
this change.  As @markap14 stated above, this is a critical section of code.  
And while the previous version has serious flaws, they are at least somewhat 
know based on a long period of use.  In other words, we know there are 
problems, but they only bite us every so often.  

This major rewrite of a critical piece being essentially forced on users, 
likely without their knowledge unless they are paying close attention, seems 
less than ideal.  This new implementation will need SIGNIFICANT testing before 
it can be trusted to the same degree as the previous, even with its issues.

I would have greatly preferred, and will still advocate for, making this a 
new implementation vs. a change to the WriteAheadFlowFileRepository (e.g. 
SequentialWriteAheadFlowFileRepository).  Again, there are other issues, but 
avoiding repo corruption in the previous WriteAheadFlowFileRepository would 
have been a reasonably simple fix, not requiring this rewrite.  While the 
rewrite may have other benefits, making it a new implementation (even if you 
were to make it the default...) would give users the opportunity to evaluate 
and decide for themselves when they are ready to move to the new repo, without 
forcing them to postpone an upgrade to 1.6.0 which has other worthwhile changes.

I know critical sections are change all the time, and that users won't 
always be aware of the changes.  However, the criticality of this section 
combined with it's history means I think we should tread a little more lightly.


---


[GitHub] nifi issue #2416: NIFI 4774: Provide alternate implementation of Write-Ahead...

2018-02-19 Thread mattyb149
Github user mattyb149 commented on the issue:

https://github.com/apache/nifi/pull/2416
  
+1 LGTM, ran full suite of tests and numerous flows in a running NiFi 
instance. I did not do performance testing (Joe W did quite a few) but did 
verify the stability.  Thanks for the improvement! Merging to master


---