Oh, sorry for the confusion, I should provide more context. Leader will use on disk txn sync with followers to if the peer zxid is not in it's in memory commit logs, the code is here: Leader on disk txn sync <https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L774>. There is bug that potentially there will be gap in the txn files, like after snap sync, etc, so it's possible the peer will miss txns due to this.
The option to disable it is snapshotSizeFactor <https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/ZKDatabase.java#L81>, set it to -1 will disable this feature. On 3.5, it's better to have a PR to set this to -1 by default. It might have more SNAP sync, but from our prod it doesn't seem to be a big problem to me. I can send out the diff to disable it by default on 3.5 if you guys think this is the right way to do. Thanks, Fangmin On Thu, Sep 13, 2018 at 1:58 AM Andor Molnar <an...@apache.org> wrote: > What’s needed to turn it off? > Do we need a PR or it’s just a config option? > Shall we implement a feature switch for that and turn it off by default? > > Sorry I don’t have too much insight on disk txn sync. > > Andor > > > > > On 2018. Sep 13., at 9:16, Fangmin Lv <lvfang...@gmail.com> wrote: > > > > And to be clear, ZOOKEEPER-2418 is actually just one case of > inconsistency > > which could caused by on disk txn sync, as I mentioned in a newer JIRA > > ZOOKEEPER-2846 <https://issues.apache.org/jira/browse/ZOOKEEPER-2846>, > the > > snap sync or txn sync could also leave txns gap in the txn file, which > is a > > more common case could trigger this issue. > > > > I would suggest to turn off the on disk txn sync by default for now to > > avoid this issue, after we finished ZOOKEEPER-3114, we can use that to > > validate the on disk txns during syncing. > > > > Thanks, > > Fangmin > > > > On Wed, Sep 12, 2018 at 9:55 AM Fangmin Lv <lvfang...@gmail.com> wrote: > > > >> Andor, > >> > >> ZOOKEEPER-3114 is about adding real time digest checking to help > detecting > >> inconsistency, it's a new feature with amounts of code change. I'll > start > >> upstream it part by part, but I don't expect it's being merged in the > next > >> few weeks. So yes, it's a nice to have, but definitely not a block for > 3.5. > >> > >> Thanks, > >> Fangmin > >> > >> On Wed, Sep 12, 2018 at 2:55 AM Andor Molnar <an...@apache.org> wrote: > >> > >>> Fangmin, > >>> > >>> Sorry, I just noticed that you want to include the consistency fixes in > >>> the stable version which is fine. Let’s finish the backports and we’ll > be > >>> done with them. > >>> > >>> ZOOKEEPER-3114 is essentially a new feature, I wouldn’t block 3.5 with > >>> that. What do you think? > >>> > >>> Andor > >>> > >>> > >>> > >>>> On 2018. Sep 12., at 11:52, Andor Molnar <an...@apache.org> wrote: > >>>> > >>>> Cool, thanks for the clarification. > >>>> > >>>> The updated list is as follows: > >>>> > >>>> - ZOOKEEPER-236 (SSL/TLS support for Atomic Broadcast protocol) > >>>> - ZOOKEEPER-1818 (Fix don't care for trunk) > >>>> - ZOOKEEPER-2778 (Potential server deadlock between follower sync with > >>> leader and follower receiving external connection requests.) > >>>> > >>>> The following are not critical and no blockers for the stable release: > >>>> > >>>> Waiting for to be ported to 3.5: > >>>> - ZOOKEEPER-3104 > >>>> - ZOOKEEPER-3125 > >>>> - ZOOKEEPER-3127 > >>>> > >>>> New feature: > >>>> - ZOOKEEPER-3114 (fixes ZOOKEEPER-2184 too) > >>>> > >>>> Regards, > >>>> Andor > >>>> > >>>> > >>>> > >>>>> On 2018. Sep 12., at 0:42, Fangmin Lv <lvfang...@gmail.com> wrote: > >>>>> > >>>>> Hi Andor, > >>>>> > >>>>> That's the on disk txn feature, which was disabled internally after > we > >>>>> found the potentially inconsistent issue. The only solution we have > >>> for now > >>>>> is waiting for the new digest checking feature I mentioned in > >>>>> ZOOKEEPER-3114. > >>>>> > >>>>> I think there are some other critical consistent issues we just fixed > >>> on > >>>>> master recently: ZOOKEEPER-3104, ZOOKEEPER-3125, ZOOKEEPER-3127, I > >>> think we > >>>>> should include that in the official 3.5 release as well. > >>>>> > >>>>> Thanks, > >>>>> Fangmin > >>>>> > >>>>> On Tue, Sep 11, 2018 at 11:58 AM Andor Molnár <an...@apache.org> > >>> wrote: > >>>>> > >>>>>> Hi Jeelani, > >>>>>> > >>>>>> > >>>>>> Thanks for letting me know. I'm happy to remove it from the list to > >>> get > >>>>>> closer to a stable release. :) > >>>>>> > >>>>>> What's the feature which can be disabled to avoid data > inconsistency? > >>>>>> > >>>>>> > >>>>>> Andor > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 09/10/2018 11:33 PM, Mohamed Jeelani wrote: > >>>>>>> Thanks Andor for compiling this. Should we be ignoring > >>> ZOOKEEPER-2418 as > >>>>>> well? This exists in 3.4 as well and the feature can be disabled. We > >>> are > >>>>>> working on a longer term fix for it in 3.6. > >>>>>>> > >>>>>>> Regards, > >>>>>>> > >>>>>>> Jeelani > >>>>>>> > >>>>>>> On 9/10/18, 5:19 AM, "Andor Molnar" <an...@cloudera.com.INVALID> > >>> wrote: > >>>>>>> > >>>>>>> Fine. > >>>>>>> > >>>>>>> I'm happy to ignore 1549, 2846 and 2930. Still we have the list > of: > >>>>>>> > >>>>>>> - ZOOKEEPER-236 (SSL/TLS support for Atomic Broadcast protocol) > >>>>>>> - ZOOKEEPER-1818 (Fix don't care for trunk) > >>>>>>> - ZOOKEEPER-2418 (txnlog diff sync can skip sending some > >>>>>> transactions to > >>>>>>> followers) > >>>>>>> - ZOOKEEPER-2778 (Potential server deadlock between follower sync > >>>>>> with > >>>>>>> leader and follower receiving external connection requests.) > >>>>>>> > >>>>>>> SSL (ZK-236) is a feature which essential for the 3.5 release, > >>> hence > >>>>>> I > >>>>>>> wouldn't leave it out or postpone it for the next stable release. > >>> PR > >>>>>> has > >>>>>>> been out for a long time, get on reviewing please. > >>>>>>> The rest are also long outstanding issues which have been found in > >>>>>> the 3.5 > >>>>>>> branch. > >>>>>>> ZK-1818 is something which was found in 3.4 and fixed in 3.4, but > >>>>>> never has > >>>>>>> been fixed in 3.5. Quite a serious issue if still present. > >>>>>>> > >>>>>>> I think we should at least run some manual testing and see if we > >>>>>> could > >>>>>>> repro any of these issues before going ahead with a stable > release. > >>>>>>> > >>>>>>> Regards, > >>>>>>> Andor > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Fri, Sep 7, 2018 at 3:24 AM, Michael Han <h...@apache.org> > >>> wrote: > >>>>>>> > >>>>>>>> I haven't went through the entire list, but looks like lots of the > >>>>>> JIRA > >>>>>>>> issues listed in this thread, such as ZOOKEEPER-1549, 2846, also > >>>>>> affects > >>>>>>>> 3.4 releases. Should we scope these issues out? > >>>>>>>> > >>>>>>>> I think historically the single outstanding blocking issue for a > >>>>>> stable 3.5 > >>>>>>>> release is the reconfig feature and security concerns around it > >>>>>> (somehow > >>>>>>>> addressed in ZOOKEEPER-2014), and the alpha and beta releases were > >>>>>> created > >>>>>>>> to stabilize that feature. > >>>>>>>> > >>>>>>>> > >>>>>> > >>> > https://urldefense.proofpoint.com/v2/url?u=http-3A__zookeeper-2Duser.578899.n2.nabble.com_Zookeeper-2Dwith-2D&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Vl4oKanLQehvaulUvoKg8A&m=wqlhnot9c-pQLdkGkccSGNpELUNUnB-wy_h0iA3PRqI&s=_tGtL3nMWtuPrXKXDx27AIWOzyyT7W-CjIVLDFZwT0E&e= > >>>>>>>> SSL-release-date-tt7581744.html > >>>>>>>> > >>>>>>>> So it looks like we are in good shape to release. Something might > >>>>>> worth > >>>>>>>> doing to claim the quality of 3.5 is on par with 3.4 > >>>>>>>> > >>>>>>>> * Run Jepsen on 3.5 - 3.4 passed the test for the record > >>>>>>>> > >>>>>> > >>> > https://urldefense.proofpoint.com/v2/url?u=https-3A__aphyr.com_posts_291-2Djepsen-2Dzookeeper&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Vl4oKanLQehvaulUvoKg8A&m=wqlhnot9c-pQLdkGkccSGNpELUNUnB-wy_h0iA3PRqI&s=VjORkX5s7hrJyl8mW9Q4cfeSWF4qfTdyRjcuAiBt0y4&e= > >>>>>>>> * Fix all flaky tests on 3.5 - 3.4 has little or no flaky tests at > >>>>>> all. > >>>>>>>> > >>>>>>>> > >>>>>>>> On Tue, Sep 4, 2018 at 1:48 AM, Andor Molnar > >>>>>> <an...@cloudera.com.invalid> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Thanks Maoling! That would be huge help, I appreciate it. > >>>>>>>>> > >>>>>>>>> Andor > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>> > >>> > >>> > >