Very well said, thank you Ted! >> I would still opt for quorum outside rather than quorum as a library.
One observation on out side quorum vs library: for Raft, cockroach db and TiDB both choose the library approach instead of depending on etcd, though they all share the etcd's Raft implementation. ZooKeeper could be used in a similar approach if we can abstract ZAB and provides a nice SMR interface on top of it. On Fri, Aug 2, 2019 at 12:44 PM Ted Dunning <ted.dunn...@gmail.com> wrote: > The core issue in these situations in my experience is that having the > quorum as a separate service can be a pain point. This misunderstanding > about how watches work and why they don't provide the data is just a > symptom of this. Having an integrated quorum is very attractive from the > point of view of management and tighter integration with the record of > state. > > If I had it all to do over again, though, I think I would still opt for > quorum outside rather than quorum as a library. There are management > burdens, but many of those management burdens are implicit in the fact that > managing the state of the system is different from managing the system or > doing the stuff the system does. Pulling the quorum system into the > do-stuff system doesn't actually make life all that much easier even if it > does simplify the installer. > > The countervailing risk that you are likely to get a quorum system wrong is > really significant. Having a battle-tested (some might say battle-scarred) > system like ZK is quite a virtue since you can have a different level of > confidence in it than something you whipped up last week. > > > > On Fri, Aug 2, 2019 at 11:49 AM Patrick Hunt <ph...@apache.org> wrote: > > > Michael I think you are describing subscribe - this? > > https://issues.apache.org/jira/browse/ZOOKEEPER-153 > > wasn't there some work done to keep tlogs around for a while? Or am I > miss > > remembering? (fb folks?) > > > > I'll also add that we haven't done any benchmarking in quite some time. > It > > would be interesting to collect a few of these use cases from the > > community, esp downstreams, and evaluate performance, see if we can > > address. > > > > Patrick > > > > On Fri, Aug 2, 2019 at 11:03 AM Michael Han <h...@apache.org> wrote: > > > > > Folks, > > > > > > Some of you might already see this. Comments? > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum > > > > > > > > > What caught my eyes are: > > > > > > *Worse still, although ZooKeeper is the store of record, the state in > > > ZooKeeper often doesn't match the state that is held in memory in the > > > controller. For example, when a partition leader changes its ISR in > ZK, > > > the controller will typically not learn about these changes for many > > > seconds. There is no generic way for the controller to follow the > > > ZooKeeper event log. Although the controller can set one-shot watches, > > the > > > number of watches is limited for performance reasons. When a watch > > > triggers, it doesn't tell the controller the current state-- only that > > the > > > state has changed. By the time the controller re-reads the znode and > > sets > > > up a new watch, the state may have changed from what it was when the > > watch > > > originally fired. If there is no watch set, the controller may not > learn > > > about the change at all. In some cases, restarting the controller is > the > > > only way to resolve the discrepancy.* > > > > > > I've seen some similar zookeeper use cases that ended up like what's > > > described here. How can ZooKeeper solve this? It seems to me that the > > only > > > solution is to provide linearizable read on watched operations. > Thoughts? > > > > > > Michael. > > > > > >