Sjiie, Penghui,

Thank you. Can we get the PIP to be a more detailed write-up  ?  I would
like this PIP to be more comprehensive

>>Hence we need to draw an agreement on understanding
>>WHAT is actually guarantees the correctness in current Pulsar design. We
>>then can move forward with a conclusion about how to do it.

That would be great.  I have been thinking recently about whether we can
formalize the system through a TLA model. It would be ideal,  but it will
also require BK to provide one. Whether we use TLA or not, we should have
an understanding of WHAT is actually guarantees the correctness in current
Pulsar design _written down_ . At least then we  will have a model against
which we can align change and fixes.

Joe

On Sun, Feb 23, 2020 at 11:44 PM Sijie Guo <guosi...@gmail.com> wrote:

> Sorry for the late reply.
>
> Joe - There are two things I would like to clarify first.
>
> 1) I think you have a misunderstanding about the zookeeper lock "ephemeral
> znode" and bookkeeper/ML fencing. Let's step back to understand the current
> Pulsar's behavior first.
>
> - A zookeeper lock doesn't prevent a dural-writers situation happening. A
> simple case - Node A creates a lock and Node B stands by. If the lock (the
> ephemeral znode expired at zookeeper), Node B acquires the lock and becomes
> the owner. But Node A might NOT receive the session expire notification
> because of a network partition, hence A still thinks it is the owner. So
> there is a given duration that both A and B think they are the owners.
>
> - The correctness of using a zookeeper lock should be gated by the
> exclusiveness of a resource. In ML, the exclusiveness is provided by
> single-writer semantics offered by bookkeeper and CAS operations offered by
> zookeeper.
>
> So a zookeeper lock (or an external locking mechanism) only ensures
> "stable" ownership of a resource in a long duration. but it doesn't prevent
> dural ownerships. The resource itself should provide a mechanism to ensure
> exclusiveness. BookKeeper/ML does that via fencing. HBase uses ACL to
> achieve "fencing" for regions.
>
> Martin Kleppmann wrote a blog post about this. It is a well-written blog
> post to check out.
>
> https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
>
> 2) Increasing session timeout doesn't prevent session expired. Session
> expiry can still happen when zookeeper leader is crashed, or client paused,
> or any network hiccups. Increasing session timeout introduces side-effects.
> The broker fail-over time will increase as well. It means the topic
> unavailable duration is increased when a broker is going down.
>
>
> ---
>
> Penghui, Joe,
>
> I think there are multiple things coupled in this discussion relate to
> zookeeper.
>
> ZooKeeper is mainly used for two places.
>
> 1) "ownership" for bundles and service discovery for "brokers". It uses
> ephemeral znodes. They are session related and will be impacted by session
> expiries.
> 2) metadata management for both policies and ML. It uses persistent znodes.
> They are not session related. But unfortunately, they are also impacted by
> session expiries. Because zookeeper ties session management to connection
> management.
>
> For 2), it is safe to retry creating a zookeeper client to establish a
> session when the session expired. We can give as high session expire time
> as possible, since they don't impact failover time.
>
> So the main discussion should be about 1) - whether we are safe to
> re-establish a zookeeper session to re-acquire bundles after the previous
> session is expired. Hence we need to draw an agreement on understanding
> WHAT is actually guarantees the correctness in current Pulsar design. We
> then can move forward with a conclusion about how to do it.
>
> - Sijie
>
> On Sat, Feb 22, 2020 at 11:40 PM Joe F <joefranc...@gmail.com> wrote:
>
> > On Sat, Feb 22, 2020 at 6:28 PM PengHui Li <codelipeng...@gmail.com>
> > wrote:
> >
> > > Hi, joe
> > >
> > > The fundamental correctness is guaranteed by the fencing mechanism
> > > provided by Apache BookKeeper and the CAS operation provided by the
> > > metadata storage. Both fencing and CAS operations will prevent two
> owners
> > > updating data or metadata at the same time.
> >
> >
> > This may be, as I said  "It may be possible that underlying lower level
> > locks may prevent catastrophe, but that does not validate this violation
> > of  basic principles. " I am far too familiar with how BK works, and how
> > CAS works for ML metadata storage, to know where all the bodies are
> buried.
> >
> > This default shutdown behavior isn’t changed. We just introduce an
> > > alternative way for improving stability when Zookeeper is doing leader
> > > election or becomes unavailable
> >
> >
> >  This is  just a claim. I would argue that it does the opposite.
> >
> >
> > >
> > >
> > According to the following rules, I think this will not break current
> > > system principles.
> > >
> > >
> > > 1. If the znode for the bundle is deleted, this is consistent with the
> > > current behavior. The broker who acquires for the lock first will
> become
> > > the owner.
> > > 2. If the znode for the bundle is not deleted, other brokers also
> unable
> > > to acquire the lock. Both the broker that re-creates the session and
> > other
> > > brokers need to wait for the znode deletion. Then the broker who
> acquires
> > > for the lock first will become the owner.
> > > 3. If the bundle ownership changed, the broker that re-creates the
> > session
> > > unable to acquire the lock. So the broker should unload the bundle,
> This
> > is
> > > also consistent with current ownership change behavior.
> > > 4. Also, if unexpected exceptions throw during the re-own process, the
> > > broker needs to shutdown itself.
> > >
> > >
> > I don't think you address the issue I have raised. Say 30 secs is the
> > timeout. Let us  say the broker B1 lost connection at t. Then B1 loses
> the
> > session at t+30 secs,    With your logic, B1 continues to service the
> topic
> > as if it still owns it. Meanwhile B2 acquires the topic at t+31 and loses
> > its connection at t+32. (and loses its session at t+62) At t+62 B3
> acquires
> > it. .And loses it connection at t+63. Now B4 acquires it  B4 crashes. Now
> > the original broker B1 reacquires the session and goes own as if nothing
> > occurred in between, merrily operating as if nothing occurred in the
> > meantime ( and so could B2 and B3 ).
> >
> > All fine, as you say,  because of lower level locks in BK and ML to
> prevent
> > catastrophe...
> >
> > If you want to make the case that bundle ownership does not guarantee
> > underlying topic ownership, and topic ownership is arbitrated  by BK/
> > ML(metadata),  then explicitly make that case. Then we can debate the
> > merits of that, and see if the code and design allows for it.  Because as
> > it is, that is not how Pulsar is designed. Now, topic ownership is
> > arbitrated by the bundle lock.  This is not a change that should casually
> > be slipped into the system.
> >
> > And my original qn still stands - if this session loss is such an issue
> > for  some use cases, why not raise the session timeout?  The broker can
> > safely keep the session for longer. That's far preferable to running the
> > risk of doing this.
> >
> >
> > >
> > > Thanks,
> > > Penghui
> > > On Feb 22, 2020, 12:27 +0800, PengHui Li <codelipeng...@gmail.com>,
> > wrote:
> > > > Hi all,
> > > >
> > > > I have drafted a proposal for improving broker's Zookeeper session
> > > timeout handling. You can find at
> > >
> >
> https://github.com/apache/pulsar/wiki/PIP-57%3A-Improve-Broker%27s-Zookeeper-Session-Timeout-Handling
> > > >
> > > > Also I copy it to the email thread for easier to view. Any
> suggestions
> > > or ideas welcome to join the discussion.
> > > >
> > > >
> > > > PIP 57: Improve Broker's Zookeeper Session Timeout Handling
> > > > Motivation
> > > > In Pulsar, brokers use Zookeeper as the configuration store and
> broker
> > > metadata maintaining. We can also call them Global Zookeeper and Local
> > > Zookeeper.
> > > > The Global Zookeeper maintains the namespace policies, cluster
> > metadata,
> > > and partitioned topic metadata. To reduce read operations on Zookeeper,
> > > each broker has a cache for global Zookeeper. The Global Zookeeper
> cache
> > > updates on znode changed. Currently, when the present session timeout
> > > happens on global Zookeeper, a new session starts. Broker does not
> create
> > > any EPHEMERAL znodes on global Zookeeper.
> > > > The Local Zookeeper maintains the local cluster metadata, such as
> > broker
> > > load data, topic ownership data, managed ledger metadata, and Bookie
> rack
> > > information. All of broker load data and topic ownership data are
> create
> > > EPHEMERAL nodes on Local Zookeeper. Currently, when session timeout
> > happens
> > > on Local Zookeeper, the broker shutdown itself.
> > > > Shutdown broker results in ownership change of topics that the broker
> > > owned. However, we encountered lots of problems related to the current
> > > session timeout handling. Such as broker with long JVM GC pause, Local
> > > Zookeeper under high workload. Especially the latter may cause all
> broker
> > > shutdowns.
> > > > So, the purpose of this proposal is to improve session timeout
> handling
> > > on Local Zookeeper to avoid unnecessary broker shutdown.
> > > > Approach
> > > > Same as the Global Zookeeper session timeout handling and Zookeeper
> > > session timeout handling in BookKeeper, a new session should start when
> > the
> > > present session timeout.
> > > > If a new session failed to start, the broker would retry several
> times.
> > > The retry times depend on the configuration of the broker. After the
> > number
> > > of retries, if still can't start session success, the broker still
> needs
> > to
> > > be shut down since this may be a problem with the Zookeeper cluster.
> The
> > > user needs to restart the broker after the zookeeper cluster returns to
> > > normal.
> > > > If a new session starts success, the issue is slightly more
> > complicated.
> > > So, I will introduce every scene separately.
> > > > Topic ownership data handling
> > > > The topic ownership data maintain all namespace bundles that owned by
> > > the broker. In Zookeeper, create an EPHEMERAL znode for each namespace
> > > bundle. When the session timeout happens on the local Zookeeper, all of
> > the
> > > EPHEMERAL znode maintained by this broker will delete automatically. We
> > > need some mechanism to avoid the unnecessary ownership transfer of the
> > > bundles. Since the broker cached the owned bundles in memory, the
> broker
> > > can use the cache to re-own the bundles.
> > > > Firstly, when the broker to re-own the bundle, if the znode of the
> > > bundle exists at Zookeeper and the owner is this broker, it may be that
> > > Zookeeper has not deleted the znode. The broker should check if the
> > > ephemeral owner is the current session ID. If not, the broker should
> wait
> > > for the znode deletion.
> > > > Then the broker tries to own the bundle. If the broker owns the
> bundle
> > > success means the bundle not owned by other brokers, the broker should
> > > check whether to preload the topics under the bundle. If the broker
> > failed
> > > to own the bundle means the bundle owned by another broker. The broker
> > > should unload the bundle.
> > > > Theoretically, the mechanism can guarantee that the ownership of most
> > > bundles will not change during the session timeout.
> > > > Broker load data handling
> > > > The load data used for namespace bundle load balancing, so there is
> no
> > > need to be overly complicated in handling. The only effect is that it
> > will
> > > interfere with the choice of the broker when finding a candidate broker
> > for
> > > a namespace bundle. Even without selecting the optimal broker, it will
> > > continue to relocate the namespace bundles.
> > > > So for broker load data handling, we need to guarantee the load data
> of
> > > the broker can report success.
> > > > Other scene handing
> > > > There are also some usage scenarios of the local Zookeeper,
> BookKeeper
> > > client, managed ledger meta, bookie rack information, and schema
> > metadata.
> > > All of these scenarios do not create any EPHEMERAL znodes on the
> > Zookeeper.
> > > Pulsar introduces the Zookeeper cache for the local Zookeeper. The
> cache
> > is
> > > invalidated when the session timeout occurs.
> > > > Configurations
> > > > A new configuration parameter zookeeperSessionExpiredPolicy added to
> > > broker.conf to control the zookeeper session expired policy. There are
> > two
> > > options, shutdown and reconnect.
> > > >
> > > >
> > > > Thanks,
> > > > Penghui
> > >
> >
>

Reply via email to