My concerns are listed in the PR comments.

A broker is allowed to operate on a (resource) bundle under a lock. When a
broker loses its session, the lock ownership COULD be lost. The right thing
at this point is to give up the resource and re-acquire it. ( In fact,
shutdown is just a shortcut to doing exactly this)  The broker continuing
to operate, ASSUMING  that it owns the bundle, violates the axiom that the
resource is protected by the lock. It breaks fundamental distribution
system principles for two nodes to own an exclusive resource concurrently.

 It does not matter even if no other broker grabbed the resource in the
meantime and the original broker successfully re-acquires the lock after
session loss. There is no way for the original broker to ascertain this
apriori,  for it to justify  operating on the resource, AS IF  it never
lost the lock.

It may be possible that underlying lower level locks may prevent
catastrophe, but that does not validate this violation of  basic
principles. Not only will it make incredibly difficult to assert the
correctness of the system, but makes the system more complex and difficult
to  maintain going forward.

The Global ZK and BK use  of ZK are not comparable to this situation. Doing
something like this would be incorrect in any distributed system.  The only
way something like this could even be attempted is if the broker can freeze
for the window of the time from where it loses the session and reacquires
the session.

Joe



On Fri, Feb 21, 2020 at 8:27 PM PengHui Li <codelipeng...@gmail.com> wrote:

> Hi all,
>
> I have drafted a proposal for improving broker's Zookeeper session timeout
> handling. You can find at
> https://github.com/apache/pulsar/wiki/PIP-57%3A-Improve-Broker%27s-Zookeeper-Session-Timeout-Handling
>
> Also I copy it to the email thread for easier to view. Any suggestions or
> ideas welcome to join the discussion.
>
>
> PIP 57: Improve Broker's Zookeeper Session Timeout Handling
> Motivation
> In Pulsar, brokers use Zookeeper as the configuration store and broker
> metadata maintaining. We can also call them Global Zookeeper and Local
> Zookeeper.
> The Global Zookeeper maintains the namespace policies, cluster metadata,
> and partitioned topic metadata. To reduce read operations on Zookeeper,
> each broker has a cache for global Zookeeper. The Global Zookeeper cache
> updates on znode changed. Currently, when the present session timeout
> happens on global Zookeeper, a new session starts. Broker does not create
> any EPHEMERAL znodes on global Zookeeper.
> The Local Zookeeper maintains the local cluster metadata, such as broker
> load data, topic ownership data, managed ledger metadata, and Bookie rack
> information. All of broker load data and topic ownership data are create
> EPHEMERAL nodes on Local Zookeeper. Currently, when session timeout happens
> on Local Zookeeper, the broker shutdown itself.
> Shutdown broker results in ownership change of topics that the broker
> owned. However, we encountered lots of problems related to the current
> session timeout handling. Such as broker with long JVM GC pause, Local
> Zookeeper under high workload. Especially the latter may cause all broker
> shutdowns.
> So, the purpose of this proposal is to improve session timeout handling on
> Local Zookeeper to avoid unnecessary broker shutdown.
> Approach
> Same as the Global Zookeeper session timeout handling and Zookeeper
> session timeout handling in BookKeeper, a new session should start when the
> present session timeout.
> If a new session failed to start, the broker would retry several times.
> The retry times depend on the configuration of the broker. After the number
> of retries, if still can't start session success, the broker still needs to
> be shut down since this may be a problem with the Zookeeper cluster. The
> user needs to restart the broker after the zookeeper cluster returns to
> normal.
> If a new session starts success, the issue is slightly more complicated.
> So, I will introduce every scene separately.
> Topic ownership data handling
> The topic ownership data maintain all namespace bundles that owned by the
> broker. In Zookeeper, create an EPHEMERAL znode for each namespace bundle.
> When the session timeout happens on the local Zookeeper, all of the
> EPHEMERAL znode maintained by this broker will delete automatically. We
> need some mechanism to avoid the unnecessary ownership transfer of the
> bundles. Since the broker cached the owned bundles in memory, the broker
> can use the cache to re-own the bundles.
> Firstly, when the broker to re-own the bundle, if the znode of the bundle
> exists at Zookeeper and the owner is this broker, it may be that Zookeeper
> has not deleted the znode. The broker should check if the ephemeral owner
> is the current session ID. If not, the broker should wait for the znode
> deletion.
> Then the broker tries to own the bundle. If the broker owns the bundle
> success means the bundle not owned by other brokers, the broker should
> check whether to preload the topics under the bundle. If the broker failed
> to own the bundle means the bundle owned by another broker. The broker
> should unload the bundle.
> Theoretically, the mechanism can guarantee that the ownership of most
> bundles will not change during the session timeout.
> Broker load data handling
> The load data used for namespace bundle load balancing, so there is no
> need to be overly complicated in handling. The only effect is that it will
> interfere with the choice of the broker when finding a candidate broker for
> a namespace bundle. Even without selecting the optimal broker, it will
> continue to relocate the namespace bundles.
> So for broker load data handling, we need to guarantee the load data of
> the broker can report success.
> Other scene handing
> There are also some usage scenarios of the local Zookeeper, BookKeeper
> client, managed ledger meta, bookie rack information, and schema metadata.
> All of these scenarios do not create any EPHEMERAL znodes on the Zookeeper.
> Pulsar introduces the Zookeeper cache for the local Zookeeper. The cache is
> invalidated when the session timeout occurs.
> Configurations
> A new configuration parameter zookeeperSessionExpiredPolicy added to
> broker.conf to control the zookeeper session expired policy. There are two
> options, shutdown and reconnect.
>
>
> Thanks,
> Penghui
>

Reply via email to