Hi, Lari
Thanks for you start this discussion.

1) Metadata consistency from user's point of view
>   - Summarized well in this great analysis and comment [1] by Zac Bentley
>    "Ideally, the resolution of all of these issues would be the same: a
> management API operation--any operation--should not return successfully
> until all observable side effects of that operation across a Pulsar cluster
> (including brokers, proxies, bookies, and ZK) were completed." (see [1] for
> the full analysis and comment)


I think the key problem is not about the async operation, the important
thing is we use the async operation but not wait for it to finish and then
do the following work. Some contributors probably cause this by not knowing
how to use the `CompletableFuture` or some wrong API.
e.g., use `thenAccept` to handle the async operation.

>From this point, I want to share another concern about the pulsar async
operation usage because we chose to use `CompletableFuture` and do not
manage all threads perfectly. So, we don't know which thread the next task
will give to run. I think it is not good that there are a lot of unexpected
things in software.
e.g. the IO thread calls metadata operations, and metadata will use the
metadata thread for the following work. After this work is done, the new
`CompletableStage` is returned, and the following work is still working
using the metadata thread.

After reading this email, I think I have to think about it.
I will continue following this email and will leave some questions in the
future.

Best,
Mattison


>

On Tue, 16 Aug 2022 at 19:16, Qiang Huang <qiang.huang1...@gmail.com> wrote:

> It is a huge milestone, but a challenge for implementing pluggable metadata
> storage. Will the plan go from providing pluggable metadata storage to
> internalize the distributed coordination functionality of Pulsar itself
> finally?
>
> Lari Hotari <lhot...@apache.org> 于2022年8月16日周二 11:17写道:
>
> > Bumping up this thread.
> >
> > -Lari
> >
> > pe 20. toukok. 2022 klo 1.57 Lari Hotari <lhot...@apache.org> kirjoitti:
> >
> > > Hi all,
> > >
> > > I started writing this email as feedback to "PIP-157: Bucketing topic
> > > metadata to allow more topics per namespace" [3].
> > > This email expanded to cover some analysis of "PIP-45: Pluggable
> metadata
> > > interface" [4] design. (A good introduction to PIP-45 is the
> StreamNative
> > > blog post "Moving Toward a ZooKeeper-Less Apache Pulsar" [5]).
> > >
> > > The intention is to start discussions for Pulsar 3.0 and beyond.
> Bouncing
> > > ideas and challenging the existing design with good intentions and the
> > > benefit of all.
> > >
> > > I'll share some thoughts that have come up in discussions together with
> > my
> > > colleague Michael Marshall.  We have been bouncing some ideas together
> > and
> > > that has been very helpful in being able to start building some
> > > understanding of the existing challenges and possible direction for
> > solving
> > > these challenges. I hope that we could have broader conversations in
> the
> > > Pulsar Community for improving Pulsar's metadata management and load
> > > balancing designs in the long term.
> > >
> > > There are few areas where there are challenges with the current
> Metadata
> > > Store / PIP-45 solution:
> > >
> > > 1) Metadata consistency from user's point of view
> > >   - Summarized well in this great analysis and comment [1] by Zac
> Bentley
> > >    "Ideally, the resolution of all of these issues would be the same: a
> > > management API operation--any operation--should not return successfully
> > > until all observable side effects of that operation across a Pulsar
> > cluster
> > > (including brokers, proxies, bookies, and ZK) were completed." (see [1]
> > for
> > > the full analysis and comment)
> > >
> > > 2) Metadata consistency issues within Pulsar
> > >   - There are issues where the state in a single broker gets left in a
> > bad
> > > state as a result of consistency and concurrency issues with metadata
> > > handling and caching.
> > >     Possible example https://github.com/apache/pulsar/issues/13946
> > >
> > > 3) Scalability issue: all metadata changes are broadcasted to all
> brokers
> > > - model doesn't scale out
> > >    - This is due to the change made in
> > > https://github.com/apache/pulsar/pull/11198 , "Use ZK persistent
> > watches".
> > >    - The global broadcasting design of metadata changes doesn't follow
> > > typical scalable design principles such as the "Scale cube". This will
> > pose
> > > limits on Pulsar clusters with large number of brokers. The current
> > > metadata change notification solution doesn't support scaling out when
> > it's
> > > based on a design that broadcast all notifications to every
> participant.
> > >
> > > When doing some initial analysis and brainstorming on the above areas,
> > > there have been thoughts that PIP-45 Metadata Store API [2]
> abstractions
> > > are somewhat not optimal.
> > >
> > > A lot of the functionality that is provided in the PIP-45 Metadata
> Store
> > > API interface [2] could be solved more efficiently in a way where
> Pulsar
> > > itself would be a key part of the metadata storage solution.
> > >
> > > For example, listing topics in a namespace could be a "scatter-gather"
> > > query to all "metadata shards" that hold namespace topics. There's not
> > > necessarily a need to have a centralized external Metadata Store API
> > > interface [2] that replies to all queries. Pulsar metadata handling
> could
> > > be moving towards a distributed database type of design where
> consistent
> > > hashing plays a key role. Since the Metadata handling is an internal
> > > concern, the interface doesn't need to provide services directly to
> > > external users of Pulsar. The Pulsar Admin API should also be improved
> to
> > > scale for queries and listing of namespaces with millions of topics,
> and
> > > should have pagination to limit results. This implementation can
> > internally
> > > handle possible "scatter-gather" queries when the metadata handling
> > backend
> > > is not centralized. The point is that Metadata Store API [2]
> abstraction
> > > doesn't necessarily need to provide service for this, since it could
> be a
> > > different concern.
> > >
> > > Most of the complexity in the current PIP-45 MetaData Store comes from
> > > data consistency challenges. The solution is heavily based on caches
> and
> > > having ways to handle cache expirations and keeping data consistent.
> > There
> > > are gaps in the caching solution since there are metadata consistency
> > > problems, as described in 1) and 2) above. A lot of the problems go
> away
> > in
> > > a model where most processing and data access is local. Similar to how
> > the
> > > broker handles the topics. The topic is owned on a single broker at a
> > time.
> > > The approach could be extended to cover metadata changes and queries.
> > >
> > > What is interesting here regarding PIP-157 is that brainstorming led
> to a
> > > sharding (aka "bucketing") solution, where there are metadata shards in
> > the
> > > system.
> > >
> > > metadata shard
> > >            |
> > > namespace bundle  (existing)
> > >            |
> > > namespace  (existing)
> > >
> > > Instead of having a specific solution in mind for dealing with the
> > storage
> > > of the metadata, the main idea is that each metadata shard is
> independent
> > > and would be able to perform operations without coordination with other
> > > metadata shards. This does impact the storage of metadata so that
> > > operations to the storage system can be isolated (for example, it is
> > > necessary to be able to list the topics for a bundle without listing
> > > everything. PIP-157 provides one type of solution for this). We didn't
> > let
> > > the existing solution limit our brainstorming.
> > >
> > > Since there is metadata that needs to be available in multiple
> locations
> > > in the system such as tenant / namespace level policies, it would be
> > easier
> > > to handle the consistency aspects with a model that is not based on
> CRUD
> > > type of operations, but instead is event sourced where the state can be
> > > rebuilt from events (with the possibility to have state snapshots).
> There
> > > could be an internal metadata replication protocol which ensures
> > > consistency (some type of acknowledgements when followers have caught
> up
> > > with changes from the leader) when that is needed.
> > >
> > > metadata shard leader
> > >               |
> > > metadata shard follower  (namespace bundle, for example)
> > >
> > > The core principle is that all write operations will always be
> redirected
> > > to be handled to the leader, which is a single writer for a shard. The
> > > followers would get events for changes, and the followers could also
> > notify
> > > the leader each time they have caught up with changes. This would be
> one
> > > way to solve "1) Metadata consistency from user's point of view"
> without
> > > having a complex Metadata cache invalidation solution. This would also
> > > solve the problem "2) Metadata consistency issues within Pulsar". In
> > event
> > > sourcing, events are the truth and there are better ways to ensure
> "cache
> > > consistency" in a leader-follower model based on event sourcing.
> > >
> > > Everything above is just initial brainstorming, but it seems that it is
> > > going to a different direction than what PIP-45 is currently.
> > > Abstractions for coordination such as leader election and distributed
> > > locks will be necessary, and some external Metadata would have to be
> > > managed in a centralized fashion. In general, the model would be
> somewhat
> > > different compared to what PIP-45 has. Since the core idea would be to
> > use
> > > an event sourced model, it would be optimal to use BookKeeper ledgers
> > > (Pulsar managed ledger) for storing the events.
> > > With the nature of event sourcing, it would be possible to create
> > > point-in-time backup and restore solutions for Pulsar metadata. Even
> > today,
> > > it is very rare that Pulsar users would go directly to Zookeeper for
> > > observing the state of the metadata. In an event sourced system, this
> > state
> > > could be stored to flat files on disk if that is needed for debugging
> and
> > > observability purposes besides back and restore. Metadata events could
> > > possibly also be exposed externally for building efficient management
> > > tooling for Pulsar.
> > >
> > > The metadata handling also expands to Pulsar load balancing, and that
> > > should also be considered when revisiting the design of PIP-45 to
> address
> > > the current challenges. There are also aspects of metadata where
> changes
> > > aren't immediate. For example, deleting a topic will require to delete
> > the
> > > underlying data stored in bookkeeper. If the operation fails, there
> > should
> > > be ways to keep on retrying. Similar approach for creation. Some
> > operations
> > > might be asynchronous, and having support for a state machine for
> > creation
> > > and deletion could be helpful. This is to bring up the point that it's
> > not
> > > optimal to model a topic deletion as an atomic operation. The state
> > change
> > > should be atomic, but the deletion from the metadata storage should not
> > > happen until all asynchronous operations have been completed. The
> > metadata
> > > admin interface caller should be able to proceed after it is marked
> > > deleted, but the system should keep on managing the deletions in the
> > > background. Similarly, the creation of topics could have more states to
> > > deal with efficient creation of a large number of topics.
> > >
> > > This was a long email covering a subject that we haven't dealt with
> > before
> > > in the Pulsar community. Usually, we have discussions about solutions
> > that
> > > are very targeted. It isn't common to transparently discuss existing
> > > design challenges or problems and find ways to solve them together.
> > Sharing
> > > observations about problems would be valuable. High-level problems
> don't
> > > get reported in the GitHub issue tracker since they aren't individual
> > bugs.
> > > We should find ways to address also this type of challenges in the
> > > community.
> > >
> > > I hope we can change this and also take the opportunity to meet at
> Pulsar
> > > Community meetings and have more of these in-depth discussions that
> will
> > > help us improve Pulsar for the benefit of us all in the Apache Pulsar
> > > community.
> > >
> > > Since PIP-157 [3] is proceeding, I see that as an opportunity to start
> > > taking the design of Pulsar Metadata handling in the direction where we
> > > could address the challenges that there are currently in Pulsar with
> > > metadata handling and load balancing. We must decide together what that
> > > direction is. I hope this email opens some new aspects to the basis of
> > > these decisions. I'm hoping that you, the reader of the email,
> > > participate to share your views and also help develop this direction.
> > >
> > > PIP-157 [3] assumes "Pulsar is able to manage millions of topics but
> the
> > > number of topics within a single namespace is limited by metadata
> > > storage.". Does this assumption hold?
> > >
> > > For example, "3) Scalability issue: all metadata changes are
> broadcasted
> > > to all brokers" will become a challenge in a large system with a high
> > > number of brokers. Together with the other Metadata consistency
> > challenges
> > > ( 1 and 2 above), I have a doubt that after PIP-157 is implemented, the
> > > bottlenecks will move to these areas. In that sense, it might be a
> > band-aid
> > > that won't address the root cause of the Pulsar Metadata handling
> > > scalability challenges.
> > >
> > > Let's discuss and address the challenges together!
> > >
> > > Regards,
> > >
> > > -Lari
> > >
> > > [1] - analysis about Metadata consistency from user's point of view -
> > > https://github.com/apache/pulsar/issues/12555#issuecomment-955748744
> > > [2] - MetadataStore interface:
> > >
> >
> https://github.com/apache/pulsar/blob/master/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/api/MetadataStore.java
> > > [3] - PIP-157: Bucketing topic metadata to allow more topics per
> > namespace
> > > - https://github.com/apache/pulsar/issues/15254
> > > [4] - PIP-45: Pluggable metadata interface -
> > >
> >
> https://github.com/apache/pulsar/wiki/PIP-45%3A-Pluggable-metadata-interface
> > >
> > > [5] - StreamNative's blog "Moving Toward a ZooKeeper-Less Apache
> Pulsar"
> > -
> > >
> >
> https://streamnative.io/blog/release/2022-01-25-moving-toward-a-zookeeperless-apache-pulsar/
> > >
> >
>
>
> --
> BR,
> Qiang Huang
>

Reply via email to