Re: [E] Re: [PIP-78] Split the individual acknowledgments into multiple entries

Rajan Dhabalia Sat, 23 Jan 2021 19:47:21 -0800

*Hi, >> Delayed messages or certain user logic can introduce a lot of
message-holes. We have seen this issue in quite a lot of customers'
production environment.I agree that we need a larger buffer to store and
recover individually deleted messages, and the existing 150K limit might
not be enough for many usescases. However, I would also like to highlight
issues that happened in the past due to keeping a large number of deleted
messages in the broker’s memory and one of them was high GC pauses.
Therefore, we introduced ConcurrentOpenLongPairRangeSet
<https://github.com/apache/pulsar/pull/3818> to manage deleted messages
without actually storing range objects in memory. OpenRangeSet uses bitset
to store ranges in memory and we can also utilize it to persist in disk for
the recovery. This approach has various advantages: simple implementation,
large enough range for recovery, and it skips intermediate conversion from
unack-messages to bitset in OpenRangeSet which saves extra CPU while
recovery.I implemented a simple prototype
<https://github.com/rdhabalia/pulsar/commit/1f8e5e745e9f1d1429697b5dee1da70545385653>
to store deleted messages using bitset in OpenRangeSet and we can persist
10M ranges with 5MB data size which I guess is large enough for any
usecases. So, we can use this approach to solve the problem without
introducing unnecessary complexity in managed-cursor.Thanks,Rajan*


On Fri, Jan 22, 2021 at 7:52 PM Sijie Guo <[email protected]> wrote:

> Joe - Delayed messages or certain user logic can introduce a lot of message
> holes. We have seen this issue in quite a lot of customers' production
> environment. Hence we need to find a solution for solving these problems.
> If you are skeptical of an implementation like that, how about us making
> cursor implementation pluggable. We can make this proposal implemented as
> one plugin. So it will not impact any existing logic but allowing people
> use a plugin to solve this problem.
>
> Thanks,
> Sijie
>
> On Fri, Jan 22, 2021 at 5:00 PM Joe Francis <[email protected]
> >
> wrote:
>
> > Let me take a step back and explain  how I am looking at this from a
> > high-level
> > design viewpoint
> >
> >
> > Bookkeeper (BK) is like an LSM implementation of a KV store. Writes to
> all
> > keys are appended to a single file; deletes are logical.  Compaction
> > reclaims space.  An Index is used locate entries, tracking logical
> deletes
> > and reclaim space.
> >
> >
> > The index in BK  is another LSM.  Again, writes are appended, deletes are
> > logical, and  an index is used to  locate entries , account for deletes
> and
> > compaction to reclaim space (the implementation within rocksdb is far
> more
> > complex with bloom filters and memtables, but you get the idea )   BK
> just
> > uses a sophisticated index (rocksdb) which is tiny and cacheable and
> > rocksdb has within it a sophisticated index which is small and cacheable
> >
> >
> > So when I look at this proposal, what I see is the same - another attempt
> > to build an LSM with a sophisticated index/cache mechanism using log
> > structured storage. So I am quite skeptical that this needs to solved
> this
> > way,  within Pulsar.
> >
> >
> >
> > Joe
> >
> > On Wed, Jan 20, 2021 at 12:30 AM linlin <[email protected]> wrote:
> >
> > > We can look at ManagedCursorImpl.buildIndividualDeletedMessageRanges
> > >
> > > What is saved in the entry is not a bitSet, but a messageRange one by
> > one,
> > > which contains information such as ledgerId and entryId. BitSet only
> > exists
> > > in the memory and is used to quickly determine whether it already
> exists.
> > > In addition, the position of each ack will be stored in the
> > > individualDeletedMessages queue. When persisted to the entry, the queue
> > > will be traversed, and the position information of each ack will
> > generate a
> > > messageRange.
> > > A messageRange contains lowerEndpoint (ledgerId+entryId), upperEndpoint
> > > (ledgerId+entryId), 4 longs, about 256 bits.
> > >
> > > We assume a more extreme scenario, 300K messages, every other ack has
> an
> > > unacknowledged, that is, 150K location information will be stored in
> > > individualDeletedMessages. 150K * 256/8/1024 /1024 ≈ 4.6MB
> > > Of course, there are also scenarios where the customer's ack spans
> > several
> > > ledgers.
> > >
> > >
> > > On 2021/01/20 00:38:47, Joe F <[email protected]> wrote:
> > > > I have a simpler question. Just storing the message-ids raw will fit
> > > ~300K>
> > > > entries in one ledger entry. With the bitmap  changes, we can store
> a>
> > > > couple of million  within one 5MB ledger entry.  So can you tell us
> > what>
> > > > numbers of unacked messages are  creating a problem?  What exactly
> are
> > > the>
> > > > issues you face, and at what numbers of unacked messages/memory use
> > etc?>
> > > >
> > > > I have my own concerns about this proposal, but I would like to
> > > understand>
> > > > the problem first>
> > > >
> > > > Joe>
> > > >
> > > > On Sun, Jan 17, 2021 at 10:16 PM Sijie Guo <[email protected]> wrote:>
> > > >
> > > > > Hi Lin,>
> > > > >>
> > > > > Thanks you and Penghui for drafting this! We have seen a lot of
> pain
> > > points>
> > > > > of `managedLedgerMaxUnackedRangesToPersist` when enabling delayed
> > > messages.>
> > > > > Glad that you and Penghui are spending time on resolving this!>
> > > > >>
> > > > > Overall the proposal looks good. But I have a couple of questions
> > about
> > > the>
> > > > > proposal.>
> > > > >>
> > > > > 1. What happens if the broker fails to write the entry marker? For
> > > example,>
> > > > > at t0, the broker flushes dirty pages and successfully writes an
> > entry>
> > > > > marker. At t1, the broker tries to flushes dirty pages but failed
> to
> > > write>
> > > > > the new entry marker. How can you recover the entry marker?>
> > > > >>
> > > > > 2.  When a broker crashes and recovers the managed ledger, the
> > cursor>
> > > > > ledger is not writable anymore. Are you going to create a new
> cursor
> > > ledger>
> > > > > and copy all the entries from the old cursor ledger to the new
> one?>
> > > > >>
> > > > > It would be good if you can clarify these two questions.>
> > > > >>
> > > > > - Sijie>
> > > > >>
> > > > > On Sun, Jan 17, 2021 at 9:48 PM linlin <[email protected]> wrote:>
> > > > >>
> > > > > > Hi, community:>
> > > > > >     Recently we encountered some problems when using individual>
> > > > > > acknowledgments, such as:>
> > > > > > when the amount of acknowledgment is large, entry writing fails;
> a
> > > large>
> > > > > > amount of cache causes OOM, etc.>
> > > > > > So I drafted a PIP in `>
> > > > > >>
> > > > > >>
> > > > >
> > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing`
> <https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing>
> > <
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> >
> > > <
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > >
> > > >
> > >
> > > > > <
> > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > >>
> > >
> > > > > > <>
> > > > >
> > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > >
> > >
> > > > > >>
> > > > > > ,>
> > > > > > any voice is welcomed.>
> > > > > >>
> > > > >>
> > > >
> > >
> >
>

Re: [E] Re: [PIP-78] Split the individual acknowledgments into multiple entries

Reply via email to