Hi guys,
    Thanks for Micheal's great job, and he already pushed one PR [1]
to fix this bug on the BookKeeper side. This comment [2] shows how
this bug happens.

I will cut BookKeeper 4.16.1 release soon.

Thanks,
Hang


[1] https://github.com/apache/bookkeeper/pull/3919
[2] https://github.com/apache/bookkeeper/pull/3919#issuecomment-1508220615

Michael Marshall <mmarsh...@apache.org> 于2023年4月14日周五 07:32写道:
>
> I ended up digging a little longer. I have some partial results to share.
>
> It seems to me the problem was likely introduced by this Bookkeeper PR
> https://github.com/apache/bookkeeper/pull/3783.
>
> These are likely the problematic lines:
> https://github.com/apache/bookkeeper/blob/234b817cdb4e054887ffd5e42eaed25dc02daf63/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L161-L167
>
> My current theory is that we're using a CompositeByteBuf when the
> brokerEntryMetadataInterceptors is in use. Because of that, the call
> in OpAddEntry [0] does not produce a deep copy of the readerIndex and
> the writerIndex. When pulsar passes the "duplicateBuffer" to the
> bookkeeper client, the internal buffer's readerIndex and writerIndex
> are both mutable state. Therefore, when we call the `writeBytes`
> method in the DigestManager, the buffer's readerIndex is moved
> forward. Interestingly, when the buffer is big enough, we rely on
> `ByteBufList` to coalesce the buffers and it appears the readerIndex
> is maintained in that case.
>
> That likely explains why we're seeing the readerIndex equal to the
> buffer's capacity.
>
> It'd be good to get a definitive test to confirm this theory. (Sorry,
> I would have tested myself, but I need to sign off now.)
>
> Thanks,
> Michael
>
> [0] 
> https://github.com/apache/pulsar/blob/bafecb2a9c0e73942de6a38df72dd5888d5afd66/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpAddEntry.java#L126
>
> On Thu, Apr 13, 2023 at 5:31 PM Michael Marshall <mmarsh...@apache.org> wrote:
> >
> > By my testing using the above steps to reproduce the issue, I do not
> > see the bug on commit [0] but I do see it on the subsequent commit
> > [1].
> >
> > That indicates [1], which is the bookkeeper client upgrade to 4.16.0
> > commit, introduced the problem. I need to sign off soon. I appreciate
> > any help you can provide in looking into this issue!
> >
> > Thanks,
> > Michael
> >
> > [0] 
> > https://github.com/apache/pulsar/commit/25af808970fc02281a0b9345c5de205654d7b789
> > [1] 
> > https://github.com/apache/pulsar/commit/a7a605f13565edc380b8ae01808832c426f16d1f
> >
> > On Thu, Apr 13, 2023 at 4:51 PM Michael Marshall <mmarsh...@apache.org> 
> > wrote:
> > >
> > > Here's an update on my progress. I reproduced Dave's observations with
> > > the following steps:
> > >
> > > 1. Download and unpack 3.0.0 RC 1.
> > >
> > > 2. Add these two settings to the conf/standalone.conf:
> > > exposingBrokerEntryMetadataToClientEnabled=true
> > > brokerEntryMetadataInterceptors=org.apache.pulsar.common.intercept.AppendIndexMetadataInterceptor,org.apache.pulsar.common.intercept.AppendBrokerTimestampMetadataInterceptor
> > >
> > > 3. Run bin/pulsar standalone
> > >
> > > 4. Run curl http://localhost:8080/admin/v2/brokers/health
> > >
> > > 5. View the following log:
> > >
> > > 2023-04-13T16:48:39,256-0500
> > > [broker-topic-workers-OrderedExecutor-7-0] ERROR
> > > org.apache.pulsar.common.protocol.Commands -
> > > [PersistentSubscription{topic=persistent://pulsar/standalone/localhost:8080/healthcheck,
> > > name=healthCheck-012d2e05-371f-4ddc-a666-daf9fa243d87}] [-1] Failed to
> > > parse message metadata
> > > java.lang.IndexOutOfBoundsException: readerIndex(100) + length(2)
> > > exceeds writerIndex(100): UnpooledDuplicatedByteBuf(ridx: 100, widx:
> > > 100, cap: 100, unwrapped: CompositeByteBuf(ridx: 100, widx: 100, cap:
> > > 100, components=2))
> > >     at 
> > > io.netty.buffer.AbstractByteBuf.checkReadableBytes0(AbstractByteBuf.java:1442)
> > > ~[io.netty-netty-buffer-4.1.89.Final.jar:4.1.89.Final]
> > >     at io.netty.buffer.AbstractByteBuf.readShort(AbstractByteBuf.java:749)
> > > ~[io.netty-netty-buffer-4.1.89.Final.jar:4.1.89.Final]
> > >     at 
> > > org.apache.pulsar.common.protocol.Commands.skipBrokerEntryMetadataIfExist(Commands.java:1692)
> > > ~[org.apache.pulsar-pulsar-common-3.0.0.jar:3.0.0]
> > >     at 
> > > org.apache.pulsar.common.protocol.Commands.parseMessageMetadata(Commands.java:452)
> > > ~[org.apache.pulsar-pulsar-common-3.0.0.jar:3.0.0]
> > >     at 
> > > org.apache.pulsar.common.protocol.Commands.parseMessageMetadata(Commands.java:445)
> > > ~[org.apache.pulsar-pulsar-common-3.0.0.jar:3.0.0]
> > >     at 
> > > org.apache.pulsar.common.protocol.Commands.peekMessageMetadata(Commands.java:1899)
> > > ~[org.apache.pulsar-pulsar-common-3.0.0.jar:3.0.0]
> > >     at 
> > > org.apache.pulsar.common.protocol.Commands.peekAndCopyMessageMetadata(Commands.java:1918)
> > > ~[org.apache.pulsar-pulsar-common-3.0.0.jar:3.0.0]
> > >     at 
> > > org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:142)
> > > ~[org.apache.pulsar-pulsar-broker-3.0.0.jar:3.0.0]
> > >     at 
> > > org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:100)
> > > ~[org.apache.pulsar-pulsar-broker-3.0.0.jar:3.0.0]
> > >     at 
> > > org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer.internalReadEntriesComplete(PersistentDispatcherSingleActiveConsumer.java:210)
> > > ~[org.apache.pulsar-pulsar-broker-3.0.0.jar:3.0.0]
> > >     at 
> > > org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer.lambda$readEntriesComplete$1(PersistentDispatcherSingleActiveConsumer.java:151)
> > > ~[org.apache.pulsar-pulsar-broker-3.0.0.jar:3.0.0]
> > >     at 
> > > org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137)
> > > ~[org.apache.bookkeeper-bookkeeper-common-4.16.0.jar:4.16.0]
> > >     at 
> > > org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:113)
> > > ~[org.apache.bookkeeper-bookkeeper-common-4.16.0.jar:4.16.0]
> > >     at 
> > > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> > > ~[io.netty-netty-common-4.1.89.Final.jar:4.1.89.Final]
> > >     at java.lang.Thread.run(Thread.java:833) ~[?:?]
> > >
> > > Now that we have a consistent way to reproduce it, I'll work on
> > > understanding the root cause. I believe we run with these two settings
> > > with other versions, so this feels like a regression. I cannot say for
> > > sure yet.
> > >
> > > Thanks,
> > > Michael
> > >
> > > On Thu, Apr 13, 2023 at 1:47 AM Enrico Olivelli <eolive...@gmail.com> 
> > > wrote:
> > > >
> > > > Dave,
> > > > IIUC (from offline discussion) those tests started to fail only 
> > > > recently,
> > > > as we run those tests against the master branch of apache/pulsar repo.
> > > >
> > > > Does the beginning of failures match the upgrade to BK 4.16 ?
> > > > The problem may be related to the handling of ByteBufs, IIRC there are a
> > > > few changes in the latest
> > > > BK release.
> > > >
> > > > Enrico
> > > >
> > > > Il giorno gio 13 apr 2023 alle ore 05:02 Hang Chen 
> > > > <chenh...@apache.org> ha
> > > > scritto:
> > > >
> > > > > Hi Dave,
> > > > >     Thanks for verifying this release candidate. I have setup the 
> > > > > Pulsar
> > > > > cluster with 3.0 on my laptop and test the produce and consume, it 
> > > > > works
> > > > > fine. We also deploy the it on k8s, the produce, consume and health 
> > > > > check
> > > > > also works fine. Would you please share your broker configurations?
> > > > >
> > > > > Thanks,
> > > > > Hang
> > > > >
> > > > > Dave Fisher <wave4d...@comcast.net> 于2023年4月13日周四 06:49写道:
> > > > >
> > > > >> Hi -
> > > > >>
> > > > >> We do End to end testing of Pulsar versions on GCP and with Pulsar 
> > > > >> 3.0
> > > > >> today we have the following problem.
> > > > >>
> > > > >> This image shows the broker deployments not coming up. This is a 
> > > > >> non-TLS
> > > > >> setup.
> > > > >>
> > > > >> [image: Screen Shot 2023-04-12 at 1.09.59 PM.png]
> > > > >>
> > > > >> The Broker Pods have errors like this one:
> > > > >>
> > > > >> 2023-04-12T20:10:35,995+0000 
> > > > >> [broker-topic-workers-OrderedExecutor-3-0] ERROR 
> > > > >> org.apache.pulsar.common.protocol.Commands - 
> > > > >> [PersistentSubscription{topic=persistent://pulsar/pulsar/10.236.0.4:8080/healthcheck,
> > > > >>  name=healthCheck-04a8e22e-3e0b-4aa3-9141-d8970c1b0712}] [-1] Failed 
> > > > >> to parse message metadata
> > > > >> java.lang.IndexOutOfBoundsException: readerIndex(96) + length(2) 
> > > > >> exceeds writerIndex(96): UnpooledDuplicatedByteBuf(ridx: 96, widx: 
> > > > >> 96, cap: 96, unwrapped: CompositeByteBuf(ridx: 96, widx: 96, cap: 
> > > > >> 96, components=2))
> > > > >>      at 
> > > > >> io.netty.buffer.AbstractByteBuf.checkReadableBytes0(AbstractByteBuf.java:1442)
> > > > >>  ~[io.netty-netty-buffer-4.1.89.Final.jar:4.1.89.Final]
> > > > >>      at 
> > > > >> io.netty.buffer.AbstractByteBuf.readShort(AbstractByteBuf.java:749) 
> > > > >> ~[io.netty-netty-buffer-4.1.89.Final.jar:4.1.89.Final]
> > > > >>      at 
> > > > >> org.apache.pulsar.common.protocol.Commands.skipBrokerEntryMetadataIfExist(Commands.java:1692)
> > > > >>  ~[org.apache.pulsar-pulsar-common-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > > >>      at 
> > > > >> org.apache.pulsar.common.protocol.Commands.parseMessageMetadata(Commands.java:452)
> > > > >>  ~[org.apache.pulsar-pulsar-common-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > > >>      at 
> > > > >> org.apache.pulsar.common.protocol.Commands.parseMessageMetadata(Commands.java:445)
> > > > >>  ~[org.apache.pulsar-pulsar-common-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > > >>      at 
> > > > >> org.apache.pulsar.common.protocol.Commands.peekMessageMetadata(Commands.java:1899)
> > > > >>  ~[org.apache.pulsar-pulsar-common-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > > >>      at 
> > > > >> org.apache.pulsar.common.protocol.Commands.peekAndCopyMessageMetadata(Commands.java:1918)
> > > > >>  ~[org.apache.pulsar-pulsar-common-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > > >>      at 
> > > > >> org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:142)
> > > > >>  ~[org.apache.pulsar-pulsar-broker-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > > >>      at 
> > > > >> org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:100)
> > > > >>  ~[org.apache.pulsar-pulsar-broker-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > > >>      at 
> > > > >> org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer.internalReadEntriesComplete(PersistentDispatcherSingleActiveConsumer.java:210)
> > > > >>  ~[org.apache.pulsar-pulsar-broker-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > > >>      at 
> > > > >> org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer.lambda$readEntriesComplete$1(PersistentDispatcherSingleActiveConsumer.java:151)
> > > > >>  ~[org.apache.pulsar-pulsar-broker-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > > >>      at 
> > > > >> org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137)
> > > > >>  ~[org.apache.bookkeeper-bookkeeper-common-4.16.0.jar:4.16.0]
> > > > >>      at 
> > > > >> org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:113)
> > > > >>  ~[org.apache.bookkeeper-bookkeeper-common-4.16.0.jar:4.16.0]
> > > > >>      at 
> > > > >> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> > > > >>  ~[io.netty-netty-common-4.1.89.Final.jar:4.1.89.Final]
> > > > >>      at java.lang.Thread.run(Thread.java:833) ~[?:?]
> > > > >>
> > > > >> We looked into this but aren’t sure what would cause this to 
> > > > >> suddenly be
> > > > >> an issue. The candidate would be differences in the new version of 
> > > > >> BK …
> > > > >>
> > > > >> Best,
> > > > >> Dave
> > > > >>
> > > > >

Reply via email to