I ended up digging a little longer. I have some partial results to share.

It seems to me the problem was likely introduced by this Bookkeeper PR
https://github.com/apache/bookkeeper/pull/3783.

These are likely the problematic lines:
https://github.com/apache/bookkeeper/blob/234b817cdb4e054887ffd5e42eaed25dc02daf63/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L161-L167

My current theory is that we're using a CompositeByteBuf when the
brokerEntryMetadataInterceptors is in use. Because of that, the call
in OpAddEntry [0] does not produce a deep copy of the readerIndex and
the writerIndex. When pulsar passes the "duplicateBuffer" to the
bookkeeper client, the internal buffer's readerIndex and writerIndex
are both mutable state. Therefore, when we call the `writeBytes`
method in the DigestManager, the buffer's readerIndex is moved
forward. Interestingly, when the buffer is big enough, we rely on
`ByteBufList` to coalesce the buffers and it appears the readerIndex
is maintained in that case.

That likely explains why we're seeing the readerIndex equal to the
buffer's capacity.

It'd be good to get a definitive test to confirm this theory. (Sorry,
I would have tested myself, but I need to sign off now.)

Thanks,
Michael

[0] 
https://github.com/apache/pulsar/blob/bafecb2a9c0e73942de6a38df72dd5888d5afd66/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpAddEntry.java#L126

On Thu, Apr 13, 2023 at 5:31 PM Michael Marshall <mmarsh...@apache.org> wrote:
>
> By my testing using the above steps to reproduce the issue, I do not
> see the bug on commit [0] but I do see it on the subsequent commit
> [1].
>
> That indicates [1], which is the bookkeeper client upgrade to 4.16.0
> commit, introduced the problem. I need to sign off soon. I appreciate
> any help you can provide in looking into this issue!
>
> Thanks,
> Michael
>
> [0] 
> https://github.com/apache/pulsar/commit/25af808970fc02281a0b9345c5de205654d7b789
> [1] 
> https://github.com/apache/pulsar/commit/a7a605f13565edc380b8ae01808832c426f16d1f
>
> On Thu, Apr 13, 2023 at 4:51 PM Michael Marshall <mmarsh...@apache.org> wrote:
> >
> > Here's an update on my progress. I reproduced Dave's observations with
> > the following steps:
> >
> > 1. Download and unpack 3.0.0 RC 1.
> >
> > 2. Add these two settings to the conf/standalone.conf:
> > exposingBrokerEntryMetadataToClientEnabled=true
> > brokerEntryMetadataInterceptors=org.apache.pulsar.common.intercept.AppendIndexMetadataInterceptor,org.apache.pulsar.common.intercept.AppendBrokerTimestampMetadataInterceptor
> >
> > 3. Run bin/pulsar standalone
> >
> > 4. Run curl http://localhost:8080/admin/v2/brokers/health
> >
> > 5. View the following log:
> >
> > 2023-04-13T16:48:39,256-0500
> > [broker-topic-workers-OrderedExecutor-7-0] ERROR
> > org.apache.pulsar.common.protocol.Commands -
> > [PersistentSubscription{topic=persistent://pulsar/standalone/localhost:8080/healthcheck,
> > name=healthCheck-012d2e05-371f-4ddc-a666-daf9fa243d87}] [-1] Failed to
> > parse message metadata
> > java.lang.IndexOutOfBoundsException: readerIndex(100) + length(2)
> > exceeds writerIndex(100): UnpooledDuplicatedByteBuf(ridx: 100, widx:
> > 100, cap: 100, unwrapped: CompositeByteBuf(ridx: 100, widx: 100, cap:
> > 100, components=2))
> >     at 
> > io.netty.buffer.AbstractByteBuf.checkReadableBytes0(AbstractByteBuf.java:1442)
> > ~[io.netty-netty-buffer-4.1.89.Final.jar:4.1.89.Final]
> >     at io.netty.buffer.AbstractByteBuf.readShort(AbstractByteBuf.java:749)
> > ~[io.netty-netty-buffer-4.1.89.Final.jar:4.1.89.Final]
> >     at 
> > org.apache.pulsar.common.protocol.Commands.skipBrokerEntryMetadataIfExist(Commands.java:1692)
> > ~[org.apache.pulsar-pulsar-common-3.0.0.jar:3.0.0]
> >     at 
> > org.apache.pulsar.common.protocol.Commands.parseMessageMetadata(Commands.java:452)
> > ~[org.apache.pulsar-pulsar-common-3.0.0.jar:3.0.0]
> >     at 
> > org.apache.pulsar.common.protocol.Commands.parseMessageMetadata(Commands.java:445)
> > ~[org.apache.pulsar-pulsar-common-3.0.0.jar:3.0.0]
> >     at 
> > org.apache.pulsar.common.protocol.Commands.peekMessageMetadata(Commands.java:1899)
> > ~[org.apache.pulsar-pulsar-common-3.0.0.jar:3.0.0]
> >     at 
> > org.apache.pulsar.common.protocol.Commands.peekAndCopyMessageMetadata(Commands.java:1918)
> > ~[org.apache.pulsar-pulsar-common-3.0.0.jar:3.0.0]
> >     at 
> > org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:142)
> > ~[org.apache.pulsar-pulsar-broker-3.0.0.jar:3.0.0]
> >     at 
> > org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:100)
> > ~[org.apache.pulsar-pulsar-broker-3.0.0.jar:3.0.0]
> >     at 
> > org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer.internalReadEntriesComplete(PersistentDispatcherSingleActiveConsumer.java:210)
> > ~[org.apache.pulsar-pulsar-broker-3.0.0.jar:3.0.0]
> >     at 
> > org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer.lambda$readEntriesComplete$1(PersistentDispatcherSingleActiveConsumer.java:151)
> > ~[org.apache.pulsar-pulsar-broker-3.0.0.jar:3.0.0]
> >     at 
> > org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137)
> > ~[org.apache.bookkeeper-bookkeeper-common-4.16.0.jar:4.16.0]
> >     at 
> > org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:113)
> > ~[org.apache.bookkeeper-bookkeeper-common-4.16.0.jar:4.16.0]
> >     at 
> > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> > ~[io.netty-netty-common-4.1.89.Final.jar:4.1.89.Final]
> >     at java.lang.Thread.run(Thread.java:833) ~[?:?]
> >
> > Now that we have a consistent way to reproduce it, I'll work on
> > understanding the root cause. I believe we run with these two settings
> > with other versions, so this feels like a regression. I cannot say for
> > sure yet.
> >
> > Thanks,
> > Michael
> >
> > On Thu, Apr 13, 2023 at 1:47 AM Enrico Olivelli <eolive...@gmail.com> wrote:
> > >
> > > Dave,
> > > IIUC (from offline discussion) those tests started to fail only recently,
> > > as we run those tests against the master branch of apache/pulsar repo.
> > >
> > > Does the beginning of failures match the upgrade to BK 4.16 ?
> > > The problem may be related to the handling of ByteBufs, IIRC there are a
> > > few changes in the latest
> > > BK release.
> > >
> > > Enrico
> > >
> > > Il giorno gio 13 apr 2023 alle ore 05:02 Hang Chen <chenh...@apache.org> 
> > > ha
> > > scritto:
> > >
> > > > Hi Dave,
> > > >     Thanks for verifying this release candidate. I have setup the Pulsar
> > > > cluster with 3.0 on my laptop and test the produce and consume, it works
> > > > fine. We also deploy the it on k8s, the produce, consume and health 
> > > > check
> > > > also works fine. Would you please share your broker configurations?
> > > >
> > > > Thanks,
> > > > Hang
> > > >
> > > > Dave Fisher <wave4d...@comcast.net> 于2023年4月13日周四 06:49写道:
> > > >
> > > >> Hi -
> > > >>
> > > >> We do End to end testing of Pulsar versions on GCP and with Pulsar 3.0
> > > >> today we have the following problem.
> > > >>
> > > >> This image shows the broker deployments not coming up. This is a 
> > > >> non-TLS
> > > >> setup.
> > > >>
> > > >> [image: Screen Shot 2023-04-12 at 1.09.59 PM.png]
> > > >>
> > > >> The Broker Pods have errors like this one:
> > > >>
> > > >> 2023-04-12T20:10:35,995+0000 
> > > >> [broker-topic-workers-OrderedExecutor-3-0] ERROR 
> > > >> org.apache.pulsar.common.protocol.Commands - 
> > > >> [PersistentSubscription{topic=persistent://pulsar/pulsar/10.236.0.4:8080/healthcheck,
> > > >>  name=healthCheck-04a8e22e-3e0b-4aa3-9141-d8970c1b0712}] [-1] Failed 
> > > >> to parse message metadata
> > > >> java.lang.IndexOutOfBoundsException: readerIndex(96) + length(2) 
> > > >> exceeds writerIndex(96): UnpooledDuplicatedByteBuf(ridx: 96, widx: 96, 
> > > >> cap: 96, unwrapped: CompositeByteBuf(ridx: 96, widx: 96, cap: 96, 
> > > >> components=2))
> > > >>      at 
> > > >> io.netty.buffer.AbstractByteBuf.checkReadableBytes0(AbstractByteBuf.java:1442)
> > > >>  ~[io.netty-netty-buffer-4.1.89.Final.jar:4.1.89.Final]
> > > >>      at 
> > > >> io.netty.buffer.AbstractByteBuf.readShort(AbstractByteBuf.java:749) 
> > > >> ~[io.netty-netty-buffer-4.1.89.Final.jar:4.1.89.Final]
> > > >>      at 
> > > >> org.apache.pulsar.common.protocol.Commands.skipBrokerEntryMetadataIfExist(Commands.java:1692)
> > > >>  ~[org.apache.pulsar-pulsar-common-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > >>      at 
> > > >> org.apache.pulsar.common.protocol.Commands.parseMessageMetadata(Commands.java:452)
> > > >>  ~[org.apache.pulsar-pulsar-common-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > >>      at 
> > > >> org.apache.pulsar.common.protocol.Commands.parseMessageMetadata(Commands.java:445)
> > > >>  ~[org.apache.pulsar-pulsar-common-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > >>      at 
> > > >> org.apache.pulsar.common.protocol.Commands.peekMessageMetadata(Commands.java:1899)
> > > >>  ~[org.apache.pulsar-pulsar-common-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > >>      at 
> > > >> org.apache.pulsar.common.protocol.Commands.peekAndCopyMessageMetadata(Commands.java:1918)
> > > >>  ~[org.apache.pulsar-pulsar-common-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > >>      at 
> > > >> org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:142)
> > > >>  ~[org.apache.pulsar-pulsar-broker-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > >>      at 
> > > >> org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:100)
> > > >>  ~[org.apache.pulsar-pulsar-broker-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > >>      at 
> > > >> org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer.internalReadEntriesComplete(PersistentDispatcherSingleActiveConsumer.java:210)
> > > >>  ~[org.apache.pulsar-pulsar-broker-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > >>      at 
> > > >> org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer.lambda$readEntriesComplete$1(PersistentDispatcherSingleActiveConsumer.java:151)
> > > >>  ~[org.apache.pulsar-pulsar-broker-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> > > >>      at 
> > > >> org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137)
> > > >>  ~[org.apache.bookkeeper-bookkeeper-common-4.16.0.jar:4.16.0]
> > > >>      at 
> > > >> org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:113)
> > > >>  ~[org.apache.bookkeeper-bookkeeper-common-4.16.0.jar:4.16.0]
> > > >>      at 
> > > >> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> > > >>  ~[io.netty-netty-common-4.1.89.Final.jar:4.1.89.Final]
> > > >>      at java.lang.Thread.run(Thread.java:833) ~[?:?]
> > > >>
> > > >> We looked into this but aren’t sure what would cause this to suddenly 
> > > >> be
> > > >> an issue. The candidate would be differences in the new version of BK …
> > > >>
> > > >> Best,
> > > >> Dave
> > > >>
> > > >

Reply via email to