Re: [Discussion] PR #24423 Handling Overloaded Netty Channels in Apache Pulsar

Yubiao Feng Sun, 13 Jul 2025 00:02:55 -0700

Hi Lari

> The correct way to handle this would be to
> configure  ChannelOption.WRITE_BUFFER_WATER_MARK
> and perform the
> desired action in the channelWritabilityChanged callback
> method.
> Why would you create a custom solution for controlling the
> size of the
> write buffer size when Netty already has a standard
> solution for
> controlling the write buffer size?


I just know what you suggested now:
- Use `WRITE_BUFFER_WATER_MARK` instead of the new configuration
- The event that changes channel state to inwritable will be the same as
`!channel.isWritable() && exceed limitation`

I consider it is a great solution, thanks, I will change the implementation
later.

> In this case, the correct action would be to call
> org.apache.pulsar.broker.service.
> ServerCnxThrottleTracker's
> incrementThrottleCount/decrementThrottleCount
> instead of toggling autoread directly.

I reviewed the feature `ServerCnxThrottleTracker`, which was introduced in
release `3.2.0`. I think we can use the tool class in a separate PR, the
first PR can be cherry-picked into branch-3.0, not only branch-3.3 and
branch-4.0. And the second PR can be cherry-picked into branch-3.3 and
branch-4.0. BTW, the feature `ServerCnxThrottleTracker` has a bug that the
rate limiter and the mechanism named max pending publishing bytes will
affect each other. I will fix the issue in the second PR as well.

On Tue, Jul 8, 2025 at 2:07 PM Yubiao Feng <yubiao.f...@streamnative.io>
wrote:

> Hi all
>
> I want to satrt a discussion, which relates to the PR. #24423: Handling
> Overloaded Netty Channels in Apache Pulsar
>
> Problem Statement
> We've encountered a critical issue in our Apache Pulsar clusters where
> brokers experience Out-Of-Memory (OOM) errors and continuous restarts under
> specific load patterns. This occurs when Netty channel write buffers become
> full, leading to a buildup of unacknowledged responses in the broker's
> memory.
>
> Background
> Our clusters are configured with numerous namespaces, each containing
> approximately 8,000 to 10,000 topics. Our consumer applications are quite
> large, with each consumer using a regular expression (regex) pattern to
> subscribe to all topics within a namespace.
>
> The problem manifests particularly during consumer application restarts.
> When a consumer restarts, it issues a getTopicsOfNamespace request. Due to
> the sheer number of topics, the response size is extremely large. This
> massive response overwhelms the socket output buffer, causing it to fill up
> rapidly. Consequently, the broker's responses get backlogged in memory,
> eventually leading to the broker's OOM and subsequent restart loop.
>
> Why "Returning an Error" Is Not a Solution
> A common approach to handling overload is to simply return an error when
> the broker cannot process a request. However, in this specific scenario,
> this solution is ineffective. If a consumer application fails to start due
> to an error, it triggers a user pod restart, which then leads to the same
> getTopicsOfNamespace request being reissued, resulting in a continuous loop
> of errors and restarts. This creates an unrecoverable state for the
> consumer application and puts immense pressure on the brokers.
>
> Proposed Solution and Justification
> We believe the solution proposed in
> https://github.com/apache/pulsar/pull/24423 is highly suitable for
> addressing this issue. The core mechanism introduced in this PR – pausing
> acceptance of new requests when a channel cannot handle more output – is
> exceptionally reasonable and addresses the root cause of the memory
> pressure.
>
> This approach prevents the broker from accepting new requests when its
> write buffers are full, effectively backpressuring the client and
> preventing the memory buildup that leads to OOMs. Furthermore, we
> anticipate that this mechanism will not significantly increase future
> maintenance costs, as it elegantly handles overload scenarios at a
> fundamental network layer.
>
> I invite the community to discuss this solution and its potential benefits
> for the overall stability and resilience of Apache Pulsar.
>
> Thanks
> Yubiao Feng
>

-- 
This email and any attachments are intended solely for the recipient(s) 
named above and may contain confidential, privileged, or proprietary 
information. If you are not the intended recipient, you are hereby notified 
that any disclosure, copying, distribution, or reproduction of this 
information is strictly prohibited. If you have received this email in 
error, please notify the sender immediately by replying to this email and 
delete it from your system.

Re: [Discussion] PR #24423 Handling Overloaded Netty Channels in Apache Pulsar

Reply via email to