RE: Re: Re: [Discuss] PIP-319 Unblock stuck Key_Shared subscription after consumer reconnect

2023-12-03 Thread Nikolai

Hi Rajan,


Thanks for sharing the link! I will follow the PR.


Best Regards,

Nikolai


On 2023/12/02 01:21:04 Rajan Dhabalia wrote:
> Hi Nikolai,
>
> I completely understand your concern and pain. But let's understand
> ordering semantics for exclusive/fail-over subscription. This 
subscription

> makes sure to dispatch new messages which the broker is reading using
> readPosition must be in order but that doesn't mean the broker will never
> handle redelivery of unack message and broker will not stop 
dispatching by

> considering the end of the world. However, key-Shared sub thinks it's the
> end of the world. key-shared sub extends the shared sub behavior with
> additional speciality of consumer dispatch affinity based on message key.
> Broker tries its best to perform this responsibility but it's still not
> guaranteed as it will be really costly to give such guarantee and if one
> requires it then we can use a failover sub. In shared or even in other
> subscriptions, brokers don't restrict or stop dispatching of redelivery
> messages.
> However, in key-shared sub, broker stops forever dispatching new and
> redelivery messages in certain conditions and additionally broker 
goes into

> an infinite iterative sequence of reading duplicate cold messages and
> discarding them which causes IO and CPU impact both on broker and bookie.
> Pulsar must not have such broken and incorrect semantic behavior in the
> system. I have created an issue:
> https://github.com/apache/pulsar/issues/21656 which has a simple unit 
test

> case to reproduce this behavior and that's what is happening in multiple
> production Pulsar systems and users are not happy with it.
>
> Therefore, we should not stop dispatching and restricting redeliver
> messages which will automatically address almost all stuck issues and the
> broker can still follow the ordering guarantee for new messages which it
> promises for other subs as well.
>
> Thanks,
> Rajan
>
> On Fri, Dec 1, 2023 at 12:03 AM Nikolai  wrote:
>
> > Hi Rajan,
> >
> > Thanks for the review!
> >
> > It looks like the root cause of "stuck" keyword is the nature of
> > Key_Shared subscription which must guarantee messages ordering. I
> > understand it is really hard to achieve such guarantees without having
> > separate physical partitions but this feature is very essential for 
many

> > use cases. I also understand that the solution I've provided is more a
> > hack than a solid solution. I was trying to touch as little code as I
> > can and, unfortunately, I see no other options to avoid stuck 
dispatches

> > in the current implementation approach.
> >
> > I also want to highlight that the current implementation leads entire
> > system to hang due to a single message corruption in case consumers
> > disconnects. This breaks one of a big Pulsar advantage: independent 
keys

> > processing.
> >
> > Do you have plans for Key_Shared dispatcher rework? I would be happy to
> > contribute to such a project if I can help somehow.
> >
> >
> > Best Regards,
> >
> > Nikolai
> >
> >
> >
> > On 2023/12/01 06:34:59 Rajan Dhabalia wrote:
> > > Hi Nikolai,
> > >
> > > Thanks for bringing this to the attention. I have seen multiple PIP
> > and bug
> > > fixes for the Key-Shared subscription and I feel Key_Shared 
subscription

> > > dispatcher is developed with multiple fundamental issues which
> > required to
> > > introduce scheduled-task to unblock stuck reads, introducing stuck
> > delivery
> > > flag, and now we are seeing at least 2-3 PIP to add few more hacks to
> > > control such stuck dispatch. After reviewing code, it seems
> > >
> >
> > 
PersistentStickyKeyDispatcherMultipleConsumers::getRestrictedMaxEntriesForConsumer(..)

> > > function can still return 0 dispatch messages with multiple different
> > edge
> > > cases and fundamental dispatch is not clean at all. This is the only
> > > dispatcher introduced after open sourcing Pulsar and unfortunately it
> > was
> > > not developed with a cleaner approach. I understand the pain of such
> > stuck
> > > dispatch and even the current release is having a bugs where 
dispatch is
> > > getting stuck and many production systems are struggling with it 
and you

> > > must be suggesting this PIP to get work around for this stuck issue.
> > But I
> > > don't think this is the right approach to follow.
> > > I think the correct approach is to get rid of this "stuck" keyword
> > from the
> > > dispatcher because having "stuck" concept in dispatcher means it 
has a

> > > known flow and bug which we are not solving but we are adding more
> > debt to
> > > perform the work around. We never had any such concept in other
> > dispatchers
> > > which was introduced when Pulsar was introduced originally and it 
just

> > > works with clear semantics and permit flow mechanism , and we should
> > follow
> > > the same practice for this subscription.
> > >
> > > Thanks,
> > > Rajan
> > >
> > > On Thu, Nov 30, 2023 at 9:18 AM Nikolai  wrote:
> > >
> 

Re: Re: [Discuss] PIP-319 Unblock stuck Key_Shared subscription after consumer reconnect

2023-12-01 Thread Rajan Dhabalia
Hi Nikolai,

I completely understand your concern and pain. But let's understand
ordering semantics for exclusive/fail-over subscription. This subscription
makes sure to dispatch new messages which the broker is reading using
readPosition must be in order but that doesn't mean the broker will never
handle redelivery of unack message and broker will not stop dispatching by
considering the end of the world. However, key-Shared sub thinks it's the
end of the world. key-shared sub extends the shared sub behavior with
additional speciality of consumer dispatch affinity based on message key.
Broker tries its best to perform this responsibility but it's still not
guaranteed as it will be really costly to give such guarantee and if one
requires it then we can use a failover sub. In shared or even in other
subscriptions, brokers don't restrict or stop dispatching of redelivery
messages.
However, in key-shared sub, broker stops forever dispatching new and
redelivery messages in certain conditions and additionally broker goes into
an infinite iterative sequence of reading duplicate cold messages and
discarding them which causes IO and CPU impact both on broker and bookie.
Pulsar must not have such broken and incorrect semantic behavior in the
system. I have created an issue:
https://github.com/apache/pulsar/issues/21656 which has a simple unit test
case to reproduce this behavior and that's what is happening in multiple
production Pulsar systems and users are not happy with it.

Therefore, we should not stop dispatching and restricting redeliver
messages which will automatically address almost all stuck issues and the
broker can still follow the ordering guarantee for new messages which it
promises for other subs as well.

Thanks,
Rajan

On Fri, Dec 1, 2023 at 12:03 AM Nikolai  wrote:

> Hi Rajan,
>
> Thanks for the review!
>
> It looks like the root cause of "stuck" keyword is the nature of
> Key_Shared subscription which must guarantee messages ordering. I
> understand it is really hard to achieve such guarantees without having
> separate physical partitions but this feature is very essential for many
> use cases. I also understand that the solution I've provided is more a
> hack than a solid solution. I was trying to touch as little code as I
> can and, unfortunately, I see no other options to avoid stuck dispatches
> in the current implementation approach.
>
> I also want to highlight that the current implementation leads entire
> system to hang due to a single message corruption in case consumers
> disconnects. This breaks one of a big Pulsar advantage: independent keys
> processing.
>
> Do you have plans for Key_Shared dispatcher rework? I would be happy to
> contribute to such a project if I can help somehow.
>
>
> Best Regards,
>
> Nikolai
>
>
>
> On 2023/12/01 06:34:59 Rajan Dhabalia wrote:
>  > Hi Nikolai,
>  >
>  > Thanks for bringing this to the attention. I have seen multiple PIP
> and bug
>  > fixes for the Key-Shared subscription and I feel Key_Shared subscription
>  > dispatcher is developed with multiple fundamental issues which
> required to
>  > introduce scheduled-task to unblock stuck reads, introducing stuck
> delivery
>  > flag, and now we are seeing at least 2-3 PIP to add few more hacks to
>  > control such stuck dispatch. After reviewing code, it seems
>  >
>
> PersistentStickyKeyDispatcherMultipleConsumers::getRestrictedMaxEntriesForConsumer(..)
>  > function can still return 0 dispatch messages with multiple different
> edge
>  > cases and fundamental dispatch is not clean at all. This is the only
>  > dispatcher introduced after open sourcing Pulsar and unfortunately it
> was
>  > not developed with a cleaner approach. I understand the pain of such
> stuck
>  > dispatch and even the current release is having a bugs where dispatch is
>  > getting stuck and many production systems are struggling with it and you
>  > must be suggesting this PIP to get work around for this stuck issue.
> But I
>  > don't think this is the right approach to follow.
>  > I think the correct approach is to get rid of this "stuck" keyword
> from the
>  > dispatcher because having "stuck" concept in dispatcher means it has a
>  > known flow and bug which we are not solving but we are adding more
> debt to
>  > perform the work around. We never had any such concept in other
> dispatchers
>  > which was introduced when Pulsar was introduced originally and it just
>  > works with clear semantics and permit flow mechanism , and we should
> follow
>  > the same practice for this subscription.
>  >
>  > Thanks,
>  > Rajan
>  >
>  > On Thu, Nov 30, 2023 at 9:18 AM Nikolai  wrote:
>  >
>  > > Hello!
>  > >
>  > > I submitted a new PIP to add a configuration option which allows to
> skip
>  > > blocking recently joined consumers for Key_Shared subscriptions
>  > >
>  > > It introduces additional memory consumption, so it will be disabled by
>  > > default. It would fix issues like
>  > > 

RE: Re: [Discuss] PIP-319 Unblock stuck Key_Shared subscription after consumer reconnect

2023-12-01 Thread Nikolai

Hi Rajan,

Thanks for the review!

It looks like the root cause of "stuck" keyword is the nature of 
Key_Shared subscription which must guarantee messages ordering. I 
understand it is really hard to achieve such guarantees without having 
separate physical partitions but this feature is very essential for many 
use cases. I also understand that the solution I've provided is more a 
hack than a solid solution. I was trying to touch as little code as I 
can and, unfortunately, I see no other options to avoid stuck dispatches 
in the current implementation approach.


I also want to highlight that the current implementation leads entire 
system to hang due to a single message corruption in case consumers 
disconnects. This breaks one of a big Pulsar advantage: independent keys 
processing.


Do you have plans for Key_Shared dispatcher rework? I would be happy to 
contribute to such a project if I can help somehow.



Best Regards,

Nikolai



On 2023/12/01 06:34:59 Rajan Dhabalia wrote:
> Hi Nikolai,
>
> Thanks for bringing this to the attention. I have seen multiple PIP 
and bug

> fixes for the Key-Shared subscription and I feel Key_Shared subscription
> dispatcher is developed with multiple fundamental issues which 
required to
> introduce scheduled-task to unblock stuck reads, introducing stuck 
delivery

> flag, and now we are seeing at least 2-3 PIP to add few more hacks to
> control such stuck dispatch. After reviewing code, it seems
> 
PersistentStickyKeyDispatcherMultipleConsumers::getRestrictedMaxEntriesForConsumer(..)
> function can still return 0 dispatch messages with multiple different 
edge

> cases and fundamental dispatch is not clean at all. This is the only
> dispatcher introduced after open sourcing Pulsar and unfortunately it was
> not developed with a cleaner approach. I understand the pain of such 
stuck

> dispatch and even the current release is having a bugs where dispatch is
> getting stuck and many production systems are struggling with it and you
> must be suggesting this PIP to get work around for this stuck issue. 
But I

> don't think this is the right approach to follow.
> I think the correct approach is to get rid of this "stuck" keyword 
from the

> dispatcher because having "stuck" concept in dispatcher means it has a
> known flow and bug which we are not solving but we are adding more 
debt to
> perform the work around. We never had any such concept in other 
dispatchers

> which was introduced when Pulsar was introduced originally and it just
> works with clear semantics and permit flow mechanism , and we should 
follow

> the same practice for this subscription.
>
> Thanks,
> Rajan
>
> On Thu, Nov 30, 2023 at 9:18 AM Nikolai  wrote:
>
> > Hello!
> >
> > I submitted a new PIP to add a configuration option which allows to 
skip

> > blocking recently joined consumers for Key_Shared subscriptions
> >
> > It introduces additional memory consumption, so it will be disabled by
> > default. It would fix issues like
> > https://github.com/apache/pulsar/issues/21199
> >
> > Link to the PIP: https://github.com/apache/pulsar/pull/21615
> >
> > PR with the PIP implementation:
> > https://github.com/apache/pulsar/pull/21579
> >
> >
> > Best Regards,
> >
> > Nikolai
> >
> >
>


Re: [Discuss] PIP-319 Unblock stuck Key_Shared subscription after consumer reconnect

2023-11-30 Thread Rajan Dhabalia
Hi Nikolai,

Thanks for bringing this to the attention. I have seen multiple PIP and bug
fixes for the Key-Shared subscription and I feel Key_Shared subscription
dispatcher is developed with multiple fundamental issues which required to
introduce scheduled-task to unblock stuck reads, introducing stuck delivery
flag, and now we are seeing at least 2-3 PIP to add few more hacks to
control such stuck dispatch. After reviewing code, it seems
PersistentStickyKeyDispatcherMultipleConsumers::getRestrictedMaxEntriesForConsumer(..)
function can still return 0 dispatch messages with multiple different edge
cases and fundamental dispatch is not clean at all. This is the only
dispatcher introduced after open sourcing Pulsar and unfortunately it was
not developed with a cleaner approach. I understand the pain of such stuck
dispatch and even the current release is having a bugs where dispatch is
getting stuck and many production systems are struggling with it and you
must be suggesting this PIP to get work around for this stuck issue. But I
don't think this is the right approach to follow.
I think the correct approach is to get rid of this "stuck" keyword from the
dispatcher because having "stuck" concept in dispatcher means it has a
known flow and bug which we are not solving but we are adding more debt to
perform the work around. We never had any such concept in other dispatchers
which was introduced when Pulsar was introduced originally and it just
works with clear semantics and permit flow mechanism , and we should follow
the same practice for this subscription.

Thanks,
Rajan

On Thu, Nov 30, 2023 at 9:18 AM Nikolai  wrote:

> Hello!
>
> I submitted a new PIP to add a configuration option which allows to skip
> blocking recently joined consumers for Key_Shared subscriptions
>
> It introduces additional memory consumption, so it will be disabled by
> default. It would fix issues like
> https://github.com/apache/pulsar/issues/21199
>
> Link to the PIP: https://github.com/apache/pulsar/pull/21615
>
> PR with the PIP implementation:
> https://github.com/apache/pulsar/pull/21579
>
>
> Best Regards,
>
> Nikolai
>
>


Re: [Discuss] PIP-319 Unblock stuck Key_Shared subscription after consumer reconnect

2023-11-30 Thread Dave Fisher
Hi Nikolai - Your email was stuck in moderation. Please subscribe to the dev 
list by sending an email to dev-subscr...@pulsar.apache.org and respond to the 
CONFIRM email.

Welcome to the Pulsar community.

Best,
Dave

> On Nov 28, 2023, at 10:30 PM, Nikolai  wrote:
> 
> Hello!
> 
> I submitted a new PIP to add a configuration option which allows to skip 
> blocking recently joined consumers for Key_Shared subscriptions
> 
> It introduces additional memory consumption, so it will be disabled by 
> default. It would fix issues like 
> https://github.com/apache/pulsar/issues/21199
> 
> Link to the PIP: https://github.com/apache/pulsar/pull/21615
> 
> PR with the PIP implementation: https://github.com/apache/pulsar/pull/21579
> 
> 
> Best Regards,
> 
> Nikolai
> 



[Discuss] PIP-319 Unblock stuck Key_Shared subscription after consumer reconnect

2023-11-30 Thread Nikolai

Hello!

I submitted a new PIP to add a configuration option which allows to skip 
blocking recently joined consumers for Key_Shared subscriptions


It introduces additional memory consumption, so it will be disabled by 
default. It would fix issues like 
https://github.com/apache/pulsar/issues/21199


Link to the PIP: https://github.com/apache/pulsar/pull/21615

PR with the PIP implementation: https://github.com/apache/pulsar/pull/21579


Best Regards,

Nikolai