RE: Re: Re: [Discuss] PIP-319 Unblock stuck Key_Shared subscription after consumer reconnect
Hi Rajan, Thanks for sharing the link! I will follow the PR. Best Regards, Nikolai On 2023/12/02 01:21:04 Rajan Dhabalia wrote: > Hi Nikolai, > > I completely understand your concern and pain. But let's understand > ordering semantics for exclusive/fail-over subscription. This subscription > makes sure to dispatch new messages which the broker is reading using > readPosition must be in order but that doesn't mean the broker will never > handle redelivery of unack message and broker will not stop dispatching by > considering the end of the world. However, key-Shared sub thinks it's the > end of the world. key-shared sub extends the shared sub behavior with > additional speciality of consumer dispatch affinity based on message key. > Broker tries its best to perform this responsibility but it's still not > guaranteed as it will be really costly to give such guarantee and if one > requires it then we can use a failover sub. In shared or even in other > subscriptions, brokers don't restrict or stop dispatching of redelivery > messages. > However, in key-shared sub, broker stops forever dispatching new and > redelivery messages in certain conditions and additionally broker goes into > an infinite iterative sequence of reading duplicate cold messages and > discarding them which causes IO and CPU impact both on broker and bookie. > Pulsar must not have such broken and incorrect semantic behavior in the > system. I have created an issue: > https://github.com/apache/pulsar/issues/21656 which has a simple unit test > case to reproduce this behavior and that's what is happening in multiple > production Pulsar systems and users are not happy with it. > > Therefore, we should not stop dispatching and restricting redeliver > messages which will automatically address almost all stuck issues and the > broker can still follow the ordering guarantee for new messages which it > promises for other subs as well. > > Thanks, > Rajan > > On Fri, Dec 1, 2023 at 12:03 AM Nikolai wrote: > > > Hi Rajan, > > > > Thanks for the review! > > > > It looks like the root cause of "stuck" keyword is the nature of > > Key_Shared subscription which must guarantee messages ordering. I > > understand it is really hard to achieve such guarantees without having > > separate physical partitions but this feature is very essential for many > > use cases. I also understand that the solution I've provided is more a > > hack than a solid solution. I was trying to touch as little code as I > > can and, unfortunately, I see no other options to avoid stuck dispatches > > in the current implementation approach. > > > > I also want to highlight that the current implementation leads entire > > system to hang due to a single message corruption in case consumers > > disconnects. This breaks one of a big Pulsar advantage: independent keys > > processing. > > > > Do you have plans for Key_Shared dispatcher rework? I would be happy to > > contribute to such a project if I can help somehow. > > > > > > Best Regards, > > > > Nikolai > > > > > > > > On 2023/12/01 06:34:59 Rajan Dhabalia wrote: > > > Hi Nikolai, > > > > > > Thanks for bringing this to the attention. I have seen multiple PIP > > and bug > > > fixes for the Key-Shared subscription and I feel Key_Shared subscription > > > dispatcher is developed with multiple fundamental issues which > > required to > > > introduce scheduled-task to unblock stuck reads, introducing stuck > > delivery > > > flag, and now we are seeing at least 2-3 PIP to add few more hacks to > > > control such stuck dispatch. After reviewing code, it seems > > > > > > > PersistentStickyKeyDispatcherMultipleConsumers::getRestrictedMaxEntriesForConsumer(..) > > > function can still return 0 dispatch messages with multiple different > > edge > > > cases and fundamental dispatch is not clean at all. This is the only > > > dispatcher introduced after open sourcing Pulsar and unfortunately it > > was > > > not developed with a cleaner approach. I understand the pain of such > > stuck > > > dispatch and even the current release is having a bugs where dispatch is > > > getting stuck and many production systems are struggling with it and you > > > must be suggesting this PIP to get work around for this stuck issue. > > But I > > > don't think this is the right approach to follow. > > > I think the correct approach is to get rid of this "stuck" keyword > > from the > > > dispatcher because having "stuck" concept in dispatcher means it has a > > > known flow and bug which we are not solving but we are adding more > > debt to > > > perform the work around. We never had any such concept in other > > dispatchers > > > which was introduced when Pulsar was introduced originally and it just > > > works with clear semantics and permit flow mechanism , and we should > > follow > > > the same practice for this subscription. > > > > > > Thanks, > > > Rajan > > > > > > On Thu, Nov 30, 2023 at 9:18 AM Nikolai wrote: > > > >
Re: Re: [Discuss] PIP-319 Unblock stuck Key_Shared subscription after consumer reconnect
Hi Nikolai, I completely understand your concern and pain. But let's understand ordering semantics for exclusive/fail-over subscription. This subscription makes sure to dispatch new messages which the broker is reading using readPosition must be in order but that doesn't mean the broker will never handle redelivery of unack message and broker will not stop dispatching by considering the end of the world. However, key-Shared sub thinks it's the end of the world. key-shared sub extends the shared sub behavior with additional speciality of consumer dispatch affinity based on message key. Broker tries its best to perform this responsibility but it's still not guaranteed as it will be really costly to give such guarantee and if one requires it then we can use a failover sub. In shared or even in other subscriptions, brokers don't restrict or stop dispatching of redelivery messages. However, in key-shared sub, broker stops forever dispatching new and redelivery messages in certain conditions and additionally broker goes into an infinite iterative sequence of reading duplicate cold messages and discarding them which causes IO and CPU impact both on broker and bookie. Pulsar must not have such broken and incorrect semantic behavior in the system. I have created an issue: https://github.com/apache/pulsar/issues/21656 which has a simple unit test case to reproduce this behavior and that's what is happening in multiple production Pulsar systems and users are not happy with it. Therefore, we should not stop dispatching and restricting redeliver messages which will automatically address almost all stuck issues and the broker can still follow the ordering guarantee for new messages which it promises for other subs as well. Thanks, Rajan On Fri, Dec 1, 2023 at 12:03 AM Nikolai wrote: > Hi Rajan, > > Thanks for the review! > > It looks like the root cause of "stuck" keyword is the nature of > Key_Shared subscription which must guarantee messages ordering. I > understand it is really hard to achieve such guarantees without having > separate physical partitions but this feature is very essential for many > use cases. I also understand that the solution I've provided is more a > hack than a solid solution. I was trying to touch as little code as I > can and, unfortunately, I see no other options to avoid stuck dispatches > in the current implementation approach. > > I also want to highlight that the current implementation leads entire > system to hang due to a single message corruption in case consumers > disconnects. This breaks one of a big Pulsar advantage: independent keys > processing. > > Do you have plans for Key_Shared dispatcher rework? I would be happy to > contribute to such a project if I can help somehow. > > > Best Regards, > > Nikolai > > > > On 2023/12/01 06:34:59 Rajan Dhabalia wrote: > > Hi Nikolai, > > > > Thanks for bringing this to the attention. I have seen multiple PIP > and bug > > fixes for the Key-Shared subscription and I feel Key_Shared subscription > > dispatcher is developed with multiple fundamental issues which > required to > > introduce scheduled-task to unblock stuck reads, introducing stuck > delivery > > flag, and now we are seeing at least 2-3 PIP to add few more hacks to > > control such stuck dispatch. After reviewing code, it seems > > > > PersistentStickyKeyDispatcherMultipleConsumers::getRestrictedMaxEntriesForConsumer(..) > > function can still return 0 dispatch messages with multiple different > edge > > cases and fundamental dispatch is not clean at all. This is the only > > dispatcher introduced after open sourcing Pulsar and unfortunately it > was > > not developed with a cleaner approach. I understand the pain of such > stuck > > dispatch and even the current release is having a bugs where dispatch is > > getting stuck and many production systems are struggling with it and you > > must be suggesting this PIP to get work around for this stuck issue. > But I > > don't think this is the right approach to follow. > > I think the correct approach is to get rid of this "stuck" keyword > from the > > dispatcher because having "stuck" concept in dispatcher means it has a > > known flow and bug which we are not solving but we are adding more > debt to > > perform the work around. We never had any such concept in other > dispatchers > > which was introduced when Pulsar was introduced originally and it just > > works with clear semantics and permit flow mechanism , and we should > follow > > the same practice for this subscription. > > > > Thanks, > > Rajan > > > > On Thu, Nov 30, 2023 at 9:18 AM Nikolai wrote: > > > > > Hello! > > > > > > I submitted a new PIP to add a configuration option which allows to > skip > > > blocking recently joined consumers for Key_Shared subscriptions > > > > > > It introduces additional memory consumption, so it will be disabled by > > > default. It would fix issues like > > > https://github.com/apache/pulsar/is
RE: Re: [Discuss] PIP-319 Unblock stuck Key_Shared subscription after consumer reconnect
Hi Rajan, Thanks for the review! It looks like the root cause of "stuck" keyword is the nature of Key_Shared subscription which must guarantee messages ordering. I understand it is really hard to achieve such guarantees without having separate physical partitions but this feature is very essential for many use cases. I also understand that the solution I've provided is more a hack than a solid solution. I was trying to touch as little code as I can and, unfortunately, I see no other options to avoid stuck dispatches in the current implementation approach. I also want to highlight that the current implementation leads entire system to hang due to a single message corruption in case consumers disconnects. This breaks one of a big Pulsar advantage: independent keys processing. Do you have plans for Key_Shared dispatcher rework? I would be happy to contribute to such a project if I can help somehow. Best Regards, Nikolai On 2023/12/01 06:34:59 Rajan Dhabalia wrote: > Hi Nikolai, > > Thanks for bringing this to the attention. I have seen multiple PIP and bug > fixes for the Key-Shared subscription and I feel Key_Shared subscription > dispatcher is developed with multiple fundamental issues which required to > introduce scheduled-task to unblock stuck reads, introducing stuck delivery > flag, and now we are seeing at least 2-3 PIP to add few more hacks to > control such stuck dispatch. After reviewing code, it seems > PersistentStickyKeyDispatcherMultipleConsumers::getRestrictedMaxEntriesForConsumer(..) > function can still return 0 dispatch messages with multiple different edge > cases and fundamental dispatch is not clean at all. This is the only > dispatcher introduced after open sourcing Pulsar and unfortunately it was > not developed with a cleaner approach. I understand the pain of such stuck > dispatch and even the current release is having a bugs where dispatch is > getting stuck and many production systems are struggling with it and you > must be suggesting this PIP to get work around for this stuck issue. But I > don't think this is the right approach to follow. > I think the correct approach is to get rid of this "stuck" keyword from the > dispatcher because having "stuck" concept in dispatcher means it has a > known flow and bug which we are not solving but we are adding more debt to > perform the work around. We never had any such concept in other dispatchers > which was introduced when Pulsar was introduced originally and it just > works with clear semantics and permit flow mechanism , and we should follow > the same practice for this subscription. > > Thanks, > Rajan > > On Thu, Nov 30, 2023 at 9:18 AM Nikolai wrote: > > > Hello! > > > > I submitted a new PIP to add a configuration option which allows to skip > > blocking recently joined consumers for Key_Shared subscriptions > > > > It introduces additional memory consumption, so it will be disabled by > > default. It would fix issues like > > https://github.com/apache/pulsar/issues/21199 > > > > Link to the PIP: https://github.com/apache/pulsar/pull/21615 > > > > PR with the PIP implementation: > > https://github.com/apache/pulsar/pull/21579 > > > > > > Best Regards, > > > > Nikolai > > > > >
Re: [Discuss] PIP-319 Unblock stuck Key_Shared subscription after consumer reconnect
Hi Nikolai, Thanks for bringing this to the attention. I have seen multiple PIP and bug fixes for the Key-Shared subscription and I feel Key_Shared subscription dispatcher is developed with multiple fundamental issues which required to introduce scheduled-task to unblock stuck reads, introducing stuck delivery flag, and now we are seeing at least 2-3 PIP to add few more hacks to control such stuck dispatch. After reviewing code, it seems PersistentStickyKeyDispatcherMultipleConsumers::getRestrictedMaxEntriesForConsumer(..) function can still return 0 dispatch messages with multiple different edge cases and fundamental dispatch is not clean at all. This is the only dispatcher introduced after open sourcing Pulsar and unfortunately it was not developed with a cleaner approach. I understand the pain of such stuck dispatch and even the current release is having a bugs where dispatch is getting stuck and many production systems are struggling with it and you must be suggesting this PIP to get work around for this stuck issue. But I don't think this is the right approach to follow. I think the correct approach is to get rid of this "stuck" keyword from the dispatcher because having "stuck" concept in dispatcher means it has a known flow and bug which we are not solving but we are adding more debt to perform the work around. We never had any such concept in other dispatchers which was introduced when Pulsar was introduced originally and it just works with clear semantics and permit flow mechanism , and we should follow the same practice for this subscription. Thanks, Rajan On Thu, Nov 30, 2023 at 9:18 AM Nikolai wrote: > Hello! > > I submitted a new PIP to add a configuration option which allows to skip > blocking recently joined consumers for Key_Shared subscriptions > > It introduces additional memory consumption, so it will be disabled by > default. It would fix issues like > https://github.com/apache/pulsar/issues/21199 > > Link to the PIP: https://github.com/apache/pulsar/pull/21615 > > PR with the PIP implementation: > https://github.com/apache/pulsar/pull/21579 > > > Best Regards, > > Nikolai > >
Re: [Discuss] PIP-319 Unblock stuck Key_Shared subscription after consumer reconnect
Hi Nikolai - Your email was stuck in moderation. Please subscribe to the dev list by sending an email to dev-subscr...@pulsar.apache.org and respond to the CONFIRM email. Welcome to the Pulsar community. Best, Dave > On Nov 28, 2023, at 10:30 PM, Nikolai wrote: > > Hello! > > I submitted a new PIP to add a configuration option which allows to skip > blocking recently joined consumers for Key_Shared subscriptions > > It introduces additional memory consumption, so it will be disabled by > default. It would fix issues like > https://github.com/apache/pulsar/issues/21199 > > Link to the PIP: https://github.com/apache/pulsar/pull/21615 > > PR with the PIP implementation: https://github.com/apache/pulsar/pull/21579 > > > Best Regards, > > Nikolai >
[Discuss] PIP-319 Unblock stuck Key_Shared subscription after consumer reconnect
Hello! I submitted a new PIP to add a configuration option which allows to skip blocking recently joined consumers for Key_Shared subscriptions It introduces additional memory consumption, so it will be disabled by default. It would fix issues like https://github.com/apache/pulsar/issues/21199 Link to the PIP: https://github.com/apache/pulsar/pull/21615 PR with the PIP implementation: https://github.com/apache/pulsar/pull/21579 Best Regards, Nikolai