[jira] [Commented] (ARTEMIS-4276) Message Group does not replicate properly during failover
[ https://issues.apache.org/jira/browse/ARTEMIS-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721596#comment-17721596 ] Justin Bertram commented on ARTEMIS-4276: - The goal of grouping messages is to ensure that the messages in the group are processed *in order* while still allowing some level of concurrent message consumption. Typically if you wanted to consume messages in order you'd have to completely eliminate concurrent message consumption and restrict the number of consumers to 1 (e.g. using an [exclusive queue|https://activemq.apache.org/components/artemis/documentation/latest/exclusive-queues.html], or [{{max-consumers}}|https://activemq.apache.org/components/artemis/documentation/latest/address-model.html#shared-durable-subscription-queue-using-max-consumers]). Using message groups allows you to have as many consumers as groups and each group of messages can be consumed concurrently in order. To be clear, it doesn't specifically matter than the *same* consumer gets all the messages, only that *just one consumer at a time* gets the messages. There is no "right consumer," per se, in this context. To this end, Artemis doesn't guarantee that upon fail-over the same consumer will be chosen to handle the group. > Message Group does not replicate properly during failover > - > > Key: ARTEMIS-4276 > URL: https://issues.apache.org/jira/browse/ARTEMIS-4276 > Project: ActiveMQ Artemis > Issue Type: Bug >Affects Versions: 2.28.0 >Reporter: Liviu Citu >Priority: Major > > Hi, > We are currently migrating our software from Classic to Artemis and we plan > to use failover functionality. > We were using message group functionality by setting *JMSXGroupID* and this > was working as expected. However after failover switch I noticed that > messages are sent to wrong consumers. > Our gateway/interface application is actually a collection of servers: > * gateway adapter server: receives messages from an external systems and > puts them on a specific/virtual topic > * gateway loader server (can be balanced): picks up the messages from the > topic and do processing > * gateway fail queue: monitors all messages that failed processing and has a > functionality of resubmitting the message (users will correct the processing > errors and then resubmit transaction) > *JMSXGroupID* is used to ensure that during message resubmit the same > consumer/loader is processing the message as it was originally processed. > However, if the message resubmit is happening during failover switch we have > noticed that the message is not sent to the right consumer as it should. > Basically the first available consumer is used which is not what we want. > I have searched for configuration changes but couldn't find any relevant > information. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARTEMIS-4276) Message Group does not replicate properly during failover
[ https://issues.apache.org/jira/browse/ARTEMIS-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723419#comment-17723419 ] Liviu Citu commented on ARTEMIS-4276: - The question is more like in the context of dealing with message duplication when grouping is being used. Let me provide some more details to better understand the business case. Our software is using ActiveMQ JMS Broker for message distribution across all of its clients and servers. We have some gateway interfaces with external systems to import transactions into the database. Such interface consists of two main components: * {*}gateway adapter server (producer){*}: receives messages from the external systems using some APIs and *puts* them on a specific JMS topic * {*}gateway loader server (consumer){*}: consumes messages from the adapter JMS topic, do some processing and save transaction into the database As the processing is time consuming and the message volumes is very high then we have to *balance the gateway loader server* (two or more loader servers/consumers can be configured to listen to the same producer. We can have multiple consumers of the same topic by using *virtual topics.* These external transactions have versioning so we need to ensure that they are processed in a specific order (actually in the order they are received). To ensure that we are using *JMSXGroupID* which will identity the transaction without its version. By using grouping we ensure that the same consumer will process all versions of the same transaction. External transaction is identified by *ExternalSystem+ExternalType+ExternalID.* Thee gateway adapter will set *JMSXGroupID* to this value in the JMS message before sending it to the topic. If a new version of the same transaction is received from external system then the same *JMSXGroupID* will be set in the message. Practical example: *EXT_SWAP_ID1* with version 1 will have *JMSXGroupID=EXT_SWAP_ID1* *EXT_SWAP_ID1* with version 2 will have *JMSXGroupID=EXT_SWAP_ID1* *EXT_BOND_ID1* with version 1 will have *JMSXGroupID=EXT_BOND_ID1* *EXT_BOND_ID1* with version 2 will have *JMSXGroupID=EXT_BOND_ID1* *EXT_BOND_ID1* ** with version 3 will have ** {*}JMSXGroupID=EXT_BOND_ID1{*}{*}{*} Let's assume we have two loaders (consumers): *LDR1* and *LDR2* . Prior to failover we know that: *LDR1* have processed all messages having {*}JMSXGroupID={*}{*}EXT_SWAP_ID1{*} *LDR2* ** have processed all messages having {*}JMSXGroupID={*}{*}EXT_BOND_ID1{*}{*}{*} During failover switched we have received two transactions: *EXT_SWAP_ID1* with version 3 ({*}JMSXGroupID=EXT_SWAP_ID1){*} *EXT_BOND_ID1* with version 4 {*}({*}{*}JMSXGroupID=EXT_BOND_ID1){*}{*}{*} *LDR1* and *LDR2* were able to process the transactions meaning: *LDR1* has processed *EXT_SWAP_ID1* with version 3 *LDR2* ** has processed *EXT_BOND_ID1* with version 4{*}{*} However when they sent the message acknowledge to the broker then the broker was not able to receive them due to network interruption (failover switch). After the broker is online it sends again the two messages to its consumers. To handle a message duplication all our consumer listeners are using a LRU (last recently used) cache of the already processed messages. So if a same message is being received then it will be skipped. Therefore: if *LDR1* will receive again *EXT_SWAP_ID1* with version 3 will skip it. if *LDR2* ** will receive again *EXT_BOND_ID1* with version 4 will skip it. However, the problem is that after failover switch: *LDR1* received *EXT_BOND_ID1* with version 4 *LDR2* received *EXT_BOND_ID1* with version 3 These messages are considered new to them because they are not in their LRU cache and hence will try to process the transactions. This leads to the same transaction being imported in the database and causing issues from financial point of view. Actually these transactions re-import might fail now and in some cases will cause both *LDR1* and *LDR2* to stop processing. Is there any setup to circumvent this? > Message Group does not replicate properly during failover > - > > Key: ARTEMIS-4276 > URL: https://issues.apache.org/jira/browse/ARTEMIS-4276 > Project: ActiveMQ Artemis > Issue Type: Bug >Affects Versions: 2.28.0 >Reporter: Liviu Citu >Priority: Major > > Hi, > We are currently migrating our software from Classic to Artemis and we plan > to use failover functionality. > We were using message group functionality by setting *JMSXGroupID* and this > was working as expected. However after failover switch I noticed that > messages are sent to wrong consumers. > Our gateway/interface application is actually a collection of servers: > * gateway adapter server: receives messages from an external systems
[jira] [Commented] (ARTEMIS-4276) Message Group does not replicate properly during failover
[ https://issues.apache.org/jira/browse/ARTEMIS-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723532#comment-17723532 ] Justin Bertram commented on ARTEMIS-4276: - bq. We are using virtual topics for that. Now that you're on ActiveMQ Artemis you can use JMS 2's [shared topic consumer|https://docs.oracle.com/javaee/7/api/javax/jms/Session.html#createSharedConsumer-javax.jms.Topic-java.lang.String-]. bq. By using grouping we ensure that the same consumer will process all versions of the same transaction. As noted previously, grouping *doesn't* ensure that the *same* consumer will process all the messages in the group. It only guarantees that _one consumer at a time_ will process the messages in the group and therefore the messages will be processed in order. bq. To handle a message duplication all our consumer's listeners are using a LRU (last recently used) cache of the already processed messages. A local, volatile LRU cache is not enough to mitigate duplicate messages. Keep in mind that even _if_ the broker maintained the consumer-group relationship during broker failover the consumer itself can still fail at any point (e.g. JVM crash, hardware failure, network glitch, etc.) at which time a new consumer for the group will be chosen which may lead to processing duplicate messages since the _new_ consumer won't have the already-processed messages in its LRU cache. In short, guaranteeing that the same consumer gets the same group on broker failover does not adequately deal with the threat of duplicate messages. Generally speaking, distributing state like this (i.e. in the consumer's LRU cache) is not a good idea because it typically leads to consistency issues. State should be concentrated in the non-distributed components (i.e. message broker & database). bq. Is the grouping cached used by the broker distributed or persisted during te failover switch? No. The consumer-group relationship is not designed to survive fail-over for the reasons I outlined previously. bq. Is there any setup to circumvent this? Yes. Simply put, your consumers need to be [_idempotent_|https://en.wikipedia.org/wiki/Idempotence]. In your situation I can think of a few ways to do this. Often when folks needs keep data between two resources like a message broker and a database in sync they use an [XA transaction|https://en.wikipedia.org/wiki/X/Open_XA]. In Java this is implemented via [JTA|https://github.com/jakartaee/transactions]. This is very common in Java especially when an application is running in a Java EE application server because MDBs are transactional by default and any other XA resource used in the course of processing a JMS message in an MDB is automatically enlisted into the transaction meaning that all the work is _atomically_ (i.e. either it all succeeds or it all fails). By using a JTA transaction between the JMS and JDBC resources you ensure that if the JDBC insert succeeds but the JMS message acknowledgement fails then everything will be rolled back so that neither the JMS message is consumed nor the data is actually inserted into the JDBC database. When the message is consumed again later there will be no duplicate entries in the database. Another way to deal with this would be to set up a primary key on the table (or tables) where you're inserting data. This would prevent duplicates records from being inserted into the database when consumers receive duplicate messages. The primary key could be a combination of the {{JMSXGroupID}} and the version (e.g. {{EXT_BOND_ID_4}}). Therefore, in the scenario you outlined in your comment when *LDR1* receives *EXT_BOND_ID* with version *4* it will process it and when it tries to insert it into the database it won't actually be able to. > Message Group does not replicate properly during failover > - > > Key: ARTEMIS-4276 > URL: https://issues.apache.org/jira/browse/ARTEMIS-4276 > Project: ActiveMQ Artemis > Issue Type: Bug >Affects Versions: 2.28.0 >Reporter: Liviu Citu >Priority: Major > > Hi, > We are currently migrating our software from Classic to Artemis and we plan > to use failover functionality. > We were using message group functionality by setting *JMSXGroupID* and this > was working as expected. However after failover switch I noticed that > messages are sent to wrong consumers. > Our gateway/interface application is actually a collection of servers: > * gateway adapter server: receives messages from an external systems and > puts them on a specific/virtual topic > * gateway loader server (can be balanced): picks up the messages from the > topic and do processing > * gateway fail queue: monitors all messages that failed processing and has a > functionality of resubmitting t
[jira] [Commented] (ARTEMIS-4276) Message Group does not replicate properly during failover
[ https://issues.apache.org/jira/browse/ARTEMIS-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723810#comment-17723810 ] Liviu Citu commented on ARTEMIS-4276: - *Virtual Topics vs Shared Topic Consumers* Our plan during migration from *Classic ActiveMQ* to *Artemis* is to modify as little as possible the source code to reduce the regression impact. Our software is C++ code based and we are using *CMS API* ({*}ActiveMQ CPP{*}) as a client. I am unable to find a CMS API to create shared topic consumer so I am not sure if it exists. In the same time, I am not very sure that the behavior of such LB using shared subscription is what we want in our Gateway Loader Servers. We do not want to process the same message in more than one group (please correct me if I am wrong): http://jmesnil.net/weblog/2013/06/27/jms-20-shared-subscription/ Nonetheless we were using virtual topics with Classic ActiveMQ and they work as expected with Artemis too (the setup changes are trivial). *Idempotent consumer using local, volatile LRU cache* *ActiveMQ CPP* does not support idempotent consumers. Nonetheless in our software we have a wrapper over the CMS consumer and a wrapper over the CMS consumer listener. The LRU cache is part of our listener. Indeed the *CMS consumer* gets restored during FailOver but the object is not recreated so our wrapper is still valid and the cache still stands in this context. Indeed this might not be the best option to handle the duplicated messages but when there is no Load Balance it works ok. The problem is indeed when there are more than one consumer involved for the same topic. *XA transaction* The synchronization problem between database and JMS Broker is not necessary related to FailOver or Artemis usage. We have this also with Classic ActiveMQ [for instance if there is a network glitch or when *ActiveMQ* goes down and the message reached the database]. We were exploring the usage of XA transaction however the code changes needed to implement it in an existing software is huge and practically impossible. However, at the database level we have a protection with primary keys and indeed the same transaction cannot be processed twice. As I have explained in the description of this issue, we have also a Gateway Fail Queue Monitor where the users might find all messages that failed during processing (included those duplicated that failed during insertion). We just wanted to explore the possibility to have a way of removing these "fake" failures caused by FailOver or somehow to distinguish them from those which are real business failures. These are technical failures (cause by FailOver in this case) and users looking to the Fail Queue Monitor might get confused when seeing such duplicated messages without understanding what went wrong. I suppose they will have to deal with this as being a system limitation. > Message Group does not replicate properly during failover > - > > Key: ARTEMIS-4276 > URL: https://issues.apache.org/jira/browse/ARTEMIS-4276 > Project: ActiveMQ Artemis > Issue Type: Bug >Affects Versions: 2.28.0 >Reporter: Liviu Citu >Priority: Major > > Hi, > We are currently migrating our software from Classic to Artemis and we plan > to use failover functionality. > We were using message group functionality by setting *JMSXGroupID* and this > was working as expected. However after failover switch I noticed that > messages are sent to wrong consumers. > Our gateway/interface application is actually a collection of servers: > * gateway adapter server: receives messages from an external systems and > puts them on a specific/virtual topic > * gateway loader server (can be balanced): picks up the messages from the > topic and do processing > * gateway fail queue: monitors all messages that failed processing and has a > functionality of resubmitting the message (users will correct the processing > errors and then resubmit transaction) > *JMSXGroupID* is used to ensure that during message resubmit the same > consumer/loader is processing the message as it was originally processed. > However, if the message resubmit is happening during failover switch we have > noticed that the message is not sent to the right consumer as it should. > Basically the first available consumer is used which is not what we want. > I have searched for configuration changes but couldn't find any relevant > information. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARTEMIS-4276) Message Group does not replicate properly during failover
[ https://issues.apache.org/jira/browse/ARTEMIS-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723979#comment-17723979 ] Justin Bertram commented on ARTEMIS-4276: - I think you've misunderstood much of what I wrote. Here are some additional comments and clarifications... bq. Our plan during migration from Classic ActiveMQ to Artemis is to modify as little as possible the source code to reduce the regression impact. Fair enough. bq. Our software is C++ code based and we are using CMS API (ActiveMQ CPP) as a client. The CMS API was originally based on JMS 1.1 and I don't believe it has been updated since JMS 2 was released 10 years ago now. Therefore I wouldn't expect it to have the methods for creating a shared subscription. bq. We do not want to process the same message in more than one group (please correct me if I am wrong)... The whole point of sharing a subscription between multiple consumers is to ensure that the same message is not processed more than once. I recommended the move to JMS 2 shared subscriptions assuming you were using a JMS client. This would make your code more portable and easier to understand. However, since you're using CMS that's obviously out of the question. bq. ActiveMQ CPP does not have idempotent consumers. Idempotency is something you, as the application developer, must implement. It is not something inherent to the client implementation which you use to communicate with the broker (i.e. ActiveMQ CPP). bq. Indeed the CMS consumer gets restored during failover but the object is not recreated so our wrapper is still valid and the cache still stands in this context. The scenario where the primary broker fails and the client switches to the backup broker (i.e. "failover") is _not_ what I was describing. The problem I was trying to describe is what happens when some kind of failure renders the cache invalid. This could happen for any number of reasons, some of which I outlined in my previous comment. This is a weakness in the application design which will lead to the same problems with duplicate messages as you have when a broker failure causes the consumer-group relationship to change. bq. The synchronization problem between database and JMS Broker is not necessary related to failover or Artemis usage. Yes, of course. This is a general problem in computing which is why XA transactions were invented in the first place. Their use is certainly not restricted to databases and message brokers or even to Java. They are used across the industry in many many different kinds of resources in many different programming languages. Typically the need for consistency between resources is identified before implementation and is part of the fundamental application design. XA is not simple and care is needed when fitting all the pieces together. bq. At the database level we have a protection with primary keys and indeed the same transaction cannot be inserted twice. This seems to flatly contradict what you said in your previous comment, "This leads to the same transaction being imported in the database twice..." Please clarify. bq. We just wanted to explore the possibility to have a way of removing these "fake" failures caused by failover or somehow to distinguish them from those which are real business failures. The "fake" failures are the result of your application design (i.e. the consumers are not idempotent). To be clear, even _if_ the broker maintained the consumer-group relationship during failover you'd still have the risk of these kinds of "fake" failures in other scenarios. That said, the client knows when a failover has occurred so it knows that, at least for a little while, there is a fair chance of duplicate messages and therefore primary key violations on the database. It could either add this context to the failure notification to help whoever reads it or it could simply ignore the primary key violations for a time. > Message Group does not replicate properly during failover > - > > Key: ARTEMIS-4276 > URL: https://issues.apache.org/jira/browse/ARTEMIS-4276 > Project: ActiveMQ Artemis > Issue Type: Bug >Affects Versions: 2.28.0 >Reporter: Liviu Citu >Priority: Major > > Hi, > We are currently migrating our software from Classic to Artemis and we plan > to use failover functionality. > We were using message group functionality by setting *JMSXGroupID* and this > was working as expected. However after failover switch I noticed that > messages are sent to wrong consumers. > Our gateway/interface application is actually a collection of servers: > * gateway adapter server: receives messages from an external systems and > puts them on a specific/virtual topic > * gateway loader server (can b
[jira] [Commented] (ARTEMIS-4276) Message Group does not replicate properly during failover
[ https://issues.apache.org/jira/browse/ARTEMIS-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723982#comment-17723982 ] Liviu Citu commented on ARTEMIS-4276: - Actually I think I understood what you meant :) Regarding: > This leads to the same transaction being imported in the database twice..." What I meant actually is that *_it will try to import_* the record in the database. Of course our database IO meta layer we have mechanism in place to avoid same transaction being imported twice (the database records have audit trail which include transaction version). This is because same database tables can also be affected by other applications part of our software (UI, batch utilities, etc) so it is not only the gateway interface who import data in the system. I just wanted to pin point a potential issue that could arise in applications Regarding: > Idempotency is something you, as the application developer, must implement. There are some third parties that have this out-of-the box. For instance, Kafka has idempotent consumers. > Message Group does not replicate properly during failover > - > > Key: ARTEMIS-4276 > URL: https://issues.apache.org/jira/browse/ARTEMIS-4276 > Project: ActiveMQ Artemis > Issue Type: Bug >Affects Versions: 2.28.0 >Reporter: Liviu Citu >Priority: Major > > Hi, > We are currently migrating our software from Classic to Artemis and we plan > to use failover functionality. > We were using message group functionality by setting *JMSXGroupID* and this > was working as expected. However after failover switch I noticed that > messages are sent to wrong consumers. > Our gateway/interface application is actually a collection of servers: > * gateway adapter server: receives messages from an external systems and > puts them on a specific/virtual topic > * gateway loader server (can be balanced): picks up the messages from the > topic and do processing > * gateway fail queue: monitors all messages that failed processing and has a > functionality of resubmitting the message (users will correct the processing > errors and then resubmit transaction) > *JMSXGroupID* is used to ensure that during message resubmit the same > consumer/loader is processing the message as it was originally processed. > However, if the message resubmit is happening during failover switch we have > noticed that the message is not sent to the right consumer as it should. > Basically the first available consumer is used which is not what we want. > I have searched for configuration changes but couldn't find any relevant > information. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARTEMIS-4276) Message Group does not replicate properly during failover
[ https://issues.apache.org/jira/browse/ARTEMIS-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723995#comment-17723995 ] Justin Bertram commented on ARTEMIS-4276: - bq. I am not seeing it as a weakness rather than an incomplete solution... Fair enough. I'm not looking to debate semantics here. :slightly_smiling_face: bq. I think it is still better to have a local cache than nothing. Assuming the cache was simple to implement and doesn't incur a meaningful runtime cost (e.g. in CPU or memory) then I would agree. It is better than nothing. bq. There are some third parties that have this out-of-the box. For instance, I have seen Kafka having idempotent consumers. The idempotency that Kafka may provide is not what I'm talking about in this context. There are definitely measures that client libraries can take to help make consuming and producing messages idempotent. However, those measures only apply to the actual _messaging_ operations. Once you add another kind of resource like a database or even another message broker there's nothing that the client library can do to make the consumer idempotent _overall_. As noted, the application developer must implement this kind of idempotency. Technologies like XA were invented to deal with this kind use-case. It's worth noting that Kafka does not, in fact, support XA. > Message Group does not replicate properly during failover > - > > Key: ARTEMIS-4276 > URL: https://issues.apache.org/jira/browse/ARTEMIS-4276 > Project: ActiveMQ Artemis > Issue Type: Bug >Affects Versions: 2.28.0 >Reporter: Liviu Citu >Priority: Major > > Hi, > We are currently migrating our software from Classic to Artemis and we plan > to use failover functionality. > We were using message group functionality by setting *JMSXGroupID* and this > was working as expected. However after failover switch I noticed that > messages are sent to wrong consumers. > Our gateway/interface application is actually a collection of servers: > * gateway adapter server: receives messages from an external systems and > puts them on a specific/virtual topic > * gateway loader server (can be balanced): picks up the messages from the > topic and do processing > * gateway fail queue: monitors all messages that failed processing and has a > functionality of resubmitting the message (users will correct the processing > errors and then resubmit transaction) > *JMSXGroupID* is used to ensure that during message resubmit the same > consumer/loader is processing the message as it was originally processed. > However, if the message resubmit is happening during failover switch we have > noticed that the message is not sent to the right consumer as it should. > Basically the first available consumer is used which is not what we want. > I have searched for configuration changes but couldn't find any relevant > information. -- This message was sent by Atlassian Jira (v8.20.10#820010)