[jira] [Commented] (ARTEMIS-2941) Improve JDBC HA connection resiliency
[ https://issues.apache.org/jira/browse/ARTEMIS-2941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223760#comment-17223760 ] ASF subversion and git services commented on ARTEMIS-2941: -- Commit e4a2a20c228c5c46838494b5325274ff79ff0145 in activemq-artemis's branch refs/heads/master from franz1981 [ https://gitbox.apache.org/repos/asf?p=activemq-artemis.git;h=e4a2a20 ] ARTEMIS-2941 Fixing query timeout value > Improve JDBC HA connection resiliency > - > > Key: ARTEMIS-2941 > URL: https://issues.apache.org/jira/browse/ARTEMIS-2941 > Project: ActiveMQ Artemis > Issue Type: Improvement > Components: Broker >Affects Versions: 2.15.0 >Reporter: Francesco Nigro >Assignee: Francesco Nigro >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > This is aiming to replace the restart enhancement feature of > https://issues.apache.org/jira/browse/ARTEMIS-2918 because this last one is > too dangerous due to the numerous potential leaks that a server in production > could hit by allowing it to restart while keeping the Java process around. > Currently, JDBC HA uses an expiration time on locks that mark the time by > which a server instance is allowed to keep a specific role, dependent by the > owned lock (live or backup). > Right now, the first failed attempt to renew such expiration time force a > broker to shutdown immediately, while it could be more "relaxed" and just > keep retry until the very end ie when the expiration time is approaching to > end. > > The only concern of this feature is related to the relation between the > broker wall-clock time and the DBMS one, that's used to set the expiration > time and that should be within certain margins. > For this last part I'm aware that classic ActiveMQ lease locks use some > configuration parameter to set the magnitude of the allowed difference (and > to compute some base offset too). > > Right now this feature seems more risk-free and appealing then > https://issues.apache.org/jira/browse/ARTEMIS-2918, given it narrows the > scope of it to what's the very core issue ie a more resilient behaviour on > JDBC lost connectivity. > > To understand the implications of such change, consider a shared store HA > pair with configured 60 seconds of expiration time: > # DBMS goes down > # an in-flight persistent operation on the live data store cause the live > broker to kill itself immediately, because no reliable storage is connected > # backup is unable to renew its backup lease lock > # DBMS goes up in time, before the backup lock local expiration time is ended > # backup is able to renew its backup lease lock and retrieve the very last > state of live (that was live) and, if no script is configured to restart the > live, to failover and take its role > # backup is now live and able to serve clients > > > There are 2 legit questions re potential improvements on this: > # why the live cannot keep re-trying I/O (on the journal, paging or large > messages) until its local expiration time end? > # why the live isn't just returning back an I/O error to the clients? > > The former is complex: the main problem I see is from the resource > utilization point of view; keeping an accumulating backlog of pending > requests, blocked awaiting the last one for an arbitrary long time will > probably cause the broker memory to blown up, to not mention that clients > will timed out too. > The latter seems more appealing, because will allow clients to fail fast, but > it would affect the current semantic we use on the broker storage operations > and I need more investigation to understand how to implement it. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARTEMIS-2941) Improve JDBC HA connection resiliency
[ https://issues.apache.org/jira/browse/ARTEMIS-2941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17222125#comment-17222125 ] ASF subversion and git services commented on ARTEMIS-2941: -- Commit 647151b0aff8f1245735bfbc6e8d22d1cdee0afb in activemq-artemis's branch refs/heads/master from gtully [ https://gitbox.apache.org/repos/asf?p=activemq-artemis.git;h=647151b ] ARTEMIS-2941 - renew tasks are nearly always a little late, make this test more tolerant of that > Improve JDBC HA connection resiliency > - > > Key: ARTEMIS-2941 > URL: https://issues.apache.org/jira/browse/ARTEMIS-2941 > Project: ActiveMQ Artemis > Issue Type: Improvement > Components: Broker >Affects Versions: 2.15.0 >Reporter: Francesco Nigro >Assignee: Francesco Nigro >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > This is aiming to replace the restart enhancement feature of > https://issues.apache.org/jira/browse/ARTEMIS-2918 because this last one is > too dangerous due to the numerous potential leaks that a server in production > could hit by allowing it to restart while keeping the Java process around. > Currently, JDBC HA uses an expiration time on locks that mark the time by > which a server instance is allowed to keep a specific role, dependent by the > owned lock (live or backup). > Right now, the first failed attempt to renew such expiration time force a > broker to shutdown immediately, while it could be more "relaxed" and just > keep retry until the very end ie when the expiration time is approaching to > end. > > The only concern of this feature is related to the relation between the > broker wall-clock time and the DBMS one, that's used to set the expiration > time and that should be within certain margins. > For this last part I'm aware that classic ActiveMQ lease locks use some > configuration parameter to set the magnitude of the allowed difference (and > to compute some base offset too). > > Right now this feature seems more risk-free and appealing then > https://issues.apache.org/jira/browse/ARTEMIS-2918, given it narrows the > scope of it to what's the very core issue ie a more resilient behaviour on > JDBC lost connectivity. > > To understand the implications of such change, consider a shared store HA > pair with configured 60 seconds of expiration time: > # DBMS goes down > # an in-flight persistent operation on the live data store cause the live > broker to kill itself immediately, because no reliable storage is connected > # backup is unable to renew its backup lease lock > # DBMS goes up in time, before the backup lock local expiration time is ended > # backup is able to renew its backup lease lock and retrieve the very last > state of live (that was live) and, if no script is configured to restart the > live, to failover and take its role > # backup is now live and able to serve clients > > > There are 2 legit questions re potential improvements on this: > # why the live cannot keep re-trying I/O (on the journal, paging or large > messages) until its local expiration time end? > # why the live isn't just returning back an I/O error to the clients? > > The former is complex: the main problem I see is from the resource > utilization point of view; keeping an accumulating backlog of pending > requests, blocked awaiting the last one for an arbitrary long time will > probably cause the broker memory to blown up, to not mention that clients > will timed out too. > The latter seems more appealing, because will allow clients to fail fast, but > it would affect the current semantic we use on the broker storage operations > and I need more investigation to understand how to implement it. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARTEMIS-2941) Improve JDBC HA connection resiliency
[ https://issues.apache.org/jira/browse/ARTEMIS-2941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217946#comment-17217946 ] ASF subversion and git services commented on ARTEMIS-2941: -- Commit 4545749969a15052ba52edd4700f29421a8e746d in activemq-artemis's branch refs/heads/master from franz1981 [ https://gitbox.apache.org/repos/asf?p=activemq-artemis.git;h=4545749 ] ARTEMIS-2941 Improve JDBC HA connection resiliency > Improve JDBC HA connection resiliency > - > > Key: ARTEMIS-2941 > URL: https://issues.apache.org/jira/browse/ARTEMIS-2941 > Project: ActiveMQ Artemis > Issue Type: Improvement > Components: Broker >Affects Versions: 2.15.0 >Reporter: Francesco Nigro >Assignee: Francesco Nigro >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > This is aiming to replace the restart enhancement feature of > https://issues.apache.org/jira/browse/ARTEMIS-2918 because this last one is > too dangerous due to the numerous potential leaks that a server in production > could hit by allowing it to restart while keeping the Java process around. > Currently, JDBC HA uses an expiration time on locks that mark the time by > which a server instance is allowed to keep a specific role, dependent by the > owned lock (live or backup). > Right now, the first failed attempt to renew such expiration time force a > broker to shutdown immediately, while it could be more "relaxed" and just > keep retry until the very end ie when the expiration time is approaching to > end. > > The only concern of this feature is related to the relation between the > broker wall-clock time and the DBMS one, that's used to set the expiration > time and that should be within certain margins. > For this last part I'm aware that classic ActiveMQ lease locks use some > configuration parameter to set the magnitude of the allowed difference (and > to compute some base offset too). > > Right now this feature seems more risk-free and appealing then > https://issues.apache.org/jira/browse/ARTEMIS-2918, given it narrows the > scope of it to what's the very core issue ie a more resilient behaviour on > JDBC lost connectivity. > > To understand the implications of such change, consider a shared store HA > pair with configured 60 seconds of expiration time: > # DBMS goes down > # an in-flight persistent operation on the live data store cause the live > broker to kill itself immediately, because no reliable storage is connected > # backup is unable to renew its backup lease lock > # DBMS goes up in time, before the backup lock local expiration time is ended > # backup is able to renew its backup lease lock and retrieve the very last > state of live (that was live) and, if no script is configured to restart the > live, to failover and take its role > # backup is now live and able to serve clients > > > There are 2 legit questions re potential improvements on this: > # why the live cannot keep re-trying I/O (on the journal, paging or large > messages) until its local expiration time end? > # why the live isn't just returning back an I/O error to the clients? > > The former is complex: the main problem I see is from the resource > utilization point of view; keeping an accumulating backlog of pending > requests, blocked awaiting the last one for an arbitrary long time will > probably cause the broker memory to blown up, to not mention that clients > will timed out too. > The latter seems more appealing, because will allow clients to fail fast, but > it would affect the current semantic we use on the broker storage operations > and I need more investigation to understand how to implement it. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)