[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck
[ https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849708#comment-15849708 ] Julian Reschke commented on OAK-5446: - No, it hasn't... > leaseUpdateThread might be blocked by leaseUpdateCheck > -- > > Key: OAK-5446 > URL: https://issues.apache.org/jira/browse/OAK-5446 > Project: Jackrabbit Oak > Issue Type: Bug >Affects Versions: 1.4, 1.5.14 >Reporter: Stefan Eissing >Assignee: Julian Reschke > Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase, > OAK-5446.testcase.v3 > > > {color:red} > moved over to OAK-5528 due to internal Jira issues, please do not delete this > ticket while the problem is being investigated > {color} > Fighting with cluster nodes losing their lease and shutting down oak-core in > a cloud environment. For reasons unknown at this point in time, the whole > process seems to skip about two minutes of real time. > This is a situation from which oak currently does not recover. Code analysis > shows that {{ClusterNodeInfo}} is handed the > {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal > since any action the {{renewLease()}} tries to do will first invoke the > {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is > reached, _stall the renewLease() thread_ for 5 retry attempts and then > declare the lease to be lost. > The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, > not the wrapped one, IMO. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck
[ https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849705#comment-15849705 ] Gavin commented on OAK-5446: [~reschke] ok can you check if this ticket has its normal buttons back? > leaseUpdateThread might be blocked by leaseUpdateCheck > -- > > Key: OAK-5446 > URL: https://issues.apache.org/jira/browse/OAK-5446 > Project: Jackrabbit Oak > Issue Type: Bug >Affects Versions: 1.4, 1.5.14 >Reporter: Stefan Eissing >Assignee: Julian Reschke > Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase, > OAK-5446.testcase.v3 > > > {color:red} > moved over to OAK-5528 due to internal Jira issues, please do not delete this > ticket while the problem is being investigated > {color} > Fighting with cluster nodes losing their lease and shutting down oak-core in > a cloud environment. For reasons unknown at this point in time, the whole > process seems to skip about two minutes of real time. > This is a situation from which oak currently does not recover. Code analysis > shows that {{ClusterNodeInfo}} is handed the > {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal > since any action the {{renewLease()}} tries to do will first invoke the > {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is > reached, _stall the renewLease() thread_ for 5 retry attempts and then > declare the lease to be lost. > The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, > not the wrapped one, IMO. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck
[ https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840020#comment-15840020 ] Julian Reschke commented on OAK-5446: - Note I had to clone the issue to work around a JIRA problem. > leaseUpdateThread might be blocked by leaseUpdateCheck > -- > > Key: OAK-5446 > URL: https://issues.apache.org/jira/browse/OAK-5446 > Project: Jackrabbit Oak > Issue Type: Bug > Components: core >Affects Versions: 1.4, 1.5.14 >Reporter: Stefan Eissing >Assignee: Julian Reschke > Labels: candidate_oak_1_4 > Fix For: 1.6 > > Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase, > OAK-5446.testcase.v3 > > > Fighting with cluster nodes losing their lease and shutting down oak-core in > a cloud environment. For reasons unknown at this point in time, the whole > process seems to skip about two minutes of real time. > This is a situation from which oak currently does not recover. Code analysis > shows that {{ClusterNodeInfo}} is handed the > {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal > since any action the {{renewLease()}} tries to do will first invoke the > {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is > reached, _stall the renewLease() thread_ for 5 retry attempts and then > declare the lease to be lost. > The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, > not the wrapped one, IMO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck
[ https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840006#comment-15840006 ] Julian Reschke commented on OAK-5446: - See [r1780424|http://svn.apache.org/r1780424] - for some I reason I currently can nor resolve this ticket though. > leaseUpdateThread might be blocked by leaseUpdateCheck > -- > > Key: OAK-5446 > URL: https://issues.apache.org/jira/browse/OAK-5446 > Project: Jackrabbit Oak > Issue Type: Bug > Components: core >Affects Versions: 1.4, 1.5.14 >Reporter: Stefan Eissing >Assignee: Julian Reschke > Labels: candidate_oak_1_4 > Fix For: 1.6 > > Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase, > OAK-5446.testcase.v3 > > > Fighting with cluster nodes losing their lease and shutting down oak-core in > a cloud environment. For reasons unknown at this point in time, the whole > process seems to skip about two minutes of real time. > This is a situation from which oak currently does not recover. Code analysis > shows that {{ClusterNodeInfo}} is handed the > {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal > since any action the {{renewLease()}} tries to do will first invoke the > {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is > reached, _stall the renewLease() thread_ for 5 retry attempts and then > declare the lease to be lost. > The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, > not the wrapped one, IMO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck
[ https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839904#comment-15839904 ] Stefan Egli commented on OAK-5446: -- oups, yes, that would not have helped then.. > leaseUpdateThread might be blocked by leaseUpdateCheck > -- > > Key: OAK-5446 > URL: https://issues.apache.org/jira/browse/OAK-5446 > Project: Jackrabbit Oak > Issue Type: Bug > Components: core >Affects Versions: 1.4, 1.5.14 >Reporter: Stefan Eissing >Assignee: Julian Reschke > Labels: candidate_oak_1_4, candidate_oak_1_6 > Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase, > OAK-5446.testcase.v3 > > > Fighting with cluster nodes losing their lease and shutting down oak-core in > a cloud environment. For reasons unknown at this point in time, the whole > process seems to skip about two minutes of real time. > This is a situation from which oak currently does not recover. Code analysis > shows that {{ClusterNodeInfo}} is handed the > {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal > since any action the {{renewLease()}} tries to do will first invoke the > {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is > reached, _stall the renewLease() thread_ for 5 retry attempts and then > declare the lease to be lost. > The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, > not the wrapped one, IMO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck
[ https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839886#comment-15839886 ] Julian Reschke commented on OAK-5446: - ...guess what: in the proposed patch I *copied* (instead of *moved*) the cluster update check... :-) > leaseUpdateThread might be blocked by leaseUpdateCheck > -- > > Key: OAK-5446 > URL: https://issues.apache.org/jira/browse/OAK-5446 > Project: Jackrabbit Oak > Issue Type: Bug > Components: core >Affects Versions: 1.4, 1.5.14 >Reporter: Stefan Eissing >Assignee: Julian Reschke > Labels: candidate_oak_1_4, candidate_oak_1_6 > Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase, > OAK-5446.testcase.v3 > > > Fighting with cluster nodes losing their lease and shutting down oak-core in > a cloud environment. For reasons unknown at this point in time, the whole > process seems to skip about two minutes of real time. > This is a situation from which oak currently does not recover. Code analysis > shows that {{ClusterNodeInfo}} is handed the > {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal > since any action the {{renewLease()}} tries to do will first invoke the > {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is > reached, _stall the renewLease() thread_ for 5 retry attempts and then > declare the lease to be lost. > The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, > not the wrapped one, IMO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck
[ https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838142#comment-15838142 ] Stefan Egli commented on OAK-5446: -- just one comment: maybe we should have two flavours of the test: one with the delay and one without - as both cases seem useful. > leaseUpdateThread might be blocked by leaseUpdateCheck > -- > > Key: OAK-5446 > URL: https://issues.apache.org/jira/browse/OAK-5446 > Project: Jackrabbit Oak > Issue Type: Bug > Components: core >Affects Versions: 1.4, 1.5.14 >Reporter: Stefan Eissing >Assignee: Julian Reschke > Labels: candidate_oak_1_4, candidate_oak_1_6 > Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase > > > Fighting with cluster nodes losing their lease and shutting down oak-core in > a cloud environment. For reasons unknown at this point in time, the whole > process seems to skip about two minutes of real time. > This is a situation from which oak currently does not recover. Code analysis > shows that {{ClusterNodeInfo}} is handed the > {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal > since any action the {{renewLease()}} tries to do will first invoke the > {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is > reached, _stall the renewLease() thread_ for 5 retry attempts and then > declare the lease to be lost. > The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, > not the wrapped one, IMO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck
[ https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838143#comment-15838143 ] Julian Reschke commented on OAK-5446: - While the test currently proves that there is a problem, I'm not totally sure that the Thread.sleep is correct here -- shouldn't I use the VirtualClock? > leaseUpdateThread might be blocked by leaseUpdateCheck > -- > > Key: OAK-5446 > URL: https://issues.apache.org/jira/browse/OAK-5446 > Project: Jackrabbit Oak > Issue Type: Bug > Components: core >Affects Versions: 1.4, 1.5.14 >Reporter: Stefan Eissing >Assignee: Julian Reschke > Labels: candidate_oak_1_4, candidate_oak_1_6 > Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase > > > Fighting with cluster nodes losing their lease and shutting down oak-core in > a cloud environment. For reasons unknown at this point in time, the whole > process seems to skip about two minutes of real time. > This is a situation from which oak currently does not recover. Code analysis > shows that {{ClusterNodeInfo}} is handed the > {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal > since any action the {{renewLease()}} tries to do will first invoke the > {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is > reached, _stall the renewLease() thread_ for 5 retry attempts and then > declare the lease to be lost. > The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, > not the wrapped one, IMO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck
[ https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838126#comment-15838126 ] Stefan Egli commented on OAK-5446: -- bq. Modified test that delays the read from clusterNodes and indeed reproduces the issue. even better :) - can I leave the patch with you, [~reschke]? > leaseUpdateThread might be blocked by leaseUpdateCheck > -- > > Key: OAK-5446 > URL: https://issues.apache.org/jira/browse/OAK-5446 > Project: Jackrabbit Oak > Issue Type: Bug > Components: core >Affects Versions: 1.4, 1.5.14 >Reporter: Stefan Eissing >Assignee: Julian Reschke > Labels: candidate_oak_1_4, candidate_oak_1_6 > Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase > > > Fighting with cluster nodes losing their lease and shutting down oak-core in > a cloud environment. For reasons unknown at this point in time, the whole > process seems to skip about two minutes of real time. > This is a situation from which oak currently does not recover. Code analysis > shows that {{ClusterNodeInfo}} is handed the > {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal > since any action the {{renewLease()}} tries to do will first invoke the > {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is > reached, _stall the renewLease() thread_ for 5 retry attempts and then > declare the lease to be lost. > The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, > not the wrapped one, IMO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck
[ https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838115#comment-15838115 ] Stefan Egli commented on OAK-5446: -- ack, I'll add the test case to OAK-3399 (in trunk and 1.4 branch) and will look into how we can simulate a 'VM freeze' during lease update.. > leaseUpdateThread might be blocked by leaseUpdateCheck > -- > > Key: OAK-5446 > URL: https://issues.apache.org/jira/browse/OAK-5446 > Project: Jackrabbit Oak > Issue Type: Bug > Components: core >Affects Versions: 1.4, 1.5.14 >Reporter: Stefan Eissing >Assignee: Julian Reschke > Labels: candidate_oak_1_4, candidate_oak_1_6 > Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase > > > Fighting with cluster nodes losing their lease and shutting down oak-core in > a cloud environment. For reasons unknown at this point in time, the whole > process seems to skip about two minutes of real time. > This is a situation from which oak currently does not recover. Code analysis > shows that {{ClusterNodeInfo}} is handed the > {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal > since any action the {{renewLease()}} tries to do will first invoke the > {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is > reached, _stall the renewLease() thread_ for 5 retry attempts and then > declare the lease to be lost. > The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, > not the wrapped one, IMO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck
[ https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838066#comment-15838066 ] Julian Reschke commented on OAK-5446: - Verified that the test tests the right thing by setting MAX_RETRY_SLEEPS_BEFORE_LEASE_FAILURE = 0 in ClusterNodeInfo (in which case the test fails). Proposal: add the test as part of OAK-3399 (after removing trailing ws :-). We still don't have a test that simulates the issue described in *this* ticket, though. > leaseUpdateThread might be blocked by leaseUpdateCheck > -- > > Key: OAK-5446 > URL: https://issues.apache.org/jira/browse/OAK-5446 > Project: Jackrabbit Oak > Issue Type: Bug > Components: core >Affects Versions: 1.4, 1.5.14 >Reporter: Stefan Eissing >Assignee: Julian Reschke > Labels: candidate_oak_1_4, candidate_oak_1_6 > Attachments: OAK-5446.diff, OAK-5446.testcase > > > Fighting with cluster nodes losing their lease and shutting down oak-core in > a cloud environment. For reasons unknown at this point in time, the whole > process seems to skip about two minutes of real time. > This is a situation from which oak currently does not recover. Code analysis > shows that {{ClusterNodeInfo}} is handed the > {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal > since any action the {{renewLease()}} tries to do will first invoke the > {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is > reached, _stall the renewLease() thread_ for 5 retry attempts and then > declare the lease to be lost. > The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, > not the wrapped one, IMO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck
[ https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837629#comment-15837629 ] Stefan Egli commented on OAK-5446: -- bq. we are ok with this to be done before we branch 1.6 it is a change in a quite central part, so the question is if this is indeed a blocker for 1.6 or if it can't wait... We'd have to thoroughly test the fix. > leaseUpdateThread might be blocked by leaseUpdateCheck > -- > > Key: OAK-5446 > URL: https://issues.apache.org/jira/browse/OAK-5446 > Project: Jackrabbit Oak > Issue Type: Bug > Components: core >Affects Versions: 1.4, 1.5.14 >Reporter: Stefan Eissing >Assignee: Julian Reschke > Labels: candidate_oak_1_4, candidate_oak_1_6 > Attachments: OAK-5446.diff > > > Fighting with cluster nodes losing their lease and shutting down oak-core in > a cloud environment. For reasons unknown at this point in time, the whole > process seems to skip about two minutes of real time. > This is a situation from which oak currently does not recover. Code analysis > shows that {{ClusterNodeInfo}} is handed the > {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal > since any action the {{renewLease()}} tries to do will first invoke the > {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is > reached, _stall the renewLease() thread_ for 5 retry attempts and then > declare the lease to be lost. > The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, > not the wrapped one, IMO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)