[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck

2017-02-02 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849708#comment-15849708
 ] 

Julian Reschke commented on OAK-5446:
-

No, it hasn't...

> leaseUpdateThread might be blocked by leaseUpdateCheck
> --
>
> Key: OAK-5446
> URL: https://issues.apache.org/jira/browse/OAK-5446
> Project: Jackrabbit Oak
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5.14
>Reporter: Stefan Eissing
>Assignee: Julian Reschke
> Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase, 
> OAK-5446.testcase.v3
>
>
> {color:red}
> moved over to OAK-5528 due to internal Jira issues, please do not delete this 
> ticket while the problem is being investigated
> {color}
> Fighting with cluster nodes losing their lease and shutting down oak-core in 
> a cloud environment. For reasons unknown at this point in time, the whole 
> process seems to skip about two minutes of real time.
> This is a situation from which oak currently does not recover. Code analysis 
> shows that {{ClusterNodeInfo}} is handed the 
> {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal 
> since any action the {{renewLease()}} tries to do will first invoke the 
> {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is 
> reached, _stall the renewLease() thread_ for 5 retry attempts and then 
> declare the lease to be lost.
> The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, 
> not the wrapped one, IMO.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck

2017-02-02 Thread Gavin (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849705#comment-15849705
 ] 

Gavin commented on OAK-5446:


[~reschke] ok can you check if this ticket has its normal buttons back?

> leaseUpdateThread might be blocked by leaseUpdateCheck
> --
>
> Key: OAK-5446
> URL: https://issues.apache.org/jira/browse/OAK-5446
> Project: Jackrabbit Oak
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5.14
>Reporter: Stefan Eissing
>Assignee: Julian Reschke
> Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase, 
> OAK-5446.testcase.v3
>
>
> {color:red}
> moved over to OAK-5528 due to internal Jira issues, please do not delete this 
> ticket while the problem is being investigated
> {color}
> Fighting with cluster nodes losing their lease and shutting down oak-core in 
> a cloud environment. For reasons unknown at this point in time, the whole 
> process seems to skip about two minutes of real time.
> This is a situation from which oak currently does not recover. Code analysis 
> shows that {{ClusterNodeInfo}} is handed the 
> {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal 
> since any action the {{renewLease()}} tries to do will first invoke the 
> {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is 
> reached, _stall the renewLease() thread_ for 5 retry attempts and then 
> declare the lease to be lost.
> The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, 
> not the wrapped one, IMO.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck

2017-01-26 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840020#comment-15840020
 ] 

Julian Reschke commented on OAK-5446:
-

Note I had to clone the issue to work around a JIRA problem.

> leaseUpdateThread might be blocked by leaseUpdateCheck
> --
>
> Key: OAK-5446
> URL: https://issues.apache.org/jira/browse/OAK-5446
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.4, 1.5.14
>Reporter: Stefan Eissing
>Assignee: Julian Reschke
>  Labels: candidate_oak_1_4
> Fix For: 1.6
>
> Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase, 
> OAK-5446.testcase.v3
>
>
> Fighting with cluster nodes losing their lease and shutting down oak-core in 
> a cloud environment. For reasons unknown at this point in time, the whole 
> process seems to skip about two minutes of real time.
> This is a situation from which oak currently does not recover. Code analysis 
> shows that {{ClusterNodeInfo}} is handed the 
> {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal 
> since any action the {{renewLease()}} tries to do will first invoke the 
> {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is 
> reached, _stall the renewLease() thread_ for 5 retry attempts and then 
> declare the lease to be lost.
> The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, 
> not the wrapped one, IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck

2017-01-26 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840006#comment-15840006
 ] 

Julian Reschke commented on OAK-5446:
-

See [r1780424|http://svn.apache.org/r1780424] - for some I reason I currently 
can nor resolve this ticket though.

> leaseUpdateThread might be blocked by leaseUpdateCheck
> --
>
> Key: OAK-5446
> URL: https://issues.apache.org/jira/browse/OAK-5446
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.4, 1.5.14
>Reporter: Stefan Eissing
>Assignee: Julian Reschke
>  Labels: candidate_oak_1_4
> Fix For: 1.6
>
> Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase, 
> OAK-5446.testcase.v3
>
>
> Fighting with cluster nodes losing their lease and shutting down oak-core in 
> a cloud environment. For reasons unknown at this point in time, the whole 
> process seems to skip about two minutes of real time.
> This is a situation from which oak currently does not recover. Code analysis 
> shows that {{ClusterNodeInfo}} is handed the 
> {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal 
> since any action the {{renewLease()}} tries to do will first invoke the 
> {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is 
> reached, _stall the renewLease() thread_ for 5 retry attempts and then 
> declare the lease to be lost.
> The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, 
> not the wrapped one, IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck

2017-01-26 Thread Stefan Egli (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839904#comment-15839904
 ] 

Stefan Egli commented on OAK-5446:
--

oups, yes, that would not have helped then..

> leaseUpdateThread might be blocked by leaseUpdateCheck
> --
>
> Key: OAK-5446
> URL: https://issues.apache.org/jira/browse/OAK-5446
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.4, 1.5.14
>Reporter: Stefan Eissing
>Assignee: Julian Reschke
>  Labels: candidate_oak_1_4, candidate_oak_1_6
> Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase, 
> OAK-5446.testcase.v3
>
>
> Fighting with cluster nodes losing their lease and shutting down oak-core in 
> a cloud environment. For reasons unknown at this point in time, the whole 
> process seems to skip about two minutes of real time.
> This is a situation from which oak currently does not recover. Code analysis 
> shows that {{ClusterNodeInfo}} is handed the 
> {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal 
> since any action the {{renewLease()}} tries to do will first invoke the 
> {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is 
> reached, _stall the renewLease() thread_ for 5 retry attempts and then 
> declare the lease to be lost.
> The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, 
> not the wrapped one, IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck

2017-01-26 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839886#comment-15839886
 ] 

Julian Reschke commented on OAK-5446:
-

...guess what: in the proposed patch I *copied* (instead of *moved*) the 
cluster update check... :-)

> leaseUpdateThread might be blocked by leaseUpdateCheck
> --
>
> Key: OAK-5446
> URL: https://issues.apache.org/jira/browse/OAK-5446
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.4, 1.5.14
>Reporter: Stefan Eissing
>Assignee: Julian Reschke
>  Labels: candidate_oak_1_4, candidate_oak_1_6
> Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase, 
> OAK-5446.testcase.v3
>
>
> Fighting with cluster nodes losing their lease and shutting down oak-core in 
> a cloud environment. For reasons unknown at this point in time, the whole 
> process seems to skip about two minutes of real time.
> This is a situation from which oak currently does not recover. Code analysis 
> shows that {{ClusterNodeInfo}} is handed the 
> {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal 
> since any action the {{renewLease()}} tries to do will first invoke the 
> {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is 
> reached, _stall the renewLease() thread_ for 5 retry attempts and then 
> declare the lease to be lost.
> The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, 
> not the wrapped one, IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck

2017-01-25 Thread Stefan Egli (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838142#comment-15838142
 ] 

Stefan Egli commented on OAK-5446:
--

just one comment: maybe we should have two flavours of the test: one with the 
delay and one without - as both cases seem useful.

> leaseUpdateThread might be blocked by leaseUpdateCheck
> --
>
> Key: OAK-5446
> URL: https://issues.apache.org/jira/browse/OAK-5446
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.4, 1.5.14
>Reporter: Stefan Eissing
>Assignee: Julian Reschke
>  Labels: candidate_oak_1_4, candidate_oak_1_6
> Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase
>
>
> Fighting with cluster nodes losing their lease and shutting down oak-core in 
> a cloud environment. For reasons unknown at this point in time, the whole 
> process seems to skip about two minutes of real time.
> This is a situation from which oak currently does not recover. Code analysis 
> shows that {{ClusterNodeInfo}} is handed the 
> {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal 
> since any action the {{renewLease()}} tries to do will first invoke the 
> {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is 
> reached, _stall the renewLease() thread_ for 5 retry attempts and then 
> declare the lease to be lost.
> The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, 
> not the wrapped one, IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck

2017-01-25 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838143#comment-15838143
 ] 

Julian Reschke commented on OAK-5446:
-

While the test currently proves that there is a problem, I'm not totally sure 
that the Thread.sleep is correct here -- shouldn't I use the VirtualClock?

> leaseUpdateThread might be blocked by leaseUpdateCheck
> --
>
> Key: OAK-5446
> URL: https://issues.apache.org/jira/browse/OAK-5446
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.4, 1.5.14
>Reporter: Stefan Eissing
>Assignee: Julian Reschke
>  Labels: candidate_oak_1_4, candidate_oak_1_6
> Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase
>
>
> Fighting with cluster nodes losing their lease and shutting down oak-core in 
> a cloud environment. For reasons unknown at this point in time, the whole 
> process seems to skip about two minutes of real time.
> This is a situation from which oak currently does not recover. Code analysis 
> shows that {{ClusterNodeInfo}} is handed the 
> {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal 
> since any action the {{renewLease()}} tries to do will first invoke the 
> {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is 
> reached, _stall the renewLease() thread_ for 5 retry attempts and then 
> declare the lease to be lost.
> The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, 
> not the wrapped one, IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck

2017-01-25 Thread Stefan Egli (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838126#comment-15838126
 ] 

Stefan Egli commented on OAK-5446:
--

bq. Modified test that delays the read from clusterNodes and indeed reproduces 
the issue.
even better :) - can I leave the patch with you, [~reschke]?

> leaseUpdateThread might be blocked by leaseUpdateCheck
> --
>
> Key: OAK-5446
> URL: https://issues.apache.org/jira/browse/OAK-5446
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.4, 1.5.14
>Reporter: Stefan Eissing
>Assignee: Julian Reschke
>  Labels: candidate_oak_1_4, candidate_oak_1_6
> Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase
>
>
> Fighting with cluster nodes losing their lease and shutting down oak-core in 
> a cloud environment. For reasons unknown at this point in time, the whole 
> process seems to skip about two minutes of real time.
> This is a situation from which oak currently does not recover. Code analysis 
> shows that {{ClusterNodeInfo}} is handed the 
> {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal 
> since any action the {{renewLease()}} tries to do will first invoke the 
> {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is 
> reached, _stall the renewLease() thread_ for 5 retry attempts and then 
> declare the lease to be lost.
> The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, 
> not the wrapped one, IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck

2017-01-25 Thread Stefan Egli (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838115#comment-15838115
 ] 

Stefan Egli commented on OAK-5446:
--

ack, I'll add the test case to OAK-3399 (in trunk and 1.4 branch) and will look 
into how we can simulate a 'VM freeze' during lease update..

> leaseUpdateThread might be blocked by leaseUpdateCheck
> --
>
> Key: OAK-5446
> URL: https://issues.apache.org/jira/browse/OAK-5446
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.4, 1.5.14
>Reporter: Stefan Eissing
>Assignee: Julian Reschke
>  Labels: candidate_oak_1_4, candidate_oak_1_6
> Attachments: OAK-5446.diff, OAK-5446-jr.diff, OAK-5446.testcase
>
>
> Fighting with cluster nodes losing their lease and shutting down oak-core in 
> a cloud environment. For reasons unknown at this point in time, the whole 
> process seems to skip about two minutes of real time.
> This is a situation from which oak currently does not recover. Code analysis 
> shows that {{ClusterNodeInfo}} is handed the 
> {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal 
> since any action the {{renewLease()}} tries to do will first invoke the 
> {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is 
> reached, _stall the renewLease() thread_ for 5 retry attempts and then 
> declare the lease to be lost.
> The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, 
> not the wrapped one, IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck

2017-01-25 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838066#comment-15838066
 ] 

Julian Reschke commented on OAK-5446:
-

Verified that the test tests the right thing by setting 
MAX_RETRY_SLEEPS_BEFORE_LEASE_FAILURE = 0 in ClusterNodeInfo (in which case the 
test fails).

Proposal: add the test as part of OAK-3399 (after removing trailing ws :-).

We still don't have a test that simulates the issue described in *this* ticket, 
though.

> leaseUpdateThread might be blocked by leaseUpdateCheck
> --
>
> Key: OAK-5446
> URL: https://issues.apache.org/jira/browse/OAK-5446
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.4, 1.5.14
>Reporter: Stefan Eissing
>Assignee: Julian Reschke
>  Labels: candidate_oak_1_4, candidate_oak_1_6
> Attachments: OAK-5446.diff, OAK-5446.testcase
>
>
> Fighting with cluster nodes losing their lease and shutting down oak-core in 
> a cloud environment. For reasons unknown at this point in time, the whole 
> process seems to skip about two minutes of real time.
> This is a situation from which oak currently does not recover. Code analysis 
> shows that {{ClusterNodeInfo}} is handed the 
> {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal 
> since any action the {{renewLease()}} tries to do will first invoke the 
> {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is 
> reached, _stall the renewLease() thread_ for 5 retry attempts and then 
> declare the lease to be lost.
> The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, 
> not the wrapped one, IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-5446) leaseUpdateThread might be blocked by leaseUpdateCheck

2017-01-25 Thread Stefan Egli (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837629#comment-15837629
 ] 

Stefan Egli commented on OAK-5446:
--

bq. we are ok with this to be done before we branch 1.6
it is a change in a quite central part, so the question is if this is indeed a 
blocker for 1.6 or if it can't wait... We'd have to thoroughly test the fix.

> leaseUpdateThread might be blocked by leaseUpdateCheck
> --
>
> Key: OAK-5446
> URL: https://issues.apache.org/jira/browse/OAK-5446
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.4, 1.5.14
>Reporter: Stefan Eissing
>Assignee: Julian Reschke
>  Labels: candidate_oak_1_4, candidate_oak_1_6
> Attachments: OAK-5446.diff
>
>
> Fighting with cluster nodes losing their lease and shutting down oak-core in 
> a cloud environment. For reasons unknown at this point in time, the whole 
> process seems to skip about two minutes of real time.
> This is a situation from which oak currently does not recover. Code analysis 
> shows that {{ClusterNodeInfo}} is handed the 
> {{LeaseCheckDocumentStoreWrapper}} instance to use as store. This is fatal 
> since any action the {{renewLease()}} tries to do will first invoke the 
> {{performLeaseCheck()}}. The lease check will, when the {{FailureMargin}} is 
> reached, _stall the renewLease() thread_ for 5 retry attempts and then 
> declare the lease to be lost.
> The {{ClusterNodeInfo}} should instead be using the "real" {{DocumentStore}}, 
> not the wrapped one, IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)