[ 
https://issues.apache.org/jira/browse/OAK-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julian Reschke updated OAK-3424:
--------------------------------
    Affects Version/s: 1.2.9
                       1.0.25
                       1.3.12

> ClusterNodeInfo does not pick an existing entry on startup
> ----------------------------------------------------------
>
>                 Key: OAK-3424
>                 URL: https://issues.apache.org/jira/browse/OAK-3424
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: core, mongomk, rdbmk
>    Affects Versions: 1.2.9, 1.0.25, 1.3.12
>            Reporter: Julian Reschke
>
> When the {{DocumentNodeStore}} starts up, it attempts to find an entry that 
> matches the current instance (which is defined by something based on network 
> interface address and the current working directory).
> However, an additional check is done when the cluster lease end time hasn't 
> been reached, in which case the entry is skipped (assuming it belongs to a 
> different instance), and the scan continues. When no other entry is found, a 
> new one is created.
> So why would we *ever* consider instances with matching instance information 
> to be different? As far as I can tell the answer is: for unit testing.
> But...
> With the current assignment very weird things can happen, and I believe I see 
> exactly this happening in a customer problem I'm investigating. The sequence 
> is:
> 1) First system startup, cluster node id 1 is assigned
> 2) System crashes or was crashed
> 3) System restarts within the lease time (120s?), a new cluster node id is 
> assigned
> 4) System shuts down, and gets restarted after a longer interval: cluster id 
> 1 is used again, and system starts {{MissingLastRevRecovery}}, despite the 
> previous shutdown having been clean
> So what we see is that the system starts up with varying cluster node ids, 
> and recovery processes may run with no correlation to what happened before.
> Proposal:
> a) Make {{ClusterNodeInfo.createInstance()}} much more verbose, so that the 
> default system log contains sufficient information to understand why a 
> certain cluster node id was picked.
> b) Drop the logic that skips entries with non-expired leases, so that we get 
> a one-to-one relation between instance ids and cluster node ids. For the unit 
> tests that currently rely on this logic, switch to APIs where the test setup 
> picks the cluster node id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to