[jira] [Comment Edited] (SOLR-6465) CDCR: fall back to whole-index replication when tlogs are insufficient

Shalin Shekhar Mangar (JIRA) Mon, 04 Apr 2016 13:31:27 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15224971#comment-15224971
 ]


Shalin Shekhar Mangar edited comment on SOLR-6465 at 4/4/16 8:29 PM:
---------------------------------------------------------------------

This is the first cut for this feature.

The CdcrRequestHandler supports a new asynchronous command called BOOTSTRAP 
which triggers a full index replication from a given master URL. There is a 
corresponding BOOTSTRAP_STATUS command which returns whether a bootstrap 
operation is running or either finished successfully or failed.

The "shardcheckpoint" command has been modified to return the max version 
across the index and update log using the same updateVersionToHighest logic 
used to initialize version buckets from tlog+index during startup/reload.

The CdcrReplicatorManager calls collectioncheckpoint to read the max version 
indexed on the target and then if it finds that there exists a gap in its tlog, 
asks the target cluster to bootstrap itself from the current shard leader on 
the source. During this time a flag is set in CdcrReplicatorState such that the 
CdcrReplicatorScheduler will not send any updates to the target cluster during 
this time. Once the bootstrap is complete, a collectioncheckpoint is called and 
the returned version is used to open a regular tlog reader using which normal 
cdcr replication mechanism takes over.

A new test called CdcrBootstrapTest is added for this feature. There is some 
additional code in CdcrUpdateLog which allows one to convert an existing 
cluster with data to be a cdcr source.

There are plenty of nocommits and debug logging in this patch which I will work 
to resolve/remove in the next patches. I also found a few bugs for which I'll 
open separate issues.

Open items/todos:
# Now that we can bootstrap target clusters using the index files, we have no 
need to keep update logs around for a long time. Therefore, we can get rid of 
CdcrUpdateLog itself and make CDCR work with regular UpdateLog.
# In the same vein, there is no need for replicating tlog files from leader to 
replicas on the source cluster so "lastprocessedversion", CdcrLogSynchronizer 
and tlog replication code be purged.
# This patch currently stops regular CDCR updates from being sent to target 
leaders during bootstrap but that is not necessary as we can buffer them and 
apply after bootstrap completes.
# Hardening is required against the bootstrap process racing with recovery. 
Normally this won't happen because bootstrap only happens on target shard 
leaders but if/when the leadership changes, I suspect bootstrap can continue to 
run for a while and race with core recovery. I haven't been able to trigger 
this yet in a test case but I'll continue to work on it.
# In this patch, the bootstrap trigger thread is initiated on state change but 
if it exits due to a unhandled condition then the replication state is forever 
in bootstrapping mode and there is no corrective action. Although care has been 
taken to handle most failures but after implementing this, I feel that it is 
unnecessarily fragile and we are better off adding some logic in the scheduled 
replicator component than trying to do bootstrap only once on init.
# The existing CDCR tests which test aspects related to tlog replication do not 
pass currently. Once we yank that code, this would be a non-issue.
# Tests and more tests!


was (Author: shalinmangar):
This is the first cut for this feature.

The CdcrRequestHandler supports a new asynchronous command called BOOTSTRAP 
which triggers a full index replication from a given master URL. There is a 
corresponding BOOTSTRAP_STATUS command which returns whether a bootstrap 
operation is running or either finished successfully or failed.

The "shardcheckpoint" command has been modified to return the max version 
across the index and update log using the same updateVersionToHighest logic 
used to initialize version buckets from tlog+index during startup/reload.

The CdcrReplicatorManager calls collectioncheckpoint to read the max version 
indexed on the target and then if it finds that there exists a gap in its tlog, 
asks the target cluster to bootstrap itself from the current shard leader on 
the source. During this time a flag is set in CdcrReplicatorState such that the 
CdcrReplicatorScheduler will not send any updates to the target cluster during 
this time. Once the bootstrap is complete, a collectioncheckpoint is called and 
the returned version is used to open a regular tlog reader using which normal 
cdcr replication mechanism takes over.

A new test called CdcrBootstrapTest is added for this feature. There is some 
additional code in CdcrUpdateLog which allows one to convert an existing 
cluster with data to be a cdcr source.

There are plenty of nocommits and debug logging in this patch which I will work 
to resolve/remove in the next patches. I also found a few bugs for which I'll 
open separate issues.

Open items/todos:
# Now that we can bootstrap target clusters using the index files, we have no 
need to keep update logs around for a long time. Therefore, we can get rid of 
CdcrUpdateLog itself and make CDCR work with regular UpdateLog.
# In the same vein, there is no need for replicating tlog files from leader to 
replicas on the source cluster so "lastprocessedversion", CdcrLogSynchronizer 
and tlog replication code be purged.
# This patch currently stops regular CDCR updates from being sent to target 
leaders during bootstrap but that is not necessary as we can buffer then and 
apply after bootstrap completes.
# Hardening is required against the bootstrap process racing with recovery. 
Normally this won't happen because bootstrap only happens on target shard 
leaders but if/when the leadership changes, I suspect bootstrap can continue to 
run for a while and race with bootstrap. I haven't been able to trigger this 
yet in a test case but I'll continue to work on it.
# In this patch, the bootstrap trigger thread is initiated on state change but 
if it exits due to a unhandled condition then the replication state is forever 
in bootstrapping mode and there is no corrective action. Although care has been 
taken to handle most failures but after implementing this, I feel that it is 
unnecessarily fragile and we are better off adding some logic in the scheduled 
replicator component than trying to do bootstrap on init only.
# The existing CDCR tests which test aspects related to tlog replication do not 
pass currently. Once we yank that code, this would be a non-issue.
# Tests and more tests!

> CDCR: fall back to whole-index replication when tlogs are insufficient
> ----------------------------------------------------------------------
>
>                 Key: SOLR-6465
>                 URL: https://issues.apache.org/jira/browse/SOLR-6465
>             Project: Solr
>          Issue Type: Sub-task
>            Reporter: Yonik Seeley
>         Attachments: SOLR-6465.patch
>
>
> When the peer-shard doesn't have transaction logs to forward all the needed 
> updates to bring a peer up to date, we need to fall back to normal 
> replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-6465) CDCR: fall back to whole-index replication when tlogs are insufficient

Reply via email to