[
https://issues.apache.org/jira/browse/SOLR-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15224971#comment-15224971
]
Shalin Shekhar Mangar edited comment on SOLR-6465 at 4/4/16 8:29 PM:
---------------------------------------------------------------------
This is the first cut for this feature.
The CdcrRequestHandler supports a new asynchronous command called BOOTSTRAP
which triggers a full index replication from a given master URL. There is a
corresponding BOOTSTRAP_STATUS command which returns whether a bootstrap
operation is running or either finished successfully or failed.
The "shardcheckpoint" command has been modified to return the max version
across the index and update log using the same updateVersionToHighest logic
used to initialize version buckets from tlog+index during startup/reload.
The CdcrReplicatorManager calls collectioncheckpoint to read the max version
indexed on the target and then if it finds that there exists a gap in its tlog,
asks the target cluster to bootstrap itself from the current shard leader on
the source. During this time a flag is set in CdcrReplicatorState such that the
CdcrReplicatorScheduler will not send any updates to the target cluster during
this time. Once the bootstrap is complete, a collectioncheckpoint is called and
the returned version is used to open a regular tlog reader using which normal
cdcr replication mechanism takes over.
A new test called CdcrBootstrapTest is added for this feature. There is some
additional code in CdcrUpdateLog which allows one to convert an existing
cluster with data to be a cdcr source.
There are plenty of nocommits and debug logging in this patch which I will work
to resolve/remove in the next patches. I also found a few bugs for which I'll
open separate issues.
Open items/todos:
# Now that we can bootstrap target clusters using the index files, we have no
need to keep update logs around for a long time. Therefore, we can get rid of
CdcrUpdateLog itself and make CDCR work with regular UpdateLog.
# In the same vein, there is no need for replicating tlog files from leader to
replicas on the source cluster so "lastprocessedversion", CdcrLogSynchronizer
and tlog replication code be purged.
# This patch currently stops regular CDCR updates from being sent to target
leaders during bootstrap but that is not necessary as we can buffer them and
apply after bootstrap completes.
# Hardening is required against the bootstrap process racing with recovery.
Normally this won't happen because bootstrap only happens on target shard
leaders but if/when the leadership changes, I suspect bootstrap can continue to
run for a while and race with core recovery. I haven't been able to trigger
this yet in a test case but I'll continue to work on it.
# In this patch, the bootstrap trigger thread is initiated on state change but
if it exits due to a unhandled condition then the replication state is forever
in bootstrapping mode and there is no corrective action. Although care has been
taken to handle most failures but after implementing this, I feel that it is
unnecessarily fragile and we are better off adding some logic in the scheduled
replicator component than trying to do bootstrap only once on init.
# The existing CDCR tests which test aspects related to tlog replication do not
pass currently. Once we yank that code, this would be a non-issue.
# Tests and more tests!
was (Author: shalinmangar):
This is the first cut for this feature.
The CdcrRequestHandler supports a new asynchronous command called BOOTSTRAP
which triggers a full index replication from a given master URL. There is a
corresponding BOOTSTRAP_STATUS command which returns whether a bootstrap
operation is running or either finished successfully or failed.
The "shardcheckpoint" command has been modified to return the max version
across the index and update log using the same updateVersionToHighest logic
used to initialize version buckets from tlog+index during startup/reload.
The CdcrReplicatorManager calls collectioncheckpoint to read the max version
indexed on the target and then if it finds that there exists a gap in its tlog,
asks the target cluster to bootstrap itself from the current shard leader on
the source. During this time a flag is set in CdcrReplicatorState such that the
CdcrReplicatorScheduler will not send any updates to the target cluster during
this time. Once the bootstrap is complete, a collectioncheckpoint is called and
the returned version is used to open a regular tlog reader using which normal
cdcr replication mechanism takes over.
A new test called CdcrBootstrapTest is added for this feature. There is some
additional code in CdcrUpdateLog which allows one to convert an existing
cluster with data to be a cdcr source.
There are plenty of nocommits and debug logging in this patch which I will work
to resolve/remove in the next patches. I also found a few bugs for which I'll
open separate issues.
Open items/todos:
# Now that we can bootstrap target clusters using the index files, we have no
need to keep update logs around for a long time. Therefore, we can get rid of
CdcrUpdateLog itself and make CDCR work with regular UpdateLog.
# In the same vein, there is no need for replicating tlog files from leader to
replicas on the source cluster so "lastprocessedversion", CdcrLogSynchronizer
and tlog replication code be purged.
# This patch currently stops regular CDCR updates from being sent to target
leaders during bootstrap but that is not necessary as we can buffer then and
apply after bootstrap completes.
# Hardening is required against the bootstrap process racing with recovery.
Normally this won't happen because bootstrap only happens on target shard
leaders but if/when the leadership changes, I suspect bootstrap can continue to
run for a while and race with bootstrap. I haven't been able to trigger this
yet in a test case but I'll continue to work on it.
# In this patch, the bootstrap trigger thread is initiated on state change but
if it exits due to a unhandled condition then the replication state is forever
in bootstrapping mode and there is no corrective action. Although care has been
taken to handle most failures but after implementing this, I feel that it is
unnecessarily fragile and we are better off adding some logic in the scheduled
replicator component than trying to do bootstrap on init only.
# The existing CDCR tests which test aspects related to tlog replication do not
pass currently. Once we yank that code, this would be a non-issue.
# Tests and more tests!
> CDCR: fall back to whole-index replication when tlogs are insufficient
> ----------------------------------------------------------------------
>
> Key: SOLR-6465
> URL: https://issues.apache.org/jira/browse/SOLR-6465
> Project: Solr
> Issue Type: Sub-task
> Reporter: Yonik Seeley
> Attachments: SOLR-6465.patch
>
>
> When the peer-shard doesn't have transaction logs to forward all the needed
> updates to bring a peer up to date, we need to fall back to normal
> replication.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]