[ 
https://issues.apache.org/jira/browse/HBASE-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-27109:
------------------------------
    Release Note: 
We introduced a table based replication queue storage in this issue. The queue 
data will be stored in hbase:replication table. This is the last piece of 
persistent data on zookeeper. So after this change, we are OK to clean up all 
the data on zookeeper, as now they are all transient, a cluster restarting can 
fix everything.

The data structure has been changed a bit as now we only support an offset for 
a WAL group instead of storing all the WAL files for a WAL group. Please see 
the replication internals section in our ref guide for more details.

To break the cyclic dependency issue, i.e, creating a new WAL writer requires 
writing to replication queue storage first but with table based replication 
queue storage, you first need a WAL writer when you want to update to table, 
now we will not record a queue when creating a new WAL writer instance. The 
downside for this change is that, the logic for claiming queue and WAL cleaner 
are much more complicated. See AssignReplicationQueuesProcedure and 
ReplicationLogCleaner for more details if you have interest.

Notice that, we will use a separate WAL provider for hbase:replication table, 
so you will see a new WAL file for the region server which holds the 
hbase:replication table. If we do not do this, the update to hbase:replication 
table will also generate some WAL edits in the WAL file we need to track in 
replication, and then lead to more updates to hbase:replication table since we 
have advanced the replication offset. In this way we will generate a lot of 
garbage in our WAL file, even if we write nothing to the cluster. So a 
separated WAL provider which is not tracked by replication is necessary here.

The data migration will be done automatically during rolling upgrading, of 
course the migration via a full cluster restart is also supported, but please 
make sure you restart master with new code first. The replication peers will be 
disabled during the migration and no claiming queue will be scheduled at the 
same time. So you may see a lot of unfinished SCPs during the migration but do 
not worry, it will not block the normal failover, all regions will be assigned. 
The replication peers will be enabled again after the migration is done, no 
manual operations needed.

The ReplicationSyncUp tool is also affected. The goal of this tool is to 
replicate data to peer cluster while the source cluster is down. But if we 
store the replication queue data in a hbase table, it is impossible for us to 
get the newest data if the source cluster is down. So here we choose to read 
from the region directory directly to load all the replication queue data in 
memory, and do the sync up work. We may lose the newest data so in this way we 
need to replicate more data but it will not affect correctness.

  was:
We introduced a table based replication queue storage in this issue. The queue 
data will stored in hbase:replication table. This is the last piece of 
persistent data on zookeeper. So after this change, we are OK to clean up all 
the data on zookeeper, as now they are all transient, a cluster restarting can 
fix everything.

The data structure has been changed a bit as now we only support an offset for 
a WAL group instead of storing all the WAL files for a WAL group. Please see 
the replication internals section in our ref guide for more details.

To break the cyclic dependency issue, i.e, creating a new WAL writer requires 
writing to replication queue storage first but with table based replication 
queue storage, you first need a WAL writer when you want to update to table, 
now we will not record a queue when creating a new WAL writer instance. The 
downside for this change it that, the logic for claiming queue and WAL cleaner 
are much more complicated. See AssignReplicationQueuesProcedure and 
ReplicationLogCleaner for more details if you have interest.

Notice that, we will use a separated WAL provider for hbase:replication table, 
so you will see a new WAL file for the region server which holds the 
hbase:replication table. If we do not do this, the update to hbase:replication 
table will also generate some WAL edits in the WAL file we need to track in 
replication, and then lead to more updates to hbase:replication table since we 
have advanced the replication offset. In this way we will generate a lot of 
garbage in our WAL file, even if we write nothing to the cluster. So a 
separated WAL provider which is not tracked by replication is necessary here.

The data migration will be done automatically during rolling upgrading, of 
course the migration via a full cluster restart is also supported, but please 
make sure you restart master with new code first. The replication peers will be 
disabled during the migration and no claiming queue will be scheduled at the 
same time. So you may see a lot of unfinished SCPs during the migration but do 
not worry, it will not block the normal failover, all regions will be assigned. 
The replication peers will be enabled again after the migration is done, no 
manual operations needed.

The ReplicationSyncUp tool is also affected. The goal of this tool is to 
replicate data to peer cluster while the source cluster is down. But if we 
store the replication queue data in a hbase table, it is impossible for us to 
get the newest data if the source cluster is down. So here we choose to read 
from the region directory directly to load all the replication queue data in 
memory, and do the sync up work. We may loss the newest data so in this way we 
need to replicate more data but it will not affect correctness.


> Move replication queue storage from zookeeper to a separated HBase table
> ------------------------------------------------------------------------
>
>                 Key: HBASE-27109
>                 URL: https://issues.apache.org/jira/browse/HBASE-27109
>             Project: HBase
>          Issue Type: New Feature
>          Components: Replication
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>            Priority: Major
>
> This is a more specific issue based on the works which are already done in 
> HBASE-15867.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to