Re: Review Request 19862: Design document for review on cross-cluster replication

Mike Drob Wed, 02 Apr 2014 10:02:26 -0700


> On April 2, 2014, 3:36 p.m., Mike Drob wrote:
> >
> 
> Mike Drob wrote:
>     Huh, RB proxy error ate my comment.
>     
>     I was speaking to some of the HBase team about this yesterday, and they 
> mentioned that they do not support replicated bulk import. Their recommended 
> solution is just to externally copy files and run bulk import on the slave. 
> Since this is something that is possible for users to configure themselves, 
> I'd like to make sure we focus on the difficult case of like ingest.
>     
>     Is the assumption that replication is an all-or-nothing deal? Either you 
> replicate all of the tables on a system, or you replicate none of them, but 
> just a defined set is not allowed? I believe the WAL groups mutations by 
> table IDs, so care would need to be taken to make sure those do not get out 
> of sync.
>     
>     What happens when I clone a table, for example when running an offline MR 
> job. does the clone need to be replicated? I assume no. If the slave is a 
> read-only implementation, can I make clones there to run MR? Maybe another 
> thing that will come out of this is 'transient clones' that have IDs in a 
> reserved high range that can be reused after they are deleted.
>     
>
> 
> Josh Elser wrote:
>     I believe I already said elsewhere that replication is on a per-table 
> basis. Replication for tables would (likely) have to be turned on, at which 
> point the offline-MR case isn't a worry.
> 
> kturner wrote:
>     Why not support replicating bulk imports?  Seems like it makes things 
> easier on users.
> 
> Mike Drob wrote:
>     Then the ID mapping is a worry.
> 
> Josh Elser wrote:
>     When configuring the replication, we would just track the source tableID 
> and the destination cluster and the destination tableID. Am I missing 
> something?

If we're shipping WALs around, then the slave has to know the mapping from 
source table ID to destination table ID. Then you need to have an extra code 
path that checks for a mapping before performing "recovery."

If we have cyclic replication, then you have to know which WAL you are 
shipping, because that could imply a different mapping. Master table x maps to 
slave table y maps to other slave table z. If we have master-master, then both 
sides need to know the mapping, so I guess the table needs to exist on both 
clusters before replication can be configured (so that we have a table ID to 
use in the configuration).

Also, if we're shipping WALs around, then it is possible that you have 99 
mutations for a table that isn't replicated and 1 mutation that is replpicated. 
Sending offsets and chunks can help minimize the bandwidth, but...

- Mike

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19862/#review39264
-----------------------------------------------------------

On April 1, 2014, 1:58 a.m., Josh Elser wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/19862/
> -----------------------------------------------------------
> 
> (Updated April 1, 2014, 1:58 a.m.)
> 
> 
> Review request for accumulo.
> 
> 
> Bugs: ACCUMULO-378
>     https://issues.apache.org/jira/browse/ACCUMULO-378
> 
> 
> Repository: accumulo
> 
> 
> Description
> -------
> 
> Re-posting a version of the design doc that I own. Contains grammatical fixes 
> from round one, with a few extra clarifications. New content should be posted 
> here, but I'll maintain the old review as discussion progresses.
> 
> 
> Diffs
> -----
> 
>   docs/src/main/resources/design/ACCUMULO-378-design.mdtext PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/19862/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Josh Elser
> 
>

Re: Review Request 19862: Design document for review on cross-cluster replication

Reply via email to