> On April 2, 2014, 3:36 p.m., Mike Drob wrote: > > > > Mike Drob wrote: > Huh, RB proxy error ate my comment. > > I was speaking to some of the HBase team about this yesterday, and they > mentioned that they do not support replicated bulk import. Their recommended > solution is just to externally copy files and run bulk import on the slave. > Since this is something that is possible for users to configure themselves, > I'd like to make sure we focus on the difficult case of like ingest. > > Is the assumption that replication is an all-or-nothing deal? Either you > replicate all of the tables on a system, or you replicate none of them, but > just a defined set is not allowed? I believe the WAL groups mutations by > table IDs, so care would need to be taken to make sure those do not get out > of sync. > > What happens when I clone a table, for example when running an offline MR > job. does the clone need to be replicated? I assume no. If the slave is a > read-only implementation, can I make clones there to run MR? Maybe another > thing that will come out of this is 'transient clones' that have IDs in a > reserved high range that can be reused after they are deleted. > > > > Josh Elser wrote: > I believe I already said elsewhere that replication is on a per-table > basis. Replication for tables would (likely) have to be turned on, at which > point the offline-MR case isn't a worry. > > kturner wrote: > Why not support replicating bulk imports? Seems like it makes things > easier on users. > > Mike Drob wrote: > Then the ID mapping is a worry. > > Josh Elser wrote: > When configuring the replication, we would just track the source tableID > and the destination cluster and the destination tableID. Am I missing > something? > > Mike Drob wrote: > If we're shipping WALs around, then the slave has to know the mapping > from source table ID to destination table ID. Then you need to have an extra > code path that checks for a mapping before performing "recovery." > > If we have cyclic replication, then you have to know which WAL you are > shipping, because that could imply a different mapping. Master table x maps > to slave table y maps to other slave table z. If we have master-master, then > both sides need to know the mapping, so I guess the table needs to exist on > both clusters before replication can be configured (so that we have a table > ID to use in the configuration). > > Also, if we're shipping WALs around, then it is possible that you have 99 > mutations for a table that isn't replicated and 1 mutation that is > replpicated. Sending offsets and chunks can help minimize the bandwidth, > but... > > Mike Drob wrote: > Actually, another thing that would be really cool is self-replication > where I clone a table and then replicate future writes to it.
re: table mapping The destination tableID could be included in the message from the source. Then, the slave would just have to do some validation that it has a table with such an ID. I wasn't initially considering the slave having the replication configuration of the master. A couple of security concerns arise again here (although I think they're general to the problem). re: cyclic replication To handle cycles, both sides need to have a replication configuration, yes, but they don't need to know each others'. Cluster 1 knows to send to cluster2, and cluster 2 knows to send to cluster 1. To prevent re-replication, cluster 1 just needs to know not to send data to cluster 2 that originated from cluster 2. This will be of interest in state-keeping (~repl records). re: new/cloned tables We had touched on remote table configuration propagation before, and while I still think that scope is outside the original plan, I'd agree that auto-replication of new tables should be added to the list of future work. For this go-around, we definitely need to firm out how a new table is added to replication. Assuming the slave is in some "read-only" mode, are table operations still permitted? - Josh ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/19862/#review39264 ----------------------------------------------------------- On April 1, 2014, 1:58 a.m., Josh Elser wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/19862/ > ----------------------------------------------------------- > > (Updated April 1, 2014, 1:58 a.m.) > > > Review request for accumulo. > > > Bugs: ACCUMULO-378 > https://issues.apache.org/jira/browse/ACCUMULO-378 > > > Repository: accumulo > > > Description > ------- > > Re-posting a version of the design doc that I own. Contains grammatical fixes > from round one, with a few extra clarifications. New content should be posted > here, but I'll maintain the old review as discussion progresses. > > > Diffs > ----- > > docs/src/main/resources/design/ACCUMULO-378-design.mdtext PRE-CREATION > > Diff: https://reviews.apache.org/r/19862/diff/ > > > Testing > ------- > > > Thanks, > > Josh Elser > >
