On Thu, Aug 3, 2017 at 10:31 AM, Yiming Zang <yz...@twitter.com.invalid> wrote:
> That make sense, I didn't consider the case when ensemble size > write > quorum size. So I think as you mentioned before, we can add an option to > enforce strong rack diversity. > > Considering the case of (4,2,2), I think we can't let num of racks = min > (num of racks, write quorum size). Because in this case, num of racks will > become 2, this will still potentially break the rack-diversity. > > So if we want to add such an option in the future, it should be min(num of > racks, ensemble size)? > if we are adding the enforcement option, it will be min(num of racks, ensemble size). > > On Thu, Aug 3, 2017 at 12:53 AM, Sijie Guo <guosi...@gmail.com> wrote: > > > On Wed, Aug 2, 2017 at 11:34 PM, Yiming Zang <yz...@twitter.com.invalid> > > wrote: > > > > > Hi @Sijie, it make sense to me to choose bookies at the same rack as > many > > > as possible and have a least one replica at a remote rack for > > > availability, so that we can have lower latency within the same rack > and > > > availability when one rack is down. > > > > > > However it seems the current implementation of RackAwarePlacementPolicy > > > doesn't do exactly as that goal is expected to do. It just start with > the > > > same rack for the first bookie, and then for each next bookie, it pick > > the > > > rack that is different from the previous one. So it's not really > choosing > > > "as many as possible" bookies from same rack as the client. > > > > > > > It is a variant of that, because the algorithm here need to ensure the > > coverage across all the write quorums. > > > > I think it is easy to explain it just using two examples. > > > > 1) ensemble size == write quorum size: [4, 4, 2] > > 2) ensemble size > write quorum size: [4, 2, 2] > > > > > > if we are going to support only 1) case, it totally makes sense to pick > up > > as many bookies from the same rack of client. because any write quorums > > will meet 2-racks requirements. > > > > However, if we apply this algorithm to second case 2), it would fail. > lets > > say we have two racks `R1` and `R2`. the client is running on a machine > > under `R1`. so if we pickup as many bookies from local rack, then an > > ensemble would end up like [R1, R2, R1, R1]. In this case, the last two > > bookies are ended at same rack `R1`, this will break the rack-diversity > for > > entries written to the quorums formed by those two bookies. > > > > Because we are allowing the existence of `ensmeble size` > `write quorum > > size`, we take a simple approach to make sure any adjacent bookies in an > > ensemble will belong to two different racks. we can guarantee rack > > diversity (2 racks), no matter what is the write quorum size. > > > > Hope this explains. > > > > - Sijie > > > > > > > > > > > > > > > > Or am I missing anything? > > > > > > On Wed, Aug 2, 2017 at 7:29 PM, Sijie Guo <guosi...@gmail.com> wrote: > > > > > > > On Aug 2, 2017 6:54 PM, "Yiming Zang" <yz...@twitter.com.invalid> > > wrote: > > > > > > > > I was looking into RackawareEnsemblePlacementPolicyImpl.java, and it > > > turns > > > > out to me that the current implementation of > > > > RackawareEnsemblePlacementPolicy only enforces at least 2 different > > > racks, > > > > it doesn't care about ensemble size, write quorum size or ack quorum > > > size. > > > > > > > > For example, imagine we have the follow rack diversity (r1 means > > rack1): > > > > > > > > =========== > > > > BK1: r1 > > > > BK2: r2 > > > > BK3: r3 > > > > BK4: r3 > > > > BK5: r3 > > > > =========== > > > > > > > > Now we're creating a ledger with (3,3,2), in which ensemble size and > > > write > > > > quorum size are both 2. My expected behavior is the ensemble must > > > contains > > > > BK1 and BK2, and would choose the third bookie from BK3~BK5. However, > > in > > > > fact, the ensemble violate that rule, which can be BK2, BK3, BK4, > where > > > two > > > > of the bookies share the same rack. > > > > > > > > I've also added a test case to validate this theory. > > > > > > > > Question: Is this behavior expected? Shall we use write quorum size > as > > > rack > > > > diversity so that we can spread the ledger across more racks? > > > > > > > > > > > > I think that was the expected behavior. The idea was choosing bookies > > at > > > > the same rack as many as possible and have replicas at a remote rack > > for > > > > availability. Because typically inter-rack latency will be higher > than > > > > intra-rack. So the rack devesity enforcement isn't aligned with write > > > > quorum size. > > > > > > > > I think we can introduce a setting in rack aware placement policy > like > > > > 'rack.diversity.enforced'. if it is true, the number of racks will be > > min > > > > (num of total racks, write quorum size). > > > > > > > > Sijie > > > > > > > > > > > > There core logic is here, where it use "~" to represent a different > > rack > > > > than the previous one: > > > > > > > > // pick nodes by racks, to ensure there is at least two racks per > write > > > > quorum. > > > > for (int i = 0; i < ensembleSize; i++) { > > > > String curRack; > > > > if (null == prevNode) { > > > > if ((null == localNode) || > > > > > > > > localNode.getNetworkLocation().equals(NetworkTopology.DEFAULT_RACK)) > { > > > > curRack = NodeBase.ROOT; > > > > } else { > > > > curRack = localNode.getNetworkLocation(); > > > > } > > > > } else { > > > > curRack = "~" + prevNode.getNetworkLocation(); > > > > } > > > > prevNode = selectFromNetworkLocation(curRack, excludeNodes, > > > > ensemble, ensemble); > > > > } > > > > ArrayList<BookieSocketAddress> bookieList = ensemble.toList(); > > > > if (ensembleSize != bookieList.size()) { > > > > LOG.error("Not enough {} bookies are available to form an > ensemble > > : > > > > {}.", > > > > ensembleSize, bookieList); > > > > throw new BKNotEnoughBookiesException(); > > > > } > > > > return bookieList; > > > > > > > > > >