On Thu, Aug 3, 2017 at 10:31 AM, Yiming Zang <yz...@twitter.com.invalid>
wrote:

> That make sense, I didn't consider the case when ensemble size > write
> quorum size. So I think as you mentioned before, we can add an option to
> enforce strong rack diversity.
>
> Considering the case of (4,2,2), I think we can't let num of racks = min
> (num of racks, write quorum size). Because in this case, num of racks will
> become 2, this will still potentially break the rack-diversity.
>
> So if we want to add  such an option in the future, it should be min(num of
> racks, ensemble size)?
>

if we are adding the enforcement option, it will be min(num of racks,
ensemble size).


>
> On Thu, Aug 3, 2017 at 12:53 AM, Sijie Guo <guosi...@gmail.com> wrote:
>
> > On Wed, Aug 2, 2017 at 11:34 PM, Yiming Zang <yz...@twitter.com.invalid>
> > wrote:
> >
> > > Hi @Sijie, it make sense to me to choose bookies at the same rack as
> many
> > > as possible and have a least one replica at a remote rack for
> > > availability, so that we can have lower latency within the same rack
> and
> > > availability when one rack is down.
> > >
> > > However it seems the current implementation of RackAwarePlacementPolicy
> > > doesn't do exactly as that goal is expected to do. It just start with
> the
> > > same rack for the first bookie, and then for each next bookie, it pick
> > the
> > > rack that is different from the previous one. So it's not really
> choosing
> > > "as many as possible" bookies from same rack as the client.
> > >
> >
> > It is a variant of that, because the algorithm here need to ensure the
> > coverage across all the write quorums.
> >
> > I think it is easy to explain it just using two examples.
> >
> > 1) ensemble size == write quorum size: [4, 4, 2]
> > 2) ensemble size > write quorum size: [4, 2, 2]
> >
> >
> > if we are going to support only 1) case, it totally makes sense to pick
> up
> > as many bookies from the same rack of client. because any write quorums
> > will meet 2-racks requirements.
> >
> > However, if we apply this algorithm to second case 2), it would fail.
> lets
> > say we have two racks `R1` and `R2`. the client is running on a machine
> > under `R1`. so if we pickup as many bookies from local rack, then an
> > ensemble would end up like [R1, R2, R1, R1]. In this case, the last two
> > bookies are ended at same rack `R1`, this will break the rack-diversity
> for
> > entries written to the quorums formed by those two bookies.
> >
> > Because we are allowing the existence of `ensmeble size` > `write quorum
> > size`, we take a simple approach to make sure any adjacent bookies in an
> > ensemble will belong to two different racks. we can guarantee rack
> > diversity (2 racks), no matter what is the write quorum size.
> >
> > Hope this explains.
> >
> > - Sijie
> >
> >
> >
> >
> >
> > >
> > > Or am I missing anything?
> > >
> > > On Wed, Aug 2, 2017 at 7:29 PM, Sijie Guo <guosi...@gmail.com> wrote:
> > >
> > > > On Aug 2, 2017 6:54 PM, "Yiming Zang" <yz...@twitter.com.invalid>
> > wrote:
> > > >
> > > > I was looking into RackawareEnsemblePlacementPolicyImpl.java, and it
> > > turns
> > > > out to me that the current implementation of
> > > > RackawareEnsemblePlacementPolicy only enforces at least 2 different
> > > racks,
> > > > it doesn't care about ensemble size, write quorum size or ack quorum
> > > size.
> > > >
> > > > For example, imagine we have the follow rack diversity (r1 means
> > rack1):
> > > >
> > > > ===========
> > > > BK1: r1
> > > > BK2: r2
> > > > BK3: r3
> > > > BK4: r3
> > > > BK5: r3
> > > > ===========
> > > >
> > > > Now we're creating a ledger with (3,3,2), in which ensemble size and
> > > write
> > > > quorum size are both 2. My expected behavior is the ensemble must
> > > contains
> > > > BK1 and BK2, and would choose the third bookie from BK3~BK5. However,
> > in
> > > > fact, the ensemble violate that rule, which can be BK2, BK3, BK4,
> where
> > > two
> > > > of the bookies share the same rack.
> > > >
> > > > I've also added a test case to validate this theory.
> > > >
> > > > Question: Is this behavior expected? Shall we use write quorum size
> as
> > > rack
> > > > diversity so that we can spread the ledger across more racks?
> > > >
> > > >
> > > > I think that was the expected behavior. The idea was choosing bookies
> > at
> > > > the same rack as many as possible and have replicas at a remote rack
> > for
> > > > availability. Because typically inter-rack latency will be higher
> than
> > > > intra-rack. So the rack devesity enforcement isn't aligned with write
> > > > quorum size.
> > > >
> > > > I think we can introduce a setting in rack aware placement policy
> like
> > > > 'rack.diversity.enforced'. if it is true, the number of racks will be
> > min
> > > > (num of total racks, write quorum size).
> > > >
> > > > Sijie
> > > >
> > > >
> > > > There core logic is here, where it use "~" to represent a different
> > rack
> > > > than the previous one:
> > > >
> > > > // pick nodes by racks, to ensure there is at least two racks per
> write
> > > > quorum.
> > > > for (int i = 0; i < ensembleSize; i++) {
> > > >     String curRack;
> > > >     if (null == prevNode) {
> > > >         if ((null == localNode) ||
> > > >
> > > > localNode.getNetworkLocation().equals(NetworkTopology.DEFAULT_RACK))
> {
> > > >             curRack = NodeBase.ROOT;
> > > >         } else {
> > > >             curRack = localNode.getNetworkLocation();
> > > >         }
> > > >     } else {
> > > >         curRack = "~" + prevNode.getNetworkLocation();
> > > >     }
> > > >     prevNode = selectFromNetworkLocation(curRack, excludeNodes,
> > > > ensemble, ensemble);
> > > > }
> > > > ArrayList<BookieSocketAddress> bookieList = ensemble.toList();
> > > > if (ensembleSize != bookieList.size()) {
> > > >     LOG.error("Not enough {} bookies are available to form an
> ensemble
> > :
> > > > {}.",
> > > >               ensembleSize, bookieList);
> > > >     throw new BKNotEnoughBookiesException();
> > > > }
> > > > return bookieList;
> > > >
> > >
> >
>

Reply via email to