Hi Maxime, Thanks for the feedback!
The proposed approach is definitely simplistic. The "Discussion" section of the design doc describes some of the rationale for starting with a very simple scheme: basically, because (a) we want to assign clear semantics to the levels of the hierarchy (regions are far away from each other and inter-region network links have high latency; racks are close together and inter-rack network links have low latency). (b) we don't want to make life too difficult for framework authors. (c) most server software (e.g., HDFS, Kafka, Cassandra, etc.) only understands a simple hierarchy -- in many cases, just a single level ("racks"), or occasionally two levels ("racks" and "DCs"). Can you elaborate on the use-cases that you see for a more complex hierarchy of fault domains? I'd be happy to chat off-list if you'd prefer. Thanks! Neil On Tue, Apr 18, 2017 at 1:33 AM, Maxime Brugidou <maxime.brugi...@gmail.com> wrote: > Hi Neil, > > I really like the idea of incorporating the concept of fault domains in > Mesos, however I feel like the implementation proposed is a bit narrow to be > actually useful for most users. > > I feel like we could make the fault domains definition more generic. As an > example in our setup we would like to have something like Region > Building >> Cage > Pod > Rack. Failure domains would be hierarchically arranged > (meaning one domain in a lower level can only be included in one domain > above). > > As a concrete example, we could have the mesos masters be aware of the fault > domain hierarchy (with a config map for example), and slaves would just need > to declare their lowest-level domain (for example their rack id). Then > frameworks could use this domain hierarchy at will. If they need to "spread" > their tasks for a very highly available setup, they could first spread using > the highest fault domain (like the region), then if they have enough tasks > to launch they could spread within each sub-domain recursively until they > run out of tasks to spread. We do not need to artificially limit the number > of levels of fault domains and the name of the fault domains. Schedulers do > not need to know the names either, just the hierarchy. > > Then, to provide the other feature of "remote" slaves that you describe, we > could configure the mesos master to only send offers from a "default" local > fault domain, and frameworks would need to advertise a certain capability to > receive offers for other remote fault domains. > > I feel we could implement this by identifying a fault domain with a simple > list of ids like ["US-WEST-1", "Building 2", "Cage 3", "POD 12", "Rack 3"] > or ["US-EAST-2", "Building 1"]. Slaves would advertise their lowest-level > fault domains and schedulers could use this arbitrarily as a hierarchical > list. > > Thanks, > Maxime > > On Mon, Apr 17, 2017 at 6:45 PM Neil Conway <neil.con...@gmail.com> wrote: >> >> Folks, >> >> I'd like to enhance Mesos to support a first-class notion of "fault >> domains" -- i.e., identifying the "rack" and "region" (DC) where a >> Mesos agent or master is located. The goal is to enable two main >> features: >> >> (1) To make it easier to write "rack-aware" Mesos frameworks that are >> portable to different Mesos clusters. >> >> (2) To improve the experience of configuring Mesos with a set of >> masters and agents in one DC, and another pool of "remote" agents in a >> different DC. >> >> For more information, please see the design doc: >> >> >> https://docs.google.com/document/d/1gEugdkLRbBsqsiFv3urRPRNrHwUC-i1HwfFfHR_MvC8 >> >> I'd love any feedback, either directly on the Google doc or via email. >> >> Thanks, >> Neil