On 2016/6/6 23:48, Jörg Schad wrote:
Hi,
thanks for your idea and design doc!
Just a few thoughts:
a) The scheduling part would be implemented in a framework scheduler and
not the Mesos Core, or?
I'm not sure which level of scheduling part do you indicate,
For the "Future" section of proposal?, It's Mesos allocation logic.
And how to use rack information to implement advanced features (fault
tolerance,
data locality) is up to the framework scheduling part.
b) As mentioned by James, this needs to be very flexible (and not
necessarily based on network structure),
The proposed network topology detection is modular, to fit into Ethernet,
Infiniband, or other network implementation. And yes, user can statically
configure /etc/mesos/rack_id to manipulate the logical network topology
easily.
afaik people are using labels
on the agents to identify different fault domains which can then be
interpreted by framework scheduler. Maybe it would make sense (instead
of identifying the network structure) to come up with a common label
naming scheme which can be understood by all/different frameworks.
I'm not convinced here why still using labels,
Based on what information to label the agents? IMO, cluster operator
still needs something like lldp to find out the network topology,
every cluster operator will need to do it by his own, and it's better
to abstract the logical inside Mesos to provide common interface to
frameworks.
Honestly speaking, I don't follow the argument here for the labels.
The proposal is designed to do it *automatically* to reduce maintenance
effort.
Looking forward to your thoughts on this!
On Mon, Jun 6, 2016 at 3:27 PM, james <gar...@verizon.net
<mailto:gar...@verizon.net>> wrote:
Hello,
@Stephen::I guess Stephen is bringing up the 'security' aspect of
who get's access to the information, particularly cluster/cloud
devops, customers or interlopers....?
@Fan:: As a consultant, most of my customers either have or are
planning hybrid installations, where some codes run on a local
cluster or using 'the cloud' for dynamic load requirements. I would
think your proposed scheme needs to be very flexible, both in
application to a campus or Metropolitan Area Network, if not
massively distributed around the globe. What about different resouce
types (racks of arm64, gpu centric hardware, DSPs, FPGA etc etc.
Hardware diversity bring many
benefits to the cluster/cloud capabilities.
This also begs the quesion of hardware management (boot/config/online)
of the various hardware, such as is built into coreOS. Are several
applications going to be supported? Standards track? Just Mesos DC/OS
centric?
TIMING DATA:: This is the main issue I see. Once you start 'vectoring
in resources' you need to add timing (latency) data to encourage robust
and diversified use of of this data. For HPC, this could be very
valuable for rDMA abusive algorithms where memory constrained
workloads not only need the knowledge of additional nearby memory
resources, but
the approximated (based on previous data collected) latency and
bandwidth constraints to use those additional resources.
Great idea. I do like it very much.
hth,
James
On 06/06/2016 05:06 AM, Stephen Gran wrote:
Hi,
This looks potentially interesting. How does it work in a
public cloud
deployment scenario? I assume you would just have to disable this
feature, or not enable it?
Cheers,
On 06/06/16 10:17, Du, Fan wrote:
Hi, Mesos folks
I’ve been thinking about Mesos rack awareness support for a
while,
it’s a common interest for lots of data center applications
to provide
data locality,
fault tolerance and better task placement. Create MESOS-5545
to track
the story,
and here is the initial design doc [1] to support rack
awareness in Mesos.
Looking forward to hear any comments from end user and other
developers,
Thanks!
[1]:
https://docs.google.com/document/d/1rql_LZSwtQzBPALnk0qCLsmxcT3-zB7X7aJp-H3xxyE/edit?usp=sharing