+1

Likewise, I think it's awesome, would love to be involved.

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Mon, Oct 5, 2015 at 10:50 AM, Neil Conway <[email protected]> wrote:

> On Sun, Oct 4, 2015 at 6:14 PM, Maged Michael <[email protected]>
> wrote:
> > I'd appreciate feedback on a proposal for a simulation tool for debugging
> > and testing the Mesos master and allocator.
>
> Overall, this is awesome! I'd love to see Mesos improve in this area,
> and I'd be happy to help out where I can.
>
> > Simulations would--randomly but deterministically--explore the state
> space
> > of cloud configurations and check for invariant violations and collect
> > stats--in addition to those already in the Mesos master code.
>
> It would be useful to be able to (a) record a "trace" from a running
> (production) Mesos instance (b) replay that trace under the simulator,
> e.g., to explore the impact of changes to Mesos. For example, see
> Section 3.1 of the Borg paper [1].
>
> > * Automated transformation of Mesos source code for integration into the
> > simulator, to allow the simulator to use simulated time instead of real
> > time and to intercept libprocess-based inter-thread and inter-node
> > communication.
>
> Can you elaborate on how you see the source code transformation working?
>
> Because of the way in which Mesos uses processes and message passing,
> you can already control timeouts and inter-process communication in a
> fairly sophisticated way -- for example, see Clock::advance(),
> Clock::settle(), FUTURE_MESSAGE(), DROP_MESSAGE(), etc. Do you think
> it would be possible to implement the simulator in a way that
> leverages (and improves!) the existing facilities in libprocess,
> rather than building new functionality? For example, to control the
> way in which processes and events are interleaved, would it be
> possible to do this by hooking into the libprocess message dispatch
> logic, rather than doing a source code transformation?
>
> > Examples of problems to be detected:
> > * Liveness problems such as deadlock, livelock, starvation
> > * Safety problems such as oversubscription of resources, permanent loss
> of
> > resources or tasks, data corruption in general.
> > * Fairness problems such as sustained imbalance in allocation of
> resources
> > to frameworks.
> > * Performance problems such as high response time, low resource
> utilization.
>
> Validating that the system behaves correctly in the presence of
> network partitions would also be great.
>
> To clarify, it seems like you are primarily focused on finding
> bugs/problems in core Mesos, rather than in Mesos framework
> implementations. The latter would also be a very interesting project
> (e.g., as a framework author, we'd give you a tool that would push
> your scheduler/executor implementation through the entire state space
> of situations the framework would need to handle).
>
> Neil
>
> [1]
> https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf
>

Reply via email to