Re: Questions about framework development - (HA and reconciling state)

Jeff Schroeder Sat, 25 Jul 2015 21:37:30 -0700

Not sure how much more difficult it would be, but Apache Aurora uses the
native mesos replicated log construct for data persistence (where you store
data in memory). It requires one manual setup to deploy the framework, but
seems like it is worth it for what you get out of it. Here is how I just
recently tested it out and was impressed with how bulletproof it is.


I ran a semi chaos monkey test with Aurora + aurproxy with an nginx load
balancer. Every random seconds < 200, it would restart one of the 5 Aurora
schedulers in a loop. Then while clients were hitting the webapp at
~50-60rps I was cycling aurora job update between 5 and 15 instances in a
loop to see how the clients handled scheduler failover and instances being
killed.

Never had a single issue from the schedulers and only a single 502 error
after about 2 million requests, which can be mitigated with a bit more
tuning.

On Saturday, July 25, 2015, Ankur Chauhan <an...@malloc64.com> wrote:

> Hi all,
>
>
> I am working on creating an integration between Apache Flink (
> http://flink.apache.org) and mesos which would be similar to the way the
> current hadoop-mesos integration works using the java mesos client.
> My current idea is that the scheduler will also run a JobManager process
> (similar to the jobTracker) which will start off a bunch of taskManager
> (similar to the TaskTracker) tasks using a custom executor.
>
> I want to get some feedback and information of the following questions I
> have:
>
> 0. How do i go about the issue of HA at the scheduler level?
>     I was thinking of using zookeeper based leader election by directly
> maintaining a zookeeper connection myself. Is there a better way to do this
> (something which does not require me to use a self managed zookeeper
> connection)?
>
> 1. How do i deal with restarts and reconciling the tasks?
>     In case the scheduler restarts (currently maintains an in-memory map
> of currently running tasks), How do I go about rediscovering tasks and
> reconciling state?
>     I was thinking of using DiscoverInfo but I can't find any reference to
> figure out how to "query" mesos for tasks matching the service discovery
> information. - Any suggestions on how to do this.
>
> 3. How does one go about testing frameworks? Any suggestions / pointers.
>
> My work in progress version is at
> https://github.com/ankurcha/flink/tree/flink-mesos/flink-mesos
>
> Any help would be much appreciated.
>
>
> Thanks!
> Ankur
>


-- 
Text by Jeff, typos by iPhone

Re: Questions about framework development - (HA and reconciling state)

Reply via email to