Re: Master slow to process status updates after massive killing of tasks?

2016-06-19 Thread Thomas Petr
atus update acknowledgements explicitly? Or > does it leave this up to the old scheduler driver (default)? > - When does Singularity use the reconcileTasks call? That is the source > of the "Performing explicit task state reconciliation..." master log line. > This might be c

Master slow to process status updates after massive killing of tasks?

2016-06-17 Thread Thomas Petr
Hey folks, We got our Mesos cluster (0.28.1) into an interesting state during some chaos monkey testing. I killed off 5 of our 16 agents to simulate an AZ outage, and then accidentally killed off almost all running tasks (a little more than 1,000 of our ~1,300 tasks -- not intentional but an inter

Re: Reconnected slaves not sending resource offers?

2016-04-25 Thread Thomas Petr
Ah, thanks for the clarification. I can't find any logs from the framework indicating that we got the initial offer, so it looks like it could have been dropped. We haven't set --offer-timeout on our masters, so your explanation makes sense. Thanks! On Mon, Apr 25, 2016 at 4:17 PM, Vinod Kone wro

Re: Reconnected slaves not sending resource offers?

2016-04-25 Thread Thomas Petr
all the slave's resources were allocated) but I think it's saying that all the resources on the slave are available as revocable resources. Am I understanding that correctly? Thanks, Tom On Mon, Apr 25, 2016 at 3:06 PM, Vinod Kone wrote: > > On Mon, Apr 25, 2016 at 8:40 AM, Th

Reconnected slaves not sending resource offers?

2016-04-25 Thread Thomas Petr
Hi there, Some of our Mesos slaves (running version 0.23) got into a strange state last week. A networking blip from ~20:59 to ~21:03 in AWS caused a number of slaves to lose connectivity to the Mesos master: I0421 21:00:46.351019 85618 slave.cpp:3077] No pings from master received within 75secs

Boston MUG

2015-04-16 Thread Thomas Petr
Hey folks, Just a heads up for those on the East coast -- we've started a Mesos users group for Boston, MA: http://www.meetup.com/Boston-Mesos-User-Group/. Thanks, Tom

Re: Updating FrameworkInfo settings

2015-02-24 Thread Thomas Petr
t;>>>> Mesos checkpoints the FrameworkInfo into disk, and recovers it on >>>>> relaunch. >>>>> >>>>> I don't think we expose any API to remove the framework manually >>>>> though if you really want to keep the FrameworkID.

Updating FrameworkInfo settings

2015-02-24 Thread Thomas Petr
Hey folks, Is there a best practice for rolling out FrameworkInfo changes? We need to set checkpoint to true, so I redeployed our framework with the new settings (with tasks still running), but when I hit a slave's stats.json endpoint, it appears that the old FrameworkInfo data is still there (whi

Re: Accounting for Mesos resource usage

2014-12-18 Thread Thomas Petr
We (HubSpot) currently have a cron job that enumerates all tasks running on the slaves and pushes resource usage data into OpenTSDB. We then use lead.js to query / visualize this data. The cron job isn't open source, but I could look into releasing it if anyone is intereste

Re: Frontend loadbalancer configuration for long running tasks

2014-09-11 Thread Thomas Petr
am still learning so I'll use > that. > > -- Ankur > > On Thu, Sep 11, 2014 at 6:49 AM, Thomas Petr wrote: > >> Hey Ankur! >> >> When was the last time you looked at Singularity? We spent some time last >> month improving our docs >> <https

Re: Frontend loadbalancer configuration for long running tasks

2014-09-11 Thread Thomas Petr
Hey Ankur! When was the last time you looked at Singularity? We spent some time last month improving our docs in anticipation for MesosCon, and we're always open to feedback on what could be improved. We're using a combination of Sing

Re: MesosCon attendee introduction thread

2014-08-14 Thread Thomas Petr
Hey folks, My name is Tom Petr and I'm a tech lead on the PaaS team at HubSpot , an internet marketing company in Cambridge, MA (USA). We've been using Mesos for about a year to run web services, background workers, cron jobs, and one-off tasks via our homegrown Singularit

Re: Exposing executor container

2014-08-12 Thread Thomas Petr
in your scenario? > > Having the executor change cgroup limits behind the scenes, opaquely to > Mesos, seems like a recipe for problems in the future to me, since it could > lead to temporary over-commit of resources and affect isolation across > containers. > > > > On Tu

Re: Exposing executor container

2014-08-12 Thread Thomas Petr
Hey Vinod, We're not using mesos-fetcher to download the executor -- we ensure our executor exists on the slaves beforehand (during machine provisioning, to be exact). The issue that Whitney is talking about is OOMing while fetching artifacts necessary for task execution (like the JAR for a web se

Re: cgroups memory isolation

2014-06-18 Thread Thomas Petr
Eric pointed out that I had a typo in the instance type -- it's a c3.8xlarge (containing SSDs, which could make a difference here). On Wed, Jun 18, 2014 at 10:36 AM, Thomas Petr wrote: > Thanks for all the info, Ian. We're running CentOS 6 with the 2.6.32 > kernel. > >

Re: cgroups memory isolation

2014-06-18 Thread Thomas Petr
t; Hope this helps! Please do let us know if you're seeing this on a > kernel >= 3.10, otherwise it's likely this is a kernel issue rather > than something with Mesos. > > Thanks, > Ian > > > On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr wrote: > > Hello, >

cgroups memory isolation

2014-06-17 Thread Thomas Petr
Hello, We're running Mesos 0.18.0 with cgroups isolation, and have run into situations where lots of file I/O causes tasks to be killed due to exceeding memory limits. Here's an example: https://gist.github.com/tpetr/ce5d80a0de9f713765f0 We were under the impression that if cache was using a lot

Re: Mesos slaves disconnecting because of Zookeeper?

2014-04-09 Thread Thomas Petr
Hey Ted, Could you check your zk connection string and ensure that all the hostnames resolve correctly? When I've hit that error in the past it was due to zookeeper failing to resolve a hostname (in my case, for a EC2 instance that was deleted). Thanks, Tom On Wed, Apr 9, 2014 at 7:09 PM, Ted Y

Re: Mesos slave GC clarification

2013-12-27 Thread Thomas Petr
t; sandbox the slave gc'es it after 16.8 hours (= (1- GC_DISK_HEADROOM - 0.8) >>> * 7days). GC_DISK_HEADROOM is currently set to 0.1. >>> >>> However it might happen that executors are getting launched (and >>> sandboxes created) at a very high rate. In this case

Re: Mesos slave GC clarification

2013-12-26 Thread Thomas Petr
y high rate. In this case the slave might not able to > react quickly enough to gc sandboxes. > > You could grep for "Current usage" in the slave log to see how the disk > utilization varies over time. > > HTH, > > > On Thu, Dec 26, 2013 at 10:56 AM, Thomas Petr

Mesos slave GC clarification

2013-12-26 Thread Thomas Petr
Hi, We're running Mesos 0.14.0-rc4 on CentOS from the mesosphere repository. Last week we had an issue where the mesos-slave process died due running out of disk space. [1] The mesos-slave usage docs mention the "[GC] delay may be shorter depending on the available disk usage." Does anyone have a