Re: Setting ulimits on mesos-slave

2016-04-25 Thread haosdent
According my test, it works in my side.

* Before add it to /etc/init/mesos-slave.conf

```
cat /proc/16550/limits
Limit Soft Limit   Hard Limit   Units
Max cpu time  unlimitedunlimitedseconds
Max file size unlimitedunlimitedbytes
Max data size unlimitedunlimitedbytes
Max stack size10485760 unlimitedbytes
Max core file size0unlimitedbytes
Max resident set  unlimitedunlimitedbytes
Max processes 6365563655
 processes
Max open files8192 8192 files
Max locked memory 6553665536bytes
Max address space unlimitedunlimitedbytes
Max file locksunlimitedunlimitedlocks
Max pending signals   6365563655signals
Max msgqueue size 819200   819200   bytes
Max nice priority 00
Max realtime priority 00
Max realtime timeout  unlimitedunlimitedus
```

* After add it to /etc/init/mesos-slave.conf

```
description "mesos slave"

# I didn't use ulimit because I have already set ulimit globally.
limit fsize 20001 20001
limit nofile 2 2
```

```
cat /proc/16602/limits
Limit Soft Limit   Hard Limit   Units
Max cpu time  unlimitedunlimitedseconds
Max file size 2000120001bytes
# <- Have changed
Max data size unlimitedunlimitedbytes
Max stack size10485760 unlimitedbytes
Max core file size0unlimitedbytes
Max resident set  unlimitedunlimitedbytes
Max processes 6365563655
 processes
Max open files8192 8192 files
Max locked memory 6553665536bytes
Max address space unlimitedunlimitedbytes
Max file locksunlimitedunlimitedlocks
Max pending signals   6365563655signals
Max msgqueue size 819200   819200   bytes
Max nice priority 00
Max realtime priority 00
Max realtime timeout  unlimitedunlimitedus
```

On Tue, Apr 26, 2016 at 3:51 AM, June Taylor  wrote:

> Hello. We are running it as root, and it is able to specify the ulimit for
> open files, as noted in our config file, but it is not setting the File
> Size limit, which is remaining at 8MB.
>
>
> Thanks,
> June Taylor
> System Administrator, Minnesota Population Center
> University of Minnesota
>
> On Mon, Apr 25, 2016 at 2:50 PM, Dick Davies 
> wrote:
>
>> Hi June
>>
>> are you running Mesos as root, or a non-privileged user? Non-root
>> won't be able to up their own ulimit too high
>> (sorry, not an upstart expert as RHELs is laughably incomplete).
>>
>> On 25 April 2016 at 19:15, June Taylor  wrote:
>> > What I'm saying is even putting them within the upstart script, per the
>> > Mesos documentation, isn't working for the file block limit. We're still
>> > getting 8MB useable, and as a result executors fail when attempting to
>> write
>> > larger files.
>> >
>> >
>> > Thanks,
>> > June Taylor
>> > System Administrator, Minnesota Population Center
>> > University of Minnesota
>> >
>> > On Mon, Apr 25, 2016 at 11:53 AM, haosdent  wrote:
>> >>
>> >> If you set in your upstart script, it isn't system wide and only
>> effective
>> >> in that session. I think need change /etc/security/limits.conf and
>> >> /etc/sysctl.conf to make your ulimit work globally.
>> >>
>> >> On Tue, Apr 26, 2016 at 12:43 AM, June Taylor  wrote:
>> >>>
>> >>> Somewhere an 8MB maximum file size is being applied on just one of our
>> >>> slaves, for example.
>> >>>
>> >>>
>> >>> Thanks,
>> >>> June Taylor
>> >>> System Administrator, Minnesota Population Center
>> >>> University of Minnesota
>> >>>
>> >>> On Mon, Apr 25, 2016 at 11:42 AM, June Taylor  wrote:
>> 
>>  We are operating a 6-node cluster running on Ubuntu, and have noticed
>>  that the ulimit settings within the slave context are difficult to
>> set and
>>  predict.
>> 
>>  The documentation is a bit unclear on this point, as well.
>> 
>>  We have had some luck adding a configuration line to
>>  /etc/init/mesos-slave.conf as follows:
>>  limit nofile 2 2
>>  limit fsize unlimited unlimited
>> 
>>  The nofile limit seems to be respected, however the

Re: stable remote branches

2016-04-25 Thread Benjamin Mahler
+user as an FYI

Going forward we'll push directly to these branches as backport decisions
are made. Since 0.28.x, 0.27.x, and 0.26.x have just been created, here is
what was already marked for these versions, that we'll have to cherry-pick:

The following need to be cherry-picked for 0.28.2:
https://issues.apache.org/jira/browse/MESOS-4705 - Linux perf fix [bmahler]
https://issues.apache.org/jira/browse/MESOS-5253 - Isolator cleanup fix
[jie]
https://issues.apache.org/jira/browse/MESOS-5282 - CHECK failure in test
[jie]
https://issues.apache.org/jira/browse/MESOS-5238 - Race in mesos
containerizer [jie] (cause of MESOS-5282)

The following need to be cherry-picked for 0.27.3:
https://issues.apache.org/jira/browse/MESOS-4705 - Linux perf fix [bmahler]
https://issues.apache.org/jira/browse/MESOS-4869 - Health checker leaks
memory [bmahler]
https://issues.apache.org/jira/browse/MESOS-5021 - process::Subprocess
memory leak [bmahler] (cause of MESOS-4869)
https://issues.apache.org/jira/browse/MESOS-4662 - PortMapping network
isolator should not assume BIND_MOUNT_ROOT is a realpath. [jie]
https://issues.apache.org/jira/browse/MESOS-4979 - os::rmdir does not
handle special files [jie]
https://issues.apache.org/jira/browse/MESOS-5018 - FrameworkInfo Capability
enum does not support upgrades. [bmahler]

The following need to be cherry-picked for 0.26.2:
https://issues.apache.org/jira/browse/MESOS-4705 - Linux perf fix [bmahler]

Looks like it's just on jie and I, I've put these together in a doc here to
capture the progress:
https://docs.google.com/document/d/1DKCn05oFNirXRvX3A-i_h3nLn-0oyJL3vmakjqdarNw/edit?usp=sharing

As always, if you see things that you believe should be backported, let us
know.

Ben

On Mon, Apr 25, 2016 at 2:47 PM, Vinod Kone  wrote:

> Hi guys,
>
> Per the latest guidelines on doing mesos releases and backports, we've
> created remote branches for releases that are still supported (0.26.x,
> 0.27.x, 0.28.x).
>
> Going forward, any issues that need to be backported or fixed should land
> in these branches. For backports and *CHANGELOG* updates, make sure to
> *land
> them first on the master* before cherry-picking them onto the remote
> branches.
>
> Please let me know if you have any questions/concerns,
>
> Thanks,
> Vinod
>
> P.S: Sorry for the commit noise. We won't have this noise from 0.29.0
> onward since we will push 0.29.x at the same time as 0.29.0.
>


Re: Reconnected slaves not sending resource offers?

2016-04-25 Thread Thomas Petr
Ah, thanks for the clarification. I can't find any logs from the framework
indicating that we got the initial offer, so it looks like it could have
been dropped. We haven't set --offer-timeout on our masters, so your
explanation makes sense. Thanks!

On Mon, Apr 25, 2016 at 4:17 PM, Vinod Kone  wrote:

>
> I0421 21:03:32.014999 17071 master.cpp:4290] Sending 1 offers to
>> framework sy3x4 (sy3x4) at
>> scheduler-6bb2bcf0-d060-4072-a25b-917d8007fb1c@172.16.13.243:56861
>>
>
> This shows that the slaves resources were sent to a framework. Looks like
> the framework is holding on to the offer for a long time?
>
>
>> I0421 21:03:32.019800 17076 hierarchical.hpp:588] Slave
>> 20151116-203437-35000492-5050-17068-S70 (lively-rice) updated with
>> oversubscribed resources  (total: mem(*):217609; cpus(*):210;
>> ports(*):[2048-3048]; disk(*):639829, allocated: mem(*):217609;
>> cpus(*):210; ports(*):[2048-3048]; disk(*):639829)
>>
>
> This says that from the view point of master/allocator, all the resources
> are allocated. This is because the framework hasn't replied to the offer.
> Did the framework receive the offer or was it dropped by the network due to
> the networking issues?
>
>


Re: Reconnected slaves not sending resource offers?

2016-04-25 Thread Vinod Kone
> I0421 21:03:32.014999 17071 master.cpp:4290] Sending 1 offers to
> framework sy3x4 (sy3x4) at
> scheduler-6bb2bcf0-d060-4072-a25b-917d8007fb1c@172.16.13.243:56861
>

This shows that the slaves resources were sent to a framework. Looks like
the framework is holding on to the offer for a long time?


> I0421 21:03:32.019800 17076 hierarchical.hpp:588] Slave
> 20151116-203437-35000492-5050-17068-S70 (lively-rice) updated with
> oversubscribed resources  (total: mem(*):217609; cpus(*):210;
> ports(*):[2048-3048]; disk(*):639829, allocated: mem(*):217609;
> cpus(*):210; ports(*):[2048-3048]; disk(*):639829)
>

This says that from the view point of master/allocator, all the resources
are allocated. This is because the framework hasn't replied to the offer.
Did the framework receive the offer or was it dropped by the network due to
the networking issues?


Re: Setting ulimits on mesos-slave

2016-04-25 Thread June Taylor
Hello. We are running it as root, and it is able to specify the ulimit for
open files, as noted in our config file, but it is not setting the File
Size limit, which is remaining at 8MB.


Thanks,
June Taylor
System Administrator, Minnesota Population Center
University of Minnesota

On Mon, Apr 25, 2016 at 2:50 PM, Dick Davies  wrote:

> Hi June
>
> are you running Mesos as root, or a non-privileged user? Non-root
> won't be able to up their own ulimit too high
> (sorry, not an upstart expert as RHELs is laughably incomplete).
>
> On 25 April 2016 at 19:15, June Taylor  wrote:
> > What I'm saying is even putting them within the upstart script, per the
> > Mesos documentation, isn't working for the file block limit. We're still
> > getting 8MB useable, and as a result executors fail when attempting to
> write
> > larger files.
> >
> >
> > Thanks,
> > June Taylor
> > System Administrator, Minnesota Population Center
> > University of Minnesota
> >
> > On Mon, Apr 25, 2016 at 11:53 AM, haosdent  wrote:
> >>
> >> If you set in your upstart script, it isn't system wide and only
> effective
> >> in that session. I think need change /etc/security/limits.conf and
> >> /etc/sysctl.conf to make your ulimit work globally.
> >>
> >> On Tue, Apr 26, 2016 at 12:43 AM, June Taylor  wrote:
> >>>
> >>> Somewhere an 8MB maximum file size is being applied on just one of our
> >>> slaves, for example.
> >>>
> >>>
> >>> Thanks,
> >>> June Taylor
> >>> System Administrator, Minnesota Population Center
> >>> University of Minnesota
> >>>
> >>> On Mon, Apr 25, 2016 at 11:42 AM, June Taylor  wrote:
> 
>  We are operating a 6-node cluster running on Ubuntu, and have noticed
>  that the ulimit settings within the slave context are difficult to
> set and
>  predict.
> 
>  The documentation is a bit unclear on this point, as well.
> 
>  We have had some luck adding a configuration line to
>  /etc/init/mesos-slave.conf as follows:
>  limit nofile 2 2
>  limit fsize unlimited unlimited
> 
>  The nofile limit seems to be respected, however the fsize limit does
>  not.
> 
>  It is also mysterious that the system-wide limits are not inherited by
>  the slave process. We would prefer to set all of these system-wide
> and have
>  mesos-slave observe them.
> 
>  Can you please advise where you are setting your ulimits for the
>  mesos-slave if it is working for you?
> 
>  Thanks,
>  June Taylor
>  System Administrator, Minnesota Population Center
>  University of Minnesota
> >>>
> >>>
> >>
> >>
> >>
> >> --
> >> Best Regards,
> >> Haosdent Huang
> >
> >
>


Re: Setting ulimits on mesos-slave

2016-04-25 Thread Dick Davies
Hi June

are you running Mesos as root, or a non-privileged user? Non-root
won't be able to up their own ulimit too high
(sorry, not an upstart expert as RHELs is laughably incomplete).

On 25 April 2016 at 19:15, June Taylor  wrote:
> What I'm saying is even putting them within the upstart script, per the
> Mesos documentation, isn't working for the file block limit. We're still
> getting 8MB useable, and as a result executors fail when attempting to write
> larger files.
>
>
> Thanks,
> June Taylor
> System Administrator, Minnesota Population Center
> University of Minnesota
>
> On Mon, Apr 25, 2016 at 11:53 AM, haosdent  wrote:
>>
>> If you set in your upstart script, it isn't system wide and only effective
>> in that session. I think need change /etc/security/limits.conf and
>> /etc/sysctl.conf to make your ulimit work globally.
>>
>> On Tue, Apr 26, 2016 at 12:43 AM, June Taylor  wrote:
>>>
>>> Somewhere an 8MB maximum file size is being applied on just one of our
>>> slaves, for example.
>>>
>>>
>>> Thanks,
>>> June Taylor
>>> System Administrator, Minnesota Population Center
>>> University of Minnesota
>>>
>>> On Mon, Apr 25, 2016 at 11:42 AM, June Taylor  wrote:

 We are operating a 6-node cluster running on Ubuntu, and have noticed
 that the ulimit settings within the slave context are difficult to set and
 predict.

 The documentation is a bit unclear on this point, as well.

 We have had some luck adding a configuration line to
 /etc/init/mesos-slave.conf as follows:
 limit nofile 2 2
 limit fsize unlimited unlimited

 The nofile limit seems to be respected, however the fsize limit does
 not.

 It is also mysterious that the system-wide limits are not inherited by
 the slave process. We would prefer to set all of these system-wide and have
 mesos-slave observe them.

 Can you please advise where you are setting your ulimits for the
 mesos-slave if it is working for you?

 Thanks,
 June Taylor
 System Administrator, Minnesota Population Center
 University of Minnesota
>>>
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>
>


Re: Reconnected slaves not sending resource offers?

2016-04-25 Thread Thomas Petr
I0421 21:03:32.014533 17073 hierarchical.hpp:528] Added slave
20151116-203437-35000492-5050-17068-S70 (lively-rice) with mem(*):217609;
cpus(*):210; ports(*):[2048-3048]; disk(*):639829 (allocated: )
I0421 21:03:32.014529 17072 master.cpp:3395] Registered slave
20151116-203437-35000492-5050-17068-S70 at slave(1)@172.16.3.103:5051
(lively-rice) with mem(*):217609; cpus(*):210; ports(*):[2048-3048];
disk(*):639829
I0421 21:03:32.014673 17076 coordinator.cpp:340] Coordinator attempting to
write TRUNCATE action at position 4102
I0421 21:03:32.014945 17069 replica.cpp:511] Replica received write request
for position 4102
I0421 21:03:32.014999 17071 master.cpp:4290] Sending 1 offers to framework
sy3x4 (sy3x4) at
scheduler-6bb2bcf0-d060-4072-a25b-917d8007fb1c@172.16.13.243:56861
I0421 21:03:32.015379 17069 leveldb.cpp:343] Persisting action (18 bytes)
to leveldb took 345429ns
I0421 21:03:32.015403 17069 replica.cpp:679] Persisted action at 4102
I0421 21:03:32.017308 17073 replica.cpp:658] Replica received learned
notice for position 4102
I0421 21:03:32.017627 17073 leveldb.cpp:343] Persisting action (20 bytes)
to leveldb took 292089ns
I0421 21:03:32.017665 17073 leveldb.cpp:401] Deleting ~2 keys from leveldb
took 14004ns
I0421 21:03:32.017681 17073 replica.cpp:679] Persisted action at 4102
I0421 21:03:32.017693 17073 replica.cpp:664] Replica learned TRUNCATE
action at position 4102
I0421 21:03:32.019726 17076 master.cpp:3687] Received update of slave
20151116-203437-35000492-5050-17068-S70 at slave(1)@172.16.3.103:5051
(lively-rice) with total oversubscribed resources
I0421 21:03:32.019800 17076 hierarchical.hpp:588] Slave
20151116-203437-35000492-5050-17068-S70 (lively-rice) updated with
oversubscribed resources  (total: mem(*):217609; cpus(*):210;
ports(*):[2048-3048]; disk(*):639829, allocated: mem(*):217609;
cpus(*):210; ports(*):[2048-3048]; disk(*):639829)

(no other mentions of lively-rice or the slave ID for 10 minutes until we
bounce our scheduler at 21:13...)

I0421 21:13:13.806171 17072 hierarchical.hpp:761] Recovered mem(*):217609;
cpus(*):210; ports(*):[2048-3048]; disk(*):639829 (total: mem(*):217609;
cpus(*):210; ports(*):[2048-3048]; disk(*):639829, allocated: ) on slave
20151116-203437-35000492-5050-17068-S70 from framework sy3x4
I0421 21:13:15.749594 17075 hierarchical.hpp:761] Recovered mem(*):217609;
cpus(*):210; ports(*):[2048-3048]; disk(*):639829 (total: mem(*):217609;
cpus(*):210; ports(*):[2048-3048]; disk(*):639829, allocated: ) on slave
20151116-203437-35000492-5050-17068-S70 from framework sy3x4
I0421 21:14:52.761143 17075 master.cpp:2505] Processing ACCEPT call for
offers: [ 20151116-203437-35000492-5050-17068-O116800466 ] on slave
20151116-203437-35000492-5050-17068-S70 at slave(1)@172.16.3.103:5051 (
lively-rice.iad02.hubspot-networks.net) for framework sy3x4 (sy3x4) at
scheduler-7dda7817-66f1-4b8e-a5dd-9744aea52cba@172.16.40.17:53645

We were originally concerned about the log line at 21:03:32.019800 (where
it says that all the slave's resources were allocated) but I think it's
saying that all the resources on the slave are available as revocable
resources. Am I understanding that correctly?

Thanks,
Tom

On Mon, Apr 25, 2016 at 3:06 PM, Vinod Kone  wrote:

>
> On Mon, Apr 25, 2016 at 8:40 AM, Thomas Petr  wrote:
>
>> The only thing that ended up fixing the situation was bouncing our
>> scheduler (~10 minutes after the restarted slaves joined the cluster) --
>> the act of failing over the framework appeared to "recover" the missing
>> resources:
>>
>
> What do the master logs say when the slave is registered with a new id?
>


Re: Reconnected slaves not sending resource offers?

2016-04-25 Thread Vinod Kone
On Mon, Apr 25, 2016 at 8:40 AM, Thomas Petr  wrote:

> The only thing that ended up fixing the situation was bouncing our
> scheduler (~10 minutes after the restarted slaves joined the cluster) --
> the act of failing over the framework appeared to "recover" the missing
> resources:
>

What do the master logs say when the slave is registered with a new id?


Re: Setting ulimits on mesos-slave

2016-04-25 Thread June Taylor
What I'm saying is even putting them within the upstart script, per the
Mesos documentation, isn't working for the file block limit. We're still
getting 8MB useable, and as a result executors fail when attempting to
write larger files.


Thanks,
June Taylor
System Administrator, Minnesota Population Center
University of Minnesota

On Mon, Apr 25, 2016 at 11:53 AM, haosdent  wrote:

> If you set in your upstart script, it isn't system wide and only effective
> in that session. I think need change /etc/security/limits.conf
> and /etc/sysctl.conf to make your ulimit work globally.
>
> On Tue, Apr 26, 2016 at 12:43 AM, June Taylor  wrote:
>
>> Somewhere an 8MB maximum file size is being applied on just one of our
>> slaves, for example.
>>
>>
>> Thanks,
>> June Taylor
>> System Administrator, Minnesota Population Center
>> University of Minnesota
>>
>> On Mon, Apr 25, 2016 at 11:42 AM, June Taylor  wrote:
>>
>>> We are operating a 6-node cluster running on Ubuntu, and have noticed
>>> that the ulimit settings within the slave context are difficult to set and
>>> predict.
>>>
>>> The documentation is a bit unclear on this point, as well.
>>>
>>> We have had some luck adding a configuration line to
>>> /etc/init/mesos-slave.conf as follows:
>>> limit nofile 2 2
>>> limit fsize unlimited unlimited
>>>
>>> The nofile limit seems to be respected, however the fsize limit does not.
>>>
>>> It is also mysterious that the system-wide limits are not inherited by
>>> the slave process. We would prefer to set all of these system-wide and have
>>> mesos-slave observe them.
>>>
>>> Can you please advise where you are setting your ulimits for the
>>> mesos-slave if it is working for you?
>>>
>>> Thanks,
>>> June Taylor
>>> System Administrator, Minnesota Population Center
>>> University of Minnesota
>>>
>>
>>
>
>
> --
> Best Regards,
> Haosdent Huang
>


Re: Run Mesos without master being able to open connections to slaves

2016-04-25 Thread Vinod Kone
On Mon, Apr 25, 2016 at 8:35 AM, Elouan Keryell-Even <
elouan.kery...@gmail.com> wrote:

> So I'd be glad to have some insight from you guys about if it is possible,
> in one way or another, to make Mesos work without the Master being able to
> initiate connections to slaves. I just need to be 100% sure there isn't any
> workaround before going back to my boss :)
>
>
Master and agent/slave are still required to be able to open connections to
each other. There is no work around that I'm aware of.

We had a similar restriction with scheduler (driver based) to master
communication. The new scheduler HTTP API no longer has this restriction
for master to scheduler communication.

For master to agent communication, the plan is to come up with a new HTTP
API similar to the scheduler HTTP API. Neither the design nor the
implementation has started yet.


Re: Running Mesos agent on ARM (Raspberry Pi)?

2016-04-25 Thread Sharma Podila
This is for an internal hackday project, not for a production setup.


On Mon, Apr 25, 2016 at 1:05 AM, Aaron Carey  wrote:

> Out of curiosity... is this for fun or production workloads? I'd be
> curious to hear about raspis being used in production!
>
> --
>
> Aaron Carey
> Production Engineer - Cloud Pipeline
> Industrial Light & Magic
> London
> 020 3751 9150
>
> --
> *From:* Sharma Podila [spod...@netflix.com]
> *Sent:* 22 April 2016 17:53
> *To:* user@mesos.apache.org; dev
> *Subject:* Running Mesos agent on ARM (Raspberry Pi)?
>
> We are working on a hack to run Mesos agents on Raspberry Pi and are
> wondering if anyone here has done that before. From the Google search
> results we looked at so far, it seems like it has been compiled, but we
> haven't seen an indication that anyone has run it and launched tasks on
> them. And does it sound right that it might take 4 hours or so to compile?
>
> We are looking to run just the agents. The master will be on a regular
> Ubuntu laptop or a server.
>
> Appreciate any pointers.
>
>
>


Re: Setting ulimits on mesos-slave

2016-04-25 Thread haosdent
If you set in your upstart script, it isn't system wide and only effective
in that session. I think need change /etc/security/limits.conf
and /etc/sysctl.conf to make your ulimit work globally.

On Tue, Apr 26, 2016 at 12:43 AM, June Taylor  wrote:

> Somewhere an 8MB maximum file size is being applied on just one of our
> slaves, for example.
>
>
> Thanks,
> June Taylor
> System Administrator, Minnesota Population Center
> University of Minnesota
>
> On Mon, Apr 25, 2016 at 11:42 AM, June Taylor  wrote:
>
>> We are operating a 6-node cluster running on Ubuntu, and have noticed
>> that the ulimit settings within the slave context are difficult to set and
>> predict.
>>
>> The documentation is a bit unclear on this point, as well.
>>
>> We have had some luck adding a configuration line to
>> /etc/init/mesos-slave.conf as follows:
>> limit nofile 2 2
>> limit fsize unlimited unlimited
>>
>> The nofile limit seems to be respected, however the fsize limit does not.
>>
>> It is also mysterious that the system-wide limits are not inherited by
>> the slave process. We would prefer to set all of these system-wide and have
>> mesos-slave observe them.
>>
>> Can you please advise where you are setting your ulimits for the
>> mesos-slave if it is working for you?
>>
>> Thanks,
>> June Taylor
>> System Administrator, Minnesota Population Center
>> University of Minnesota
>>
>
>


-- 
Best Regards,
Haosdent Huang


Re: Setting ulimits on mesos-slave

2016-04-25 Thread June Taylor
Somewhere an 8MB maximum file size is being applied on just one of our
slaves, for example.


Thanks,
June Taylor
System Administrator, Minnesota Population Center
University of Minnesota

On Mon, Apr 25, 2016 at 11:42 AM, June Taylor  wrote:

> We are operating a 6-node cluster running on Ubuntu, and have noticed that
> the ulimit settings within the slave context are difficult to set and
> predict.
>
> The documentation is a bit unclear on this point, as well.
>
> We have had some luck adding a configuration line to
> /etc/init/mesos-slave.conf as follows:
> limit nofile 2 2
> limit fsize unlimited unlimited
>
> The nofile limit seems to be respected, however the fsize limit does not.
>
> It is also mysterious that the system-wide limits are not inherited by the
> slave process. We would prefer to set all of these system-wide and have
> mesos-slave observe them.
>
> Can you please advise where you are setting your ulimits for the
> mesos-slave if it is working for you?
>
> Thanks,
> June Taylor
> System Administrator, Minnesota Population Center
> University of Minnesota
>


Setting ulimits on mesos-slave

2016-04-25 Thread June Taylor
We are operating a 6-node cluster running on Ubuntu, and have noticed that
the ulimit settings within the slave context are difficult to set and
predict.

The documentation is a bit unclear on this point, as well.

We have had some luck adding a configuration line to
/etc/init/mesos-slave.conf as follows:
limit nofile 2 2
limit fsize unlimited unlimited

The nofile limit seems to be respected, however the fsize limit does not.

It is also mysterious that the system-wide limits are not inherited by the
slave process. We would prefer to set all of these system-wide and have
mesos-slave observe them.

Can you please advise where you are setting your ulimits for the
mesos-slave if it is working for you?

Thanks,
June Taylor
System Administrator, Minnesota Population Center
University of Minnesota


Reconnected slaves not sending resource offers?

2016-04-25 Thread Thomas Petr
Hi there,

Some of our Mesos slaves (running version 0.23) got into a strange state
last week. A networking blip from ~20:59 to ~21:03 in AWS caused a number
of slaves to lose connectivity to the Mesos master:

I0421 21:00:46.351019 85618 slave.cpp:3077] No pings from master
received within 75secs
I0421 21:00:46.355203 85594 status_update_manager.cpp:176] Pausing
sending status updates
I0421 21:00:46.355406 85622 slave.cpp:673] Re-detecting master
I0421 21:00:46.355630 85622 slave.cpp:720] Detecting new master
I0421 21:00:46.356101 85603 status_update_manager.cpp:176] Pausing
sending status updates
I0421 21:00:46.356115 85622 slave.cpp:684] New master detected at
master@172.16.22.2:5050
I0421 21:00:46.357239 85622 slave.cpp:709] No credentials provided.
Attempting to register without authentication
I0421 21:00:46.357364 85622 slave.cpp:720] Detecting new master

These slaves were shut down and removed by the master, and their
corresponding tasks were all marked as TASK_LOST:

I0421 21:01:01.355435 17076 master.cpp:241] Shutting down slave
20151116-203245-4077719724-5050-17017-S208 due to health check timeout
W0421 21:01:01.36 17076 master.cpp:3913] Shutting down slave
20151116-203245-4077719724-5050-17017-S208 at
slave(1)@172.16.3.103:5051 (lively-rice) with message 'health check
timed out'
I0421 21:01:01.355660 17076 master.cpp:4974] Removing slave
20151116-203245-4077719724-5050-17017-S208 at
slave(1)@172.16.3.103:5051 (lively-rice): health check timed out
...snip...
I0421 21:01:01.498541 17073 master.cpp:5079] Removed slave
20151116-203245-4077719724-5050-17017-S208 (lively-rice): health check
timed out
I0421 21:01:01.501723 17073 master.cpp:5102] Notifying framework sy3x4
(sy3x4) at scheduler-6a46b6f2-ccf8-416b-b8ba-7bef99576197@172.16.40.17:38483
of lost slave 20151116-203245-4077719724-5050-17017-S208 (lively-rice)
after recovering

The networking issues eventually clear up. The slaves attempt to
re-register with the master, but are shut down due to the master having
removed them:

I0421 21:03:13.789948 85612 slave.cpp:606] Slave asked to shut down by
master@172.16.22.2:5050 because 'health check timed out'
I0421 21:03:13.791801 85612 slave.cpp:1946] Asked to shut down
framework sy3x4 by master@172.16.22.2:5050
I0421 21:03:13.791960 85612 slave.cpp:1971] Shutting down framework sy3x4
I0421 21:03:13.793388 85612 slave.cpp:3667] Shutting down executor
'4ki18' of framework sy3x4
I0421 21:03:13.793678 85612 slave.cpp:3667] Shutting down executor
'8cjp8' of framework sy3x4
I0421 21:03:13.793822 85612 slave.cpp:3667] Shutting down executor
't4ila' of framework sy3x4
I0421 21:03:13.794312 85612 slave.cpp:3667] Shutting down executor
'1al5a' of framework sy3x4
I0421 21:03:13.794628 85612 slave.cpp:3667] Shutting down executor
'i4qp9' of framework sy3x4
...snip...
I0421 21:03:13.820853 85612 slave.cpp:606] Slave asked to shut down by
master@172.16.22.2:5050 because 'Slave attempted to re-register after
removal'
I0421 21:03:13.821146 85612 slave.cpp:1946] Asked to shut down
framework sy3x4 by master@172.16.22.2:5050
W0421 21:03:13.821462 85612 slave.cpp:1967] Ignoring shutdown
framework sy3x4 because it is terminating
...snip...
I0421 21:03:19.281539 85617 slave.cpp:606] Slave asked to shut down by
master@172.16.22.2:5050 because 'Executor exited message from unknown
slave'
I0421 21:03:19.281738 85617 slave.cpp:1946] Asked to shut down
framework sy3x4 by master@172.16.22.2:5050
W0421 21:03:19.281782 85617 slave.cpp:1967] Ignoring shutdown
framework sy3x4 because it is terminating
...snip...
I0421 21:03:23.154587 85619 slave.cpp:564] Slave terminating

Monit starts up the mesos-slave process again, and the affected slaves
successfully register with the master with new slave IDs:

2016-04-21 21:03:31,210:53195(0x7ff160eb8700):ZOO_INFO@check_events@1750:
session establishment complete on server [172.16.5.8:2181],
sessionId=0x1751d0ed4b004cbb, negotiated timeout=1
I0421 21:03:31.210914 53209 group.cpp:313] Group process
(group(1)@172.16.3.103:5051) connected to ZooKeeper
I0421 21:03:31.210963 53209 group.cpp:787] Syncing group operations:
queue size (joins, cancels, datas) = (0, 0, 0)
I0421 21:03:31.210979 53209 group.cpp:385] Trying to create path
'/mesos/mesos_prod_3x4' in ZooKeeper
I0421 21:03:31.212005 53198 state.cpp:36] Recovering state from
'/usr/share/hubspot/mesos/meta'
I0421 21:03:31.213176 53198 state.cpp:79] Failed to find the latest
slave from '/usr/share/hubspot/mesos/meta'
I0421 21:03:31.213376 53198 status_update_manager.cpp:202] Recovering
status update manager
...snip...
I0421 21:03:31.228364 53229 status_update_manager.cpp:176] Pausing
sending status updates
I0421 21:03:31.228436 53218 slave.cpp:684] New master detected at
master@172.16.22.2:5050
I0421 21:03:31.228768 53218 slave.cpp:709] No credentials provided.
Attempting to register without authentication
I0421 21:03:31.228844 53218 slave.cpp:720] Detecting new master
I0421 21:03:31.228996 53218 slave.cpp:4193] Received oversubscribab

Run Mesos without master being able to open connections to slaves

2016-04-25 Thread Elouan Keryell-Even
Hi all,

I work on a R&D project where I need to make two computing clusters
collaborate together (an industrial datacenter & a cluster running in the
cloud). One of the two clusters acts as the "main" cluster, the other as
the "secondary" one. The idea is to test "bursting", i.e. when the main
cluster is full it will send jobs to the secondary cluster, so that it can
overcome the current load peak.

Now this is where it gets complex: as it is a small R&D project which
interacts with big industrial infrastructures, we face some strict network
restrictions (security oblige). We were able to have the authorization to
open an outgoing SSH tunnel (from the industrial data center to the cloud),
but not an ingoing one. And, of course, we have not the authorization to
work around this restriction by using an outgoing tunnel in reverse mode.

I guess this could be overcome if there was a way to make it work with
1-way communications, i.e. one of the two sides only (master or slave)
could take care of initiating all the connections. I think the slave
absolutely needs to be able to open a connection to the master, since it is
him who initiate his own registration to the cluster. But on the other hand
I was not 100% sure if the master also needed to initiate connections
(maybe to cancel tasks).

So I tried this configuration:

Datacenter ===> Cloud

^^^  ^^ ^^^
slaves   1-way SSH tunnel  master

The slave can reach the master in the beginning, but it doesn't work for
very long because afterwards because from the master point of view it keeps
switching between connected/disconnected state. I think what happens is
that the slave successfully reach the master for registration, but then
when the master tries to check if it is still alive (which happens
periodically I guess), it can't reach it (unable to open connection) and
thinks it is disconnected. Then the slave registers again, etc etc...

I tried to look for a similar problem on the web, and I think I found
evidences that the master needs the ability to open connections back to the
slaves:

https://mail-archives.apache.org/mod_mbox/mesos-user/201412.mbox/%3cca+8rcorxmr2nk-sa9ipyk_uvuyr8k7xeh_abl69r0jnb3ul...@mail.gmail.com%3E

http://stackoverflow.com/a/32275220/3037171

http://stackoverflow.com/a/24559617/3037171


However, the latest link dates back to ~September 2015, and I personnally
use Mesos 0.22.1 which dates back to ~May 2015. So I was wondering if this
particular network behavior could be overcome with the latest versions, but
I quickly read through the changelogs and didn't notice anything relative
to that. I also dug into the code for several hours, but I found it hard to
understand precisely how the communication architecture of Mesos works.


So I'd be glad to have some insight from you guys about if it is possible,
in one way or another, to make Mesos work without the Master being able to
initiate connections to slaves. I just need to be 100% sure there isn't any
workaround before going back to my boss :)


Thank you very much for your attention!

Elouan


Re: Running Mesos agent on ARM (Raspberry Pi)?

2016-04-25 Thread Jan Stabenow
Hey Sam,

what’s your experience with a slave on Pi? (no master/zookeeper)
This can be a cool enlargement for e.g. small development scenarios, low-power 
processes and experiments on arm.

Regards,
Jan


> Am 25.04.2016 um 12:33 schrieb Sam :
> 
> Guys,
> I don't understand why deploying Mesos master and slave on Raspberrypi right 
> now. Most of scenarios is using raspberry pi as edge server of IoT since 
> Raspberrypi low configuration and performance .
> Regards of the possibility of deployment, we have experimented before , and 
> it works. You have to install Debian on Raspberry pi first , then deploying 
> Mesos Master and Slave as docker images. The performance is too low.
> Hope to see what's you guys scenarios .
> 
> 
> Regards,
> Sam
> 
> Sent from my iPhone
> 
> On Apr 25, 2016, at 4:10 PM, tommy xiao  > wrote:
> 
>> let it go. it give us alternative solution.
>> 
>> 2016-04-25 16:05 GMT+08:00 Aaron Carey > >:
>> Out of curiosity... is this for fun or production workloads? I'd be curious 
>> to hear about raspis being used in production!
>> 
>>  --
>> 
>> Aaron Carey
>> Production Engineer - Cloud Pipeline
>> Industrial Light & Magic
>> London
>> 020 3751 9150 <>
>> From: Sharma Podila [spod...@netflix.com ]
>> Sent: 22 April 2016 17:53
>> To: user@mesos.apache.org ; dev
>> Subject: Running Mesos agent on ARM (Raspberry Pi)?
>> 
>> We are working on a hack to run Mesos agents on Raspberry Pi and are 
>> wondering if anyone here has done that before. From the Google search 
>> results we looked at so far, it seems like it has been compiled, but we 
>> haven't seen an indication that anyone has run it and launched tasks on 
>> them. And does it sound right that it might take 4 hours or so to compile?
>> 
>> We are looking to run just the agents. The master will be on a regular 
>> Ubuntu laptop or a server.
>> 
>> Appreciate any pointers.
>> 
>> 
>> 
>> 
>> 
>> --
>> Deshi Xiao
>> Twitter: xds2000
>> E-mail: xiaods(AT)gmail.com 


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Running Mesos agent on ARM (Raspberry Pi)?

2016-04-25 Thread Sam
Guys, 
I don't understand why deploying Mesos master and slave on Raspberrypi right 
now. Most of scenarios is using raspberry pi as edge server of IoT since 
Raspberrypi low configuration and performance .
Regards of the possibility of deployment, we have experimented before , and it 
works. You have to install Debian on Raspberry pi first , then deploying Mesos 
Master and Slave as docker images. The performance is too low. 
Hope to see what's you guys scenarios .


Regards,
Sam

Sent from my iPhone

> On Apr 25, 2016, at 4:10 PM, tommy xiao  wrote:
> 
> let it go. it give us alternative solution.
> 
> 2016-04-25 16:05 GMT+08:00 Aaron Carey :
>> Out of curiosity... is this for fun or production workloads? I'd be curious 
>> to hear about raspis being used in production!
>> 
>>  --
>> 
>> Aaron Carey
>> Production Engineer - Cloud Pipeline
>> Industrial Light & Magic
>> London
>> 020 3751 9150
>> From: Sharma Podila [spod...@netflix.com]
>> Sent: 22 April 2016 17:53
>> To: user@mesos.apache.org; dev
>> Subject: Running Mesos agent on ARM (Raspberry Pi)?
>> 
>> We are working on a hack to run Mesos agents on Raspberry Pi and are 
>> wondering if anyone here has done that before. From the Google search 
>> results we looked at so far, it seems like it has been compiled, but we 
>> haven't seen an indication that anyone has run it and launched tasks on 
>> them. And does it sound right that it might take 4 hours or so to compile?
>> 
>> We are looking to run just the agents. The master will be on a regular 
>> Ubuntu laptop or a server. 
>> 
>> Appreciate any pointers.
> 
> 
> 
> -- 
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com


Re: setting roles in mesos 0.28

2016-04-25 Thread Adam Bordelon
Seems like you're trying to start Marathon with multiple Mesos roles
"spark;sparkr;ms;qa", but Marathon may be interpreting this as a single
role that happens to include semi-colons. Mesos does not yet support
multiple roles in a single framework. See
https://issues.apache.org/jira/browse/MESOS-1763
Note that the acceptedResourceRoles feature in Marathon currently only
applies to the "*" (unreserved) role vs. the value of --mesos_role

On Wed, Apr 20, 2016 at 5:19 AM, Rodrick Brown 
wrote:

> On Apr 20 2016, at 1:36 am, Jian Qiu  wrote:
>
>> It is not necessary to configure --role on master. Actually it should
>> work if you configure --default_role='sparkr' on agent and start marathon
>> with --mesos_role=sparkr. Which version of mesos are you using? And could
>> you attach the master log?
>>
>
> This is Marathon 0.15 and Mesos 0.28.1
>
> on my masters I have the following attribute set
> $ cat /etc/marathon/conf/mesos_role
> spark;sparkr;ms;qa
>
> On the slave I have the following set in the agent
> $ cat /etc/mesos-slave/default_role
> sparkr
>
> $ cat /etc/mesos-slave/resources
> cpus:10;mem:10
>
> $ cat attributes
> rack:sparkr
>
> I'm trying to launch a simple task from marathon on this agent with
> following configs
>
> $ cat rstudio-mesos-shuffle-server.marathon.json
> {
>"id": "/mesos/rstudio-shuffle-service",
>"cmd": ". /opt/spark-1.6.1/conf/spark-env.sh .
> /opt/spark-1.6.1/sbin/spark-config.sh && .
> /opt/spark-1.6.1/bin/load-spark-env.sh && env &&
> /opt/spark-1.6.1/bin/spark-class
> org.apache.spark.deploy.mesos.MesosExternalShuffleService 1",
>"cpus": 0.5,
>"mem": 1524,
>"disk": 100,
>"user": "mesos",
>"instances": 1,
>"requirePorts": true,
>"acceptedResourceRoles": ["sparkr"],
>"ports":
>[
>  31338
>],
>"constraints": [
>  [
>"hostname","UNIQUE"
>  ],
>  [
>"rack", "LIKE", "sparkr"
>  ]
>],
>"env": {
>"SPARK_HOME": "/opt/spark-1.6.1",
>"SPARK_SCALA_VERSION": "2.11"
>},
>"healthChecks": [
>  {
>"gracePeriodSeconds": 5,
>"intervalSeconds": 10,
>"maxConsecutiveFailures": 3,
>"portIndex": 0,
>"protocol": "TCP",
>"timeoutSeconds": 5
>  }
>],
>"maxLaunchDelaySeconds": 3,
>"backoffFactor": 1.20,
> "upgradeStrategy": {
>  "minimumHealthCapacity": 0.5,
>  "maximumOverCapacity": 0.5
>}
> }
>
> In the marathon logs this is what I see
>
> 20 12:11:42 prod-mesos-m-3.aws.xxx.com marathon[29617]: [2016-04-20
> 12:11:42,807] INFO Offer ID:
> [50ceafa4-f3c1-4738-a9eb-c5d3bf0ff742-O13166461]. Considered resources with
> roles: [sparkr]. Not all basic resources satisfied: cpu not in offer, disk
> not in offer, mem not in offer
> (mesosphere.mesos.ResourceMatcher$:marathon-akka.actor.default-dispatcher-9)
>
> Thanks.
>
>
>
>> On Wed, Apr 20, 2016 at 11:11 AM, Rodrick Brown 
>> wrote:
>>
>> I'm confused do roles need to be configured on masters and slaves or just
>> slaves?
>> The docs says --roles has been deprecated on mesos-master but doesn't
>> state an alternate method.
>>
>>
>> on my slaves i'm using default_role='sparkr' and in marathon I've added
>> --mesos_role=sparkr however I'm not able to get any tasks to run on this
>> server do I need to set it on the masters also ?
>>
>> Please advise thanks.
>>
>> --RB
>>
>>
>>
>> --
>>
>> *Rodrick Brown* / Systems Engineer
>>
>> +1 917 445 6839 / rodr...@orchardplatform.com
>> 
>>
>> *Orchard Platform*
>>
>> 101 5th Avenue, 4th Floor, New York, NY 10003
>>
>> http://www.orchardplatform.com
>>
>> Orchard Blog  | Marketplace
>> Lending Meetup 
>>
>> *NOTICE TO RECIPIENTS*: This communication is confidential and intended
>> for the use of the addressee only. If you are not an intended recipient of
>> this communication, please delete it immediately and notify the sender by
>> return email. Unauthorized reading, dissemination, distribution or copying
>> of this communication is prohibited. This communication does not constitute
>> an offer to sell or a solicitation of an indication of interest to purchase
>> any loan, security or any other financial product or instrument, nor is it
>> an offer to sell or a solicitation of an indication of interest to purchase
>> any products or services to any persons who are prohibited from receiving
>> such information under applicable law. The contents of this communication
>> may not be accurate or complete and are subject to change without notice.
>> As such, Orchard App, Inc. (including its subsidiaries and affiliates,
>> "Orchard") makes no representation regarding the accuracy or completeness
>> of the information contained herein. The intended recipient is advised to
>> consult its own professional advisors, including those specializing in
>> legal, tax and accounting matters. Orchard does not provide legal, tax or
>> accounting advice.

Re: Running Mesos agent on ARM (Raspberry Pi)?

2016-04-25 Thread tommy xiao
let it go. it give us alternative solution.

2016-04-25 16:05 GMT+08:00 Aaron Carey :

> Out of curiosity... is this for fun or production workloads? I'd be
> curious to hear about raspis being used in production!
>
> --
>
> Aaron Carey
> Production Engineer - Cloud Pipeline
> Industrial Light & Magic
> London
> 020 3751 9150
>
> --
> *From:* Sharma Podila [spod...@netflix.com]
> *Sent:* 22 April 2016 17:53
> *To:* user@mesos.apache.org; dev
> *Subject:* Running Mesos agent on ARM (Raspberry Pi)?
>
> We are working on a hack to run Mesos agents on Raspberry Pi and are
> wondering if anyone here has done that before. From the Google search
> results we looked at so far, it seems like it has been compiled, but we
> haven't seen an indication that anyone has run it and launched tasks on
> them. And does it sound right that it might take 4 hours or so to compile?
>
> We are looking to run just the agents. The master will be on a regular
> Ubuntu laptop or a server.
>
> Appreciate any pointers.
>
>
>


-- 
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com


RE: Running Mesos agent on ARM (Raspberry Pi)?

2016-04-25 Thread Aaron Carey
Out of curiosity... is this for fun or production workloads? I'd be curious to 
hear about raspis being used in production!


--

Aaron Carey
Production Engineer - Cloud Pipeline
Industrial Light & Magic
London
020 3751 9150


From: Sharma Podila [spod...@netflix.com]
Sent: 22 April 2016 17:53
To: user@mesos.apache.org; dev
Subject: Running Mesos agent on ARM (Raspberry Pi)?

We are working on a hack to run Mesos agents on Raspberry Pi and are wondering 
if anyone here has done that before. From the Google search results we looked 
at so far, it seems like it has been compiled, but we haven't seen an 
indication that anyone has run it and launched tasks on them. And does it sound 
right that it might take 4 hours or so to compile?

We are looking to run just the agents. The master will be on a regular Ubuntu 
laptop or a server.

Appreciate any pointers.




mesos docker vs native container

2016-04-25 Thread vincent gromakowski
I am very interesting in getting some feedback of people who has moved from
native container through Docker specially from network performance
perspective.
DCOS has been open sourced and I like all automation it brings with
frameworks but it seems everything is running in docker ?
I am looking for the smack stack for which network perf is important.
Tx