[jira] [Commented] (MESOS-5408) Delete the /observe HTTP endpoint

2016-05-20 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294541#comment-15294541
 ] 

Vinod Kone commented on MESOS-5408:
---

Thanks [~gyliu]!

commit aafe80911c9885de3a97253ff6c553f43bab083a
Author: Guangya Liu 
Date:   Fri May 20 16:40:11 2016 -0700

Removed documentation about /observe endpoint.

Review: https://reviews.apache.org/r/47635/


> Delete the /observe HTTP endpoint
> -
>
> Key: MESOS-5408
> URL: https://issues.apache.org/jira/browse/MESOS-5408
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Qian Zhang
> Fix For: 0.29.0
>
>
> The "/observe" endpoint was introduced a long time ago for supporting 
> functionality that was never implemented. We should just kill this endpoint 
> and associated code to avoid tech debt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5427) Mesos master locks up after slave fails to authenticate

2016-05-20 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294461#comment-15294461
 ] 

Joseph Wu commented on MESOS-5427:
--

Are you running on Ubuntu 10?  (Typo?)  I'm not sure if Mesos builds on that.

Could you try the same setup/configuration with a more recent version of Mesos? 
 The SASL-based authentication code has not change much.  (It was moved around, 
and is now called the CRAM MD5 authenticator/ee.)

> Mesos master locks up after slave fails to authenticate
> ---
>
> Key: MESOS-5427
> URL: https://issues.apache.org/jira/browse/MESOS-5427
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.20.1
> Environment: Linux XX-X 3.13.0-49-generic #81-Ubuntu SMP 
> Tue Mar 24 19:29:48 UTC 2015 x86_64 GNU/Linux
> Ubuntu 10.04.1 LTS
> AWS/8cores/16GB
>Reporter: analogue
>Priority: Minor
>
> In a mesos master cluster with one leader and two backups, a single slave 
> attempting to authenticate with the leader locked up the master and resulted 
> in 2 CPU cores pegged at 100% CPU usage until restarted.
> master
> {noformat}
> I0516 02:55:39.945566 32126 master.cpp:3612] Authenticating 
> slave(1)@10.85.20.76:5051
> I0516 02:55:39.945757 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.945802 32123 authenticator.hpp:156] Creating new server SASL 
> connection
> I0516 02:55:39.945991 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946030 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946063 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946095 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946126 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946158 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946189 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946221 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946252 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946285 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946316 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946347 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946379 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> ...
> W0516 02:55:44.945811 32124 master.cpp:3670] Authentication timed out
> I0516 02:55:49.290623 32121 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> (last long line repeats until mesos-master restarted)
> {noformat}
> slave
> {noformat}
> Log file created at: 2016/05/16 02:37:52
> Running on machine: 10-85-20-76-uswest2btestopia
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> I0516 02:37:52.112509 10198 logging.cpp:142] INFO level logging started!
> I0516 02:37:52.112761 10198 main.cpp:126] Build: 2014-12-12 00:52:32 by
> I0516 02:37:52.112772 10198 main.cpp:128] Version: 0.20.1
> I0516 02:37:52.112778 10198 main.cpp:131] Git tag: 0.20.1
> I0516 02:37:52.112783 10198 main.cpp:135] Git SHA: 
> fe0a39112f3304283f970f1b08b322b1e970829d
> I0516 02:37:52.112793 10198 containerizer.cpp:89] Using isolation: 
> cgroups/cpu,cgroups/mem
> I0516 02:37:52.125773 10198 linux_launcher.cpp:78] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0516 02:37:52.126652 10198 main.cpp:149] Starting Mesos slave
> I0516 02:37:52.128687 10246 slave.cpp:167] Slave started on 
> 

[jira] [Created] (MESOS-5428) Update the mechanism to define flags in FlagsBase derived clases

2016-05-20 Thread Daniel Pravat (JIRA)
Daniel Pravat created MESOS-5428:


 Summary: Update the mechanism to define flags in FlagsBase derived 
clases
 Key: MESOS-5428
 URL: https://issues.apache.org/jira/browse/MESOS-5428
 Project: Mesos
  Issue Type: Bug
Reporter: Daniel Pravat


If a program exeposes flags,  the recommendation from Mesos was to use a 
derived class from FlagsBase, add the new flags in constructor.
As benefit  the new `Flags` class `inherits` all the flags from the derived 
classes.
Each derived calss calls the method `add` implemented in `FlagsBase` which uses 
`dynamic_cast` to set the default value and other things.

Since the constructor is not completed class is not completed (in Visual Studio 
the vtable is not correct at that time) the code does not work on Windows.
We should have to call a separate method in Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5427) Mesos master locks up after slave fails to authenticate

2016-05-20 Thread analogue (JIRA)
analogue created MESOS-5427:
---

 Summary: Mesos master locks up after slave fails to authenticate
 Key: MESOS-5427
 URL: https://issues.apache.org/jira/browse/MESOS-5427
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.20.1
 Environment: Linux XX-X 3.13.0-49-generic #81-Ubuntu SMP 
Tue Mar 24 19:29:48 UTC 2015 x86_64 GNU/Linux
Ubuntu 10.04.1 LTS
AWS/8cores/16GB
Reporter: analogue


In a mesos master cluster with one leader and two backups, a single slave 
attempting to authenticate with the leader locked up the master and resulted in 
2 CPU cores pegged at 100% CPU usage until restarted.

master
{noformat}
I0516 02:55:39.945566 32126 master.cpp:3612] Authenticating 
slave(1)@10.85.20.76:5051
I0516 02:55:39.945757 32126 master.cpp:3598] Queuing up authentication request 
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.945802 32123 authenticator.hpp:156] Creating new server SASL 
connection
I0516 02:55:39.945991 32126 master.cpp:3598] Queuing up authentication request 
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946030 32126 master.cpp:3598] Queuing up authentication request 
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946063 32126 master.cpp:3598] Queuing up authentication request 
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946095 32126 master.cpp:3598] Queuing up authentication request 
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946126 32126 master.cpp:3598] Queuing up authentication request 
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946158 32126 master.cpp:3598] Queuing up authentication request 
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946189 32126 master.cpp:3598] Queuing up authentication request 
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946221 32126 master.cpp:3598] Queuing up authentication request 
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946252 32126 master.cpp:3598] Queuing up authentication request 
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946285 32126 master.cpp:3598] Queuing up authentication request 
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946316 32126 master.cpp:3598] Queuing up authentication request 
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946347 32126 master.cpp:3598] Queuing up authentication request 
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946379 32126 master.cpp:3598] Queuing up authentication request 
from slave(1)@10.85.20.76:5051 because authentication is still in progress
...
W0516 02:55:44.945811 32124 master.cpp:3670] Authentication timed out
I0516 02:55:49.290623 32121 master.cpp:3598] Queuing up authentication request 
from slave(1)@10.85.20.76:5051 because authentication is still in progress
(last long line repeats until mesos-master restarted)
{noformat}

slave
{noformat}
Log file created at: 2016/05/16 02:37:52
Running on machine: 10-85-20-76-uswest2btestopia
Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
I0516 02:37:52.112509 10198 logging.cpp:142] INFO level logging started!
I0516 02:37:52.112761 10198 main.cpp:126] Build: 2014-12-12 00:52:32 by
I0516 02:37:52.112772 10198 main.cpp:128] Version: 0.20.1
I0516 02:37:52.112778 10198 main.cpp:131] Git tag: 0.20.1
I0516 02:37:52.112783 10198 main.cpp:135] Git SHA: 
fe0a39112f3304283f970f1b08b322b1e970829d
I0516 02:37:52.112793 10198 containerizer.cpp:89] Using isolation: 
cgroups/cpu,cgroups/mem
I0516 02:37:52.125773 10198 linux_launcher.cpp:78] Using /sys/fs/cgroup/freezer 
as the freezer hierarchy for the Linux launcher
I0516 02:37:52.126652 10198 main.cpp:149] Starting Mesos slave
I0516 02:37:52.128687 10246 slave.cpp:167] Slave started on 1)@10.85.20.76:5051
I0516 02:37:52.128708 10246 credentials.hpp:84] Loading credential for 
authentication from '/etc/seagull_mesos_creds'
W0516 02:37:52.128865 10246 credentials.hpp:99] Permissions on credential file 
'/etc/seagull_mesos_creds' are too open. It is recommended that your credential 
file is NOT accessible by others.
I0516 02:37:52.128968 10246 slave.cpp:265] Slave using credential for: 
seagull_slave
I0516 02:37:52.129612 10246 slave.cpp:278] Slave resources: cpus(*):31; 
disk(*):14; mem(*):59382; ports(*):[31000-32000]
I0516 02:37:52.132064 10255 group.cpp:313] Group process 
(group(1)@10.85.20.76:5051) connected to ZooKeeper
I0516 02:37:52.132086 10255 group.cpp:787] Syncing 

[jira] [Updated] (MESOS-5427) Mesos master locks up after slave fails to authenticate

2016-05-20 Thread analogue (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

analogue updated MESOS-5427:

Priority: Minor  (was: Major)

> Mesos master locks up after slave fails to authenticate
> ---
>
> Key: MESOS-5427
> URL: https://issues.apache.org/jira/browse/MESOS-5427
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.20.1
> Environment: Linux XX-X 3.13.0-49-generic #81-Ubuntu SMP 
> Tue Mar 24 19:29:48 UTC 2015 x86_64 GNU/Linux
> Ubuntu 10.04.1 LTS
> AWS/8cores/16GB
>Reporter: analogue
>Priority: Minor
>
> In a mesos master cluster with one leader and two backups, a single slave 
> attempting to authenticate with the leader locked up the master and resulted 
> in 2 CPU cores pegged at 100% CPU usage until restarted.
> master
> {noformat}
> I0516 02:55:39.945566 32126 master.cpp:3612] Authenticating 
> slave(1)@10.85.20.76:5051
> I0516 02:55:39.945757 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.945802 32123 authenticator.hpp:156] Creating new server SASL 
> connection
> I0516 02:55:39.945991 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946030 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946063 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946095 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946126 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946158 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946189 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946221 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946252 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946285 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946316 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946347 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946379 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> ...
> W0516 02:55:44.945811 32124 master.cpp:3670] Authentication timed out
> I0516 02:55:49.290623 32121 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> (last long line repeats until mesos-master restarted)
> {noformat}
> slave
> {noformat}
> Log file created at: 2016/05/16 02:37:52
> Running on machine: 10-85-20-76-uswest2btestopia
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> I0516 02:37:52.112509 10198 logging.cpp:142] INFO level logging started!
> I0516 02:37:52.112761 10198 main.cpp:126] Build: 2014-12-12 00:52:32 by
> I0516 02:37:52.112772 10198 main.cpp:128] Version: 0.20.1
> I0516 02:37:52.112778 10198 main.cpp:131] Git tag: 0.20.1
> I0516 02:37:52.112783 10198 main.cpp:135] Git SHA: 
> fe0a39112f3304283f970f1b08b322b1e970829d
> I0516 02:37:52.112793 10198 containerizer.cpp:89] Using isolation: 
> cgroups/cpu,cgroups/mem
> I0516 02:37:52.125773 10198 linux_launcher.cpp:78] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0516 02:37:52.126652 10198 main.cpp:149] Starting Mesos slave
> I0516 02:37:52.128687 10246 slave.cpp:167] Slave started on 
> 1)@10.85.20.76:5051
> I0516 02:37:52.128708 10246 credentials.hpp:84] Loading credential for 
> authentication from '/etc/seagull_mesos_creds'
> W0516 02:37:52.128865 10246 credentials.hpp:99] Permissions on credential 
> file '/etc/seagull_mesos_creds' are too open. It is recommended that your 
> 

[jira] [Commented] (MESOS-5426) Relax version compatibility requirement for some modules

2016-05-20 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294234#comment-15294234
 ] 

Adam B commented on MESOS-5426:
---

Not a blocker for 0.29, but I'd like to track it for the release until we 
decide to defer it.

> Relax version compatibility requirement for some modules
> 
>
> Key: MESOS-5426
> URL: https://issues.apache.org/jira/browse/MESOS-5426
> Project: Mesos
>  Issue Type: Task
>  Components: modules
>Affects Versions: 0.29.0
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere, security
> Fix For: 0.29.0
>
>
> Some module interfaces such as authenticatee, have not changed for a while 
> and so we should be able to relax the version compatibility checks. This 
> needs to be done on a case-by-case basis.
> I am also hoping, this change will also provide a framework for updating the 
> version requirement for other modules as we go towards a stable module API.
> [cc: [~adam-mesos] [~tillt] ]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5426) Relax version compatibility requirement for some modules

2016-05-20 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5426:
--
Labels: mesosphere security  (was: mesosphere)

> Relax version compatibility requirement for some modules
> 
>
> Key: MESOS-5426
> URL: https://issues.apache.org/jira/browse/MESOS-5426
> Project: Mesos
>  Issue Type: Task
>  Components: modules
>Affects Versions: 0.29.0
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere, security
> Fix For: 0.29.0
>
>
> Some module interfaces such as authenticatee, have not changed for a while 
> and so we should be able to relax the version compatibility checks. This 
> needs to be done on a case-by-case basis.
> I am also hoping, this change will also provide a framework for updating the 
> version requirement for other modules as we go towards a stable module API.
> [cc: [~adam-mesos] [~tillt] ]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5426) Relax version compatibility requirement for some modules

2016-05-20 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5426:
--
Fix Version/s: 0.29.0

> Relax version compatibility requirement for some modules
> 
>
> Key: MESOS-5426
> URL: https://issues.apache.org/jira/browse/MESOS-5426
> Project: Mesos
>  Issue Type: Task
>  Components: modules
>Affects Versions: 0.29.0
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere, security
> Fix For: 0.29.0
>
>
> Some module interfaces such as authenticatee, have not changed for a while 
> and so we should be able to relax the version compatibility checks. This 
> needs to be done on a case-by-case basis.
> I am also hoping, this change will also provide a framework for updating the 
> version requirement for other modules as we go towards a stable module API.
> [cc: [~adam-mesos] [~tillt] ]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5426) Relax version compatibility requirement for some modules

2016-05-20 Thread Kapil Arya (JIRA)
Kapil Arya created MESOS-5426:
-

 Summary: Relax version compatibility requirement for some modules
 Key: MESOS-5426
 URL: https://issues.apache.org/jira/browse/MESOS-5426
 Project: Mesos
  Issue Type: Task
  Components: modules
Affects Versions: 0.29.0
Reporter: Kapil Arya
Assignee: Kapil Arya


Some module interfaces such as authenticatee, have not changed for a while and 
so we should be able to relax the version compatibility checks. This needs to 
be done on a case-by-case basis.

I am also hoping, this change will also provide a framework for updating the 
version requirement for other modules as we go towards a stable module API.

[cc: [~adam-mesos] [~tillt] ]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5421) Mesos Docker executor taskHealthUpdated removes information about job ipAddresses

2016-05-20 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293988#comment-15293988
 ] 

Joseph Wu commented on MESOS-5421:
--

[~dfedorov], can you check if MESOS-5294 is the same issue?  (There isn't 
enough information in the bug description.)

> Mesos Docker executor taskHealthUpdated removes information about job 
> ipAddresses
> -
>
> Key: MESOS-5421
> URL: https://issues.apache.org/jira/browse/MESOS-5421
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.28.1
>Reporter: Dmitry Fedorov
>Priority: Minor
> Fix For: 0.28.2
>
>
> When you create job with command health check, right after job is launched 
> the status is correct and ipAddresses field is present in it. 
> But after health status is updated, ipAddresses field is missed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4697) Consolidate cgroup isolators into one single isolator.

2016-05-20 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293970#comment-15293970
 ] 

Charles Allen commented on MESOS-4697:
--

General feedback from someone who has written frameworks and deployed other 
frameworks.

Whenever I'm deploying a framework on Mesos, I rarely care about what isolation 
group it is using. Usually I simply want to have an understanding of how my 
resources are going to be requested/handled. This comes in play largely with 
frameworks who have different levels of resource awareness. Some know about 
memory and cpu, and a select few about disk needs.

As the capabilities of resource isolation expand, I do NOT want to have to go 
back and update older frameworks to make sure they play nice with more-modern 
frameworks with better resource awareness.

My current approach to handling this is through [roles | 
http://mesos.apache.org/documentation/latest/roles/] where a role is really a 
pre-agreed upon set of resource expectations.

What I would love to see is a way for me to have different cgroup roots per 
role. Or at least more clear expectations on how to have such a scenario. This 
way I can tune cgroups at a system level regardless of how aware mesos is of 
the node's capabilities.

As a discrete example, I would like to have blkio tuned on a node such that all 
tasks from a particular mesos role have some expectations of blkio, all tasks 
from a DIFFERENT mesos task have some other expectations, and a THIRD group of 
tasks which are NOT part of mesos might have a third set of tuningset. This 
could be accomplished within mesos IFF mesos were aware of all potential 
cgroups my kernel supports, AND all my frameworks had ways of running through 
mesos, but neither one of those is a guaranteed assumption.

My ask here is that the intended behavior is clarified for when a cgroup is 
present on a system, but the version of mesos running is not aware of how to 
use such a cgroup (blkio or maybe 
https://issues.apache.org/jira/browse/MESOS-4424 or something else even).

> Consolidate cgroup isolators into one single isolator.
> --
>
> Key: MESOS-4697
> URL: https://issues.apache.org/jira/browse/MESOS-4697
> Project: Mesos
>  Issue Type: Epic
>Reporter: Jie Yu
>Assignee: haosdent
> Attachments: cgroup_v2.pdf
>
>
> There are two motivations for this:
> 1) It's very verbose to add a new isolator. For cgroup isolators (e.g., cpu, 
> mem, net_cls, etc.), many of the logics are the same. We are currently 
> duplicating a lot of the code.
> 2) Initially, we decided to use a separate isolator for each cgroup subsystem 
> is because we want each subsystem to be mounted under a 
> different hierarchy. This gradually become not true with unified cgroup 
> hierarchy introduced in kernel 3.16([The unified control group hierarchy in 
> 3.16|https://lwn.net/Articles/601840/], 
> [cgroup-v2|https://github.com/torvalds/linux/blob/master/Documentation/cgroup-v2.txt|]).
>  Also, on some popular linux distributions, some subsystems are co-mounted 
> within the same hierarchy (e.g., net_cls and net_prio, cpu and cpuacct). It 
> becomes very hard to co-manage a hierarchy by two isolators.
> We can still introduce subsystem specific code under the unified cgroup 
> isolator by introduce a Subsystem abstraction. But we don't plan to support 
> cgroup v2 in this ticket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5041) Add cgroups unified isolator

2016-05-20 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293876#comment-15293876
 ] 

haosdent commented on MESOS-5041:
-

[~anandmazumdar], thank you very much for pointing this. Just updated.

[~drcrallen], we list the motivations in the 
[MESOS-4697](https://issues.apache.org/jira/browse/MESOS-4697) . Free feel to 
ping me in that ticket or left comments in the design document if you have any 
problems or concerns. Any comments and feedbacks are highly appreciated!

> Add cgroups unified isolator
> 
>
> Key: MESOS-5041
> URL: https://issues.apache.org/jira/browse/MESOS-5041
> Project: Mesos
>  Issue Type: Task
>  Components: cgroups, isolation
>Reporter: haosdent
>Assignee: haosdent
>
> Implement the cgroups unified isolator and enable it in Mesos containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-970) Upgrade bundled leveldb to 1.18

2016-05-20 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293844#comment-15293844
 ] 

haosdent commented on MESOS-970:


[~vinodkone]] Thank you so much for the detail comments.

So for compatibility, the test cases are

* Master(build with leveldb 1.4) write the replicated log, then stop and use 
Master(build with leveldb 1.18) to test recovery.
* Master(build with leveldb 1.18) write the replicated log, then stop and use 
Master(build with leveldb 1.4) to test recovery.

Let me test these by following your suggestion way.

> Upgrade bundled leveldb to 1.18
> ---
>
> Key: MESOS-970
> URL: https://issues.apache.org/jira/browse/MESOS-970
> Project: Mesos
>  Issue Type: Improvement
>  Components: replicated log
>Reporter: Benjamin Mahler
>Assignee: Tomasz Janiszewski
>
> We currently bundle leveldb 1.4, and the latest version is leveldb 1.18.
> Upgrade to 1.18 could solve the problems when build Mesos in some non-x86 
> architecture CPU.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5041) Add cgroups unified isolator

2016-05-20 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293815#comment-15293815
 ] 

Charles Allen commented on MESOS-5041:
--

Since this is listed in the mesos roadmap, can there be a little more flavor in 
the master comment about why this is needed and what it is doing differently 
than the current cgroup impl?

> Add cgroups unified isolator
> 
>
> Key: MESOS-5041
> URL: https://issues.apache.org/jira/browse/MESOS-5041
> Project: Mesos
>  Issue Type: Task
>  Components: cgroups, isolation
>Reporter: haosdent
>Assignee: haosdent
>
> Implement the cgroups unified isolator and enable it in Mesos containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-970) Upgrade bundled leveldb to 1.18

2016-05-20 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293812#comment-15293812
 ] 

haosdent commented on MESOS-970:


Sorry for typo, should be 1.18

> Upgrade bundled leveldb to 1.18
> ---
>
> Key: MESOS-970
> URL: https://issues.apache.org/jira/browse/MESOS-970
> Project: Mesos
>  Issue Type: Improvement
>  Components: replicated log
>Reporter: Benjamin Mahler
>Assignee: Tomasz Janiszewski
>
> We currently bundle leveldb 1.4, and the latest version is leveldb 1.18.
> Upgrade to 1.18 could solve the problems when build Mesos in some non-x86 
> architecture CPU.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-970) Upgrade bundled leveldb to 1.18

2016-05-20 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293788#comment-15293788
 ] 

Vinod Kone edited comment on MESOS-970 at 5/20/16 5:33 PM:
---

Great to hear!

Agents do not use leveldb, so no point in building agents with different 
versions of leveldb.

We need to test masters with different version of level db. For testing wire 
format compatibility, make sure master writes to replicated log using leveldb 
1.4 and tries to recover with leveldb 1.18. And vice-versa. This is probably 
easiest to do with some masters using 1.4 and some masters using 1.18 in your 
quorum. Note that you need to start masters with replicate log registry. For 
testing recovery, you can just induce leader elections by restarting masters.


was (Author: vinodkone):
Great to hear!

Agents do not use leveldb, so no point in building agents with different 
versions of leveldb.

We need to test masters with different version of level db. For testing wire 
format compatibility, make sure master writes to replicated log using leveldb 
1.4 and tries to recover with leveldb 1.18. And vice-versa. This is easiest to 
do with some masters using 1.14 and some masters using 1.18 in your quorum. 
That way you can just induce leader elections by restarting masters.

> Upgrade bundled leveldb to 1.18
> ---
>
> Key: MESOS-970
> URL: https://issues.apache.org/jira/browse/MESOS-970
> Project: Mesos
>  Issue Type: Improvement
>  Components: replicated log
>Reporter: Benjamin Mahler
>Assignee: Tomasz Janiszewski
>
> We currently bundle leveldb 1.4, and the latest version is leveldb 1.18.
> Upgrade to 1.18 could solve the problems when build Mesos in some non-x86 
> architecture CPU.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3051) performance issues with port ranges comparison

2016-05-20 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293757#comment-15293757
 ] 

Joseph Wu commented on MESOS-3051:
--

Filed [MESOS-5425] to follow up on further performance improvements.

> performance issues with port ranges comparison
> --
>
> Key: MESOS-3051
> URL: https://issues.apache.org/jira/browse/MESOS-3051
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 0.22.1
>Reporter: James Peach
>Assignee: Joerg Schad
>  Labels: mesosphere
> Fix For: 0.25.0, 0.24.2
>
>
> Testing in an environment with lots of frameworks (>200), where the 
> frameworks permanently decline resources they don't need. The allocator ends 
> up spending a lot of time figuring out whether offers are refused (the code 
> path through {{HierarchicalAllocatorProcess::isFiltered()}}.
> In profiling a synthetic benchmark, it turns out that comparing port ranges 
> is very expensive, involving many temporary allocations. 61% of 
> Resources::contains() run time is in operator -= (Resource). 35% of 
> Resources::contains() run time is in Resources::_contains().
> The heaviest call chain through {{Resources::_contains}} is:
> {code}
> Running Time  Self (ms) Symbol Name
> 7237.0ms   35.5%  4.0
> mesos::Resources::_contains(mesos::Resource const&) const
> 7200.0ms   35.3%  1.0 mesos::contains(mesos::Resource 
> const&, mesos::Resource const&)
> 7133.0ms   35.0%121.0  
> mesos::operator<=(mesos::Value_Ranges const&, mesos::Value_Ranges const&)
> 6319.0ms   31.0%  7.0   
> mesos::coalesce(mesos::Value_Ranges*, mesos::Value_Ranges const&)
> 6240.0ms   30.6%161.0
> mesos::coalesce(mesos::Value_Ranges*, mesos::Value_Range const&)
> 1867.0ms9.1% 25.0 mesos::Value_Ranges::add_range()
> 1694.0ms8.3%  4.0 
> mesos::Value_Ranges::~Value_Ranges()
> 1495.0ms7.3% 16.0 
> mesos::Value_Ranges::operator=(mesos::Value_Ranges const&)
>  445.0ms2.1% 94.0 
> mesos::Value_Range::MergeFrom(mesos::Value_Range const&)
>  154.0ms0.7% 24.0 mesos::Value_Ranges::range(int) 
> const
>  103.0ms0.5% 24.0 
> mesos::Value_Ranges::range_size() const
>   95.0ms0.4%  2.0 
> mesos::Value_Range::Value_Range(mesos::Value_Range const&)
>   59.0ms0.2%  4.0 
> mesos::Value_Ranges::Value_Ranges()
>   50.0ms0.2% 50.0 mesos::Value_Range::begin() 
> const
>   28.0ms0.1% 28.0 mesos::Value_Range::end() const
>   26.0ms0.1%  0.0 
> mesos::Value_Range::~Value_Range()
> {code}
> mesos::coalesce(Value_Ranges) gets done a lot and ends up being really 
> expensive. The heaviest parts of the inverted call chain are:
> {code}
> Running Time  Self (ms)   Symbol Name
> 3209.0ms   15.7%  3209.0  mesos::Value_Range::~Value_Range()
> 3209.0ms   15.7%  0.0  
> google::protobuf::internal::GenericTypeHandler::Delete(mesos::Value_Range*)
> 3209.0ms   15.7%  0.0   void 
> google::protobuf::internal::RepeatedPtrFieldBase::Destroy()
> 3209.0ms   15.7%  0.0
> google::protobuf::RepeatedPtrField::~RepeatedPtrField()
> 3209.0ms   15.7%  0.0 
> google::protobuf::RepeatedPtrField::~RepeatedPtrField()
> 3209.0ms   15.7%  0.0  
> mesos::Value_Ranges::~Value_Ranges()
> 3209.0ms   15.7%  0.0   
> mesos::Value_Ranges::~Value_Ranges()
> 2441.0ms   11.9%  0.0
> mesos::coalesce(mesos::Value_Ranges*, mesos::Value_Range const&)
>  452.0ms2.2%  0.0
> mesos::remove(mesos::Value_Ranges*, mesos::Value_Range const&)
>  169.0ms0.8%  0.0
> mesos::operator<=(mesos::Value_Ranges const&, mesos::Value_Ranges const&)
>   82.0ms0.4%  0.0
> mesos::operator-=(mesos::Value_Ranges&, mesos::Value_Ranges const&)
>   65.0ms0.3%  0.0
> mesos::Value_Ranges::~Value_Ranges()
> 2541.0ms   12.4%  2541.0  
> google::protobuf::internal::GenericTypeHandler::New()
> 2541.0ms   12.4%  0.0  
> google::protobuf::RepeatedPtrField::TypeHandler::Type* 
> google::protobuf::internal::RepeatedPtrFieldBase::Add()
> 2305.0ms   11.3%  0.0   
> google::protobuf::RepeatedPtrField::Add()
> 2305.0ms   11.3%  0.0

[jira] [Updated] (MESOS-5423) Updating the website section in release-guide is out of dated

2016-05-20 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-5423:

Labels: documentation  (was: )

> Updating the website section in release-guide is out of dated
> -
>
> Key: MESOS-5423
> URL: https://issues.apache.org/jira/browse/MESOS-5423
> Project: Mesos
>  Issue Type: Bug
>  Components: project website
>Reporter: haosdent
>Assignee: haosdent
>Priority: Minor
>  Labels: documentation
>
> This part is out of dated
> {code}
> ## Updating the website
> 1. After a successful release, please update the website pointing to the new 
> release.
>See our [website 
> README](https://github.com/apache/mesos/blob/master/site/README.md/) and
>the general [Apache project website 
> guide](https://www.apache.org/dev/project-site.html)
>for details on how to build and publish the website.
> $ svn co https://svn.apache.org/repos/asf/mesos/site mesos-site
> 2. Update doxygen and javadoc pages for the website. For more information, see
>[website 
> README](https://github.com/apache/mesos/blob/master/site/README.md/).
> 3. Write a blog post announcing the new release and its features and major 
> bug fixes.
> 4. Update the Getting Started guide to use the latest release link.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5423) Updating the website section in release-guide is out of dated

2016-05-20 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-5423:

Component/s: project website

> Updating the website section in release-guide is out of dated
> -
>
> Key: MESOS-5423
> URL: https://issues.apache.org/jira/browse/MESOS-5423
> Project: Mesos
>  Issue Type: Bug
>  Components: project website
>Reporter: haosdent
>Assignee: haosdent
>Priority: Minor
>
> This part is out of dated
> {code}
> ## Updating the website
> 1. After a successful release, please update the website pointing to the new 
> release.
>See our [website 
> README](https://github.com/apache/mesos/blob/master/site/README.md/) and
>the general [Apache project website 
> guide](https://www.apache.org/dev/project-site.html)
>for details on how to build and publish the website.
> $ svn co https://svn.apache.org/repos/asf/mesos/site mesos-site
> 2. Update doxygen and javadoc pages for the website. For more information, see
>[website 
> README](https://github.com/apache/mesos/blob/master/site/README.md/).
> 3. Write a blog post announcing the new release and its features and major 
> bug fixes.
> 4. Update the Getting Started guide to use the latest release link.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-970) Upgrade bundled leveldb to 1.18

2016-05-20 Thread Tomasz Janiszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293522#comment-15293522
 ] 

Tomasz Janiszewski commented on MESOS-970:
--

Why 1.8? Did you mean 1.18?

> Upgrade bundled leveldb to 1.18
> ---
>
> Key: MESOS-970
> URL: https://issues.apache.org/jira/browse/MESOS-970
> Project: Mesos
>  Issue Type: Improvement
>  Components: replicated log
>Reporter: Benjamin Mahler
>Assignee: Tomasz Janiszewski
>
> We currently bundle leveldb 1.4, and the latest version is leveldb 1.18.
> Upgrade to 1.18 could solve the problems when build Mesos in some non-x86 
> architecture CPU.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-970) Upgrade bundled leveldb to 1.18

2016-05-20 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293495#comment-15293495
 ] 

haosdent commented on MESOS-970:


For compatibility, I would post result about these cases

* Master build with leveldb 1.4, and all Agents build with leveldb 1.8
* Master build with leveldb 1.8, and all Agents build with leveldb 1.4
* Master build with leveldb 1.4, and mix Agents build with leveldb 1.4 and 1.8
* Master build with leveldb 1.8, and mix Agents build with leveldb 1.4 and 1.8

For performance regressions, I would run

{code}
make bench GTEST_FILTER="*BENCHMARK*" 
{code}
in both leveldb 1.4 and leveldb 1.8 and compare their results.

Let me know if you need more test cases in other scenarios. :-)

> Upgrade bundled leveldb to 1.18
> ---
>
> Key: MESOS-970
> URL: https://issues.apache.org/jira/browse/MESOS-970
> Project: Mesos
>  Issue Type: Improvement
>  Components: replicated log
>Reporter: Benjamin Mahler
>Assignee: Tomasz Janiszewski
>
> We currently bundle leveldb 1.4, and the latest version is leveldb 1.18.
> Upgrade to 1.18 could solve the problems when build Mesos in some non-x86 
> architecture CPU.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5423) Updating the website section in release-guide is out of dated

2016-05-20 Thread haosdent (JIRA)
haosdent created MESOS-5423:
---

 Summary: Updating the website section in release-guide is out of 
dated
 Key: MESOS-5423
 URL: https://issues.apache.org/jira/browse/MESOS-5423
 Project: Mesos
  Issue Type: Bug
Reporter: haosdent
Assignee: haosdent
Priority: Minor


This part is out of dated
{code}
## Updating the website

1. After a successful release, please update the website pointing to the new 
release.
   See our [website 
README](https://github.com/apache/mesos/blob/master/site/README.md/) and
   the general [Apache project website 
guide](https://www.apache.org/dev/project-site.html)
   for details on how to build and publish the website.

$ svn co https://svn.apache.org/repos/asf/mesos/site mesos-site

2. Update doxygen and javadoc pages for the website. For more information, see
   [website 
README](https://github.com/apache/mesos/blob/master/site/README.md/).

3. Write a blog post announcing the new release and its features and major bug 
fixes.

4. Update the Getting Started guide to use the latest release link.
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5422) Website README.md is out of dated

2016-05-20 Thread haosdent (JIRA)
haosdent created MESOS-5422:
---

 Summary: Website README.md is out of dated
 Key: MESOS-5422
 URL: https://issues.apache.org/jira/browse/MESOS-5422
 Project: Mesos
  Issue Type: Bug
  Components: project website
Reporter: haosdent


{quote}
Tomek Janiszewski via mesos.apache.org 
10:15 PM (32 minutes ago)

to dev 
Hi

I think website readme 
is out of date.
1. It doesn't mention mesos-website-container

2. support/generate-help-site.py does not exists
Am I right? How to generate full site (with documentation and getting
started section)?

Thanks
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5377) Improve DRF behavior with scarce resources.

2016-05-20 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293421#comment-15293421
 ] 

Guangya Liu commented on MESOS-5377:


Another thinking is similar with MESOS-4923, what about introducing a new 
sorter to handle those scare resources? Cluster admin can define a list of 
scare resources when start up and allocator can sort those scare resources in a 
different sorter.

> Improve DRF behavior with scarce resources.
> ---
>
> Key: MESOS-5377
> URL: https://issues.apache.org/jira/browse/MESOS-5377
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation
>Reporter: Benjamin Mahler
>
> The allocator currently uses the notion of Weighted [Dominant Resource 
> Fairness|https://www.cs.berkeley.edu/~alig/papers/drf.pdf] (WDRF) to 
> establish a linear notion of fairness across allocation roles.
> DRF behaves well for resources that are present within each machine in a 
> cluster (e.g. CPUs, memory, disk). However, some resources (e.g. GPUs) are 
> only present on a subset of machines in the cluster.
> Consider the behavior when there are the following agents in a cluster:
> 1000 agents with (cpus:4,mem:1024,disk:1024)
> 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
> If a role wishes to use both GPU and non-GPU resources for tasks, consuming 1 
> GPU will lead DRF to consider the role to have a 100% share of the cluster, 
> since it consumes 100% of the GPUs in the cluster. This framework will then 
> not receive any other offers.
> Among possible improvements, fairness can have understanding of resource 
> packages. In a sense there is 1 GPU package that is competed on and 1000 
> non-GPU packages competed on, and ideally a role's consumption of the single 
> GPU package does not have a large effect on the role's access to the other 
> 1000 non-GPU packages.
> In the interim, we should consider having a recommended way to deal with 
> scarce resources in the current model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5421) Mesos Docker executor taskHealthUpdated removes information about job ipAddresses

2016-05-20 Thread Dmitry Fedorov (JIRA)
Dmitry Fedorov created MESOS-5421:
-

 Summary: Mesos Docker executor taskHealthUpdated removes 
information about job ipAddresses
 Key: MESOS-5421
 URL: https://issues.apache.org/jira/browse/MESOS-5421
 Project: Mesos
  Issue Type: Bug
  Components: slave
Affects Versions: 0.28.1
Reporter: Dmitry Fedorov
Priority: Minor
 Fix For: 0.28.2


When you create job with command health check, right after job is launched the 
status is correct and ipAddresses field is present in it. 
But after health status is updated, ipAddresses field is missed.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)