[jira] [Commented] (MESOS-2369) Segfault when mesos-slave tries to clean up docker containers on startup

2017-02-10 Thread Bruce Merry (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862297#comment-15862297
 ] 

Bruce Merry commented on MESOS-2369:


This was on a test lab server that has some configuration of Mesos already. 
I've got a Vagrantfile with a similar setup, so when I'm back at work next week 
I'll see if I can reproduce it with that.

The one bit of configuration that is quite likely to make a difference is that 
/etc/mesos-slave/containerizers contains "docker,mesos".



> Segfault when mesos-slave tries to clean up docker containers on startup
> 
>
> Key: MESOS-2369
> URL: https://issues.apache.org/jira/browse/MESOS-2369
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.21.1, 1.2.0, 1.3.0
> Environment: Debian Jessie, mesos package 0.21.1-1.2.debian77 
> docker 1.3.2 build 39fa2fa
>Reporter: Pas
>Assignee: Anand Mazumdar
>
> I did a gdb backtrace, it seems like a stack overflow due to a bit too much 
> recursion.
> The interesting aspect is that after running mesos-slave with strace -f -b 
> execve it successfully proceeded with the docker cleanup. However, there were 
> a few strace sessions (on other slaves) where I was able to observe the 
> SIGSEGV, and it was around (or a bit before) the "docker ps -a" call, because 
> docker got a broken pipe shortly, then got killed by the propagating SIGSEGV 
> signal.
> {code}
> 
> #59296 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59297 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59298 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59299 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59300 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59301 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59302 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59303 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59304 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59305 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59306 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59307 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59308 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59309 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59310 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59311 0x76e53000 in 
> std::tr1::_Fu

[jira] [Updated] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction.

2017-02-10 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-5967:
---
Target Version/s: 1.3.0

> Add support for 'docker image inspect' in our docker abstraction.
> -
>
> Key: MESOS-5967
> URL: https://issues.apache.org/jira/browse/MESOS-5967
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, docker
>Reporter: Kevin Klues
>Assignee: Guangya Liu
>  Labels: gpu
>
> Docker's command line tool for {{docker inspect}} can take either a 
> {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON 
> array containing low-level information about that container, image or task. 
> However, the current {{docker inspect}} support in our docker abstraction 
> only supports inspecting containers (not images or tasks).  We should expand 
> this to (at least) support images.
> In particular, this additional functionality is motivated by the upcoming GPU 
> support, which needs to inspect the labels in a docker image to decide if it 
> should inject the required Nvidia volumes into a container.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-7061) Re-persist tasks/executors with allocation info during agent recovery.

2017-02-10 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-7061:
--

Assignee: Michael Park

[~mcypark] can you still take this?

> Re-persist tasks/executors with allocation info during agent recovery.
> --
>
> Key: MESOS-7061
> URL: https://issues.apache.org/jira/browse/MESOS-7061
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Michael Park
>
> When the agent is upgraded, it will need to re-persist resources for 
> recovered active tasks and executors, but with Resource.allocation_info.role 
> set to FrameworkInfo.role.
> If this agent receives new tasks from an old master (because the master has 
> not been upgraded yet), it will also augment the resources to have 
> Resource.allocation_info.role set prior to persisting on disk. This is 
> necessary to ensure we continue to charge existing tasks / executors if the 
> framework changes its role(s).
> Importantly, re-persisting will not prevent downgrading the agent since a 
> downgraded agent will simply ignore the unknown fields.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-6938) Libprocess reinitialization is flaky, can segfault

2017-02-10 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-6938:


Assignee: (was: Greg Mann)

> Libprocess reinitialization is flaky, can segfault
> --
>
> Key: MESOS-6938
> URL: https://issues.apache.org/jira/browse/MESOS-6938
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, tests
> Environment: ASF CI, CentOS 7, libevent and SSL enabled
>Reporter: Greg Mann
>  Labels: libprocess, tests
>
> This was observed on ASF CI. Based on the placement of the stacktrace, the 
> segfault seems to occur during libprocess reinitialization, when 
> {{process::initialize}} is called:
> {code}
> [--] 4 tests from Encryption/NetSocketTest
> [ RUN  ] Encryption/NetSocketTest.EOFBeforeRecv/0
> I0117 15:18:35.320691 27596 openssl.cpp:419] CA file path is unspecified! 
> NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0117 15:18:35.320714 27596 openssl.cpp:424] CA directory path unspecified! 
> NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0117 15:18:35.320719 27596 openssl.cpp:429] Will not verify peer certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0117 15:18:35.320726 27596 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> I0117 15:18:35.335141 27596 process.cpp:1234] libprocess is initialized on 
> 172.17.0.3:46415 with 16 worker threads
> [   OK ] Encryption/NetSocketTest.EOFBeforeRecv/0 (422 ms)
> [ RUN  ] Encryption/NetSocketTest.EOFBeforeRecv/1
> I0117 15:18:35.390697 27596 process.cpp:1234] libprocess is initialized on 
> 172.17.0.3:39822 with 16 worker threads
> [   OK ] Encryption/NetSocketTest.EOFBeforeRecv/1 (6 ms)
> [ RUN  ] Encryption/NetSocketTest.EOFAfterRecv/0
> I0117 15:18:35.998528 27596 openssl.cpp:419] CA file path is unspecified! 
> NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0117 15:18:35.998559 27596 openssl.cpp:424] CA directory path unspecified! 
> NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0117 15:18:35.998566 27596 openssl.cpp:429] Will not verify peer certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0117 15:18:35.998572 27596 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> I0117 15:18:36.010643 27596 process.cpp:1234] libprocess is initialized on 
> 172.17.0.3:47429 with 16 worker threads
> [   OK ] Encryption/NetSocketTest.EOFAfterRecv/0 (664 ms)
> [ RUN  ] Encryption/NetSocketTest.EOFAfterRecv/1
> I0117 15:18:36.079453 27596 process.cpp:1234] libprocess is initialized on 
> 172.17.0.3:38149 with 16 worker threads
> [   OK ] Encryption/NetSocketTest.EOFAfterRecv/1 (19 ms)
> *** Aborted at 1484666316 (unix time) try "date -d @1484666316" if you are 
> using GNU date ***
> PC: @ 0x7f7643ad7c56 __memcpy_ssse3_back
> *** SIGSEGV (@0x57c10f8) received by PID 27596 (TID 0x7f76393c2700) from PID 
> 92016888; stack trace: ***
> @ 0x7f7644ba0370 (unknown)
> @ 0x7f7643ad7c56 __memcpy_ssse3_back
> @ 0x7f76443248e0 (unknown)
> @ 0x7f7644324f8c (unknown)
> @   0x422a4d process::UPID::UPID()
> I0117 15:18:36.090376 27596 process.cpp:1234] libprocess is initialized on 
> 172.17.0.3:43835 with 16 worker threads
> [--] 4 tests from Encryption/NetSocketTest (1116 ms total)
> [--] 6 tests from SSLVerifyIPAdd/SSLTest
> [ RUN  ] SSLVerifyIPAdd/SSLTest.BasicSameProcess/0
> @   0x8ae4a8 process::DispatchEvent::DispatchEvent()
> @   0x8a6a5e process::internal::dispatch()
> @   0x8c0b44 process::dispatch<>()
> @   0x8a598a process::ProcessBase::route()
> @   0x98be53 process::ProcessBase::route<>()
> @   0x988096 process::Help::initialize()
> @   0x89ef2a process::ProcessManager::resume()
> @   0x89b976 
> _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
> @   0x8adb3c 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
> @   0x8ada80 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
> @   0x8ada0a 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f764431b230 (unknown)
> @ 0x7f7644b98dc5 start_thread
> @ 0x7f7643a8473d __clone
> make[7]: *** [check-local] Segmentation fault
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7057) Consider using the relink functionality of libprocess in the executor driver.

2017-02-10 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7057:
--
Target Version/s: 1.1.2, 1.3.0, 1.2.1  (was: 1.1.2, 1.3.0)

> Consider using the relink functionality of libprocess in the executor driver.
> -
>
> Key: MESOS-7057
> URL: https://issues.apache.org/jira/browse/MESOS-7057
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> As outlined in the root cause analysis for MESOS-5332, it is possible for a 
> iptables firewall to terminate an idle connection after a timeout. (the 
> default is 5 days). Once this happens, the executor driver is not notified of 
> the disconnection. It keeps on thinking that it is still connected with the 
> agent.
> When the agent process is restarted, the executor still tries to re-use the 
> old broken connection to send the re-register message to the agent. This is 
> when it eventually realizes that the connection is broken (due to the nature 
> of TCP) and calls the {{exited}} callback and commits suicide in 15 minutes 
> upon the recovery timeout.
> To offset this, an executor should always {{relink}} when it receives a 
> reconnect request from the agent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-2369) Segfault when mesos-slave tries to clean up docker containers on startup

2017-02-10 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862082#comment-15862082
 ] 

Anand Mazumdar commented on MESOS-2369:
---

[~bmerry] I gave it a try on a couple of Ubuntu 16.04 vms but couldn't 
reproduce with the steps you mentioned. Would it be possible for you to give us 
a stack trace helping us to debug the issue further or can you double check 
again if you missed anything in the steps to reproduce? 
This was with {{ulimit -s 4096}}.

[~bbannier] had managed to reproduce the issue on {{HEAD}} today but the stack 
trace turned out to be the one from MESOS-7102 and that has been already fixed 
today. 


> Segfault when mesos-slave tries to clean up docker containers on startup
> 
>
> Key: MESOS-2369
> URL: https://issues.apache.org/jira/browse/MESOS-2369
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.21.1, 1.2.0, 1.3.0
> Environment: Debian Jessie, mesos package 0.21.1-1.2.debian77 
> docker 1.3.2 build 39fa2fa
>Reporter: Pas
>Assignee: Anand Mazumdar
>
> I did a gdb backtrace, it seems like a stack overflow due to a bit too much 
> recursion.
> The interesting aspect is that after running mesos-slave with strace -f -b 
> execve it successfully proceeded with the docker cleanup. However, there were 
> a few strace sessions (on other slaves) where I was able to observe the 
> SIGSEGV, and it was around (or a bit before) the "docker ps -a" call, because 
> docker got a broken pipe shortly, then got killed by the propagating SIGSEGV 
> signal.
> {code}
> 
> #59296 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59297 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59298 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59299 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59300 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59301 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59302 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59303 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59304 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59305 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59306 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59307 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59308 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59309 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59310 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr

[jira] [Commented] (MESOS-6638) Update Suppress and Revive to be per-role.

2017-02-10 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862069#comment-15862069
 ] 

Guangya Liu commented on MESOS-6638:


{noformat}
commit f40e3d5fb167a691f6a3071f504b77e0def29604
Author: Guangya Liu gy...@apache.org
Date:   Sat Feb 11 08:24:26 2017 +0800

Added roles field to framework.

Added roles field to framework.

Review: https://reviews.apache.org/r/56499/
{noformat}

> Update Suppress and Revive to be per-role.
> --
>
> Key: MESOS-6638
> URL: https://issues.apache.org/jira/browse/MESOS-6638
> Project: Mesos
>  Issue Type: Task
>  Components: framework api
>Reporter: Benjamin Mahler
>Assignee: Guangya Liu
>
> The {{SUPPRESS}} and {{REVIVE}} calls need to be updated to be per-role. I.e. 
> Include {{Revive.role}} and {{Suppress.role}} fields, indicating which role 
> the operation is being applied to.
> {{Revive}} and {{Suppress}} messages do not currently exist, so these need to 
> be added. To support the old-style schedulers, we will make the role fields 
> optional.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7102) Crash when sending a SIGUSR1 signal to the agent.

2017-02-10 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7102:
--
Target Version/s: 1.2.1

> Crash when sending a SIGUSR1 signal to the agent.
> -
>
> Key: MESOS-7102
> URL: https://issues.apache.org/jira/browse/MESOS-7102
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.2.0
> Environment: ubuntu 16.04
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.3.0
>
>
> Looks like sending a {{SIGUSR1}} to the agent crashes it. This is a 
> regression and used to work fine in the 1.1 release. Note that the agent does 
> unregisters with the master and the crash happens after that.
> Steps to reproduce:
> - Start the agent.
> - Send it a {{SIGUSR1}} signal.
> The agent should crash with a stack trace similar to this:
> {noformat}
> I0209 16:19:46.210819 31977472 slave.cpp:851] Received SIGUSR1 signal from 
> user gmann; unregistering and shutting down
> I0209 16:19:46.210960 31977472 slave.cpp:803] Agent terminating
> *** Aborted at 1486685986 (unix time) try "date -d @1486685986" if you are 
> using GNU date ***
> PC: @ 0x7fffbc4904fc _pthread_key_global_init
> *** SIGSEGV (@0x38) received by PID 88894 (TID 0x7fffc50c83c0) stack trace: 
> ***
> @ 0x7fffbc488bba _sigtramp
> @ 0x7fe8a5d03f38 (unknown)
> @0x10b6d67d9 
> _ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENKUlPS1_E_clES6_
> @0x10b6d67b8 
> _ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENUlPS1_E_8__invokeES6_
> @0x10b6d6889 Synchronized<>::Synchronized()
> @0x10b6d678d Synchronized<>::Synchronized()
> @0x10b6a708a synchronize<>()
> @0x10e2f148d process::ProcessManager::wait()
> @0x10e2e9a78 process::wait()
> @0x10b30614f process::wait()
> @0x10c9619dc 
> mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
> @0x10c961a55 
> mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
> @0x10b1ab035 main
> @ 0x7fffbc27b255 start
> [1]88894 segmentation fault  bin/mesos-agent.sh —master=127.0.0.1:5050
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-7104) Mesos-slave aborts after failed CNI network event

2017-02-10 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-7104:
-

Assignee: (was: Jie Yu)

> Mesos-slave aborts after failed CNI network event
> -
>
> Key: MESOS-7104
> URL: https://issues.apache.org/jira/browse/MESOS-7104
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Dan Osborne
>
> I'm trying to network a task with a CNI plugin. I'm not sure what I've done 
> wrong, It's probably a bad CNI config or missing plugins or something. But 
> the error has managed to kill the entire slave process.
> {code}
>  I0209 22:57:56.841765  3949 containerizer.cpp:992] Starting container 
> f6d069b8-62f2-42c4-a8ff-145e366409f0 for executor 
> 'calico-cni-marathon-test.1b93ed5d-ef45-11e6-906e-024245c0fe5e' of framework 
> fbf5b855-bdf9-4a12-aeee-cc341ffe0f1b-
>  mesos-slave: ../../3rdparty/stout/include/stout/option.hpp:111: const T& 
> Option::get() const & [with T = std::basic_string]: Assertion 
> `isSome()' failed.
>  *** Aborted at 1486699076 (unix time) try "date -d @1486699076" if you are 
> using GNU date ***
>  PC: @ 0x7effbb6121d7 __GI_raise
>  *** SIGABRT (@0xf5a) received by PID 3930 (TID 0x7effb3cdf700) from PID 
> 3930; stack trace: ***
>  @ 0x7effbbecc370 (unknown)
>  @ 0x7effbb6121d7 __GI_raise
>  @ 0x7effbb6138c8 __GI_abort
>  @ 0x7effbb60b146 __assert_fail_base
> slave.service: main process exited, code=killed, status=6/ABRT
> {code}
> Will update with more info when possible.
> This is using the latest nightly mesos build at time of writing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7104) Mesos-slave aborts after failed CNI network event

2017-02-10 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862011#comment-15862011
 ] 

Jie Yu commented on MESOS-7104:
---

Hum, can you reproduce. This has no sufficient information to tell.

Can you paste the full agent log please?

> Mesos-slave aborts after failed CNI network event
> -
>
> Key: MESOS-7104
> URL: https://issues.apache.org/jira/browse/MESOS-7104
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Dan Osborne
>Assignee: Jie Yu
>
> I'm trying to network a task with a CNI plugin. I'm not sure what I've done 
> wrong, It's probably a bad CNI config or missing plugins or something. But 
> the error has managed to kill the entire slave process.
> {code}
>  I0209 22:57:56.841765  3949 containerizer.cpp:992] Starting container 
> f6d069b8-62f2-42c4-a8ff-145e366409f0 for executor 
> 'calico-cni-marathon-test.1b93ed5d-ef45-11e6-906e-024245c0fe5e' of framework 
> fbf5b855-bdf9-4a12-aeee-cc341ffe0f1b-
>  mesos-slave: ../../3rdparty/stout/include/stout/option.hpp:111: const T& 
> Option::get() const & [with T = std::basic_string]: Assertion 
> `isSome()' failed.
>  *** Aborted at 1486699076 (unix time) try "date -d @1486699076" if you are 
> using GNU date ***
>  PC: @ 0x7effbb6121d7 __GI_raise
>  *** SIGABRT (@0xf5a) received by PID 3930 (TID 0x7effb3cdf700) from PID 
> 3930; stack trace: ***
>  @ 0x7effbbecc370 (unknown)
>  @ 0x7effbb6121d7 __GI_raise
>  @ 0x7effbb6138c8 __GI_abort
>  @ 0x7effbb60b146 __assert_fail_base
> slave.service: main process exited, code=killed, status=6/ABRT
> {code}
> Will update with more info when possible.
> This is using the latest nightly mesos build at time of writing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7102) Crash when sending a SIGUSR1 signal to the agent.

2017-02-10 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7102:
--
Shepherd: Joseph Wu
  Sprint: Mesosphere Sprint 51
Story Points: 2

> Crash when sending a SIGUSR1 signal to the agent.
> -
>
> Key: MESOS-7102
> URL: https://issues.apache.org/jira/browse/MESOS-7102
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.2.0
> Environment: ubuntu 16.04
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.3.0
>
>
> Looks like sending a {{SIGUSR1}} to the agent crashes it. This is a 
> regression and used to work fine in the 1.1 release. Note that the agent does 
> unregisters with the master and the crash happens after that.
> Steps to reproduce:
> - Start the agent.
> - Send it a {{SIGUSR1}} signal.
> The agent should crash with a stack trace similar to this:
> {noformat}
> I0209 16:19:46.210819 31977472 slave.cpp:851] Received SIGUSR1 signal from 
> user gmann; unregistering and shutting down
> I0209 16:19:46.210960 31977472 slave.cpp:803] Agent terminating
> *** Aborted at 1486685986 (unix time) try "date -d @1486685986" if you are 
> using GNU date ***
> PC: @ 0x7fffbc4904fc _pthread_key_global_init
> *** SIGSEGV (@0x38) received by PID 88894 (TID 0x7fffc50c83c0) stack trace: 
> ***
> @ 0x7fffbc488bba _sigtramp
> @ 0x7fe8a5d03f38 (unknown)
> @0x10b6d67d9 
> _ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENKUlPS1_E_clES6_
> @0x10b6d67b8 
> _ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENUlPS1_E_8__invokeES6_
> @0x10b6d6889 Synchronized<>::Synchronized()
> @0x10b6d678d Synchronized<>::Synchronized()
> @0x10b6a708a synchronize<>()
> @0x10e2f148d process::ProcessManager::wait()
> @0x10e2e9a78 process::wait()
> @0x10b30614f process::wait()
> @0x10c9619dc 
> mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
> @0x10c961a55 
> mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
> @0x10b1ab035 main
> @ 0x7fffbc27b255 start
> [1]88894 segmentation fault  bin/mesos-agent.sh —master=127.0.0.1:5050
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7086) Tighten up rules on IDs used in Mesos

2017-02-10 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861592#comment-15861592
 ] 

Yan Xu commented on MESOS-7086:
---

Some precursors: 
https://reviews.apache.org/r/56526/
https://reviews.apache.org/r/56527/

> Tighten up rules on IDs used in Mesos
> -
>
> Key: MESOS-7086
> URL: https://issues.apache.org/jira/browse/MESOS-7086
> Project: Mesos
>  Issue Type: Task
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> We currently have pretty relaxed rules on validity of IDs (e.g., TaskID, 
> ExecutorID, PersistenceID):
> https://github.com/apache/mesos/blob/7a3df44eb6a59bd95604fd38a18dc745363d468d/src/common/validation.cpp
> https://github.com/apache/mesos/blob/7a3df44eb6a59bd95604fd38a18dc745363d468d/src/slave/validation.cpp#L40
> We should tighten up the restrictions to prevent misleading and exploitable 
> ID and document these rules.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7115) Agent should prefer LOG(FATAL) over EXIT().

2017-02-10 Thread James Peach (JIRA)
James Peach created MESOS-7115:
--

 Summary: Agent should prefer LOG(FATAL) over EXIT().
 Key: MESOS-7115
 URL: https://issues.apache.org/jira/browse/MESOS-7115
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: James Peach
Priority: Minor


I saw the agent exit with an auth failure:
{noformat}
I0210 14:16:49.731459  9503 authenticatee.cpp:259] Received SASL authentication 
step
Master master@17.174.144.199:5050 refused authentication
{noformat}

Note the lack of log metadata on the exit message. This message (from 
{{slave.cpp}} and a number of others in the same file should all use 
{{LOG(FATAL)}} so that log aggregation can pick up the timestamp, error 
severity, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-2369) Segfault when mesos-slave tries to clean up docker containers on startup

2017-02-10 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar reassigned MESOS-2369:
-

Assignee: Anand Mazumdar

> Segfault when mesos-slave tries to clean up docker containers on startup
> 
>
> Key: MESOS-2369
> URL: https://issues.apache.org/jira/browse/MESOS-2369
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.21.1, 1.2.0, 1.3.0
> Environment: Debian Jessie, mesos package 0.21.1-1.2.debian77 
> docker 1.3.2 build 39fa2fa
>Reporter: Pas
>Assignee: Anand Mazumdar
>
> I did a gdb backtrace, it seems like a stack overflow due to a bit too much 
> recursion.
> The interesting aspect is that after running mesos-slave with strace -f -b 
> execve it successfully proceeded with the docker cleanup. However, there were 
> a few strace sessions (on other slaves) where I was able to observe the 
> SIGSEGV, and it was around (or a bit before) the "docker ps -a" call, because 
> docker got a broken pipe shortly, then got killed by the propagating SIGSEGV 
> signal.
> {code}
> 
> #59296 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59297 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59298 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59299 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59300 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59301 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59302 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59303 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59304 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59305 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59306 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59307 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59308 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59309 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59310 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59311 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0

[jira] [Commented] (MESOS-2369) Segfault when mesos-slave tries to clean up docker containers on startup

2017-02-10 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861444#comment-15861444
 ] 

Anand Mazumdar commented on MESOS-2369:
---

Thanks [~bmerry] for the reproduction steps. I am assigning this to myself to 
carry out further root cause analysis.

> Segfault when mesos-slave tries to clean up docker containers on startup
> 
>
> Key: MESOS-2369
> URL: https://issues.apache.org/jira/browse/MESOS-2369
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.21.1, 1.2.0, 1.3.0
> Environment: Debian Jessie, mesos package 0.21.1-1.2.debian77 
> docker 1.3.2 build 39fa2fa
>Reporter: Pas
>
> I did a gdb backtrace, it seems like a stack overflow due to a bit too much 
> recursion.
> The interesting aspect is that after running mesos-slave with strace -f -b 
> execve it successfully proceeded with the docker cleanup. However, there were 
> a few strace sessions (on other slaves) where I was able to observe the 
> SIGSEGV, and it was around (or a bit before) the "docker ps -a" call, because 
> docker got a broken pipe shortly, then got killed by the propagating SIGSEGV 
> signal.
> {code}
> 
> #59296 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59297 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59298 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59299 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59300 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59301 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59302 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59303 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59304 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59305 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59306 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59307 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59308 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59309 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59310 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59311 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long

[jira] [Commented] (MESOS-2369) Segfault when mesos-slave tries to clean up docker containers on startup

2017-02-10 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861443#comment-15861443
 ] 

Benjamin Bannier commented on MESOS-2369:
-

[~bmerry]: Thanks for the great reproducer. I was able to produce a segfault 
with around 2.2k terminated containers on Fedora 25 with a today's master 
({{8e2e52c}}).

> Segfault when mesos-slave tries to clean up docker containers on startup
> 
>
> Key: MESOS-2369
> URL: https://issues.apache.org/jira/browse/MESOS-2369
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.21.1, 1.2.0, 1.3.0
> Environment: Debian Jessie, mesos package 0.21.1-1.2.debian77 
> docker 1.3.2 build 39fa2fa
>Reporter: Pas
>
> I did a gdb backtrace, it seems like a stack overflow due to a bit too much 
> recursion.
> The interesting aspect is that after running mesos-slave with strace -f -b 
> execve it successfully proceeded with the docker cleanup. However, there were 
> a few strace sessions (on other slaves) where I was able to observe the 
> SIGSEGV, and it was around (or a bit before) the "docker ps -a" call, because 
> docker got a broken pipe shortly, then got killed by the propagating SIGSEGV 
> signal.
> {code}
> 
> #59296 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59297 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59298 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59299 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59300 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59301 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59302 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59303 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59304 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59305 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59306 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59307 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59308 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59309 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59310 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59311 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_pt

[jira] [Updated] (MESOS-2369) Segfault when mesos-slave tries to clean up docker containers on startup

2017-02-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-2369:

Affects Version/s: 1.3.0
   1.2.0

> Segfault when mesos-slave tries to clean up docker containers on startup
> 
>
> Key: MESOS-2369
> URL: https://issues.apache.org/jira/browse/MESOS-2369
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.21.1, 1.2.0, 1.3.0
> Environment: Debian Jessie, mesos package 0.21.1-1.2.debian77 
> docker 1.3.2 build 39fa2fa
>Reporter: Pas
>
> I did a gdb backtrace, it seems like a stack overflow due to a bit too much 
> recursion.
> The interesting aspect is that after running mesos-slave with strace -f -b 
> execve it successfully proceeded with the docker cleanup. However, there were 
> a few strace sessions (on other slaves) where I was able to observe the 
> SIGSEGV, and it was around (or a bit before) the "docker ps -a" call, because 
> docker got a broken pipe shortly, then got killed by the propagating SIGSEGV 
> signal.
> {code}
> 
> #59296 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59297 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59298 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59299 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59300 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59301 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59302 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59303 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59304 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59305 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59306 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59307 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59308 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59309 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59310 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59311 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #

[jira] [Updated] (MESOS-6606) Reject optimized builds with libcxx before 3.9

2017-02-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-6606:

Target Version/s: 1.1.0, 1.0.2, 1.2.0, 1.3.0

> Reject optimized builds with libcxx before 3.9
> --
>
> Key: MESOS-6606
> URL: https://issues.apache.org/jira/browse/MESOS-6606
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>  Labels: newbie
>
> Recent clang versions optimize more aggressively which leads to runtime 
> errors using valid code, see e.g., MESOS-5745, due to code exposing undefined 
> behavior in libcxx-3.8 and earlier. This was fixed with upstream libcxx-3.9. 
> See https://reviews.llvm.org/D20786 for the patch and 
> https://llvm.org/bugs/show_bug.cgi?id=28469 for the code example extracted 
> from our code base.
> We should consider rejecting builds if libcxx-3.8 or older is detected since 
> not all users compiling Mesos might run the test suite. In our decision to 
> reject we could possibly also take the used clang versions into account 
> (which would just ensure we don't run into the known problems from the UB in 
> libcxx).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-2369) Segfault when mesos-slave tries to clean up docker containers on startup

2017-02-10 Thread Bruce Merry (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861315#comment-15861315
 ] 

Bruce Merry commented on MESOS-2369:


Confirmed on another machine, this time running Ubuntu 16.04, again with Mesos 
1.1.0 from the Mesosphere PPA. To reproduce (as root):
1. for i in {1..4000}; do docker run ubuntu:xenial-20161010 /bin/true; done
2. service mesos-slave stop
3. MESOS_RECOVER=cleanup mesos-init-wrapper slave
I found that 3000 wasn't enough to trigger the segfault, but 4000 was.

If I run "ulimit -s 32768" first, then the segfault does not occur, so it is 
presumably a stack overflow.

I'm not sure if this is related to MESOS_RECOVER=cleanup at all; I included it 
since that where I first encountered the issue.

> Segfault when mesos-slave tries to clean up docker containers on startup
> 
>
> Key: MESOS-2369
> URL: https://issues.apache.org/jira/browse/MESOS-2369
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.21.1
> Environment: Debian Jessie, mesos package 0.21.1-1.2.debian77 
> docker 1.3.2 build 39fa2fa
>Reporter: Pas
>
> I did a gdb backtrace, it seems like a stack overflow due to a bit too much 
> recursion.
> The interesting aspect is that after running mesos-slave with strace -f -b 
> execve it successfully proceeded with the docker cleanup. However, there were 
> a few strace sessions (on other slaves) where I was able to observe the 
> SIGSEGV, and it was around (or a bit before) the "docker ps -a" call, because 
> docker got a broken pipe shortly, then got killed by the propagating SIGSEGV 
> signal.
> {code}
> 
> #59296 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59297 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59298 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59299 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59300 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59301 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59302 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59303 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59304 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59305 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59306 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59307 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59308 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59309 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59310 0x76e5012c in process::io:

[jira] [Commented] (MESOS-7076) libprocess tests fail when using libevent 2.1.8

2017-02-10 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861234#comment-15861234
 ] 

Jan Schlicht commented on MESOS-7076:
-

Seems to be related to {{send}} and {{recv}}. The tests that are failing all 
send or receive on a socket.

> libprocess tests fail when using libevent 2.1.8
> ---
>
> Key: MESOS-7076
> URL: https://issues.apache.org/jira/browse/MESOS-7076
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, tests
> Environment: macOS 10.12.3, libevent 2.1.8 (installed via Homebrew)
>Reporter: Jan Schlicht
>
> Running {{libprocess-tests}} on Mesos compiled with {{--enable-libevent 
> --enable-ssl}} on macOS with the libevent 2.1.8 installed through homebrew, 
> SSL related tests fail like
> {noformat}
> [ RUN  ] SSLTest.SSLSocket
> I0207 15:20:46.017881 2528580544 openssl.cpp:419] CA file path is 
> unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0207 15:20:46.017904 2528580544 openssl.cpp:424] CA directory path 
> unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0207 15:20:46.017918 2528580544 openssl.cpp:429] Will not verify peer 
> certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0207 15:20:46.017923 2528580544 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0207 15:20:46.033001 2528580544 openssl.cpp:419] CA file path is 
> unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0207 15:20:46.033179 2528580544 openssl.cpp:424] CA directory path 
> unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0207 15:20:46.033196 2528580544 openssl.cpp:429] Will not verify peer 
> certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0207 15:20:46.033201 2528580544 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> ../../../3rdparty/libprocess/src/tests/ssl_tests.cpp:257: Failure
> Failed to wait 15secs for Socket(socket.get()).recv()
> [  FAILED  ] SSLTest.SSLSocket (15196 ms)
> {noformat}
> Tests failing are {{SSLTest.SSLSocket}}, {{SSLTest.NoVerifyBadCA}}, 
> {{SSLTest.VerifyCertificate}}, {{SSLTest.ProtocolMismatch}}, 
> {{SSLTest.HTTPSGet}}, {{SSLTest.HTTPSPost}}, {{SSLTest.SilentSocket}} 
> (hangs), {{HTTPTest.Endpoints}}, {{HTTPTest.EndpointsHelp}}, 
> {{HTTPTest.Get}}, {{HTTPTest.NestedGet}}, {{HTTPTest.StreamingGetComplete}}, 
> {{HTTPTest.StreamingGetFailure}}, {{HTTPTest.Post}}, {{HTTPTest.Delete}}, 
> {{HTTPTest.Request}}, {{NetSocketTest.EOFBeforeRecv}}, 
> {{NetSocketTest.EOFAfterRecv}}, {{SSLTest.BasicSameProcess}}, 
> {{SSLTest.BasicSameProcessUnix}}, and {{SSLTest.RequireCertificate}}.  It 
> hasn't been tested if Linux builds are affected as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7114) GroupTest.GroupCancelWithDisconnect fails on Mac OS.

2017-02-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7114:

Issue Type: Bug  (was: Task)

> GroupTest.GroupCancelWithDisconnect fails on Mac OS.
> 
>
> Key: MESOS-7114
> URL: https://issues.apache.org/jira/browse/MESOS-7114
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: OS X
>Reporter: Benjamin Bannier
>  Labels: mesosphere
>
> We saw {{GroupTest.GroupCancelWithDisconnect}} fail on a recent OS X in a SSL 
> build in our internal CI recently:
> {code}
> [ RUN  ] GroupTest.GroupCancelWithDisconnect
> I0209 19:22:17.574175 1985630208 zookeeper_test_server.cpp:156] Started 
> ZooKeeperTestServer on port 55440
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@726: Client 
> environment:zookeeper.version=zookeeper C client 3.4.8
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@730: Client 
> environment:host.name=Jenkinss-Mac-mini.local
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@737: Client 
> environment:os.name=Darwin
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@738: Client 
> environment:os.arch=15.6.0
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@739: Client 
> environment:os.version=Darwin Kernel Version 15.6.0: Mon Jan  9 23:07:29 PST 
> 2017; root:xnu-3248.60.11.2.1~1/RELEASE_X86_64
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@747: Client 
> environment:user.name=jenkins
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@755: Client 
> environment:user.home=/Users/jenkins
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@767: Client 
> environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@zookeeper_init@800: 
> Initiating client connection, host=127.0.0.1:55440 sessionTimeout=1 
> watcher=0x10eba31e0 sessionId=0 sessionPasswd= context=0x7f8edb6dee50 
> flags=0
> 2017-02-09 19:22:17,574:84405(0x70cb4000):ZOO_INFO@check_events@1728: 
> initiated connection to server [127.0.0.1:55440]
> 2017-02-09 19:22:17,578:84405(0x70cb4000):ZOO_INFO@check_events@1775: 
> session establishment complete on server [127.0.0.1:55440], 
> sessionId=0x15a260af865, negotiated timeout=1
> I0209 19:22:17.578824 3211264 group.cpp:340] Group process 
> (zookeeper-group(44)@10.0.90.182:54133) connected to ZooKeeper
> I0209 19:22:17.578876 3211264 group.cpp:830] Syncing group operations: queue 
> size (joins, cancels, datas) = (1, 0, 0)
> I0209 19:22:17.578893 3211264 group.cpp:418] Trying to create path '/test' in 
> ZooKeeper
> I0209 19:22:17.582217 2674688 group.cpp:699] Trying to get '/test/00' 
> in ZooKeeper
> I0209 19:22:17.582960 1985630208 zookeeper_test_server.cpp:116] Shutting down 
> ZooKeeperTestServer on port 55440
> 2017-02-09 
> 19:22:17,583:84405(0x70cb4000):ZOO_ERROR@handle_socket_error_msg@1746: 
> Socket [127.0.0.1:55440] zk retcode=-4, errno=64(Host is down): failed while 
> receiving a server response
> I0209 19:22:17.583799 1601536 group.cpp:451] Lost connection to ZooKeeper, 
> attempting to reconnect ...
> I0209 19:22:17.584373 1601536 group.cpp:656] Trying to remove 
> '/test/00' in ZooKeeper
> 2017-02-09 19:22:17,584:84405(0x70cb4000):ZOO_INFO@check_events@1728: 
> initiated connection to server [127.0.0.1:55440]
> 2017-02-09 
> 19:22:17,585:84405(0x70cb4000):ZOO_ERROR@handle_socket_error_msg@1746: 
> Socket [127.0.0.1:55440] zk retcode=-4, errno=64(Host is down): failed while 
> receiving a server response
> I0209 19:22:17.586333 1985630208 zookeeper_test_server.cpp:156] Started 
> ZooKeeperTestServer on port 55440
> 2017-02-09 
> 19:23:05,168:84405(0x70cb4000):ZOO_WARN@zookeeper_interest@1570: Exceeded 
> deadline by 44249ms
> 2017-02-09 19:23:05,196:84405(0x70cb4000):ZOO_INFO@check_events@1728: 
> initiated connection to server [127.0.0.1:55440]
> 2017-02-09 
> 19:23:05,232:84405(0x70cb4000):ZOO_ERROR@handle_socket_error_msg@1764: 
> Socket [127.0.0.1:55440] zk retcode=-112, errno=70(Stale NFS file handle): 
> sessionId=0x15a260af865 has expired.
> I0209 19:23:05.232564 2138112 group.cpp:830] Syncing group operations: queue 
> size (joins, cancels, datas) = (0, 1, 0)
> I0209 19:23:05.243890 2138112 group.cpp:656] Trying to remove 
> '/test/00' in ZooKeeper
> W0209 19:23:05.257310 2138112 group.cpp:494] Timed out waiting to connect to 
> ZooKeeper. Forcing ZooKeeper session (sessionId=15a260af865) expiration
> I0209 19:23:05.258072 2138112 group.cpp:510] ZooKeeper session expired
> ../../src/tests/group_tests.cpp:183: Failure
> Failed to wait 15secs for cancellation
> 2017

[jira] [Created] (MESOS-7114) GroupTest.GroupCancelWithDisconnect fails on Mac OS.

2017-02-10 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-7114:
---

 Summary: GroupTest.GroupCancelWithDisconnect fails on Mac OS.
 Key: MESOS-7114
 URL: https://issues.apache.org/jira/browse/MESOS-7114
 Project: Mesos
  Issue Type: Task
  Components: test
 Environment: OS X
Reporter: Benjamin Bannier


We saw {{GroupTest.GroupCancelWithDisconnect}} fail on a recent OS X in a SSL 
build in our internal CI recently:

{code}
[ RUN  ] GroupTest.GroupCancelWithDisconnect
I0209 19:22:17.574175 1985630208 zookeeper_test_server.cpp:156] Started 
ZooKeeperTestServer on port 55440
2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@726: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@730: Client 
environment:host.name=Jenkinss-Mac-mini.local
2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@737: Client 
environment:os.name=Darwin
2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@738: Client 
environment:os.arch=15.6.0
2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@739: Client 
environment:os.version=Darwin Kernel Version 15.6.0: Mon Jan  9 23:07:29 PST 
2017; root:xnu-3248.60.11.2.1~1/RELEASE_X86_64
2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@747: Client 
environment:user.name=jenkins
2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@755: Client 
environment:user.home=/Users/jenkins
2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@767: Client 
environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build
2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@zookeeper_init@800: 
Initiating client connection, host=127.0.0.1:55440 sessionTimeout=1 
watcher=0x10eba31e0 sessionId=0 sessionPasswd= context=0x7f8edb6dee50 
flags=0
2017-02-09 19:22:17,574:84405(0x70cb4000):ZOO_INFO@check_events@1728: 
initiated connection to server [127.0.0.1:55440]
2017-02-09 19:22:17,578:84405(0x70cb4000):ZOO_INFO@check_events@1775: 
session establishment complete on server [127.0.0.1:55440], 
sessionId=0x15a260af865, negotiated timeout=1
I0209 19:22:17.578824 3211264 group.cpp:340] Group process 
(zookeeper-group(44)@10.0.90.182:54133) connected to ZooKeeper
I0209 19:22:17.578876 3211264 group.cpp:830] Syncing group operations: queue 
size (joins, cancels, datas) = (1, 0, 0)
I0209 19:22:17.578893 3211264 group.cpp:418] Trying to create path '/test' in 
ZooKeeper
I0209 19:22:17.582217 2674688 group.cpp:699] Trying to get '/test/00' 
in ZooKeeper
I0209 19:22:17.582960 1985630208 zookeeper_test_server.cpp:116] Shutting down 
ZooKeeperTestServer on port 55440
2017-02-09 
19:22:17,583:84405(0x70cb4000):ZOO_ERROR@handle_socket_error_msg@1746: 
Socket [127.0.0.1:55440] zk retcode=-4, errno=64(Host is down): failed while 
receiving a server response
I0209 19:22:17.583799 1601536 group.cpp:451] Lost connection to ZooKeeper, 
attempting to reconnect ...
I0209 19:22:17.584373 1601536 group.cpp:656] Trying to remove 
'/test/00' in ZooKeeper
2017-02-09 19:22:17,584:84405(0x70cb4000):ZOO_INFO@check_events@1728: 
initiated connection to server [127.0.0.1:55440]
2017-02-09 
19:22:17,585:84405(0x70cb4000):ZOO_ERROR@handle_socket_error_msg@1746: 
Socket [127.0.0.1:55440] zk retcode=-4, errno=64(Host is down): failed while 
receiving a server response
I0209 19:22:17.586333 1985630208 zookeeper_test_server.cpp:156] Started 
ZooKeeperTestServer on port 55440
2017-02-09 19:23:05,168:84405(0x70cb4000):ZOO_WARN@zookeeper_interest@1570: 
Exceeded deadline by 44249ms
2017-02-09 19:23:05,196:84405(0x70cb4000):ZOO_INFO@check_events@1728: 
initiated connection to server [127.0.0.1:55440]
2017-02-09 
19:23:05,232:84405(0x70cb4000):ZOO_ERROR@handle_socket_error_msg@1764: 
Socket [127.0.0.1:55440] zk retcode=-112, errno=70(Stale NFS file handle): 
sessionId=0x15a260af865 has expired.
I0209 19:23:05.232564 2138112 group.cpp:830] Syncing group operations: queue 
size (joins, cancels, datas) = (0, 1, 0)
I0209 19:23:05.243890 2138112 group.cpp:656] Trying to remove 
'/test/00' in ZooKeeper
W0209 19:23:05.257310 2138112 group.cpp:494] Timed out waiting to connect to 
ZooKeeper. Forcing ZooKeeper session (sessionId=15a260af865) expiration
I0209 19:23:05.258072 2138112 group.cpp:510] ZooKeeper session expired
../../src/tests/group_tests.cpp:183: Failure
Failed to wait 15secs for cancellation
2017-02-09 19:23:05,260:84405(0x7020a000):ZOO_INFO@zookeeper_close@2543: 
Freeing zookeeper resources for sessionId=0x15a260af865

2017-02-09 19:23:05,260:84405(0x7028d000):ZOO_INFO@log_env@726: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2017-02-09 19:23:05,260:84405(0x7028d000):ZOO_INFO@log_env@730: Client 
environment:host.name=Jenkinss-Mac-mini.local
2

[jira] [Created] (MESOS-7113) Ensure Mesos can be built and tests successfully on Mac OS.

2017-02-10 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-7113:
---

 Summary: Ensure Mesos can be built and tests successfully on Mac 
OS.
 Key: MESOS-7113
 URL: https://issues.apache.org/jira/browse/MESOS-7113
 Project: Mesos
  Issue Type: Task
Reporter: Benjamin Bannier


This is a tracking bug collecting issues in building or running Mesos on recent 
Mac OS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7083) No master is currently leading

2017-02-10 Thread hemanth makaraju (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861063#comment-15861063
 ] 

hemanth makaraju commented on MESOS-7083:
-

thank you .. it works with 1.1.1

> No master is currently leading
> --
>
> Key: MESOS-7083
> URL: https://issues.apache.org/jira/browse/MESOS-7083
> Project: Mesos
>  Issue Type: Bug
>  Components: master, webui
>Affects Versions: 1.1.0
>Reporter: hemanth makaraju
>Assignee: haosdent
>
> when i run http://127.0.0.1:5050 on web-browser i see "No master is currently 
> leading" but mesos resolve command detected master
> mesos-resolve zk://172.17.0.2:2181/mesos
> I0208 11:17:33.489379 24715 zookeeper.cpp:259] A new leading master 
> (UPID=master@127.0.0.1:5050) is detected
> this is the command i used to run mesos-master
> mesos-master --zk=zk://127.0.0.1:2181/mesos --quorum=1 
> --advertise_ip=127.0.0.1 --advertise_port=5050 --work_dir=/mesos/master



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7112) Ensure Mesos can be built and tests successfully on Debian8

2017-02-10 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-7112:
---

 Summary: Ensure Mesos can be built and tests successfully on 
Debian8
 Key: MESOS-7112
 URL: https://issues.apache.org/jira/browse/MESOS-7112
 Project: Mesos
  Issue Type: Task
 Environment: debian8
Reporter: Benjamin Bannier


This is a tracking bug collecting issues in building or running Mesos on 
Debian8.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7108) Ensure Mesos can be built and tests successfully on Ubuntu12

2017-02-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7108:

Issue Type: Task  (was: Story)

> Ensure Mesos can be built and tests successfully on Ubuntu12
> 
>
> Key: MESOS-7108
> URL: https://issues.apache.org/jira/browse/MESOS-7108
> Project: Mesos
>  Issue Type: Task
> Environment: ubuntu-12
>Reporter: Benjamin Bannier
>  Labels: mesosphere
>
> This is a tracking bug collecting issues in building or running Mesos on 
> ubuntu-12
> .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7109) Ensure Mesos can be built and tests successfully on Ubuntu14

2017-02-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7109:

Issue Type: Task  (was: Story)

> Ensure Mesos can be built and tests successfully on Ubuntu14
> 
>
> Key: MESOS-7109
> URL: https://issues.apache.org/jira/browse/MESOS-7109
> Project: Mesos
>  Issue Type: Task
> Environment: ubuntu-14
>Reporter: Benjamin Bannier
>  Labels: mesosphere
>
> This is a tracking bug collecting issues in building or running Mesos on 
> ubuntu-14.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7107) Ensure Mesos can be built and tests successfully on Centos7

2017-02-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7107:

Issue Type: Task  (was: Story)

> Ensure Mesos can be built and tests successfully on Centos7
> ---
>
> Key: MESOS-7107
> URL: https://issues.apache.org/jira/browse/MESOS-7107
> Project: Mesos
>  Issue Type: Task
> Environment: centos-7
>Reporter: Benjamin Bannier
>  Labels: mesosphere
>
> This is a tracking bug collecting issues in building or running Mesos on 
> centos-7.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7110) Ensure Mesos can be built and tests successfully on Ubuntu16

2017-02-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7110:

Issue Type: Task  (was: Story)

> Ensure Mesos can be built and tests successfully on Ubuntu16
> 
>
> Key: MESOS-7110
> URL: https://issues.apache.org/jira/browse/MESOS-7110
> Project: Mesos
>  Issue Type: Task
> Environment: ubuntu-16
>Reporter: Benjamin Bannier
>  Labels: mesosphere
>
> This is a tracking bug collecting issues in building or running Mesos on 
> ubuntu-16.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7111) HttpFaultToleranceTest.SchedulerFailoverFrameworkToExecutorMessage segfaults

2017-02-10 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-7111:
---

 Summary: 
HttpFaultToleranceTest.SchedulerFailoverFrameworkToExecutorMessage segfaults
 Key: MESOS-7111
 URL: https://issues.apache.org/jira/browse/MESOS-7111
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 1.2.0
 Environment: ubuntu-16
Reporter: Benjamin Bannier


We observed a segfault in 
{{HttpFaultToleranceTest.SchedulerFailoverFrameworkToExecutorMessage}} in 
internal CI on an ubuntu16 machine. Note that ubuntu16 uses gcc-6.
{code}
[ RUN  ] HttpFaultToleranceTest.SchedulerFailoverFrameworkToExecutorMessage
I0210 02:47:31.260174 19578 cluster.cpp:160] Creating default 'local' authorizer
I0210 02:47:31.261225 19597 master.cpp:383] Master 
d8129420-2a04-48e7-9b28-6b0a0af73168 (ip-10-150-111-24.ec2.internal) started on 
10.150.111.24:33608
I0210 02:47:31.261281 19597 master.cpp:385] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="false" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/fBrqHi/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/fBrqHi/master" 
--zk_session_timeout="10secs"
I0210 02:47:31.261404 19597 master.cpp:437] Master allowing unauthenticated 
frameworks to register
I0210 02:47:31.261411 19597 master.cpp:449] Master only allowing authenticated 
agents to register
I0210 02:47:31.261415 19597 master.cpp:462] Master only allowing authenticated 
HTTP frameworks to register
I0210 02:47:31.261420 19597 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/fBrqHi/credentials'
I0210 02:47:31.261488 19597 master.cpp:507] Using default 'crammd5' 
authenticator
I0210 02:47:31.261530 19597 http.cpp:919] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0210 02:47:31.261591 19597 http.cpp:919] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0210 02:47:31.261631 19597 http.cpp:919] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0210 02:47:31.261698 19597 master.cpp:587] Authorization enabled
I0210 02:47:31.261754 19601 whitelist_watcher.cpp:77] No whitelist given
I0210 02:47:31.261754 19602 hierarchical.cpp:161] Initialized hierarchical 
allocator process
I0210 02:47:31.262462 19597 master.cpp:2124] Elected as the leading master!
I0210 02:47:31.262482 19597 master.cpp:1646] Recovering from registrar
I0210 02:47:31.262545 19603 registrar.cpp:329] Recovering registrar
I0210 02:47:31.262774 19602 registrar.cpp:362] Successfully fetched the 
registry (0B) in 201984ns
I0210 02:47:31.262809 19602 registrar.cpp:461] Applied 1 operations in 2963ns; 
attempting to update the registry
I0210 02:47:31.263062 19599 registrar.cpp:506] Successfully updated the 
registry in 214016ns
I0210 02:47:31.263119 19599 registrar.cpp:392] Successfully recovered registrar
I0210 02:47:31.263267 19597 master.cpp:1762] Recovered 0 agents from the 
registry (172B); allowing 10mins for agents to re-register
I0210 02:47:31.263295 19598 hierarchical.cpp:188] Skipping recovery of 
hierarchical allocator: nothing to recover
I0210 02:47:31.264645 19578 cluster.cpp:446] Creating default 'local' authorizer
I0210 02:47:31.265029 19598 slave.cpp:211] Mesos agent started on 
(105)@10.150.111.24:33608
I0210 02:47:31.265187 19578 scheduler.cpp:184] Version: 1.3.0
I0210 02:47:31.265043 19598 slave.cpp:212] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://"; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers

[jira] [Created] (MESOS-7110) Ensure Mesos can be built and tests successfully on Ubuntu16

2017-02-10 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-7110:
---

 Summary: Ensure Mesos can be built and tests successfully on 
Ubuntu16
 Key: MESOS-7110
 URL: https://issues.apache.org/jira/browse/MESOS-7110
 Project: Mesos
  Issue Type: Story
 Environment: ubuntu-16
Reporter: Benjamin Bannier


This is a tracking bug collecting issues in building or running Mesos on 
ubuntu-16.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7108) Ensure Mesos can be built and tests successfully on Ubuntu12

2017-02-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7108:

Labels: mesosphere  (was: )

> Ensure Mesos can be built and tests successfully on Ubuntu12
> 
>
> Key: MESOS-7108
> URL: https://issues.apache.org/jira/browse/MESOS-7108
> Project: Mesos
>  Issue Type: Story
> Environment: ubuntu-12
>Reporter: Benjamin Bannier
>  Labels: mesosphere
>
> This is a tracking bug collecting issues in building or running Mesos on 
> ubuntu-12
> .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7109) Ensure Mesos can be built and tests successfully on Ubuntu14

2017-02-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7109:

Labels: mesosphere  (was: )

> Ensure Mesos can be built and tests successfully on Ubuntu14
> 
>
> Key: MESOS-7109
> URL: https://issues.apache.org/jira/browse/MESOS-7109
> Project: Mesos
>  Issue Type: Story
> Environment: ubuntu-14
>Reporter: Benjamin Bannier
>  Labels: mesosphere
>
> This is a tracking bug collecting issues in building or running Mesos on 
> ubuntu-14.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7108) Ensure Mesos can be built and tests successfully on Ubuntu12

2017-02-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7108:

Description: 
This is a tracking bug collecting issues in building or running Mesos on 
ubuntu-12
.


  was:
This is a tracking bug collecting issues in building or running Mesos on 
centos-7.



> Ensure Mesos can be built and tests successfully on Ubuntu12
> 
>
> Key: MESOS-7108
> URL: https://issues.apache.org/jira/browse/MESOS-7108
> Project: Mesos
>  Issue Type: Story
> Environment: ubuntu-12
>Reporter: Benjamin Bannier
>
> This is a tracking bug collecting issues in building or running Mesos on 
> ubuntu-12
> .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7109) Ensure Mesos can be built and tests successfully on Ubuntu14

2017-02-10 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-7109:
---

 Summary: Ensure Mesos can be built and tests successfully on 
Ubuntu14
 Key: MESOS-7109
 URL: https://issues.apache.org/jira/browse/MESOS-7109
 Project: Mesos
  Issue Type: Story
 Environment: ubuntu-14
Reporter: Benjamin Bannier


This is a tracking bug collecting issues in building or running Mesos on 
ubuntu-14.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7108) Ensure Mesos can be built and tests successfully on Ubuntu12

2017-02-10 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-7108:
---

 Summary: Ensure Mesos can be built and tests successfully on 
Ubuntu12
 Key: MESOS-7108
 URL: https://issues.apache.org/jira/browse/MESOS-7108
 Project: Mesos
  Issue Type: Story
 Environment: ubuntu-12
Reporter: Benjamin Bannier


This is a tracking bug collecting issues in building or running Mesos on 
centos-7.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7107) Ensure Mesos can be built and tests successfully on Centos7

2017-02-10 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-7107:
---

 Summary: Ensure Mesos can be built and tests successfully on 
Centos7
 Key: MESOS-7107
 URL: https://issues.apache.org/jira/browse/MESOS-7107
 Project: Mesos
  Issue Type: Story
 Environment: centos-7
Reporter: Benjamin Bannier


This is a tracking bug collecting issues in building or running Mesos on 
centos-7.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-4828) XFS disk quota isolator

2017-02-10 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-4828:
--
Target Version/s:   (was: 1.2.0)

> XFS disk quota isolator
> ---
>
> Key: MESOS-4828
> URL: https://issues.apache.org/jira/browse/MESOS-4828
> Project: Mesos
>  Issue Type: Epic
>  Components: isolation
>Reporter: James Peach
>Assignee: James Peach
>
> Implement a disk resource isolator using XFS project quotas. Compared to the 
> {{posix/disk}} isolator, this doesn't need to scan the filesystem 
> periodically, and applications receive a {{EDQUOT}} error instead of being 
> summarily killed.
> This initial implementation only isolates sandbox directory resources, since 
> isolation doesn't have any visibility into the the lifecycle of volumes, 
> which is needed to assign and track project IDs.
> The build dependencies for this are XFS header (from xfsprogs-devel) and 
> libblkid. We need libblkid or the equivalent to map filesystem paths to block 
> devices in order to apply quota.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-4828) XFS disk quota isolator

2017-02-10 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860901#comment-15860901
 ] 

Adam B commented on MESOS-4828:
---

Looks like the bulk of this Epic was completed 10 months ago, so it's not 
reasonable to call it out in the 1.2.0 CHANGELOG, and the work is ongoing, so 
I'm just going to remove the targetVersion, and you can set the fixVersion 
whenever you do resolve the Epic.

> XFS disk quota isolator
> ---
>
> Key: MESOS-4828
> URL: https://issues.apache.org/jira/browse/MESOS-4828
> Project: Mesos
>  Issue Type: Epic
>  Components: isolation
>Reporter: James Peach
>Assignee: James Peach
>
> Implement a disk resource isolator using XFS project quotas. Compared to the 
> {{posix/disk}} isolator, this doesn't need to scan the filesystem 
> periodically, and applications receive a {{EDQUOT}} error instead of being 
> summarily killed.
> This initial implementation only isolates sandbox directory resources, since 
> isolation doesn't have any visibility into the the lifecycle of volumes, 
> which is needed to assign and track project IDs.
> The build dependencies for this are XFS header (from xfsprogs-devel) and 
> libblkid. We need libblkid or the equivalent to map filesystem paths to block 
> devices in order to apply quota.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6913) AgentAPIStreamingTest.AttachInputToNestedContainerSession fails on Mac OS.

2017-02-10 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860897#comment-15860897
 ] 

Adam B commented on MESOS-6913:
---

[~anandmazumdar], please keep me up to date on your investigation. If it's 
small and low-risk, we can cherry-pick it into 1.2.0-rc2. If not, let's 
retarget to 1.3.0. Or maybe you discover that this issue is indeed fixed in 
1.2.0 already.

> AgentAPIStreamingTest.AttachInputToNestedContainerSession fails on Mac OS.
> --
>
> Key: MESOS-6913
> URL: https://issues.apache.org/jira/browse/MESOS-6913
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Mac OS 10.11.6 with Apple clang-703.0.31
>Reporter: Alexander Rukletsov
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
>
> {noformat}
> [ RUN  ] 
> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession/0
> make[3]: *** [check-local] Illegal instruction: 4
> make[2]: *** [check-am] Error 2
> make[1]: *** [check] Error 2
> make: *** [check-recursive] Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7050) IOSwitchboard FDs leaked when containerizer launch fails -- leads to deadlock

2017-02-10 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860893#comment-15860893
 ] 

Adam B commented on MESOS-7050:
---

I removed the targetVersion. It didn't make it into 1.2.0-rc1, and it sounds 
like the real fix would be non-trivial.
Feel free to retarget to 1.3.0 if/when you like.

> IOSwitchboard FDs leaked when containerizer launch fails -- leads to deadlock
> -
>
> Key: MESOS-7050
> URL: https://issues.apache.org/jira/browse/MESOS-7050
> Project: Mesos
>  Issue Type: Bug
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>Priority: Critical
>  Labels: debugging, mesosphere
>
> If the containizer launch path fails before actually
> launching the container, the FDs allocated to the container by the
> IOSwitchboard isolator are leaked. This leads to deadlock in
> the destroy path because the IOSwitchboard does not shutdown until the
> FDs it allocates to the container have been closed. Since the
> switchboard doesn't shutdown, the future returned by its 'cleanup()'
> function is never satisfied. 
> We need a general purpose method for closing the IOSwitchboard FDs when 
> failing in the launch path.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7050) IOSwitchboard FDs leaked when containerizer launch fails -- leads to deadlock

2017-02-10 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7050:
--
Target Version/s:   (was: 1.2.0)

> IOSwitchboard FDs leaked when containerizer launch fails -- leads to deadlock
> -
>
> Key: MESOS-7050
> URL: https://issues.apache.org/jira/browse/MESOS-7050
> Project: Mesos
>  Issue Type: Bug
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>Priority: Critical
>  Labels: debugging, mesosphere
>
> If the containizer launch path fails before actually
> launching the container, the FDs allocated to the container by the
> IOSwitchboard isolator are leaked. This leads to deadlock in
> the destroy path because the IOSwitchboard does not shutdown until the
> FDs it allocates to the container have been closed. Since the
> switchboard doesn't shutdown, the future returned by its 'cleanup()'
> function is never satisfied. 
> We need a general purpose method for closing the IOSwitchboard FDs when 
> failing in the launch path.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6784) IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky

2017-02-10 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860891#comment-15860891
 ] 

Adam B commented on MESOS-6784:
---

[~anandmazumdar], should we resolve as fixed in 1.2.0, or continue 
investigating for 1.3?

> IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky
> 
>
> Key: MESOS-6784
> URL: https://issues.apache.org/jira/browse/MESOS-6784
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Neil Conway
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> {noformat}
> [ RUN  ] IOSwitchboardTest.KillSwitchboardContainerDestroyed
> I1212 13:57:02.641043  2211 containerizer.cpp:220] Using isolation: 
> posix/cpu,filesystem/posix,network/cni
> W1212 13:57:02.641438  2211 backend.cpp:76] Failed to create 'overlay' 
> backend: OverlayBackend requires root privileges, but is running as user nrc
> W1212 13:57:02.641559  2211 backend.cpp:76] Failed to create 'bind' backend: 
> BindBackend requires root privileges
> I1212 13:57:02.642822  2268 containerizer.cpp:594] Recovering containerizer
> I1212 13:57:02.643975  2253 provisioner.cpp:253] Provisioner recovery complete
> I1212 13:57:02.644953  2255 containerizer.cpp:986] Starting container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f for executor 'executor' of framework
> I1212 13:57:02.647004  2245 switchboard.cpp:430] Allocated pseudo terminal 
> '/dev/pts/54' for container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.652305  2245 switchboard.cpp:596] Created I/O switchboard 
> server (pid: 2705) listening on socket file 
> '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' for 
> container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.655513  2267 launcher.cpp:133] Forked child with pid '2706' 
> for container '09e87380-00ab-4987-83c9-fa1c5d86717f'
> I1212 13:57:02.655732  2267 containerizer.cpp:1621] Checkpointing container's 
> forked pid 2706 to 
> '/tmp/IOSwitchboardTest_KillSwitchboardContainerDestroyed_Me5CRx/meta/slaves/frameworks/executors/executor/runs/09e87380-00ab-4987-83c9-fa1c5d86717f/pids/forked.pid'
> I1212 13:57:02.726306  2265 containerizer.cpp:2463] Container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f has exited
> I1212 13:57:02.726352  2265 containerizer.cpp:2100] Destroying container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f in RUNNING state
> E1212 13:57:02.726495  2243 switchboard.cpp:861] Unexpected termination of 
> I/O switchboard server: 'IOSwitchboard' exited with signal: Killed for 
> container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.726563  2265 launcher.cpp:149] Asked to destroy container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f
> E1212 13:57:02.783607  2228 switchboard.cpp:799] Failed to remove unix domain 
> socket file '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' 
> for container '09e87380-00ab-4987-83c9-fa1c5d86717f': No such file or 
> directory
> ../../mesos/src/tests/containerizer/io_switchboard_tests.cpp:661: Failure
> Value of: wait.get()->reasons().size() == 1
>   Actual: false
> Expected: true
> *** Aborted at 1481579822 (unix time) try "date -d @1481579822" if you are 
> using GNU date ***
> PC: @  0x1bf16d0 testing::UnitTest::AddTestPartResult()
> *** SIGSEGV (@0x0) received by PID 2211 (TID 0x7faed7d078c0) from PID 0; 
> stack trace: ***
> @ 0x7faecf855100 (unknown)
> @  0x1bf16d0 testing::UnitTest::AddTestPartResult()
> @  0x1be6247 testing::internal::AssertHelper::operator=()
> @  0x19ed751 
> mesos::internal::tests::IOSwitchboardTest_KillSwitchboardContainerDestroyed_Test::TestBody()
> @  0x1c0ed8c 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c09e74 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1beb505 testing::Test::Run()
> @  0x1bebc88 testing::TestInfo::Run()
> @  0x1bec2ce testing::TestCase::Run()
> @  0x1bf2ba8 testing::internal::UnitTestImpl::RunAllTests()
> @  0x1c0f9b1 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c0a9f2 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1bf18ee testing::UnitTest::Run()
> @  0x11bc9e3 RUN_ALL_TESTS()
> @  0x11bc599 main
> @ 0x7faece663b15 __libc_start_main
> @   0xa9c219 (unknown)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-6886) Add authorization tests for debug API handlers

2017-02-10 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6886:
--
Target Version/s: 1.3.0  (was: 1.2.0)

> Add authorization tests for debug API handlers
> --
>
> Key: MESOS-6886
> URL: https://issues.apache.org/jira/browse/MESOS-6886
> Project: Mesos
>  Issue Type: Task
>  Components: security
>Reporter: Vinod Kone
>Assignee: Alexander Rojas
>  Labels: security
>
> Should test authz of all 3 debug calls.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-6913) AgentAPIStreamingTest.AttachInputToNestedContainerSession fails on Mac OS.

2017-02-10 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6913:
--
Fix Version/s: (was: 1.2.0)

> AgentAPIStreamingTest.AttachInputToNestedContainerSession fails on Mac OS.
> --
>
> Key: MESOS-6913
> URL: https://issues.apache.org/jira/browse/MESOS-6913
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Mac OS 10.11.6 with Apple clang-703.0.31
>Reporter: Alexander Rukletsov
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
>
> {noformat}
> [ RUN  ] 
> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession/0
> make[3]: *** [check-local] Illegal instruction: 4
> make[2]: *** [check-am] Error 2
> make[1]: *** [check] Error 2
> make: *** [check-recursive] Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)