[jira] [Commented] (MESOS-10243) MAC Address changes from link::setMAC may not stick, leading to container launch failure with port mapping isolator.

2024-07-15 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17866134#comment-17866134
 ] 

Benjamin Mahler commented on MESOS-10243:
-

Landed fix for host network namespace veth interface.

Let's leave this open and mark as fixed once we also set the container network 
namespace eth0 interface's mac address on creation / update the script to stop 
setting it.

> MAC Address changes from link::setMAC may not stick, leading to container 
> launch failure with port mapping isolator.
> 
>
> Key: MESOS-10243
> URL: https://issues.apache.org/jira/browse/MESOS-10243
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Jason Zhou
>Assignee: Jason Zhou
>Priority: Major
>
> It seems that there are scenarios where mesos containers cannot communicate 
> with agents as the MAC addresses are set incorrectly, leading to dropped 
> packets. A workaround for this behavior is to check that the MAC address is 
> set correctly after the ioctl call, and retry the address setting if 
> necessary.
> In our test, this workaround appears to reduce the frequency of this issue, 
> but does not seem to prevent all such failures.
> Reviewboard ticket for the workaround: [https://reviews.apache.org/r/75057/] 
> Observed scenarios with incorrectly assigned MAC addresses:
> 1. ioctl returns the correct MAC address, but not net::mac
> 2. both net::mac and ioctl return the same MAC address, but are both wrong
> 3. There are no cases where ioctl/net::mac come back with the same MAC
>    address as before setting. i.e. there is no no-op observed.
> 4. There is a possibility that ioctl/net::mac results disagree with each
>    other even before attempting to set our desired MAC address. As such, we
>    check that the results agree before we set, and log a warning if we find
>    a mismatch
> 5. There is a possibility that the MAC address we set ends up overwritten by
>    a garbage value after setMAC has already completed and checked that the
>    mac address was set correctly. Since this error happens after this
>    function has finished, we cannot log nor detect it in setMAC. Our 
> workaround cannot     deal with this scenario as it occurs outside setMAC
> Notes:
> 1. We have observed this behavior only on CentOS 9 systems at the moment,
>    We have tried kernels 5.15.147, 5.15.160, 5.15.161, which all have this
>    issue.
>    CentOS 7 systems do not seem to have this issue with setMAC.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (MESOS-9045) LogZooKeeperTest.WriteRead can segfault

2024-07-02 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862647#comment-17862647
 ] 

Benjamin Mahler commented on MESOS-9045:


Very different case, but also a segfault:

{noformat}
[--] 2 tests from LogZooKeeperTest
I0703 02:50:24.968773 185149 zookeeper.cpp:82] Using Java classpath: 
-Djava.class.path=/tmp/SRC/build/mesos-1.12.0/_build/sub/3rdparty/zookeeper-3.4.8/zookeeper-3.4.8.jar:/tmp/SRC/build/mesos-1.12.0/_build/sub/3rdparty/zookeeper-3.4.8/lib/log4j-1.2.16.jar:/tmp/SRC/build/mesos-1.12.0/_build/sub/3rdparty/zookeeper-3.4.8/lib/jline-0.9.94.jar:/tmp/SRC/build/mesos-1.12.0/_build/sub/3rdparty/zookeeper-3.4.8/lib/slf4j-log4j12-1.6.1.jar:/tmp/SRC/build/mesos-1.12.0/_build/sub/3rdparty/zookeeper-3.4.8/lib/netty-3.7.0.Final.jar:/tmp/SRC/build/mesos-1.12.0/_build/sub/3rdparty/zookeeper-3.4.8/lib/slf4j-api-1.6.1.jar
[ RUN  ] LogZooKeeperTest.WriteRead
I0703 02:50:25.058761 185149 jvm.cpp:590] Looking up method 
(Ljava/lang/String;)V
I0703 02:50:25.059170 185149 jvm.cpp:590] Looking up method deleteOnExit()V
I0703 02:50:25.060112 185149 jvm.cpp:590] Looking up method 
(Ljava/io/File;Ljava/io/File;)V
log4j:WARN No appenders could be found for logger 
(org.apache.zookeeper.server.persistence.FileTxnSnapLog).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
I0703 02:50:25.206512 185149 jvm.cpp:590] Looking up method ()V
I0703 02:50:25.207417 185149 jvm.cpp:590] Looking up method 
(Lorg/apache/zookeeper/server/persistence/FileTxnSnapLog;Lorg/apache/zookeeper/server/ZooKeeperServer$DataTreeBuilder;)V
*** Aborted at 1719975025 (unix time) try "date -d @1719975025" if you are 
using GNU date ***
PC: @ 0x7f5edf914ccd OopStorage::Block::release_entries()
*** SIGSEGV (@0x238) received by PID 185149 (TID 0x7f5f5a15cb40) from PID 568; 
stack trace: ***
@ 0x7f5edf923929 os::Linux::chained_handler()
@ 0x7f5edf92963b JVM_handle_linux_signal
@ 0x7f5edf91c1dc signalHandler()
@ 0x7f5f5b7af420 (unknown)
@ 0x7f5edf914ccd OopStorage::Block::release_entries()
@ 0x7f5edf914f26 OopStorage::release()
@ 0x7f5edf617b21 jni_DeleteGlobalRef
@ 0x7f5f6aeffaf2 JNIEnv_::DeleteGlobalRef()
@ 0x7f5f6aefdc3a Jvm::deleteGlobalRef()
@ 0x55f0ed05a2ea Jvm::Object::~Object()
@ 0x55f0ed05f110 
org::apache::zookeeper::server::ZooKeeperServer::DataTreeBuilder::~DataTreeBuilder()
@ 0x55f0ed061a14 
org::apache::zookeeper::server::ZooKeeperServer::BasicDataTreeBuilder::~BasicDataTreeBuilder()
@ 0x55f0ed05da2f 
mesos::internal::tests::ZooKeeperTestServer::ZooKeeperTestServer()
@ 0x55f0eba05ef6 mesos::internal::tests::ZooKeeperTest::ZooKeeperTest()
@ 0x55f0eba0823f 
mesos::internal::tests::LogZooKeeperTest::LogZooKeeperTest()
@ 0x55f0eba08350 
mesos::internal::tests::LogZooKeeperTest_WriteRead_Test::LogZooKeeperTest_WriteRead_Test()
@ 0x55f0eba73252 testing::internal::TestFactoryImpl<>::CreateTest()
@ 0x55f0ed0a42bc 
testing::internal::HandleSehExceptionsInMethodIfSupported<>()
@ 0x55f0ed09d88d 
testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x55f0ed079ac5 testing::TestInfo::Run()
@ 0x55f0ed07a1cd testing::TestCase::Run()
@ 0x55f0ed081567 testing::internal::UnitTestImpl::RunAllTests()
@ 0x55f0ed0a54ea 
testing::internal::HandleSehExceptionsInMethodIfSupported<>()
@ 0x55f0ed09e3f3 
testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x55f0ed08007f testing::UnitTest::Run()
@ 0x55f0eba90e05 RUN_ALL_TESTS()
@ 0x55f0eba907cc main
@ 0x7f5f5b5cd083 __libc_start_main
@ 0x55f0eaad675e _start
{noformat}


> LogZooKeeperTest.WriteRead can segfault
> ---
>
> Key: MESOS-9045
> URL: https://issues.apache.org/jira/browse/MESOS-9045
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.1
> Environment: macOS
>Reporter: Jan Schlicht
>Priority: Major
>  Labels: flaky-test, segfault
>
> The following segfault occured when testing the {{1.5.x}} branch (SHA 
> {{64341865d}}) on macOS:
> {noformat}
> [ RUN  ] LogZooKeeperTest.WriteRead
> I0702 00:49:46.259831 2560127808 jvm.cpp:590] Looking up method 
> (Ljava/lang/String;)V
> I0702 00:49:46.260002 2560127808 jvm.cpp:590] Looking up method 
> deleteOnExit()V
> I0702 00:49:46.260550 2560127808 jvm.cpp:590] Looking up method 
> (Ljava/io/File;Ljava/io/File;)V
> log4j:WARN No appenders could be found for logger 
> (org.apache.zookeeper.server.persistence.FileTxnSnapLog).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> 

[jira] [Assigned] (MESOS-8867) CMake: Bundled libevent v2.1.5-beta doesn't compile with OpenSSL 1.1.0

2024-07-02 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-8867:
--

Assignee: Jason Zhou

> CMake: Bundled libevent v2.1.5-beta doesn't compile with OpenSSL 1.1.0
> --
>
> Key: MESOS-8867
> URL: https://issues.apache.org/jira/browse/MESOS-8867
> Project: Mesos
>  Issue Type: Bug
>  Components: cmake
> Environment: Fedora 28 with OpenSSL 1.1.0h, {{cmake -G Ninja -D 
> ENABLE_LIBEVENT=ON -D ENABLE_SSL=ON}}
>Reporter: Jan Schlicht
>Assignee: Jason Zhou
>Priority: Major
>
> Compiling libevent 2.1.5 beta with OpenSSL 1.1.0 fails with errors like
> {noformat}
> /home/vagrant/mesos/build/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:
>  In function ‘bio_bufferevent_new’:
> /home/vagrant/mesos/build/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:112:3:
>  error: dereferencing pointer to incomplete type ‘BIO’ {aka ‘struct bio_st’}
>   b->init = 0;
>^~
> {noformat}
> As this is the version currently bundled by CMake, builds with 
> {{ENABLE_LIBEVENT=ON, ENABLE_SSL=ON}} will fail to compile.
> Libevent supports OpenSSL 1.1.0 beginning with v2.1.7-rc (see 
> https://github.com/libevent/libevent/pull/397) 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (MESOS-7187) Master can neglect to update agent metadata in a re-registration corner case.

2024-04-22 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839811#comment-17839811
 ] 

Benjamin Mahler commented on MESOS-7187:


Added a mitigation of the bug I commented on above: 
https://github.com/apache/mesos/pull/558
It does not fix the overall issue here due to a lack of a connection construct, 
but it prevents the agent from getting stuck sending TASK_DROPPED for all 
incoming tasks.

> Master can neglect to update agent metadata in a re-registration corner case.
> -
>
> Key: MESOS-7187
> URL: https://issues.apache.org/jira/browse/MESOS-7187
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: tech-debt
>
> If the agent is re-registering with the master for the first time, the master 
> will drop any re-registration messages that arrive while the registry 
> operation is in progress.
> These dropped messages can have different metadata (e.g. version, 
> capabilities, etc) that gets dropped. Since the master doesn't distinguish 
> between different instances of the agent (both share the same UPID and there 
> is no instance identifying information), the master can't tell whether this 
> is a retry from the original instance of the agent or a re-registration from 
> a new instance of the agent.
> The following is an example:
> (1) Master restarts.
> (2) Agent re-registers with OLD_VERSION / OLD_CAPABILITIES.
> (3) While registry operation is in progress, agent is upgraded and 
> re-registers with NEW_VERSION / NEW_CAPABILITIES.
> (4) Registry operation completes, new agent receives the re-registration 
> acknowledgement message and so, does not retry.
> (5) Now, the master's memory reflects OLD_VERSION / OLD_CAPABILITIES for the 
> agent which remains inconsistent until a later re-registration occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (MESOS-7187) Master can neglect to update agent metadata in a re-registration corner case.

2024-04-12 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836731#comment-17836731
 ] 

Benjamin Mahler commented on MESOS-7187:


Observed an actual instance of this, occurred due to the following occurring:

1. ZK session expired
2. Master failover
3. Agent run 1 sends re-registration message to new master with UUID 1.
4. Agent fails over (for upgrade)
5. Agent run 2 sends re-registration message to new master
6. Master receives run 1 re-registration message.
7. Master ignores run 2 re-registration message (as agent is already 
re-registering).
8. Master completes re-registration and stores resource UUID 1 and notifies 
agent.
9. Agent receives re-registration completion, sends resource update with UUID 2.
10. Master *does not update* the agent's resource UUID (not because it ignores 
the update message, but because the logic simply doesn't make any update to it, 
which looks like a bug), so it remains UUID 1.

At this point, any tasks launched on the agent will go to TASK_DROPPED due to 
"Task assumes outdated resource state". The agent must be restarted at this 
point to fix the issue.


> Master can neglect to update agent metadata in a re-registration corner case.
> -
>
> Key: MESOS-7187
> URL: https://issues.apache.org/jira/browse/MESOS-7187
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: tech-debt
>
> If the agent is re-registering with the master for the first time, the master 
> will drop any re-registration messages that arrive while the registry 
> operation is in progress.
> These dropped messages can have different metadata (e.g. version, 
> capabilities, etc) that gets dropped. Since the master doesn't distinguish 
> between different instances of the agent (both share the same UPID and there 
> is no instance identifying information), the master can't tell whether this 
> is a retry from the original instance of the agent or a re-registration from 
> a new instance of the agent.
> The following is an example:
> (1) Master restarts.
> (2) Agent re-registers with OLD_VERSION / OLD_CAPABILITIES.
> (3) While registry operation is in progress, agent is upgraded and 
> re-registers with NEW_VERSION / NEW_CAPABILITIES.
> (4) Registry operation completes, new agent receives the re-registration 
> acknowledgement message and so, does not retry.
> (5) Now, the master's memory reflects OLD_VERSION / OLD_CAPABILITIES for the 
> agent which remains inconsistent until a later re-registration occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (MESOS-10208) Support multiple I/O threads in libprocess for libev.

2021-01-22 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10208:
---

Assignee: Benjamin Mahler

> Support multiple I/O threads in libprocess for libev.
> -
>
> Key: MESOS-10208
> URL: https://issues.apache.org/jira/browse/MESOS-10208
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>
> The current approach to I/O in libprocess, with a single thread performing 
> both the I/O polling and I/O syscalls, cannot keep up with the I/O load on 
> massive scale mesos clusters (which use libev rather than libevent).
> Libev supports running multiple loops within a process, so it is possible to 
> support a customizable number of I/O threads in libprocess, at least for 
> libev to start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10208) Support multiple I/O threads in libprocess for libev.

2021-01-05 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10208:
---

 Summary: Support multiple I/O threads in libprocess for libev.
 Key: MESOS-10208
 URL: https://issues.apache.org/jira/browse/MESOS-10208
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Benjamin Mahler


The current approach to I/O in libprocess, with a single thread performing both 
the I/O polling and I/O syscalls, cannot keep up with the I/O load on massive 
scale mesos clusters (which use libev rather than libevent).

Libev supports running multiple loops within a process, so it is possible to 
support a customizable number of I/O threads in libprocess, at least for libev 
to start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10186) Segmentation fault while running mesos in SSL mode

2020-10-08 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210343#comment-17210343
 ] 

Benjamin Mahler commented on MESOS-10186:
-

[~jpreddy] have you looked at what the error:1408F09C:SSL 
routines:ssl3_get_record:http request error means?

Sounds like something is trying to speak the wrong protocol perhaps.. my 
suggestion would be to take a look at the bytes being sent to see if they look 
right for TLS.

> Segmentation fault while running mesos in SSL mode
> --
>
> Key: MESOS-10186
> URL: https://issues.apache.org/jira/browse/MESOS-10186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.0, 1.9.0
>Reporter: Jaya Prasad Reddy G
>Priority: Blocker
>
> Hello,
> I've been runnning into segmentation faults while running mesos in SSL mode.
> After backtracing the coredump, this is what I found:
> {code:java}
> #0  0x7f93060b7592 in free () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7f9303ebe8cd in CRYPTO_free () from 
> /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #2  0x7f9303f7bfde in EVP_CIPHER_CTX_cleanup () from 
> /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #3  0x7f93042e4275 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
> #4  0x7f93042e59ca in SSL_set_accept_state () from 
> /lib/x86_64-linux-gnu/libssl.so.1.0.0
> #5  0x7f93059a5c72 in ?? () from 
> /usr/lib/x86_64-linux-gnu/libevent_openssl-2.0.so.5
> #6  0x7f930a801a74 in 
> process::network::internal::LibeventSSLSocketImpl::accept_SSL_callback 
> (request=request@entry=0x7f92f8000a80) at 
> src/posix/libevent/libevent_ssl_socket.cpp:1172
> #7  0x7f930a8021ea in 
> process::network::internal::LibeventSSLSocketImpl::accept_callback 
> (this=this@entry=0x55569161b430, request=request@entry=0x7f92f8000a80) at 
> src/posix/libevent/libevent_ssl_socket.cpp:1124
> #8  0x7f930a8027bc in 
> process::network::internal::LibeventSSLSocketImpl:: int, sockaddr*, int, void*)>::operator()(evconnlistener *, int, sockaddr *, 
> void *, int) (listener=0x555691667980, socket=24, 
> addr=, arg=0x55569161be30, addr_length=, 
> __closure=) at src/posix/libevent/libevent_ssl_socket.cpp:988
> #9  0x7f9305e0829c in ?? () from 
> /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
> #10 0x7f9305dfa639 in event_base_loop () from 
> /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
> #11 0x7f930a81c41d in process::EventLoop::run () at 
> src/posix/libevent/libevent.cpp:98
> #12 0x7f93066cbd00 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #13 0x7f930699c6ba in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #14 0x7f930613a60d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> {code}
> The environment variables exported before running mesos are:
> {code:java}
> MESOS_SSL_CIPHERS=ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA
>  
> LIBPROCESS_SSL_ENABLED=true 
> LIBPROCESS_SSL_SUPPORT_DOWNGRADE=false
> LIBPROCESS_SSL_CA_DIR=/opt/mesos/secrets/certs/
> LIBPROCESS_SSL_KEY_FILE=/opt/mesos/secrets/certs/mesos.private_key 
> LIBPROCESS_SSL_CERT_FILE=/opt/mesos/secrets/certs/mesos.certificate 
> LIBPROCESS_SSL_VERIFY_CERT=false
> LIBPROCESS_SSL_REQUIRE_CERT=false 
> LIBPROCESS_SSL_VERIFY_IPADD=false 
> LIBPROCESS_SSL_CA_FILE=/etc/ssl_ca/ca_list.pem 
> LIBPROCESS_SSL_CIPHERS=ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA
>  
> LIBPROCESS_SSL_ENABLE_SSL_V3=false 
> LIBPROCESS_SSL_ENABLE_TLS_V1_0=false 
> LIBPROCESS_SSL_ENABLE_TLS_V1_1=false 
> LIBPROCESS_SSL_ENABLE_TLS_V1_2=true 
> ZOOKEEPER_SSL_ENABLED=0 
> ZOOKEEPER_SSL_VERIFY_CERT=false
> ZOOKEEPER_SSL_REQUIRE_CERT=false 
> ZOOKEEPER_SSL_CA_FILE=/etc/ssl_ca/ca_list.pem 
> ZOOKEEPER_SSL_KEY_FILE=/opt/mesos/secrets/certs/mesos.private_key 
> 

[jira] [Created] (MESOS-10191) Finish 1.10.0 release process.

2020-09-30 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10191:
---

 Summary: Finish 1.10.0 release process.
 Key: MESOS-10191
 URL: https://issues.apache.org/jira/browse/MESOS-10191
 Project: Mesos
  Issue Type: Task
  Components: release
Reporter: Benjamin Mahler


The 1.10.0 release needs the following:

* Update homebrew to 1.10.0
* Add 1.10.0 on bintray
* Add a blog post for the release



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10143) Outstanding Offers accumulating

2020-09-30 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10143:
---

  Assignee: Benjamin Mahler
Resolution: Not A Problem

> Outstanding Offers accumulating
> ---
>
> Key: MESOS-10143
> URL: https://issues.apache.org/jira/browse/MESOS-10143
> Project: Mesos
>  Issue Type: Bug
>  Components: master, scheduler driver
>Affects Versions: 1.7.0
> Environment: Mesos Version 1.7.0
> JDK 8.0
>Reporter: Puneet Kumar
>Assignee: Benjamin Mahler
>Priority: Minor
>
> We manage an Apache Mesos cluster version 1.7.0. We have written a framework 
> in Java that schedules tasks to Mesos master at a rate of 300 TPS. Everything 
> works fine for almost 24 hours but then outstanding offers accumulate & 
> saturate within 15 minutes. Outstanding offers aren't reclaimed by Mesos 
> master. We observe "RescindOffer" messages in verbose (GLOG v=3) framework 
> logs but outstanding offers don't reduce. New resources aren't offered to 
> framework when outstanding offers saturate. We have to restart the scheduler 
> to reset outstanding offers to zero.
> Any suggestions to debug this issue are welcome.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts

2020-09-24 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201746#comment-17201746
 ] 

Benjamin Mahler commented on MESOS-10190:
-

cc [~qianzhang]

> libprocess fails with "Failed to obtain the IP address for " when using 
> CNI on some hosts
> ---
>
> Key: MESOS-10190
> URL: https://issues.apache.org/jira/browse/MESOS-10190
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.9.0
>Reporter: acecile555
>Priority: Major
>
> Hello,
>  
> We deployed CNI support and 3 of our hosts (all the same) are failing to 
> start container with CNI enabled. The log file is:
> {noformat}
> E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos 
> overwrites it. So I rebuilt Mesos with additionnal debugging and here is the 
> log:
> {noformat}
> Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to 
> '0.0.0.0'
> E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I 
> tried to understand why libprocess attempts to resolve a container run uuid 
> instead of the hostname, here is libprocess code:
>  
> {noformat}
> // Resolve the hostname if ip is 0.0.0.0 in case we actually have
>  // a valid external IP address. Note that we need only one IP
>  // address, so that other processes can send and receive and
>  // don't get confused as to whom they are sending to.
>  if (__address__.ip.isAny()) {
>  char hostname[512];
> if (gethostname(hostname, sizeof(hostname)) < 0) {
>  PLOG(FATAL) << "Failed to initialize, gethostname";
>  }
> // Lookup an IP address of local hostname, taking the first result.
>  Try ip = net::getIP(hostname, __address__.ip.family());
> if (ip.isError()) {
>  EXIT(EXIT_FAILURE)
>  << "Failed to obtain the IP address for '" << hostname << "';"
>  << " the DNS service may not be able to resolve it: " << ip.error();
>  }
> __address__.ip = ip.get();
>  }
> {noformat}
>  
> Well actually this is perfectly fine, except "gethostname" returns the 
> container UUID instead of an valid host IP address. How is that even possible 
> ?
>  
> Any help would be greatly appreciated.
> Regards, Adam.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts

2020-09-24 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10190:
---

Assignee: Benjamin Mahler

> libprocess fails with "Failed to obtain the IP address for " when using 
> CNI on some hosts
> ---
>
> Key: MESOS-10190
> URL: https://issues.apache.org/jira/browse/MESOS-10190
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.9.0
>Reporter: acecile555
>Assignee: Benjamin Mahler
>Priority: Major
>
> Hello,
>  
> We deployed CNI support and 3 of our hosts (all the same) are failing to 
> start container with CNI enabled. The log file is:
> {noformat}
> E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos 
> overwrites it. So I rebuilt Mesos with additionnal debugging and here is the 
> log:
> {noformat}
> Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to 
> '0.0.0.0'
> E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I 
> tried to understand why libprocess attempts to resolve a container run uuid 
> instead of the hostname, here is libprocess code:
>  
> {noformat}
> // Resolve the hostname if ip is 0.0.0.0 in case we actually have
>  // a valid external IP address. Note that we need only one IP
>  // address, so that other processes can send and receive and
>  // don't get confused as to whom they are sending to.
>  if (__address__.ip.isAny()) {
>  char hostname[512];
> if (gethostname(hostname, sizeof(hostname)) < 0) {
>  PLOG(FATAL) << "Failed to initialize, gethostname";
>  }
> // Lookup an IP address of local hostname, taking the first result.
>  Try ip = net::getIP(hostname, __address__.ip.family());
> if (ip.isError()) {
>  EXIT(EXIT_FAILURE)
>  << "Failed to obtain the IP address for '" << hostname << "';"
>  << " the DNS service may not be able to resolve it: " << ip.error();
>  }
> __address__.ip = ip.get();
>  }
> {noformat}
>  
> Well actually this is perfectly fine, except "gethostname" returns the 
> container UUID instead of an valid host IP address. How is that even possible 
> ?
>  
> Any help would be greatly appreciated.
> Regards, Adam.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts

2020-09-24 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10190:
---

Assignee: (was: Benjamin Mahler)

> libprocess fails with "Failed to obtain the IP address for " when using 
> CNI on some hosts
> ---
>
> Key: MESOS-10190
> URL: https://issues.apache.org/jira/browse/MESOS-10190
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.9.0
>Reporter: acecile555
>Priority: Major
>
> Hello,
>  
> We deployed CNI support and 3 of our hosts (all the same) are failing to 
> start container with CNI enabled. The log file is:
> {noformat}
> E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos 
> overwrites it. So I rebuilt Mesos with additionnal debugging and here is the 
> log:
> {noformat}
> Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to 
> '0.0.0.0'
> E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I 
> tried to understand why libprocess attempts to resolve a container run uuid 
> instead of the hostname, here is libprocess code:
>  
> {noformat}
> // Resolve the hostname if ip is 0.0.0.0 in case we actually have
>  // a valid external IP address. Note that we need only one IP
>  // address, so that other processes can send and receive and
>  // don't get confused as to whom they are sending to.
>  if (__address__.ip.isAny()) {
>  char hostname[512];
> if (gethostname(hostname, sizeof(hostname)) < 0) {
>  PLOG(FATAL) << "Failed to initialize, gethostname";
>  }
> // Lookup an IP address of local hostname, taking the first result.
>  Try ip = net::getIP(hostname, __address__.ip.family());
> if (ip.isError()) {
>  EXIT(EXIT_FAILURE)
>  << "Failed to obtain the IP address for '" << hostname << "';"
>  << " the DNS service may not be able to resolve it: " << ip.error();
>  }
> __address__.ip = ip.get();
>  }
> {noformat}
>  
> Well actually this is perfectly fine, except "gethostname" returns the 
> container UUID instead of an valid host IP address. How is that even possible 
> ?
>  
> Any help would be greatly appreciated.
> Regards, Adam.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10182) Mesos failed to build due to error C1083: Cannot open include file: 'csi/state.pb.h': No such file or directory on windows with MSVC

2020-09-24 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201745#comment-17201745
 ] 

Benjamin Mahler commented on MESOS-10182:
-

[~QuellaZhang] [~934341445] can you tell us more about how you're using mesos 
on windows? We currently don't have a lot of resources to put on active windows 
development. If you can submit PRs to fix this, that will help fix this 
resolved.

cc [~asekretenko] for the re2 error.

> Mesos failed to build due to error C1083: Cannot open include file: 
> 'csi/state.pb.h': No such file or directory on windows with MSVC
> 
>
> Key: MESOS-10182
> URL: https://issues.apache.org/jira/browse/MESOS-10182
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: master
>Reporter: QuellaZhang
>Priority: Major
> Attachments: build.log
>
>
> Hi All,
> I tried to build Mesos on Windows with VS2019. It failed to build due to 
> error C1083: Cannot open include file: 'csi/state.pb.h': No such file or 
> directory on Windows using MSVC. It can be reproduced on latest reversion 
> d4634f4 on master branch. Could you please take a look at this isssue? Thanks 
> a lot!
>  
> *Reproduce steps:*
> 1. git clone -c core.autocrlf=true https://github.com/apache/mesos 
> F:\gitP\apache\mesos
> 2. Open a VS 2019 x64 command prompt as admin and browse to 
> F:\gitP\apache\mesos
> 3. mkdir build_amd64 && pushd build_amd64
> 4. cmake -G "Visual Studio 16 2019" -A x64 
> -DCMAKE_SYSTEM_VERSION=10.0.18362.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="F:\tools\gnuwin32\bin" -T host=x64 ..
> 5. set CL=/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING %CL%
> 6. msbuild /maxcpucount:4 /p:Platform=x64 /p:Configuration=Debug Mesos.sln 
> /t:Rebuild
> *ErrorMessage:*
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file F:\gitP\apache\mesos\src\slave\csi_server.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file 
> F:\gitP\apache\mesos\src\slave\containerizer\mesos\launcher_tracker.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file 
> F:\gitP\apache\mesos\src\slave\containerizer\mesos\launcher.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file 
> F:\gitP\apache\mesos\src\slave\containerizer\composing.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file F:\gitP\apache\mesos\src\slave\slave.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file 
> F:\gitP\apache\mesos\src\slave\task_status_update_manager.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10186) Segmentation fault while running mesos in SSL mode

2020-09-24 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201743#comment-17201743
 ] 

Benjamin Mahler commented on MESOS-10186:
-

There seems to be something unique to your environment that's not causing this 
to be reported more broadly (this is the only report of this crash as far as 
I'm aware). Can you give us more information to try to understand why this is 
happening only for you? When did you first encounter this? Which versions of 
libraries were working before you upgraded and started seeing this crash?

For what it's worth, here's an example of a working master from one of our 
clusters:

{noformat}
$ ldd 
/opt/mesosphere/packages/mesos--694f19b8312fb86ff1d095c71497cd7f538e4ba1/bin/mesos-master
linux-vdso.so.1 =>  (0x7ffed2d7f000)
libmesos-1.8.2.so => /opt/mesosphere/lib/libmesos-1.8.2.so 
(0x7f3a263e5000)
libdl.so.2 => /lib64/libdl.so.2 (0x7f3a261db000)
libevent_core-2.0.so.5 => /opt/mesosphere/lib/libevent_core-2.0.so.5 
(0x7f3a25fa8000)
librt.so.1 => /lib64/librt.so.1 (0x7f3a25da)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x7f3a25a97000)
libm.so.6 => /lib64/libm.so.6 (0x7f3a25795000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f3a2557f000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x7f3a25362000)
libc.so.6 => /lib64/libc.so.6 (0x7f3a24f94000)
/lib64/ld-linux-x86-64.so.2 (0x55f76698)
libevent-2.0.so.5 => 
/opt/mesosphere/packages/libevent--a158693c2ee9f62e39d67980959f0952494b4313/lib/libevent-2.0.so.5
 (0x7f3a24d3d000)
libseccomp.so.2 => 
/opt/mesosphere/packages/libseccomp--e22c027075351745483adc9c10058ec376c65d3f/lib/libseccomp.so.2
 (0x7f3a24aec000)
libz.so.1 => /lib64/libz.so.1 (0x7f3a248d6000)
libsvn_delta-1.so.1 => /opt/mesosphere/lib/libsvn_delta-1.so.1 
(0x7f3a246c2000)
libsvn_subr-1.so.1 => /opt/mesosphere/lib/libsvn_subr-1.so.1 
(0x7f3a2439e000)
libevent_openssl-2.0.so.5 => 
/opt/mesosphere/packages/libevent--a158693c2ee9f62e39d67980959f0952494b4313/lib/libevent_openssl-2.0.so.5
 (0x7f3a24197000)
libssl.so.1.0.0 => /opt/mesosphere/lib/libssl.so.1.0.0 
(0x7f3a23f26000)
libcrypto.so.1.0.0 => /opt/mesosphere/lib/libcrypto.so.1.0.0 
(0x7f3a23ab4000)
libsasl2.so.2 => /opt/mesosphere/lib/libsasl2.so.2 (0x7f3a23899000)
libevent_pthreads-2.0.so.5 => 
/opt/mesosphere/packages/libevent--a158693c2ee9f62e39d67980959f0952494b4313/lib/libevent_pthreads-2.0.so.5
 (0x7f3a23695000)
libcurl.so.4 => 
/opt/mesosphere/packages/curl--e1ad33f11031799e5da6452429494e02e3198a45/lib/libcurl.so.4
 (0x7f3a23427000)
libapr-1.so.0 => /opt/mesosphere/lib/libapr-1.so.0 (0x7f3a231f5000)
libaprutil-1.so.0 => /opt/mesosphere/lib/libaprutil-1.so.0 
(0x7f3a22fcd000)
libexpat.so.1 => /lib64/libexpat.so.1 (0x7f3a22da3000)
libsqlite3.so.0 => /lib64/libsqlite3.so.0 (0x7f3a22aed000)
libuuid.so.1 => /lib64/libuuid.so.1 (0x7f3a228e8000)
libcrypt.so.1 => /lib64/libcrypt.so.1 (0x7f3a226b)
libfreebl3.so => /lib64/libfreebl3.so (0x7f3a224ad000)

$ sudo xargs -0 -L1 -a /proc/17934/environ | grep LIBPROCESS
LIBPROCESS_SSL_CA_FILE=/run/dcos/pki/CA/ca-bundle.crt
LIBPROCESS_NUM_WORKER_THREADS=16
LIBPROCESS_SSL_VERIFY_CERT=false
LIBPROCESS_SSL_KEY_FILE=/run/dcos/pki/tls/private/mesos-master.key
LIBPROCESS_SSL_REQUIRE_CERT=false
LIBPROCESS_SSL_CIPHERS=ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:AES128-SHA:AES256-SHA
LIBPROCESS_SSL_ENABLED=true
LIBPROCESS_SSL_CERT_FILE=/run/dcos/pki/tls/certs/mesos-master.crt
LIBPROCESS_SSL_SUPPORT_DOWNGRADE=false
{noformat}

> Segmentation fault while running mesos in SSL mode
> --
>
> Key: MESOS-10186
> URL: https://issues.apache.org/jira/browse/MESOS-10186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.0, 1.9.0
>Reporter: Jaya Prasad Reddy G
>Priority: Blocker
>
> Hello,
> I've been runnning into segmentation faults while running mesos in SSL mode.
> After backtracing the coredump, this is what I found:
> {code:java}
> #0  0x7f93060b7592 in free () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7f9303ebe8cd in CRYPTO_free () from 
> /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #2  0x7f9303f7bfde in EVP_CIPHER_CTX_cleanup () from 
> /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #3  0x7f93042e4275 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
> #4  0x7f93042e59ca in SSL_set_accept_state () from 
> /lib/x86_64-linux-gnu/libssl.so.1.0.0
> #5  0x7f93059a5c72 in ?? () from 
> 

[jira] [Comment Edited] (MESOS-10164) mesos-dns only showing the 2nd network

2020-09-24 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201729#comment-17201729
 ] 

Benjamin Mahler edited comment on MESOS-10164 at 9/24/20, 6:56 PM:
---

mesos-dns is not mantained by the mesos developers, please contact the 
mesos-dns project directly


was (Author: bmahler):
mesos-dns is not mantained by the mesos developers

> mesos-dns only showing the 2nd network
> --
>
> Key: MESOS-10164
> URL: https://issues.apache.org/jira/browse/MESOS-10164
> Project: Mesos
>  Issue Type: Bug
>Reporter: none
>Priority: Major
>
> Have launched a task like this"networks": [     { "mode": "container", 
> "name": "cni-apps" },    { "mode": "container", "name": "cni-apps-public", 
> "labels": {"CNI_ARGS": "IP=x.x.x.x"} }    ],dig +short only shows me the ip 
> address of the last network.curl 
> http://192.168.10.151:5050/tasks?task_id=sendmail.instance-257020f4-b574-11e9-bdb3-0050563001a1._app.7
>  | jq .Shows both ip addressesusing v0.7.0-rc2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9609) Master check failure when marking agent unreachable

2020-08-31 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9609:
--

Assignee: Benjamin Mahler

> Master check failure when marking agent unreachable
> ---
>
> Key: MESOS-9609
> URL: https://issues.apache.org/jira/browse/MESOS-9609
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: foundations, mesosphere
>
> {code}
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81543313 
> http.cpp:1185] HTTP POST for /master/api/v1/scheduler from 10.142.0.5:55133
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81558813 
> master.cpp:5467] Processing DECLINE call for offers: [ 
> 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 ] for framework 
> 5e57f633-a69c-4009-b7
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81569313 
> master.cpp:10703] Removing offer 5e57f633-a69c-4009-b773-990b4b8984ad-O58323
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82014210 
> master.cpp:8227] Marking agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at 
> slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engi
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82036710 
> registrar.cpp:495] Applied 1 operations in 86528ns; attempting to update the 
> registry
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82057210 
> registrar.cpp:552] Successfully updated the registry in 175872ns
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82064211 
> master.cpp:8275] Marked agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at 
> slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engin
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820957 9 
> hierarchical.cpp:609] Removed agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49
> Mar 11 10:04:35 research docker[4503]: F0311 10:04:35.85196111 
> master.cpp:10018] Check failed: 'framework' Must be non NULL
> Mar 11 10:04:35 research docker[4503]: *** Check failure stack trace: ***
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d  
> google::LogMessage::Fail()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830  
> google::LogMessage::SendToLog()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663  
> google::LogMessage::Flush()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259  
> google::LogMessageFatal::~LogMessageFatal()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14  
> google::CheckNotNull<>()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8  
> mesos::internal::master::Master::__removeSlave()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2  
> mesos::internal::master::Master::_markUnreachable()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5f98f11  
> process::ProcessBase::consume()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb2a4a  
> process::ProcessManager::resume()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb65d6  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c35d4c80  (unknown)
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2de76ba  start_thread
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2b1d41d  (unknown)
> Mar 11 10:04:36 research docker[4503]: *** Aborted at 1520762676 (unix time) 
> try "date -d @1520762676" if you are using GNU date ***
> Mar 11 10:04:36 research docker[4503]: PC: @ 0x7f96c2a4d196 (unknown)
> Mar 11 10:04:36 research docker[4503]: *** SIGSEGV (@0x0) received by PID 1 
> (TID 0x7f96b986d700) from PID 0; stack trace: ***
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2df1390 (unknown)
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2a4d196 (unknown)
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c604ce2c 
> google::DumpStackTraceAndExit()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d 
> google::LogMessage::Fail()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830 
> google::LogMessage::SendToLog()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663 
> google::LogMessage::Flush()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259 
> google::LogMessageFatal::~LogMessageFatal()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14 
> google::CheckNotNull<>()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8 
> mesos::internal::master::Master::__removeSlave()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2 
> mesos::internal::master::Master::_markUnreachable()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5f98f11 
> process::ProcessBase::consume()
> Mar 

[jira] [Commented] (MESOS-10143) Outstanding Offers accumulating

2020-07-28 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166505#comment-17166505
 ] 

Benjamin Mahler commented on MESOS-10143:
-

[~puneetku287] It looks like the scheduler native library is getting 
backlogged, this can happen when the scheduler cannot process messages as fast 
as they come in from the master. (In your example I see 
Scheduler::resourceOffers took 6.8 ms which is long).

If you want to check this while it's happening the next time, you can hit 
{{http://IP:PORT/metrics/snapshot}} of the scheduler library, and it shouldn't 
return because the scheduler metrics are not able to get computed in a timely 
manner. You can also specify a timeout via 
{{http://IP:PORT/metrics/snapshot?timeout=10secs}} and should see a response 
without the {{scheduler/event_queue_messages}} metric present.

You may want to fix the port of the scheduler library in order to do this, by 
setting LIBPROCESS_PORT=X in your environment before instantiating the library.

> Outstanding Offers accumulating
> ---
>
> Key: MESOS-10143
> URL: https://issues.apache.org/jira/browse/MESOS-10143
> Project: Mesos
>  Issue Type: Bug
>  Components: master, scheduler driver
>Affects Versions: 1.7.0
> Environment: Mesos Version 1.7.0
> JDK 8.0
>Reporter: Puneet Kumar
>Priority: Minor
>
> We manage an Apache Mesos cluster version 1.7.0. We have written a framework 
> in Java that schedules tasks to Mesos master at a rate of 300 TPS. Everything 
> works fine for almost 24 hours but then outstanding offers accumulate & 
> saturate within 15 minutes. Outstanding offers aren't reclaimed by Mesos 
> master. We observe "RescindOffer" messages in verbose (GLOG v=3) framework 
> logs but outstanding offers don't reduce. New resources aren't offered to 
> framework when outstanding offers saturate. We have to restart the scheduler 
> to reset outstanding offers to zero.
> Any suggestions to debug this issue are welcome.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10137) Mesos failed to build due to error C2668 on windows with MSVC

2020-07-27 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10137:
---

Fix Version/s: 1.11.0
 Assignee: Benjamin Mahler
   Resolution: Fixed

Ah I misread this issue initially, to get it compiling again I committed 
[~chris.d.holman]'s suggestion to disambiguate. It must depend on the compiler 
version since there wasn't an ambiguity issue in our windows CI.

{noformat}
commit da08b0cc33d3b6b70a507348783a70ac863cb1dd
Author: Benjamin Mahler 
Date:   Mon Jul 27 13:20:54 2020 -0400

Fixed a compilation issue on Windows with os::spawn.

Per MESOS-10137, the call to os::spawn touched here can be seen
as ambiguous by the compiler, given that windows adds a default
argument to os::spawn. Probably we need to remove the default
argument and explicitly implement os::spawn which calls into the
additional windows os::spawn with the extra environment argument.
{noformat}

> Mesos failed to build due to error C2668 on windows with MSVC
> -
>
> Key: MESOS-10137
> URL: https://issues.apache.org/jira/browse/MESOS-10137
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: master
>Reporter: LinGao
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: windows
> Fix For: 1.11.0
>
> Attachments: build.log
>
>
> Hi All,
> I tried to build Mesos on Windows with VS2019. It failed to build due to 
> error C2668: 'os::spawn': ambiguous call to overloaded function on Windows 
> using MSVC. It can be reproduced on latest reversion d4634f4 on master 
> branch. Could you please take a look at this isssue? Thanks a lot!
>  
> Reproduce steps:
> 1.  git clone -c core.autocrlf=true [https://github.com/apache/mesos] 
> F:\gitP\apache\mesos
>  2.  Open a VS 2019 x64 command prompt as admin and browse to 
> F:\gitP\apache\mesos
>  3.  mkdir build_amd64 && pushd build_amd64
> 4.  cmake -G "Visual Studio 16 2019" -A x64 
> -DCMAKE_SYSTEM_VERSION=10.0.18362.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="F:\tools\gnuwin32\bin" -T host=x64 ..
> 5.  set _CL_=/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING %_CL_%
> 6.  msbuild /maxcpucount:4 /p:Platform=x64 /p:Configuration=Debug Mesos.sln 
> /t:Rebuild
>  
> ErrorMessage:
> F:\gitP\apache\mesos\3rdparty\stout\include\stout/os/windows/shell.hpp(168,68):
>  error C2668: 'os::spawn': ambiguous call to overloaded function (compiling 
> source file F:\gitP\apache\mesos\3rdparty\libprocess\src\authenticator.cpp) 
> [F:\gitP\apache\mesos\build_amd64\3rdparty\libprocess\src\process.vcxproj]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10137) Mesos failed to build due to error C2668 on windows with MSVC

2020-06-09 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129441#comment-17129441
 ] 

Benjamin Mahler commented on MESOS-10137:
-

Strange.. can you try changing {{spawn}} to {{process::spawn}} on the spawn 
call in the authenticator.cpp file?

> Mesos failed to build due to error C2668 on windows with MSVC
> -
>
> Key: MESOS-10137
> URL: https://issues.apache.org/jira/browse/MESOS-10137
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: master
>Reporter: LinGao
>Priority: Major
> Attachments: build.log
>
>
> Hi All,
> I tried to build Mesos on Windows with VS2019. It failed to build due to 
> error C2668: 'os::spawn': ambiguous call to overloaded function on Windows 
> using MSVC. It can be reproduced on latest reversion d4634f4 on master 
> branch. Could you please take a look at this isssue? Thanks a lot!
>  
> Reproduce steps:
> 1.  git clone -c core.autocrlf=true [https://github.com/apache/mesos] 
> F:\gitP\apache\mesos
>  2.  Open a VS 2019 x64 command prompt as admin and browse to 
> F:\gitP\apache\mesos
>  3.  mkdir build_amd64 && pushd build_amd64
> 4.  cmake -G "Visual Studio 16 2019" -A x64 
> -DCMAKE_SYSTEM_VERSION=10.0.18362.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="F:\tools\gnuwin32\bin" -T host=x64 ..
> 5.  set _CL_=/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING %_CL_%
> 6.  msbuild /maxcpucount:4 /p:Platform=x64 /p:Configuration=Debug Mesos.sln 
> /t:Rebuild
>  
> ErrorMessage:
> F:\gitP\apache\mesos\3rdparty\stout\include\stout/os/windows/shell.hpp(168,68):
>  error C2668: 'os::spawn': ambiguous call to overloaded function (compiling 
> source file F:\gitP\apache\mesos\3rdparty\libprocess\src\authenticator.cpp) 
> [F:\gitP\apache\mesos\build_amd64\3rdparty\libprocess\src\process.vcxproj]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10133) Task launched with 0 scalar value for persistent volume.

2020-05-26 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10133:
---

 Summary: Task launched with 0 scalar value for persistent volume.
 Key: MESOS-10133
 URL: https://issues.apache.org/jira/browse/MESOS-10133
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Benjamin Mahler


We saw the following task launch message:

{noformat}
I0520 17:58:11.559875  2913 master.cpp:4985] Launching task T of framework 
288ebd4e-8bbf-4f2a-ac3a-2e5eb2885266 (name) with resources 
[{"allocation_info":{"role":"role"},"disk":{"persistence":{"id":"ID","principal":"p"},"volume":{"container_path":"path","mode":"RW"}},"name":"disk","reservations":[{"labels":{"labels":[{"key":"marathon_framework_id","value":"16fa03ca-0048-4124-bdac-aff56e679c95-"},{"key":"marathon_task_id","value":"T"}]},"principal":"p","role":"role","type":"DYNAMIC"}],"scalar":{"value":0.0},"type":"SCALAR"},{"allocation_info":{"role":"role"},"name":"mem","scalar":{"value":544.0},"type":"SCALAR"},{"allocation_info":{"role":"role"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"}]
 on agent 16fa03ca-0048-4124-bdac-aff56e679c95-S49 at slave(1)@IP:5051 (IP) on  
new executor
{noformat}

In which the persistent volume is being used with a 0 scalar value. This should 
have been considered invalid since we require the entire persistent volume to 
be used, however perhaps it gets considered as not being used since the value 
is 0 (e.g. cpus:1;foobars:0 == cpus:1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10093) Libprocess does not properly escape subprocess argument strings on Windows

2020-05-01 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097699#comment-17097699
 ] 

Benjamin Mahler commented on MESOS-10093:
-

Cleanups have landed, I have to move on to other work, but here are two WIP 
patches of fixes at the stout and libprocess layer to make this easier when we 
pick this back up:

https://reviews.apache.org/r/72460/
https://reviews.apache.org/r/72461/

Also needed is a patch for Mesos in the docker wrapper code, and in the 
containerization launch related code. Then we need to document for users up in 
CommandInfo how shell=true and shell=false works for Windows in particular.

> Libprocess does not properly escape subprocess argument strings on Windows
> --
>
> Key: MESOS-10093
> URL: https://issues.apache.org/jira/browse/MESOS-10093
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Greg Mann
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: containerization, docker, mesosphere, windows
>
> When running some tests of Mesos on Windows, I discovered that the following 
> command would not execute successfully when passed to the Docker 
> containerizer in {{TaskInfo.command}}:
> {noformat}
> python -c "print('hello world')"
> {noformat}
> The following error is found in the task sandbox:
> {noformat}
>   File "", line 1
> "print('hello
> ^
> SyntaxError: EOL while scanning string literal
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10093) Libprocess does not properly escape subprocess argument strings on Windows

2020-05-01 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10093:
---

Assignee: (was: Benjamin Mahler)

> Libprocess does not properly escape subprocess argument strings on Windows
> --
>
> Key: MESOS-10093
> URL: https://issues.apache.org/jira/browse/MESOS-10093
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Greg Mann
>Priority: Major
>  Labels: containerization, docker, mesosphere, windows
>
> When running some tests of Mesos on Windows, I discovered that the following 
> command would not execute successfully when passed to the Docker 
> containerizer in {{TaskInfo.command}}:
> {noformat}
> python -c "print('hello world')"
> {noformat}
> The following error is found in the task sandbox:
> {noformat}
>   File "", line 1
> "print('hello
> ^
> SyntaxError: EOL while scanning string literal
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-6568) JSON serialization should not omit empty arrays in HTTP APIs

2020-04-22 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089867#comment-17089867
 ] 

Benjamin Mahler commented on MESOS-6568:


The code is ready to ship but has not been committed yet (as you can see in the 
linked reviews), a user reported it broke some of their tests so I've been 
giving them time to fix their tests before I land this.

> JSON serialization should not omit empty arrays in HTTP APIs
> 
>
> Key: MESOS-6568
> URL: https://issues.apache.org/jira/browse/MESOS-6568
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Neil Conway
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: mesosphere
>
> When using the JSON content type with the HTTP APIs, a {{repeated}} protobuf 
> field is omitted entirely from the JSON serialization of the message. For 
> example, this is a response to the {{GetTasks}} call:
> {noformat}
> {
>   "get_tasks": {
> "tasks": [{...}]
>   },
>   "type": "GET_TASKS"
> }
> {noformat}
> I think it would be better to include empty arrays for the other fields of 
> the message ({{pending_tasks}}, {{completed_tasks}}, etc.). Advantages:
> # Consistency with the old HTTP endpoints, e.g., /state
> # Semantically, an empty array is more accurate. The master's response should 
> be interpreted as saying it doesn't know about any pending/completed tasks; 
> that is more accurately conveyed by explicitly including an empty array, not 
> by omitting the key entirely.
> *NOTE: The 
> [asV1Protobuf|https://github.com/apache/mesos/blob/d10a33acc426dda9e34db995f16450faf898bb3b/src/common/http.cpp#L172-L423]
>  copy needs to also be updated.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10113) OpenSSLSocketImpl with 'support_downgrade' waits for incoming bytes before accepting new connection.

2020-04-21 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10113:
---

Assignee: Benjamin Mahler

https://reviews.apache.org/r/72352/

> OpenSSLSocketImpl with 'support_downgrade' waits for incoming bytes before 
> accepting new connection.
> 
>
> Key: MESOS-10113
> URL: https://issues.apache.org/jira/browse/MESOS-10113
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Critical
>
> The accept loop in OpenSSLSocketImpl in the case of {{support_downgrade}} 
> enabled will wait for incoming bytes on the accepted socket before allowing 
> another socket to be accepted. This will lead to significant throughput 
> issues for accepting new connections (e.g. during a master failover), or may 
> block entirely if a client doesn't send any data for whatever reason.
> Marking as a bug due to the potential for blocking incoming connections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10124) OpenSSLSocketImpl on Windows with 'support_downgrade' is incorrectly polling for read readiness.

2020-04-21 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10124:
---

Assignee: Benjamin Mahler

> OpenSSLSocketImpl on Windows with 'support_downgrade' is incorrectly polling 
> for read readiness.
> 
>
> Key: MESOS-10124
> URL: https://issues.apache.org/jira/browse/MESOS-10124
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: windows
>
> OpenSSLSocket is currently using the zero byte read trick on Windows to poll 
> for read readiness when peaking at the data to determine whether the incoming 
> connection is performing an SSL handshake. However, io::read is designed to 
> provide consistent semantics for a zero byte read across posix and windows, 
> which is to return immediately.
> To fix this, we can either:
> (1) Have different semantics for zero byte io::read on posix / windows, where 
> we just let it fall through to the system calls. This might be confusing for 
> users, but it's unlikely that a caller would perform a zero byte read in 
> typical code so the confusion is probably avoided.
> (2) Implement io::poll for reads on windows. This would make the caller code 
> consistent and is probably less confusing to users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10124) OpenSSLSocketImpl on Windows with 'support_downgrade' is incorrectly polling for read readiness.

2020-04-21 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10124:
---

 Summary: OpenSSLSocketImpl on Windows with 'support_downgrade' is 
incorrectly polling for read readiness.
 Key: MESOS-10124
 URL: https://issues.apache.org/jira/browse/MESOS-10124
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Benjamin Mahler


OpenSSLSocket is currently using the zero byte read trick on Windows to poll 
for read readiness when peaking at the data to determine whether the incoming 
connection is performing an SSL handshake. However, io::read is designed to 
provide consistent semantics for a zero byte read across posix and windows, 
which is to return immediately.

To fix this, we can either:

(1) Have different semantics for zero byte io::read on posix / windows, where 
we just let it fall through to the system calls. This might be confusing for 
users, but it's unlikely that a caller would perform a zero byte read in 
typical code so the confusion is probably avoided.

(2) Implement io::poll for reads on windows. This would make the caller code 
consistent and is probably less confusing to users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10123) Windows overlapped IO discard handling can drop data.

2020-04-21 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10123:
---

 Summary: Windows overlapped IO discard handling can drop data.
 Key: MESOS-10123
 URL: https://issues.apache.org/jira/browse/MESOS-10123
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


When getting a discard request for an io operation on windows, a cancellation 
is requested [1] and when the io operation completes we check whether the 
future had a discard request to decide whether to discard it [2]:

{code}
template 
static void set_io_promise(Promise* promise, const T& data, DWORD error)
{
  if (promise->future().hasDiscard()) {
promise->discard();
  } else if (error == ERROR_SUCCESS) {
promise->set(data);
  } else {
promise->fail("IO failed with error code: " + WindowsError(error).message);
  }
}
{code}

However, it's possible the operation completed successfully, in which case we 
did not succeed at canceling it. We need to check for 
{{ERROR_OPERATION_ABORTED}} [3]:

{code}
template 
static void set_io_promise(Promise* promise, const T& data, DWORD error)
{
  if (promise->future().hasDiscard() && error == ERROR_OPERATION_ABORTED) {
promise->discard();
  } else if (error == ERROR_SUCCESS) {
promise->set(data);
  } else {
promise->fail("IO failed with error code: " + WindowsError(error).message);
  }
}
{code}

I don't think there are currently any major consequences to this issue, since 
most callers tend to be discarding only when they're essentially abandoning the 
entire process of reading or writing.

[1] 
https://github.com/apache/mesos/blob/1.9.0/3rdparty/libprocess/src/windows/libwinio.cpp#L448
[2] 
https://github.com/apache/mesos/blob/1.9.0/3rdparty/libprocess/src/windows/libwinio.cpp#L141-L151
[3] https://docs.microsoft.com/en-us/windows/win32/fileio/cancelioex-func



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10119) failure to destroy container can cause the agent to "leak" a GPU

2020-04-20 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087975#comment-17087975
 ] 

Benjamin Mahler commented on MESOS-10119:
-

Marking as a duplicate of MESOS-8038.

> failure to destroy container can cause the agent to "leak" a GPU
> 
>
> Key: MESOS-10119
> URL: https://issues.apache.org/jira/browse/MESOS-10119
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Charles Natali
>Priority: Major
>
> At work we hit the following problem:
>  # cgroup for a task using the GPU isolation failed to be destroyed on OOM
>  # the agent continued advertising the GPU as available
>  # all subsequent attempts to start tasks using a GPU fails with "Requested 1 
> gpus but only 0 available"
> Problem 1 looks like https://issues.apache.org/jira/browse/MESOS-9950) so can 
> be tackled separately, however the fact that the agent basically leaks the 
> GPU is pretty bad, because it basically turns into /dev/null, failing all 
> subsequent tasks requesting a GPU.
>  
> See the logs:
>  
>  
> {noformat}
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874277 2138 
> memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874305 2138 
> memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874315 2138 
> memory.cpp:686] Failed to read 'memory.stat': No such file or directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874701 2136 
> memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874734 2136 
> memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874747 2136 
> memory.cpp:686] Failed to read 'memory.stat': No such file or directory
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.062358 2152 
> slave.cpp:6994] Termination of executor 
> 'task_0:067b0963-134f-a917-4503-89b6a2a630ac' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to clean up an 
> isolator when destroying container: Failed to destroy cgroups: Failed to get 
> nested cgroups: Failed to determine canonical path of 
> '/sys/fs/cgroup/memory/mesos/8ef00748-b640-4620-97dc-f719e9775e88': No such 
> file or directory
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063295 2150 
> containerizer.cpp:2567] Skipping status for container 
> 8ef00748-b640-4620-97dc-f719e9775e88 because: Container does not exist
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063429 2137 
> containerizer.cpp:2428] Ignoring update for currently being destroyed 
> container 8ef00748-b640-4620-97dc-f719e9775e88
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.079169 2150 
> slave.cpp:6994] Termination of executor 
> 'task_1:a00165a1-123a-db09-6b1a-b6c4054b0acd' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to kill all 
> processes in the container: Failed to remove cgroup 
> 'mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Failed to remove cgroup 
> '/sys/fs/cgroup/freezer/mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Device 
> or resource busy
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079537 2140 
> containerizer.cpp:2567] Skipping status for container 
> 5c1418f0-1d4d-47cd-a188-0f4b87e394f2 because: Container does not exist
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079670 2136 
> containerizer.cpp:2428] Ignoring update for currently being destroyed 
> container 5c1418f0-1d4d-47cd-a188-0f4b87e394f2
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.956969 2136 
> slave.cpp:6889] Container '87253521-8d39-47ea-b4d1-febe527d230c' for executor 
> 'task_2:8b129d24-70d2-2cab-b2df-c73911954ec3' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed to start: Requested 1 gpus 
> but only 0 available
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.957670 2149 
> memory.cpp:637] Listening on OOM events failed for container 
> 87253521-8d39-47ea-b4d1-febe527d230c: Event listener is terminating
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: W0417 17:00:07.966552 2150 
> containerizer.cpp:2421] Ignoring update for unknown container 
> 87253521-8d39-47ea-b4d1-febe527d230c
> Apr 17 17:00:08 engpuc006 mesos-slave[2068]: W0417 17:00:08.109067 2154 
> process.cpp:1480] Failed to link to '172.16.22.201:34059', connect: Failed 
> connect: connection closed
> Apr 17 17:00:10 engpuc006 mesos-slave[2068]: E0417 17:00:10.310817 2141 
> 

[jira] [Assigned] (MESOS-10114) OpenSSLSocketImpl with 'support_downgrade' can silently stop accepting sockets.

2020-04-10 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10114:
---

Assignee: Benjamin Mahler

> OpenSSLSocketImpl with 'support_downgrade' can silently stop accepting 
> sockets.
> ---
>
> Key: MESOS-10114
> URL: https://issues.apache.org/jira/browse/MESOS-10114
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> Related to MESOS-10113, the accept code for OpenSSLSocketImpl will return a 
> failed future from the accept loop body if the initial io::read/io::poll 
> fails when checking for downgrade.
> This would cause the accept loop to silently stop, and incoming connections 
> would no longer be accepted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10114) OpenSSLSocketImpl with 'support_downgrade' can silently stop accepting sockets.

2020-04-10 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10114:
---

 Summary: OpenSSLSocketImpl with 'support_downgrade' can silently 
stop accepting sockets.
 Key: MESOS-10114
 URL: https://issues.apache.org/jira/browse/MESOS-10114
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Benjamin Mahler


Related to MESOS-10113, the accept code for OpenSSLSocketImpl will return a 
failed future from the accept loop body if the initial io::read/io::poll fails 
when checking for downgrade.

This would cause the accept loop to silently stop, and incoming connections 
would no longer be accepted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10113) OpenSSLSocketImpl with 'support_downgrade' waits for incoming bytes before accepting new connection.

2020-04-10 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10113:
---

 Summary: OpenSSLSocketImpl with 'support_downgrade' waits for 
incoming bytes before accepting new connection.
 Key: MESOS-10113
 URL: https://issues.apache.org/jira/browse/MESOS-10113
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Benjamin Mahler


The accept loop in OpenSSLSocketImpl in the case of {{support_downgrade}} 
enabled will wait for incoming bytes on the accepted socket before allowing 
another socket to be accepted. This will lead to significant throughput issues 
for accepting new connections (e.g. during a master failover), or may block 
entirely if a client doesn't send any data for whatever reason.

Marking as a bug due to the potential for blocking incoming connections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10112) Log peer address during SSL handshake failures.

2020-04-10 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10112:
---

 Summary: Log peer address during SSL handshake failures.
 Key: MESOS-10112
 URL: https://issues.apache.org/jira/browse/MESOS-10112
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


Currently, when accepting a socket from a LibeventSSLSocket or OpenSSLSocket, 
the caller cannot log the peer address in the case of a failure. We currently 
do not include the peer address in the accept failure which makes debugging 
more difficult (e.g. who is trying to connect with a bad SSL handshake? who is 
disconnecting in the middle of an SSL handshake?).

This should be backported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10110) Libprocess ignores most protobuf (de)serialisation failure cases.

2020-04-07 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10110:
---

Assignee: Charles Natali

> Libprocess ignores most protobuf (de)serialisation failure cases.
> -
>
> Key: MESOS-10110
> URL: https://issues.apache.org/jira/browse/MESOS-10110
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Charles
>Assignee: Charles Natali
>Priority: Major
>
> Before the code didn't check at all the return value of
>  {{Message::SerializeToString}}, which can fail for various reasons,
>  e.g. out-of-memory, message too large, or invalid UTF-8 string.
>  Also, the way deserialisation was checked for error using
>  {{Message::IsInitialized}} doesn't detect errors such as the above,
>  we need to check {{Message::ParseFromString}} return value.
> {{}}
> We noticed this at work because our custom executor had a bug causing it to 
> send invalid/non-UTF8 {{mesos.TaskID}}, but it was successfully serialised by 
> the executor (driver), and deserialised by the framework, which was blowing 
> it to blow up at later point far from the original source of the problem.
> More generally we want to catch such invalid messages - which can happen for 
> a variety of reasons - as early as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10110) libprocess: check protobuf (de)serialisation success

2020-04-07 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077628#comment-17077628
 ] 

Benjamin Mahler commented on MESOS-10110:
-

[~Charle] for some reason I seem to be unable to add you to the contributors 
role so that I can assign this to you, filed INFRA-20092.

> libprocess: check protobuf (de)serialisation success
> 
>
> Key: MESOS-10110
> URL: https://issues.apache.org/jira/browse/MESOS-10110
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Charles
>Priority: Major
>
> Before the code didn't check at all the return value of
>  {{Message::SerializeToString}}, which can fail for various reasons,
>  e.g. out-of-memory, message too large, or invalid UTF-8 string.
>  Also, the way deserialisation was checked for error using
>  {{Message::IsInitialized}} doesn't detect errors such as the above,
>  we need to check {{Message::ParseFromString}} return value.
> {{}}
> We noticed this at work because our custom executor had a bug causing it to 
> send invalid/non-UTF8 {{mesos.TaskID}}, but it was successfully serialised by 
> the executor (driver), and deserialised by the framework, which was blowing 
> it to blow up at later point far from the original source of the problem.
> More generally we want to catch such invalid messages - which can happen for 
> a variety of reasons - as early as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10093) Libprocess does not properly escape subprocess argument strings on Windows

2020-04-01 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073265#comment-17073265
 ] 

Benjamin Mahler commented on MESOS-10093:
-

Did some research and synced with [~greggomann] and [~asekretenko]. On windows 
processes can't actually receive an argument array and they're started using a 
full string command line; it's up to them to parse these command lines into 
arguments. Most programs use {{CommandLineToArgvW}}, especially those programs 
that are not windows only (e.g. some libc, java, or python program). However, 
cmd.exe *does not* use {{CommandLineToArgvW}} and therefore applying the 
quoting is not correct.

The fix will be to surface direct command line string utilities (windows only) 
alongside the argv style, and callers will have to choose when to use the 
command line string versions (e.g. if using cmd.exe).

We may need to do something special for docker style commands (e.g. "docker run 
... "), but not sure yet.

Some cleanup patches to prepare for a fix:

https://reviews.apache.org/r/72273/
https://reviews.apache.org/r/72285/
https://reviews.apache.org/r/72286/
https://reviews.apache.org/r/72303/
https://reviews.apache.org/r/72304/

> Libprocess does not properly escape subprocess argument strings on Windows
> --
>
> Key: MESOS-10093
> URL: https://issues.apache.org/jira/browse/MESOS-10093
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Greg Mann
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: containerization, docker, mesosphere, windows
>
> When running some tests of Mesos on Windows, I discovered that the following 
> command would not execute successfully when passed to the Docker 
> containerizer in {{TaskInfo.command}}:
> {noformat}
> python -c "print('hello world')"
> {noformat}
> The following error is found in the task sandbox:
> {noformat}
>   File "", line 1
> "print('hello
> ^
> SyntaxError: EOL while scanning string literal
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10042) Mesos UI template not always rendered

2020-03-31 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072205#comment-17072205
 ] 

Benjamin Mahler commented on MESOS-10042:
-

Need more to go on to debug this as I can't seem to reproduce it, are there any 
errors in the browser's inspection tools? Are any requests taking very long or 
never getting a response?

> Mesos UI template not always rendered
> -
>
> Key: MESOS-10042
> URL: https://issues.apache.org/jira/browse/MESOS-10042
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 1.9.0
> Environment: Linux Vivaldi & Firefox
> ubuntu 18.04
>Reporter: Damien Gerard
>Priority: Minor
> Attachments: image-2019-11-27-17-34-29-733.png, 
> image-2019-11-27-17-37-18-679.png, image-2019-11-27-17-39-06-984.png, 
> image-2019-11-27-17-39-16-491.png, image-2019-11-27-17-39-37-341.png, 
> image-2019-11-27-17-39-44-306.png
>
>
> When opening the webui directly or when  switching tabs (by clicking on 
> "Frameworks"/"Agents"/whatever back to the main page), the page is not always 
> rendered (see as below).
>   !image-2019-11-27-17-39-37-341.png!
> Also, the cluster name is never replaced (the same in our mesos 1.6) even if 
> --cluster "some-value" is set
>   !image-2019-11-27-17-39-44-306.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9337) Hook manager implementation is missing mutex acquisition in several places.

2020-03-30 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9337:
--

Assignee: Dong Zhu

> Hook manager implementation is missing mutex acquisition in several places.
> ---
>
> Key: MESOS-9337
> URL: https://issues.apache.org/jira/browse/MESOS-9337
> Project: Mesos
>  Issue Type: Bug
>  Components: modules
>Reporter: Benjamin Mahler
>Assignee: Dong Zhu
>Priority: Major
>
> The hook manager uses a mutex to protect availableHooks from writing during 
> read (probably this should be a read/write mutex).
> However, this mutex is not acquired in many of the reads. For example:
> (mutex acquired)
> https://github.com/apache/mesos/blob/1.7.0/src/hook/manager.cpp#L108-L138
> (mutex not acquired!)
> https://github.com/apache/mesos/blob/1.7.0/src/hook/manager.cpp#L141-L150
> Also, the mutex and map are non-POD statics, which are banned:
> https://github.com/apache/mesos/blob/1.9.0/src/hook/manager.cpp#L50-L51



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9337) Hook manager implementation is missing mutex acquisition in several places.

2020-03-30 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9337:
--

Assignee: (was: Benjamin Mahler)

> Hook manager implementation is missing mutex acquisition in several places.
> ---
>
> Key: MESOS-9337
> URL: https://issues.apache.org/jira/browse/MESOS-9337
> Project: Mesos
>  Issue Type: Bug
>  Components: modules
>Reporter: Benjamin Mahler
>Priority: Major
>
> The hook manager uses a mutex to protect availableHooks from writing during 
> read (probably this should be a read/write mutex).
> However, this mutex is not acquired in many of the reads. For example:
> (mutex acquired)
> https://github.com/apache/mesos/blob/1.7.0/src/hook/manager.cpp#L108-L138
> (mutex not acquired!)
> https://github.com/apache/mesos/blob/1.7.0/src/hook/manager.cpp#L141-L150
> Also, the mutex and map are non-POD statics, which are banned:
> https://github.com/apache/mesos/blob/1.9.0/src/hook/manager.cpp#L50-L51



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9337) Hook manager implementation is missing mutex acquisition in several places.

2020-03-30 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9337:
--

Assignee: Benjamin Mahler

> Hook manager implementation is missing mutex acquisition in several places.
> ---
>
> Key: MESOS-9337
> URL: https://issues.apache.org/jira/browse/MESOS-9337
> Project: Mesos
>  Issue Type: Bug
>  Components: modules
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>
> The hook manager uses a mutex to protect availableHooks from writing during 
> read (probably this should be a read/write mutex).
> However, this mutex is not acquired in many of the reads. For example:
> (mutex acquired)
> https://github.com/apache/mesos/blob/1.7.0/src/hook/manager.cpp#L108-L138
> (mutex not acquired!)
> https://github.com/apache/mesos/blob/1.7.0/src/hook/manager.cpp#L141-L150
> Also, the mutex and map are non-POD statics, which are banned:
> https://github.com/apache/mesos/blob/1.9.0/src/hook/manager.cpp#L50-L51



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9942) Deprecate framework sorter.

2020-03-24 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9942:
--

Assignee: (was: Meng Zhu)

> Deprecate framework sorter.
> ---
>
> Key: MESOS-9942
> URL: https://issues.apache.org/jira/browse/MESOS-9942
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> Given the flat structure of the framework, there is no need to store and sort 
> frameworks in the sorter tree structure. We should deprecate framework 
> sorter. This would dedicate the sorter for roles, opening up room for 
> optimization and cleanup. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9943) Dedicate sorter for roles.

2020-03-24 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9943:
--

Assignee: (was: Meng Zhu)

> Dedicate sorter for roles.
> --
>
> Key: MESOS-9943
> URL: https://issues.apache.org/jira/browse/MESOS-9943
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> Once MESOS-9942 has landed, we can clean up and optimize the sorter for 
> roles. Specifically, each node in the tree (except the root and virtual leaf 
> node) will carry a back pointer to the role tree structure in the allocator. 
> This will eliminate all the state duplications and unnecessary trackings that 
> currently done inside the sorter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-7245) Add a Windows segfault handler for stacktraces

2020-03-16 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-7245:
--

Assignee: Benjamin Mahler

> Add a Windows segfault handler for stacktraces
> --
>
> Key: MESOS-7245
> URL: https://issues.apache.org/jira/browse/MESOS-7245
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joseph Wu
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: mesosphere, windows
>
> Current in the Windows builds, if we segfault, the program will exit 
> immediately with no other output.  
> For example, when you add a failing test to {{stout-tests}} and run 
> {{3rdparty/stout/tests/Debug/stout-tests --gtest_break_on_failure}}, gtest 
> simulates a segfault to exit immediately.  On Posix, this also prints a 
> stacktrace.
> We may be able to re-use the code added to glog for print stacktraces 
> (MESOS-6815).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10103) MasterAPITest.ReservationUpdate crashes on windows.

2020-03-12 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10103:
---

 Summary: MasterAPITest.ReservationUpdate crashes on windows.
 Key: MESOS-10103
 URL: https://issues.apache.org/jira/browse/MESOS-10103
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


The windows CI was crashing in this test, adding stack traces revealed the 
following:

{noformat}
E 00:00:00.00  5928 logging.cpp:308] RAW:   
google::protobuf::internal::RepeatedPtrIterator::operator* 
[7FF676253FEE+14] 
(c:\users\administrator\workspace\mesos\mesos_ci_windows-build-wip\mesos\build\3rdparty\protobuf-
3.5.0\src\protobuf-3.5.0\src\google\protobuf\repeated_field.h:2266)
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[00D03DFFCA40+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   
mesos::authorization::ActionObject::reserve [7FF6778AFE40+720] 
(c:\users\administrator\workspace\mesos\mesos_ci_windows-build-wip\mesos\src\master\authorization.cpp:236)
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[00D03DFFC8A8+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[00D03DFFC9E8+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[00D03DFFC558+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[02ADCACC50F0+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[02ADC8AA0690+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[02ADC8AA0790+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[02ADC8AA0790+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[00D03DFFC2D8+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 00:00:00.00  5928 logging.cpp:308] RAW:   (No symbol) 
[+0]
E 

[jira] [Commented] (MESOS-10102) MasterAPITest.ReservationUpdate is flaky

2020-03-12 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058265#comment-17058265
 ] 

Benjamin Mahler commented on MESOS-10102:
-

[~asekretenko] can you add the patch you committed so far here?

> MasterAPITest.ReservationUpdate is flaky
> 
>
> Key: MESOS-10102
> URL: https://issues.apache.org/jira/browse/MESOS-10102
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Major
>
> There seems to be two kinds of flakes:
> {noformat}
> [ RUN  ] ContentType/MasterAPITest.ReservationUpdate/0
> I0306 13:09:52.496989  3096 cluster.cpp:186] Creating default 'local' 
> authorizer
> I0306 13:09:52.498562  3112 master.cpp:443] Master 
> 6ca18692-2b4b-4219-8e2f-78bdf56dc5d8 (core1.hw.ca1.mesosphere.com) started on 
> 66.70.182.167:35079
> I0306 13:09:52.498643  3112 master.cpp:446] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1000secs" --allocator="hierarchical" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/VOU6Oj/credentials" --filter_gpu_resources="true" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_operator_event_stream_subscribers="1000" 
> --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
> --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
> --publish_per_framework_metrics="true" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/VOU6Oj/master" --zk_session_timeout="10secs"
> I0306 13:09:52.498890  3112 master.cpp:495] Master only allowing 
> authenticated frameworks to register
> I0306 13:09:52.498986  3112 master.cpp:501] Master only allowing 
> authenticated agents to register
> I0306 13:09:52.499089  3112 master.cpp:507] Master only allowing 
> authenticated HTTP frameworks to register
> I0306 13:09:52.499109  3112 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/VOU6Oj/credentials'
> I0306 13:09:52.499182  3112 master.cpp:551] Using default 'crammd5' 
> authenticator
> I0306 13:09:52.499311  3112 http.cpp:1265] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0306 13:09:52.499446  3112 http.cpp:1265] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0306 13:09:52.499575  3112 http.cpp:1265] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0306 13:09:52.499667  3112 master.cpp:632] Authorization enabled
> I0306 13:09:52.500351  3112 hierarchical.cpp:567] Initialized hierarchical 
> allocator process
> I0306 13:09:52.500371  3153 whitelist_watcher.cpp:77] No whitelist given
> I0306 13:09:52.500718  3149 master.cpp:2165] Elected as the leading master!
> I0306 13:09:52.500874  3149 master.cpp:1661] Recovering from registrar
> I0306 13:09:52.500926  3149 registrar.cpp:339] Recovering registrar
> I0306 13:09:52.501082  3149 registrar.cpp:383] Successfully fetched the 
> registry (0B) in 117248ns
> I0306 13:09:52.501142  3149 registrar.cpp:487] Applied 1 operations in 
> 14299ns; attempting to update the registry
> I0306 13:09:52.501288  3142 registrar.cpp:544] Successfully updated the 
> registry in 127744ns
> I0306 13:09:52.501333  3142 registrar.cpp:416] Successfully recovered 
> registrar
> I0306 13:09:52.501436  3142 master.cpp:1814] Recovered 0 agents from the 
> registry (181B); allowing 10mins for agents to reregister
> I0306 13:09:52.501466  3148 hierarchical.cpp:606] Skipping recovery of 
> hierarchical allocator: nothing to recover
> W0306 13:09:52.503216  3096 process.cpp:2877] Attempted to spawn already 
> running process files@66.70.182.167:35079
> I0306 13:09:52.503903  3096 containerizer.cpp:317] Using isolation { 
> environment_secret, filesystem/posix, network/cni, posix/cpu, posix/mem }

[jira] [Assigned] (MESOS-10100) Recently introduced PathTest.Relative and PathTest.PathIteration fail on windows.

2020-03-05 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10100:
---

Assignee: Benjamin Mahler

> Recently introduced PathTest.Relative and PathTest.PathIteration fail on 
> windows.
> -
>
> Key: MESOS-10100
> URL: https://issues.apache.org/jira/browse/MESOS-10100
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>
> These were introduced recently and never passed on windows:
> {noformat}
> 14:53:09 [ RUN  ] PathTest.Relative
> 14:53:09 
> C:\Users\Administrator\workspace\mesos\Mesos_CI_Windows-build-WIP\mesos\3rdparty\stout\tests\path_tests.cpp(361):
>  error: Value of: path::relative("a", "/a").isError()
> 14:53:09   Actual: false
> 14:53:09 Expected: true
> 14:53:09 
> C:\Users\Administrator\workspace\mesos\Mesos_CI_Windows-build-WIP\mesos\3rdparty\stout\tests\path_tests.cpp(362):
>  error: Value of: path::relative("/a", "a").isError()
> 14:53:09   Actual: false
> 14:53:09 Expected: true
> 14:53:09 
> C:\Users\Administrator\workspace\mesos\Mesos_CI_Windows-build-WIP\mesos\3rdparty\stout\tests\path_tests.cpp(369):
>  error: Value of: (path::relative("/a/b", "/a")).get()
> 14:53:09   Actual: "..\\/a/b"
> 14:53:09 Expected: "b"
> 14:53:09 Which is: "b"
> 14:53:09 [  FAILED  ] PathTest.Relative (3 ms)
> 14:53:09 [ RUN  ] PathTest.Comparison
> 14:53:09 [   OK ] PathTest.Comparison (1 ms)
> 14:53:09 [ RUN  ] PathTest.FromURI
> 14:53:09 [   OK ] PathTest.FromURI (0 ms)
> 14:53:09 [ RUN  ] PathTest.PathIteration
> 14:53:09 
> C:\Users\Administrator\workspace\mesos\Mesos_CI_Windows-build-WIP\mesos\3rdparty\stout\tests\path_tests.cpp(456):
>  error: Value of: absolute_path.is_absolute()
> 14:53:09   Actual: false
> 14:53:09 Expected: true
> 14:53:09 [  FAILED  ] PathTest.PathIteration (0 ms)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10100) Recently introduced PathTest.Relative and PathTest.PathIteration fail on windows.

2020-03-05 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10100:
---

 Summary: Recently introduced PathTest.Relative and 
PathTest.PathIteration fail on windows.
 Key: MESOS-10100
 URL: https://issues.apache.org/jira/browse/MESOS-10100
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Benjamin Mahler


These were introduced recently and never passed on windows:

{noformat}
14:53:09 [ RUN  ] PathTest.Relative
14:53:09 
C:\Users\Administrator\workspace\mesos\Mesos_CI_Windows-build-WIP\mesos\3rdparty\stout\tests\path_tests.cpp(361):
 error: Value of: path::relative("a", "/a").isError()
14:53:09   Actual: false
14:53:09 Expected: true
14:53:09 
C:\Users\Administrator\workspace\mesos\Mesos_CI_Windows-build-WIP\mesos\3rdparty\stout\tests\path_tests.cpp(362):
 error: Value of: path::relative("/a", "a").isError()
14:53:09   Actual: false
14:53:09 Expected: true
14:53:09 
C:\Users\Administrator\workspace\mesos\Mesos_CI_Windows-build-WIP\mesos\3rdparty\stout\tests\path_tests.cpp(369):
 error: Value of: (path::relative("/a/b", "/a")).get()
14:53:09   Actual: "..\\/a/b"
14:53:09 Expected: "b"
14:53:09 Which is: "b"
14:53:09 [  FAILED  ] PathTest.Relative (3 ms)
14:53:09 [ RUN  ] PathTest.Comparison
14:53:09 [   OK ] PathTest.Comparison (1 ms)
14:53:09 [ RUN  ] PathTest.FromURI
14:53:09 [   OK ] PathTest.FromURI (0 ms)
14:53:09 [ RUN  ] PathTest.PathIteration
14:53:09 
C:\Users\Administrator\workspace\mesos\Mesos_CI_Windows-build-WIP\mesos\3rdparty\stout\tests\path_tests.cpp(456):
 error: Value of: absolute_path.is_absolute()
14:53:09   Actual: false
14:53:09 Expected: true
14:53:09 [  FAILED  ] PathTest.PathIteration (0 ms)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-5255) Add GPUs to container resource consumption metrics.

2020-03-02 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049448#comment-17049448
 ] 

Benjamin Mahler commented on MESOS-5255:


[~jomach] we currently don't have Nvidia GPU support in the docker 
containerizer, see MESOS-5795.

> Add GPUs to container resource consumption metrics.
> ---
>
> Key: MESOS-5255
> URL: https://issues.apache.org/jira/browse/MESOS-5255
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Kevin Klues
>Priority: Major
>  Labels: gpu
>
> Currently the usage callback in the Nvidia GPU isolator is unimplemented:
> {noformat}
> src/slave/containerizer/mesos/isolators/cgroups/devices/gpus/nvidia.cpp
> {noformat}
> It should use functionality from NVML to gather the current GPU usage and add 
> it to a ResourceStatistics object. It is still an open question as to exactly 
> what information we want to expose here (power, memory consumption, current 
> load, etc.). Whatever we decide on should be standard across different GPU 
> types, different GPU vendors, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10096) Reactivating a draining agent leaves the agent in draining state.

2020-02-12 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10096:
---

Assignee: Benjamin Mahler

> Reactivating a draining agent leaves the agent in draining state.
> -
>
> Key: MESOS-10096
> URL: https://issues.apache.org/jira/browse/MESOS-10096
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, master
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>
> When reactivating an agent that's in the draining state, the master erases it 
> from its draining maps, and erases its estimated drain time.
> However, it doesn't send any message to the agent, so if the agent is still 
> draining and waiting for tasks to terminate, it will stay in that state, 
> ultimately making any tasks that then get launched get DROPPED due to the 
> agent still being in a draining state.
> Seems like we should either:
> * Disallow the user from reactivating if still in draining, or
> * Send a message to the agent, and have the agent move itself out of draining.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10095) Agent draining logging makes it hard to tell which tasks did not terminate.

2020-02-12 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10095:
---

Assignee: Benjamin Mahler

> Agent draining logging makes it hard to tell which tasks did not terminate.
> ---
>
> Key: MESOS-10095
> URL: https://issues.apache.org/jira/browse/MESOS-10095
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, master
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>
> When draining an agent, it's hard to tell which tasks failed to terminate.
> The master prints a count of the tasks remaining (only as VLOG(1) however), 
> but not the IDs:
> {noformat}
> I1223 13:19:49.021764 30480 master.cpp:6367] DRAINING Agent 
> c0146010-8af6-4a9d-bcdb-99e30a778663-S6 has 0 pending tasks, 1 tasks, and 0 
> operations
> {noformat}
> The agent does not print how many or which ones.
> It would be helpful to at least see which tasks need to be drained when it 
> begins, and possibly, upon each check, which ones remain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10096) Reactivating a draining agent leaves the agent in draining state.

2020-02-11 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10096:
---

 Summary: Reactivating a draining agent leaves the agent in 
draining state.
 Key: MESOS-10096
 URL: https://issues.apache.org/jira/browse/MESOS-10096
 Project: Mesos
  Issue Type: Bug
  Components: agent, master
Reporter: Benjamin Mahler


When reactivating an agent that's in the draining state, the master erases it 
from its draining maps, and erases its estimated drain time.

However, it doesn't send any message to the agent, so if the agent is still 
draining and waiting for tasks to terminate, it will stay in that state, 
ultimately making any tasks that then get launched get DROPPED due to the agent 
still being in a draining state.

Seems like we should either:

* Disallow the user from reactivating if still in draining, or
* Send a message to the agent, and have the agent move itself out of draining.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10095) Agent draining logging makes it hard to tell which tasks did not terminate.

2020-02-11 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10095:
---

 Summary: Agent draining logging makes it hard to tell which tasks 
did not terminate.
 Key: MESOS-10095
 URL: https://issues.apache.org/jira/browse/MESOS-10095
 Project: Mesos
  Issue Type: Improvement
  Components: agent, master
Reporter: Benjamin Mahler


When draining an agent, it's hard to tell which tasks failed to terminate.

The master prints a count of the tasks remaining (only as VLOG(1) however), but 
not the IDs:

{noformat}
I1223 13:19:49.021764 30480 master.cpp:6367] DRAINING Agent 
c0146010-8af6-4a9d-bcdb-99e30a778663-S6 has 0 pending tasks, 1 tasks, and 0 
operations
{noformat}

The agent does not print how many or which ones.

It would be helpful to at least see which tasks need to be drained when it 
begins, and possibly, upon each check, which ones remain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10094) Master's agent draining VLOG prints incorrect task counts.

2020-02-11 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10094:
---

Assignee: Benjamin Mahler

> Master's agent draining VLOG prints incorrect task counts.
> --
>
> Key: MESOS-10094
> URL: https://issues.apache.org/jira/browse/MESOS-10094
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.9.0
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>
> This logic is printing the framework counts of these maps rather than the 
> task counts:
> https://github.com/apache/mesos/blob/4575c9b452c25f64e6c6cc3eddc12ed3b1f8538b/src/master/master.cpp#L6318-L6319
> {code}
>   // Check if the agent has any tasks running or operations pending.
>   if (!slave->pendingTasks.empty() ||
>   !slave->tasks.empty() ||
>   !slave->operations.empty()) {
> VLOG(1)
>   << "DRAINING Agent " << slaveId << " has "
>   << slave->pendingTasks.size() << " pending tasks, "
>   << slave->tasks.size() << " tasks, and "
>   << slave->operations.size() << " operations";
> return;
>   }
> {code}
> Since these are {{hashmap>}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10094) Master's agent draining VLOG prints incorrect task counts.

2020-02-11 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10094:
---

 Summary: Master's agent draining VLOG prints incorrect task counts.
 Key: MESOS-10094
 URL: https://issues.apache.org/jira/browse/MESOS-10094
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 1.9.0
Reporter: Benjamin Mahler


This logic is printing the framework counts of these maps rather than the task 
counts:

https://github.com/apache/mesos/blob/4575c9b452c25f64e6c6cc3eddc12ed3b1f8538b/src/master/master.cpp#L6318-L6319

{code}
  // Check if the agent has any tasks running or operations pending.
  if (!slave->pendingTasks.empty() ||
  !slave->tasks.empty() ||
  !slave->operations.empty()) {
VLOG(1)
  << "DRAINING Agent " << slaveId << " has "
  << slave->pendingTasks.size() << " pending tasks, "
  << slave->tasks.size() << " tasks, and "
  << slave->operations.size() << " operations";
return;
  }
{code}

Since these are {{hashmap>}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10026) Improve v1 operator API read performance.

2020-01-30 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027015#comment-17027015
 ] 

Benjamin Mahler commented on MESOS-10026:
-

Agent v1 calls, only did it for GET_METRICS, and GET_TASKS, GET_EXECUTORS, 
GET_FRAMEWORKS, GET_STATE since those seem like the most expensive:

https://reviews.apache.org/r/72056/
https://reviews.apache.org/r/72064/
https://reviews.apache.org/r/72065/
https://reviews.apache.org/r/72066/
https://reviews.apache.org/r/72067/

> Improve v1 operator API read performance.
> -
>
> Key: MESOS-10026
> URL: https://issues.apache.org/jira/browse/MESOS-10026
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: foundations
> Fix For: 1.10.0
>
>
> Currently, the v1 operator API has poor performance relative to the v0 json 
> API. The following initial numbers were provided by [~Will Mahler] from our 
> state serving benchmark:
>  
> |OPTIMIZED - Master (baseline)| | | | |
> |Test setup|1000 agents with a total of 1 running tasks and 1 
> completed tasks|1 agents with a total of 10 running tasks and 10 
> completed tasks|2 agents with a total of 20 running tasks and 20 
> completed tasks|4 agents with a total of 40 running tasks and 40 
> completed tasks|
> |v0 'state' response|0.17|1.66|8.96|12.42|
> |v1 x-protobuf|0.35|3.21|9.47|19.09|
> |v1 json|0.45|4.72|10.81|31.43|
> There is quite a lot of variance, but v1 protobuf consistently slower than v0 
> (sometimes significantly so) and v1 json is consistently slower than v1 
> protobuf (sometimes significantly so).
> The reason that the v1 operator API is slower is that it does the following:
> (1) Construct temporary unversioned state response object by copying 
> in-memory un-versioned state into overall response object. (expensive!)
> (2) Evolve it to v1: serialize, de-serialize into v1 overall state object. 
> (expensive!)
> (3) Serialize the overall v1 state object to protobuf or json.
> (4) Destruct the temporaries (expensive! but is done after response starts 
> serving)
> On the other hand, the v0 jsonify approach does the following:
> (1) Serialize the in-memory unversioned state into json, by traversing state 
> and accumulating the overall serialized json.
> This means that v1 has substantial overhead vs v0, and we need to remove it 
> to bring v1 on-par or better than v0. v1 should serialize directly to json 
> (straightforward with jsonify) or protobuf (this can be done via a 
> io::CodedOutputStream).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10026) Improve v1 operator API read performance.

2020-01-27 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024705#comment-17024705
 ] 

Benjamin Mahler commented on MESOS-10026:
-

Note that the same approach still needs to be applied to the agent's v1 calls.

> Improve v1 operator API read performance.
> -
>
> Key: MESOS-10026
> URL: https://issues.apache.org/jira/browse/MESOS-10026
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: foundations
> Fix For: 1.10.0
>
>
> Currently, the v1 operator API has poor performance relative to the v0 json 
> API. The following initial numbers were provided by [~Will Mahler] from our 
> state serving benchmark:
>  
> |OPTIMIZED - Master (baseline)| | | | |
> |Test setup|1000 agents with a total of 1 running tasks and 1 
> completed tasks|1 agents with a total of 10 running tasks and 10 
> completed tasks|2 agents with a total of 20 running tasks and 20 
> completed tasks|4 agents with a total of 40 running tasks and 40 
> completed tasks|
> |v0 'state' response|0.17|1.66|8.96|12.42|
> |v1 x-protobuf|0.35|3.21|9.47|19.09|
> |v1 json|0.45|4.72|10.81|31.43|
> There is quite a lot of variance, but v1 protobuf consistently slower than v0 
> (sometimes significantly so) and v1 json is consistently slower than v1 
> protobuf (sometimes significantly so).
> The reason that the v1 operator API is slower is that it does the following:
> (1) Construct temporary unversioned state response object by copying 
> in-memory un-versioned state into overall response object. (expensive!)
> (2) Evolve it to v1: serialize, de-serialize into v1 overall state object. 
> (expensive!)
> (3) Serialize the overall v1 state object to protobuf or json.
> (4) Destruct the temporaries (expensive! but is done after response starts 
> serving)
> On the other hand, the v0 jsonify approach does the following:
> (1) Serialize the in-memory unversioned state into json, by traversing state 
> and accumulating the overall serialized json.
> This means that v1 has substantial overhead vs v0, and we need to remove it 
> to bring v1 on-par or better than v0. v1 should serialize directly to json 
> (straightforward with jsonify) or protobuf (this can be done via a 
> io::CodedOutputStream).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10042) Mesos UI template not always rendered

2020-01-21 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020546#comment-17020546
 ] 

Benjamin Mahler commented on MESOS-10042:
-

[~milipili] is this a transient thing? Or it gets stuck like that? Can you make 
a screen recording?

> Mesos UI template not always rendered
> -
>
> Key: MESOS-10042
> URL: https://issues.apache.org/jira/browse/MESOS-10042
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 1.9.0
> Environment: Linux Vivaldi & Firefox
> ubuntu 18.04
>Reporter: Damien Gerard
>Priority: Minor
> Attachments: image-2019-11-27-17-34-29-733.png, 
> image-2019-11-27-17-37-18-679.png, image-2019-11-27-17-39-06-984.png, 
> image-2019-11-27-17-39-16-491.png, image-2019-11-27-17-39-37-341.png, 
> image-2019-11-27-17-39-44-306.png
>
>
> When opening the webui directly or when  switching tabs (by clicking on 
> "Frameworks"/"Agents"/whatever back to the main page), the page is not always 
> rendered (see as below).
>   !image-2019-11-27-17-39-37-341.png!
> Also, the cluster name is never replaced (the same in our mesos 1.6) even if 
> --cluster "some-value" is set
>   !image-2019-11-27-17-39-44-306.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state

2020-01-21 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020545#comment-17020545
 ] 

Benjamin Mahler commented on MESOS-10068:
-

The first thing to comment on is that we don't yet have a formalized agent 
lifecycle in the API, we have AgentAdded / AgentRemoved but internally there is 
also the notion of disconnecting, becoming unreachable, getting transitioned to 
gone. So the API and internals are at a bit of a mismatch here and more broadly 
of this particular ticket we would need to make them consistent to have events 
that make sense.

[~daltonmatos] It looks like the reason you're seeing no AGENT_REMOVED is that 
the the agent became unreachable, and we don't send it in that case. The first 
case goes through a different path where we never were able to communicate with 
the agent, but we don't know that and the agent retries its registration, upon 
seeing this we remove the previous version of that agent and try to register 
the new one. You may see this repeating itself over and over.

[~greggomann] looks like we don't send AGENT_REMOVED when an agent is marked as 
gone? Seems like a bug due to {{__removeSlave}} being used for both marking 
unreachable and gone?



> Mesos Master doesn't send AGENT_REMOVED when removing agent from internal 
> state
> ---
>
> Key: MESOS-10068
> URL: https://issues.apache.org/jira/browse/MESOS-10068
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.3, 1.8.2, 1.9.1
>Reporter: Dalton Matos Coelho Barreto
>Priority: Major
> Attachments: master-full-logs.log
>
>
> Hello,
>  
> Looking at the documentation of the master {{/api/v1}} endpoint, the 
> {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is 
> supported for this endpoint, but when a new agent joins the cluster a 
> {{AGENT_ADDED}} event is received.
> The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not 
> received by clients subscribed to the master API.
>  
> I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All 
> using the docker image {{mesos/mesos-centos}}.
> The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the 
> cluster but the master couldn't communicate with this agent, in this specific 
> test there was a firewall blocking port {{5051}} on the slave, that is, no 
> body was being able to tal to the slave on port {{5051}}.
>  
> h2. Here are the steps do reproduce the problem
>  * Start a new mesos master
>  * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message:
>  ** 
> {noformat}
> curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: 
> application/json" http://MASTER_IP:5050/api/v1{noformat}
>  * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered;
>  * Stop this slave;
>  * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the 
> field {{active=false}}.
>  * Waits for mesos master stop listing this slave, that is, 
> {{/slaves?slave_id=AGENT_ID}} returns an empty response;
> Even after the empty response, the event never reaches the subscriber.
>  
> The mesos master logs shows this:
> {noformat}
>  I1213 15:03:10.33893513 master.cpp:1297] Agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964) disconnected
> I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> {noformat}
> And then:
> {noformat}
> W1213 15:04:40.72667015 process.cpp:1917] Failed to send 
> 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to 
> connect to 172.18.0.51:5051: No route to host{noformat}
> And some time after this:
> {noformat}
> I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1   {noformat}
>  
> Even after this removal, the {{AGENT_REMOVED}} event is not delivered.
>  
> I will attach the full master logs also.
>  
> Do you think this could be a bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-1807) Disallow executors with cpu only or memory only resources

2020-01-21 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020530#comment-17020530
 ] 

Benjamin Mahler commented on MESOS-1807:


[~Charle] the command executor is a special case, it's implicitly generated and 
we oversubscribe a little bit to make room for it: 
https://github.com/apache/mesos/blob/1.9.0/src/slave/slave.cpp#L6663-L6676

I think the expectation for CUSTOM or the new DEFAULT executors are that they 
specify their resource requirements. Since it didn't break any backwards 
compatibility, we enforce it for the new DEFAULT case: 
https://github.com/apache/mesos/blob/1.9.0/src/master/validation.cpp#L1842-L1859

[~greggomann] is also working on cpu/mem requests vs limits (see MESOS-10001), 
so that may provide you with the flexibility you desire depending on what 
you're looking to do.

> Disallow executors with cpu only or memory only resources
> -
>
> Key: MESOS-1807
> URL: https://issues.apache.org/jira/browse/MESOS-1807
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Priority: Major
> Attachments: Screenshot 2015-07-28 14.40.35.png
>
>
> Currently master allows executors to be launched with either only cpus or 
> only memory but we shouldn't allow that.
> This is because executor is an actual unix process that is launched by the 
> slave. If an executor doesn't specify cpus, what should the cpu limits be for 
> that executor when there are no tasks running on it? If no cpu limits are set 
> then it might starve other executors/tasks on the slave violating isolation 
> guarantees. Same goes with memory. Moreover, the current 
> containerizer/isolator code will throw failures when using such an executor, 
> e.g., when the last task on the executor finishes and Containerizer::update() 
> is called with 0 cpus or 0 mem.
> According to a source code [TODO | 
> https://github.com/apache/mesos/blob/0226620747e1769434a1a83da547bfc3470a9549/src/master/validation.cpp#L400]
>  this should also include checking whether requested resources are greater 
> than  MIN_CPUS/MIN_BYTES.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9497) Parallel reads for master v1 state calls

2020-01-16 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017367#comment-17017367
 ] 

Benjamin Mahler commented on MESOS-9497:


{noformat}
commit fae9ed4626e8073ff5f17f3d87448053252e2046
Author: Benjamin Mahler 
Date:   Mon Jan 6 12:56:19 2020 -0500

Updated all GET_* v1 master calls to be served in parallel.

Using the same approach taken for the v0 read-only endpoints,
this enables parallel reads for the GET_* v1 master calls.

Review: https://reviews.apache.org/r/71978
{noformat}

> Parallel reads for master v1 state calls
> 
>
> Key: MESOS-9497
> URL: https://issues.apache.org/jira/browse/MESOS-9497
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API, master
>Reporter: Greg Mann
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: mesosphere, performance
>
> Similar to MESOS-9158 - we should make the operator API calls which serve 
> master state perform computation of multiple such responses in parallel to 
> reduce the performance impact on the master actor.
> Note that this includes the initial expensive SUBSCRIBE payload for the event 
> streaming API, which is less straightforward to incorporate into the parallel 
> serving logic since it performs writes (to track the subscriber) and produces 
> an infinite response, unlike the other state related calls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9497) Parallel reads for master v1 state calls

2020-01-09 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9497:
--

Assignee: Benjamin Mahler

> Parallel reads for master v1 state calls
> 
>
> Key: MESOS-9497
> URL: https://issues.apache.org/jira/browse/MESOS-9497
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API, master
>Reporter: Greg Mann
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: mesosphere, performance
>
> Similar to MESOS-9158 - we should make the operator API calls which serve 
> master state perform computation of multiple such responses in parallel to 
> reduce the performance impact on the master actor.
> Note that this includes the initial expensive SUBSCRIBE payload for the event 
> streaming API, which is less straightforward to incorporate into the parallel 
> serving logic since it performs writes (to track the subscriber) and produces 
> an infinite response, unlike the other state related calls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9497) Parallel reads for master v1 state calls

2020-01-09 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012078#comment-17012078
 ] 

Benjamin Mahler commented on MESOS-9497:


Parallel reads for GET_* calls, SUBSCRIBE will be done separately since it 
needs to mutate state to add subscribers: https://reviews.apache.org/r/71978/

> Parallel reads for master v1 state calls
> 
>
> Key: MESOS-9497
> URL: https://issues.apache.org/jira/browse/MESOS-9497
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API, master
>Reporter: Greg Mann
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: mesosphere, performance
>
> Similar to MESOS-9158 - we should make the operator API calls which serve 
> master state perform computation of multiple such responses in parallel to 
> reduce the performance impact on the master actor.
> Note that this includes the initial expensive SUBSCRIBE payload for the event 
> streaming API, which is less straightforward to incorporate into the parallel 
> serving logic since it performs writes (to track the subscriber) and produces 
> an infinite response, unlike the other state related calls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-6568) JSON serialization should not omit empty arrays in HTTP APIs

2019-12-06 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-6568:
--

Assignee: Benjamin Mahler

> JSON serialization should not omit empty arrays in HTTP APIs
> 
>
> Key: MESOS-6568
> URL: https://issues.apache.org/jira/browse/MESOS-6568
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Neil Conway
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: mesosphere
>
> When using the JSON content type with the HTTP APIs, a {{repeated}} protobuf 
> field is omitted entirely from the JSON serialization of the message. For 
> example, this is a response to the {{GetTasks}} call:
> {noformat}
> {
>   "get_tasks": {
> "tasks": [{...}]
>   },
>   "type": "GET_TASKS"
> }
> {noformat}
> I think it would be better to include empty arrays for the other fields of 
> the message ({{pending_tasks}}, {{completed_tasks}}, etc.). Advantages:
> # Consistency with the old HTTP endpoints, e.g., /state
> # Semantically, an empty array is more accurate. The master's response should 
> be interpreted as saying it doesn't know about any pending/completed tasks; 
> that is more accurately conveyed by explicitly including an empty array, not 
> by omitting the key entirely.
> *NOTE: The 
> [asV1Protobuf|https://github.com/apache/mesos/blob/d10a33acc426dda9e34db995f16450faf898bb3b/src/common/http.cpp#L172-L423]
>  copy needs to also be updated.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10057) Perform synchronous authorization for outgoing events on event stream.

2019-11-30 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10057:
---

 Summary: Perform synchronous authorization for outgoing events on 
event stream.
 Key: MESOS-10057
 URL: https://issues.apache.org/jira/browse/MESOS-10057
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Benjamin Mahler


The master event stream outgoing message authorization becomes very expensive 
for a large number of subscribers (cc [~greggomann]).

Instead of asynchronously obtaining object approvers for every event multiplied 
by every subscriber, we can instead always hold on to valid object approvers so 
that we could synchronously (and cheaply) authorize the outgoing events. These 
object approvers would be kept up to date in the background, and if 
authorization fails to keep them up to date, we would treat that the same as an 
authorization failure is currently treated.

This should improve the performance dramatically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10056) Perform synchronous authorization for scheduler calls.

2019-11-30 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10056:
---

 Summary: Perform synchronous authorization for scheduler calls.
 Key: MESOS-10056
 URL: https://issues.apache.org/jira/browse/MESOS-10056
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Benjamin Mahler


After chatting with [~asekretenko] about how best to resolve MESOS-10023, as an 
alternative to making all scheduler calls get sequenced through an asynchronous 
authorization step, I brought up the old idea of making authorization 
synchronous.

This came up (although I can't find a ticket for it) in the past because the 
master event stream outgoing message authorization becomes very expensive for a 
large number of subscribers (cc [~greggomann]). Back then, I suggested that we 
always hold on to valid object approvers so that we could synchronously (and 
cheaply) authorize the outgoing events. These object approvers would be kept up 
to date in the background, and if authorization fails to keep them up to date, 
we would treat that the same as an authorization failure is currently treated.

We can apply the same idea (although we haven't applied it to the master's 
event stream yet) to scheduler API calls, which should help resolve MESOS-10023 
since we're no longer mixing asynchronously authorized calls with calls that 
don't go through authorization.

This will also yield a performance improvement, scheduler calls no longer get 
delayed by asynchronous authorization, and an extra trip through the master 
queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10026) Improve v1 operator API read performance.

2019-11-22 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980530#comment-16980530
 ] 

Benjamin Mahler commented on MESOS-10026:
-

All of the GET_ reads are done: 

{noformat}
commit b275a032f217794f20dae15d86f188e55d43ce59
Author: Benjamin Mahler 
Date:   Fri Nov 8 16:13:40 2019 -0800

Support jsonifying v0 protobuf to v1 protobuf.

This allows us to jsonify a v0 protobuf directly to a v1 protobuf
efficiently, with no need to `evolve()` the message (which is rather
expensive).

The way this works is by converting all "slave" and "SLAVE" strings
in fields and enum values, respectively, to "agent" and "AGENT".

Our current v0 to v1 conversion for the v1 operator API simply
serializes the v0 message and de-serializes into a v1 message, which
means all field tags and message structures are the same, except
for field names. The only difference with field names is the use
of "agent" in place of "slave".

Review: https://reviews.apache.org/r/71748
{noformat}

{noformat}
commit 8bfbcab09be82d3a443697925fbf3c4f31333060
Author: Benjamin Mahler 
Date:   Fri Nov 8 16:49:36 2019 -0800

Added a test for AsV1Protobuf.

Review: https://reviews.apache.org/r/71749
{noformat}

{noformat}
commit 715035b24cb90ba17f9d92217f6556a2f66979e8
Author: Benjamin Mahler 
Date:   Fri Nov 8 16:52:37 2019 -0800

Improved performance of v1 operator API GetAgents call.

This updates the handling to serialize directly to protobuf or json
from the in-memory v0 state, bypassing expensive intermediate
serialization / de-serialization / object construction / object
destruction.

This initial patch shows the approach that will be used for the
other expensive calls. Note that this type of manual writing is
more brittle and complex, but it can be mostly eliminated if we
keep an up-to-date v1 GetState in memory in the future.

When this approach is applied fully to GetState, it leads to the
following improvement:

Before:
v0 '/state' response took 6.55 secs
v1 'GetState' application/x-protobuf response took 24.08 secs
v1 'GetState' application/json response took 22.76 secs

After:
v0 '/state' response took 8.00 secs
v1 'GetState' application/x-protobuf response took 5.73 secs
v1 'GetState' application/json response took 9.62 secs

Review: https://reviews.apache.org/r/71750
{noformat}

{noformat}
commit 4f4dab961bd45ca444d13b831cdb2541dd10ced8
Author: Benjamin Mahler 
Date:   Fri Nov 8 16:56:16 2019 -0800

Improved performance of v1 operator API GetFrameworks call.

This follow the same approach used in the GetAgents call;
serializing directly to protobuf or json from the in-memory
v0 state.

Review: https://reviews.apache.org/r/71751
{noformat}

{noformat}
commit 6ab835459a452e53fec8982a5aaab7e78094bbcb
Author: Benjamin Mahler 
Date:   Fri Nov 8 16:57:28 2019 -0800

Improved performance of v1 operator API GetExecutors call.

This follow the same approach used in the GetAgents call;
serializing directly to protobuf or json from the in-memory
v0 state.

Review: https://reviews.apache.org/r/71752
{noformat}

{noformat}
commit d7dd4d0e8493331d7b7a21b504ebeab702ff06d5
Author: Benjamin Mahler 
Date:   Fri Nov 8 16:58:47 2019 -0800

Improved performance of v1 operator API GetTasks call.

This follow the same approach used in the GetAgents call;
serializing directly to protobuf or json from the in-memory
v0 state.

Review: https://reviews.apache.org/r/71753
{noformat}

{noformat}
commit 1c60f0e4acbac96c34bd90e265150cdd3844f915
Author: Benjamin Mahler 
Date:   Fri Nov 8 16:59:44 2019 -0800

Improved performance of v1 operator API GetState call.

This follow the same approach used in the GetAgents call;
serializing directly to protobuf or json from the in-memory
v0 state.

Before:
v0 '/state' response took 6.55 secs
v1 'GetState' application/x-protobuf response took 24.08 secs
v1 'GetState' application/json response took 22.76 secs

After:
v0 '/state' response took 8.00 secs
v1 'GetState' application/x-protobuf response took 5.73 secs
v1 'GetState' application/json response took 9.62 secs

Review: https://reviews.apache.org/r/71754
{noformat}

{noformat}
commit 469f2ebaf65b1642d1eb4a1df81abfc2c94889dd
Author: Benjamin Mahler 
Date:   Fri Nov 8 17:00:37 2019 -0800

Improved performance of v1 operator API GetMetrics call.

This follow the same approach used in the GetAgents call;
serializing directly to protobuf or json from the in-memory
v0 state.

Review: https://reviews.apache.org/r/71755
{noformat}

SUBSCRIBE will be next.

> Improve v1 operator API read performance.
> -
>
> Key: MESOS-10026
> URL: 

[jira] [Created] (MESOS-10040) Keep an up-to-date v1 GetState object in the master's memory.

2019-11-21 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10040:
---

 Summary: Keep an up-to-date v1 GetState object in the master's 
memory.
 Key: MESOS-10040
 URL: https://issues.apache.org/jira/browse/MESOS-10040
 Project: Mesos
  Issue Type: Improvement
  Components: HTTP API, master
Reporter: Benjamin Mahler


In MESOS-10026, the operator API read performance was improved by directly 
serializing out unversioned protobuf held in memory to serialized v1 protobuf. 
This required a lot of manual writing out of serialized protobuf or json that 
we could eliminate if we had v1 state held in memory.

Holding v1 state in memory could be tackled by "self-susbcribing" (although 
probably without going through http) to the master's event stream and updating 
an overall GetState object based on each event.

Note that MESOS-8163 suggests adding a "state actor", such an actor would be 
implemented using the approach outlined in this ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9896) Consider using protobuf provided json conversion facilities rather than custom ones.

2019-11-08 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970596#comment-16970596
 ] 

Benjamin Mahler commented on MESOS-9896:


Started a thread here on their mailing list:
[https://groups.google.com/forum/#!topic/protobuf/4qmUqGE5-oQ]

> Consider using protobuf provided json conversion facilities rather than 
> custom ones.
> 
>
> Key: MESOS-9896
> URL: https://issues.apache.org/jira/browse/MESOS-9896
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: foundations
>
> Currently, stout provides custom JSON to protobuf conversion facilities, some 
> of which use protobuf reflection.
> When upgrading protobuf to 3.7.x in MESOS-9755, we found that the v0 /state 
> response of the master slowed down, and it appears to be due to a performance 
> regression in the protobuf reflection code.
> We should file an issue with protobuf, but we should also look into using the 
> json conversion code that protobuf provides to see if that can help avoid the 
> regression. It may be the case that using the built-in facilities actually 
> provides a significant performance benefit, given they don't use reflection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10026) Improve v1 operator API read performance.

2019-11-08 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970594#comment-16970594
 ] 

Benjamin Mahler commented on MESOS-10026:
-

Some preliminary numbers from a prototype 
https://github.com/bmahler/mesos/tree/bmahler_v1_operator_api_read_performance

{noformat}
Before:
v0 '/state' response took 6.549942141secs
v1 'master::call::GetState' application/x-protobuf response took 
24.081624381secs
v1 'master::call::GetState' application/json response took 22.760332466secs
{noformat}
{noformat}
After:
v0 '/state' response took 7.57313099secs
v1 'master::call::GetState' application/x-protobuf response took 5.240223816secs
v1 'master::call::GetState' application/json response took 1.76133347258333mins
{noformat}

However, as you can see, it turns out protobuf’s built-in json conversion is 
extremely slow at least for going from serialized protobuf to serialized json 
(I haven’t run perf to see why). This means we can’t really use the built-in 
json facilities (see MESOS-9896), and we have to have two code paths, one doing 
direct protobuf serialization and one doing direct json serialization via 
jsonify. I implemented that and got the following:

{noformat}
After:
v0 '/state' response took 7.743768168secs
v1 'master::call::GetState' application/x-protobuf response took 5.640594663secs
v1 'master::call::GetState' application/json response took 11.795411549secs
{noformat}

> Improve v1 operator API read performance.
> -
>
> Key: MESOS-10026
> URL: https://issues.apache.org/jira/browse/MESOS-10026
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: foundations
>
> Currently, the v1 operator API has poor performance relative to the v0 json 
> API. The following initial numbers were provided by [~Will Mahler] from our 
> state serving benchmark:
>  
> |OPTIMIZED - Master (baseline)| | | | |
> |Test setup|1000 agents with a total of 1 running tasks and 1 
> completed tasks|1 agents with a total of 10 running tasks and 10 
> completed tasks|2 agents with a total of 20 running tasks and 20 
> completed tasks|4 agents with a total of 40 running tasks and 40 
> completed tasks|
> |v0 'state' response|0.17|1.66|8.96|12.42|
> |v1 x-protobuf|0.35|3.21|9.47|19.09|
> |v1 json|0.45|4.72|10.81|31.43|
> There is quite a lot of variance, but v1 protobuf consistently slower than v0 
> (sometimes significantly so) and v1 json is consistently slower than v1 
> protobuf (sometimes significantly so).
> The reason that the v1 operator API is slower is that it does the following:
> (1) Construct temporary unversioned state response object by copying 
> in-memory un-versioned state into overall response object. (expensive!)
> (2) Evolve it to v1: serialize, de-serialize into v1 overall state object. 
> (expensive!)
> (3) Serialize the overall v1 state object to protobuf or json.
> (4) Destruct the temporaries (expensive! but is done after response starts 
> serving)
> On the other hand, the v0 jsonify approach does the following:
> (1) Serialize the in-memory unversioned state into json, by traversing state 
> and accumulating the overall serialized json.
> This means that v1 has substantial overhead vs v0, and we need to remove it 
> to bring v1 on-par or better than v0. v1 should serialize directly to json 
> (straightforward with jsonify) or protobuf (this can be done via a 
> io::CodedOutputStream).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9497) Parallel reads for master v1 state calls

2019-11-06 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9497:
--

Assignee: Meng Zhu

> Parallel reads for master v1 state calls
> 
>
> Key: MESOS-9497
> URL: https://issues.apache.org/jira/browse/MESOS-9497
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API, master
>Reporter: Greg Mann
>Assignee: Meng Zhu
>Priority: Major
>  Labels: foundations, mesosphere, performance
>
> Similar to MESOS-9158 - we should make the operator API calls which serve 
> master state perform computation of multiple such responses in parallel to 
> reduce the performance impact on the master actor.
> Note that this includes the initial expensive SUBSCRIBE payload for the event 
> streaming API, which is less straightforward to incorporate into the parallel 
> serving logic since it performs writes (to track the subscriber) and produces 
> an infinite response, unlike the other state related calls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10026) Improve v1 operator API read performance.

2019-11-06 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10026:
---

Assignee: Benjamin Mahler

> Improve v1 operator API read performance.
> -
>
> Key: MESOS-10026
> URL: https://issues.apache.org/jira/browse/MESOS-10026
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: foundations
>
> Currently, the v1 operator API has poor performance relative to the v0 json 
> API. The following initial numbers were provided by [~Will Mahler] from our 
> state serving benchmark:
>  
> |OPTIMIZED - Master (baseline)| | | | |
> |Test setup|1000 agents with a total of 1 running tasks and 1 
> completed tasks|1 agents with a total of 10 running tasks and 10 
> completed tasks|2 agents with a total of 20 running tasks and 20 
> completed tasks|4 agents with a total of 40 running tasks and 40 
> completed tasks|
> |v0 'state' response|0.17|1.66|8.96|12.42|
> |v1 x-protobuf|0.35|3.21|9.47|19.09|
> |v1 json|0.45|4.72|10.81|31.43|
> There is quite a lot of variance, but v1 protobuf consistently slower than v0 
> (sometimes significantly so) and v1 json is consistently slower than v1 
> protobuf (sometimes significantly so).
> The reason that the v1 operator API is slower is that it does the following:
> (1) Construct temporary unversioned state response object by copying 
> in-memory un-versioned state into overall response object. (expensive!)
> (2) Evolve it to v1: serialize, de-serialize into v1 overall state object. 
> (expensive!)
> (3) Serialize the overall v1 state object to protobuf or json.
> (4) Destruct the temporaries (expensive! but is done after response starts 
> serving)
> On the other hand, the v0 jsonify approach does the following:
> (1) Serialize the in-memory unversioned state into json, by traversing state 
> and accumulating the overall serialized json.
> This means that v1 has substantial overhead vs v0, and we need to remove it 
> to bring v1 on-par or better than v0. v1 should serialize directly to json 
> (straightforward with jsonify) or protobuf (this can be done via a 
> io::CodedOutputStream).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10023) Allocator method dispatches can be reordered (relative to scheduler API calls which triggered them).

2019-11-06 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10023:
---

Assignee: Andrei Sekretenko

> Allocator method dispatches can be reordered (relative to scheduler API calls 
> which triggered them).
> 
>
> Key: MESOS-10023
> URL: https://issues.apache.org/jira/browse/MESOS-10023
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Major
>  Labels: foundations
>
> Observed an example of such reordering on a testing cluster with a V1 
> framework.
> Framework side:
>  - framework issues ACCEPT for a slave with no operations and a 365+ days 
> filter 
>  - framework issues REVIVE call for all roles (which should clear all filters)
>  - framework waits for an offer for that slave and never receives it
> Master side:
>  - master receives ACCEPT, processes the first part and starts authorization
>  - master receives REVIVE and dispatches reviveOffers() to the allocator
>  - master receives a response from authorizer (for ACCEPT) and dispatches 
> recoverResources() with a 365-day filter to the allocator
> *We need to provide an ability for the framework to avoid such kind of 
> reorderings.*
> Things to consider:
>  - v1 framework are not required to use a single connection for API requests; 
> even if they were, there still is a reconnection case, during which the views 
> of the framework and the master on the state of connection might differ. This 
> means that we cannot completely avoid this problem by sequencing processing 
> of requests from the same connection.
> - Currently, all calls directly influencing allocator (except for 
> UPDATE_FRAMEWORK) return '202 ACCEPTED` at an early stage of processing. 
> _Unconditionally_ changing this might break compatibility with some existing 
> frameworks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9896) Consider using protobuf provided json conversion facilities rather than custom ones.

2019-11-05 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967948#comment-16967948
 ] 

Benjamin Mahler commented on MESOS-9896:


At least for the serialized protobuf to serialized json conversion, the 
provided conversion function appears to be extremely slow:

{noformat}
Before:
v0 '/state' response took 6.549942141secs
v1 'master::call::GetState' application/x-protobuf response took 
24.081624381secs
v1 'master::call::GetState' application/json response took 22.760332466secs

After:
v0 '/state' response took 7.57313099secs
v1 'master::call::GetState' application/x-protobuf response took 5.240223816secs
v1 'master::call::GetState' application/json response took 1.76133347258333mins
{noformat}

These are numbers from a change where we directly serialize v1 GetState to 
protobuf, and then use protobuf's {{BinaryToJsonString(...)}} utility to 
convert to json. So it might not be possible to use the built-in facilities in 
lieu of jsonify.

> Consider using protobuf provided json conversion facilities rather than 
> custom ones.
> 
>
> Key: MESOS-9896
> URL: https://issues.apache.org/jira/browse/MESOS-9896
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: foundations
>
> Currently, stout provides custom JSON to protobuf conversion facilities, some 
> of which use protobuf reflection.
> When upgrading protobuf to 3.7.x in MESOS-9755, we found that the v0 /state 
> response of the master slowed down, and it appears to be due to a performance 
> regression in the protobuf reflection code.
> We should file an issue with protobuf, but we should also look into using the 
> json conversion code that protobuf provides to see if that can help avoid the 
> regression. It may be the case that using the built-in facilities actually 
> provides a significant performance benefit, given they don't use reflection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-7694) Discarding process::loop doesn't stop the loop

2019-11-04 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-7694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966905#comment-16966905
 ] 

Benjamin Mahler commented on MESOS-7694:


[~xujyan] thanks, yeah I think it would be consistent with the current discard 
semantics (i.e. implicit discard handling as done with {{Future::then}}) for 
the loop to stop itself when a discard is requested. It also means that much 
like {{Future::then}} stopping the chain, any code needs to handle the case 
where if an iteration value is produced, it is no longer guaranteed to be 
processed by the body. I'm still a bit torn on this type of implicit discard 
handling, since it's quite easy to get bugs with it vs only having explicit 
server handling of discards, but it is currently the approach that 
{{Future::then}} takes: MESOS-8448.

> Discarding process::loop doesn't stop the loop
> --
>
> Key: MESOS-7694
> URL: https://issues.apache.org/jira/browse/MESOS-7694
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Yan Xu
>Priority: Major
>
> When a loop's returned future is discarded, the loop would attempt to stop 
> itself from progressing by propagating the discard to the future that 
> {{iteration}} or {{body}} are blocked on. However for cases such as the run 
> is dispatched and pending, or {{iteration}} or {{body}} are not blocked at 
> all, discarding the loop doesn't stop it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10026) Improve v1 operator API read performance.

2019-11-02 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10026:
---

 Summary: Improve v1 operator API read performance.
 Key: MESOS-10026
 URL: https://issues.apache.org/jira/browse/MESOS-10026
 Project: Mesos
  Issue Type: Improvement
  Components: HTTP API
Reporter: Benjamin Mahler


Currently, the v1 operator API has poor performance relative to the v0 json 
API. The following initial numbers were provided by [~Will Mahler] from our 
state serving benchmark:

 
|OPTIMIZED - Master (baseline)| | | | |
|Test setup|1000 agents with a total of 1 running tasks and 1 completed 
tasks|1 agents with a total of 10 running tasks and 10 completed 
tasks|2 agents with a total of 20 running tasks and 20 completed 
tasks|4 agents with a total of 40 running tasks and 40 completed 
tasks|
|v0 'state' response|0.17|1.66|8.96|12.42|
|v1 x-protobuf|0.35|3.21|9.47|19.09|
|v1 json|0.45|4.72|10.81|31.43|


There is quite a lot of variance, but v1 protobuf consistently slower than v0 
(sometimes significantly so) and v1 json is consistently slower than v1 
protobuf (sometimes significantly so).

The reason that the v1 operator API is slower is that it does the following:

(1) Construct temporary unversioned state response object by copying in-memory 
un-versioned state into overall response object. (expensive!)
(2) Evolve it to v1: serialize, de-serialize into v1 overall state object. 
(expensive!)
(3) Serialize the overall v1 state object to protobuf or json.
(4) Destruct the temporaries (expensive! but is done after response starts 
serving)

On the other hand, the v0 jsonify approach does the following:

(1) Serialize the in-memory unversioned state into json, by traversing state 
and accumulating the overall serialized json.

This means that v1 has substantial overhead vs v0, and we need to remove it to 
bring v1 on-par or better than v0. v1 should serialize directly to json 
(straightforward with jsonify) or protobuf (this can be done via a 
io::CodedOutputStream).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-6568) JSON serialization should not omit empty arrays in HTTP APIs

2019-11-02 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965511#comment-16965511
 ] 

Benjamin Mahler commented on MESOS-6568:


Linking this with MESOS-9896, since as far as I could tell, protobuf always 
omits empty lists (it's burned in 
[here|https://github.com/protocolbuffers/protobuf/blob/v3.10.1/src/google/protobuf/util/internal/default_value_objectwriter.cc#L67]).

> JSON serialization should not omit empty arrays in HTTP APIs
> 
>
> Key: MESOS-6568
> URL: https://issues.apache.org/jira/browse/MESOS-6568
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Neil Conway
>Priority: Major
>  Labels: mesosphere
>
> When using the JSON content type with the HTTP APIs, a {{repeated}} protobuf 
> field is omitted entirely from the JSON serialization of the message. For 
> example, this is a response to the {{GetTasks}} call:
> {noformat}
> {
>   "get_tasks": {
> "tasks": [{...}]
>   },
>   "type": "GET_TASKS"
> }
> {noformat}
> I think it would be better to include empty arrays for the other fields of 
> the message ({{pending_tasks}}, {{completed_tasks}}, etc.). Advantages:
> # Consistency with the old HTTP endpoints, e.g., /state
> # Semantically, an empty array is more accurate. The master's response should 
> be interpreted as saying it doesn't know about any pending/completed tasks; 
> that is more accurately conveyed by explicitly including an empty array, not 
> by omitting the key entirely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10015) HierarchicalAllocatorProcess::updateAllocation() can stall the allocator with a huge number of reservations on an agent.

2019-10-30 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963527#comment-16963527
 ] 

Benjamin Mahler commented on MESOS-10015:
-

{noformat}
commit 3f753e77a9e00b884000b59df8797e5422da8ccd
Author: Andrei Sekretenko 
Date:   Wed Oct 30 19:45:33 2019 -0400

Fixed allocator performance issue in updateAllocation().

This patch addresses poor performance of
`HierarchicalAllocatorProcess::updateAllocation()` for agents with
a huge number of non-addable resources in a many-framework case
(see MESOS-10015).

Sorter methods for totals tracking that modify `Resources` of an agent
in the Sorter are replaced with methods that add/remove resource
quantities of an agent as a whole (which was actually the only use case
of the old methods). Thus, subtracting/adding `Resources` of a whole
agent no longer occurs when updating resources of an agent in a Sorter.

Further, this patch completely removes agent resource tracking logic
from the random sorter (which by itself makes no use of them) by
implementing cluster totals tracking in the allocator.

Results of `*BENCHMARK_WithReservationParam.UpdateAllocation*`
(for the DRF sorter):

Master:
Agent resources size: 200 (50 frameworks)
Made 20 reserve and unreserve operations in 2.08586secs
Agent resources size: 400 (100 frameworks)
Made 20 reserve and unreserve operations in 13.8449005secs
Agent resources size: 800 (200 frameworks)
Made 20 reserve and unreserve operations in 2.19253121188333mins

Master + this patch:
Agent resources size: 200 (50 frameworks)
Made 20 reserve and unreserve operations in 468.482366ms
Agent resources size: 400 (100 frameworks)
Made 20 reserve and unreserve operations in 925.725947ms
Agent resources size: 800 (200 frameworks)
Made 20 reserve and unreserve operations in 2.110337109secs
...
Agent resources size: 6400 (1600 frameworks)
Made 20 reserve and unreserve operations in 1.50141861756667mins

Review: https://reviews.apache.org/r/71646/
{noformat}

> HierarchicalAllocatorProcess::updateAllocation() can stall the allocator with 
> a huge number of reservations on an agent.
> 
>
> Key: MESOS-10015
> URL: https://issues.apache.org/jira/browse/MESOS-10015
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.3, 1.6.2, 1.7.2, 1.8.1, 1.9.0
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Critical
>  Labels: resource-management
> Fix For: 1.10.0
>
> Attachments: out.svg
>
>
> Currently, updateAllocation() called for a single-object Resources for a 
> single framework on a single slave requires `(total number of frameworks) * 
> (number of resource objects per this slave)^2` calls of `Resource::addable()`
> In a cluster with a large number of frameworks this results in severe 
> degradation of allocator performance  when a bunch of RESERVE/UNRESERVE 
> operations occurs for an agent with hundreds of unique resources. 
> On our testing cluster task we observed task scheduling delays up to 30 
> minutes due to allocator being occupied with processing UNRESERVE operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-4809) Allow parallel execution of tests

2019-10-28 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961600#comment-16961600
 ] 

Benjamin Mahler commented on MESOS-4809:


[~bbannier] at this point should we resolve this and pull the improvements out 
of the epic?

> Allow parallel execution of tests
> -
>
> Key: MESOS-4809
> URL: https://issues.apache.org/jira/browse/MESOS-4809
> Project: Mesos
>  Issue Type: Epic
>Reporter: Benjamin Bannier
>Priority: Minor
>  Labels: mesosphere
>
> We should allow parallel execution of tests. There are two flavors to this:
> (a) tests are run in parallel in the same process, or
> (b) tests are run in parallel with separate processes (e.g., with 
> gtest-parallel).
> While (a) likely has overall better performance, it depends on tests being 
> independent of global state (e.g., current directory, and others). On the 
> other hand, already (b) improves execution time, and has much smaller 
> requirements.
> This epic tracks efforts to fix test to allow scenario (b) above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10017) Log all reverse DNS lookup failures in 'legacy' TLS (SSL) hostname validation scheme.

2019-10-23 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958062#comment-16958062
 ] 

Benjamin Mahler commented on MESOS-10017:
-

{noformat}
commit 87bfb76640fbcd83ce1dc48e3ba62d00bb1a4613
Author: Benjamin Mahler 
Date:   Mon Oct 21 19:57:54 2019 -0400

Logged failed TLS reverse DNS lookups as warnings for 'legacy' scheme.

These were getting logged at VLOG(2), whereas we want all networking
related errors to be logged as warnings (or errors if appropriate).

Review: https://reviews.apache.org/r/71643
{noformat}

> Log all reverse DNS lookup failures in 'legacy' TLS (SSL) hostname validation 
> scheme.
> -
>
> Key: MESOS-10017
> URL: https://issues.apache.org/jira/browse/MESOS-10017
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: foundations
> Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.2, 1.9.1, 1.10.0
>
>
> There were being logged at VLOG(2):
> https://github.com/apache/mesos/blob/1.9.0/3rdparty/libprocess/src/openssl.cpp#L859-L860
> In the same spirit as MESOS-9340, we'd like to log all networking related 
> errors as warnings and include any relevant information (IP address, etc).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10007) random "Failed to get exit status for Command" for short-lived commands

2019-10-22 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10007:
---

  Sprint: Foundations: RI-19 57
Story Points: 1
Assignee: Benjamin Mahler

> random "Failed to get exit status for Command" for short-lived commands
> ---
>
> Key: MESOS-10007
> URL: https://issues.apache.org/jira/browse/MESOS-10007
> Project: Mesos
>  Issue Type: Bug
>  Components: executor, libprocess
>Reporter: Charles
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: foundations
> Attachments: 
> 0001-Avoid-double-reaping-race-in-the-command-executor.patch, 
> executor_race_reprod.diff, test_scheduler.py
>
>
> Hi,
> While testing Mesos to see if we could use it at work, I encountered a random 
> bug which I believe happens when a command exits really quickly, when run via 
> the command executor.
> See the attached test case, but basically all it does is constantly start 
> "exit 0" tasks.
> At some point, a task randomly fails with the error "Failed to get exit 
> status for Command":
>  
> {noformat}
> 'state': 'TASK_FAILED', 'message': 'Failed to get exit status for Command', 
> 'source': 'SOURCE_EXECUTOR',{noformat}
>   
> I've had a look at the code, and I found something which could potentially 
> explain it - it's the first time I look at the code so apologies if I'm 
> missing something.
>  We can see the error originates from `reaped`:
> [https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L1017]
> {noformat}
> } else if (status_->isNone()) {
>   taskState = TASK_FAILED;
>   message = "Failed to get exit status for Command";
> } else {{noformat}
>  
> Looking at the code, we can see that the `status_` future can be set to 
> `None` in `ReaperProcess::reap`:
> [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L69]
>  
>  
> {noformat}
> Future> ReaperProcess::reap(pid_t pid)
> {
>   // Check to see if this pid exists.
>   if (os::exists(pid)) {
> Owned>> promise(new Promise>());
> promises.put(pid, promise);
> return promise->future();
>   } else {
> return None();
>   }
> }{noformat}
>  
>  
> So we could have this if the process has already been reaped (`kill -0` will 
> fail).
>  
> Now, looking at the code path which spawns the process:
> `launchTaskSubprocess`
> [https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L724]
>  
> calls `subprocess`:
> [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L315]
>  
> If we look at the bottom of the function we can see the following:
> [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462]
>  
>  
> {noformat}
>   // We need to bind a copy of this Subprocess into the onAny callback
>   // below to ensure that we don't close the file descriptors before
>   // the subprocess has terminated (i.e., because the caller doesn't
>   // keep a copy of this Subprocess around themselves).
>   process::reap(process.data->pid)
> .onAny(lambda::bind(internal::cleanup, lambda::_1, promise, process));  
> return process;{noformat}
>  
>  
> So at this point we've already called `process::reap`.
>  
> And after that, the executor also calls `process::reap`:
> [https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801]
>  
>  
> {noformat}
> // Monitor this process.
> process::reap(pid.get())
>   .onAny(defer(self(), ::reaped, pid.get(), lambda::_1));{noformat}
>  
>  
> But if we look at the implementation of `process::reap`:
> [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L152]
>  
>  
> {noformat}
> Future> reap(pid_t pid)
> {
>   // The reaper process is instantiated in `process::initialize`.
>   process::initialize();  return dispatch(
>   internal::reaper,
>   ::ReaperProcess::reap,
>   pid);
> }{noformat}
> We can see that `ReaperProcess::reap` is going to get called asynchronously.
>  
> Doesn't this mean that it's possible that the first call to `reap` set up by 
> `subprocess` 
> ([https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462)|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462]
> will get executed first, and if the task has already exited by that time, the 
> child will get reaped before the call to `reap` set up by the executor 
> ([https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801])
>  gets a chance to run?
>  
> In that case, when it runs
>  
> {noformat}
> if (os::exists(pid)) {{noformat}
> would return false, `reap` would set the future to None which would result in 
> this error.
>  



--
This 

[jira] [Assigned] (MESOS-10017) Log all reverse DNS lookup failures in 'legacy' TLS (SSL) hostname validation scheme.

2019-10-21 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10017:
---

Assignee: Benjamin Mahler

> Log all reverse DNS lookup failures in 'legacy' TLS (SSL) hostname validation 
> scheme.
> -
>
> Key: MESOS-10017
> URL: https://issues.apache.org/jira/browse/MESOS-10017
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: foundations
>
> There were being logged at VLOG(2):
> https://github.com/apache/mesos/blob/1.9.0/3rdparty/libprocess/src/openssl.cpp#L859-L860
> In the same spirit as MESOS-9340, we'd like to log all networking related 
> errors as warnings and include any relevant information (IP address, etc).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10017) Log all reverse DNS lookup failures in 'legacy' TLS (SSL) hostname validation scheme.

2019-10-21 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10017:
---

 Summary: Log all reverse DNS lookup failures in 'legacy' TLS (SSL) 
hostname validation scheme.
 Key: MESOS-10017
 URL: https://issues.apache.org/jira/browse/MESOS-10017
 Project: Mesos
  Issue Type: Task
  Components: libprocess
Reporter: Benjamin Mahler


There were being logged at VLOG(2):

https://github.com/apache/mesos/blob/1.9.0/3rdparty/libprocess/src/openssl.cpp#L859-L860

In the same spirit as MESOS-9340, we'd like to log all networking related 
errors as warnings and include any relevant information (IP address, etc).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

2019-10-21 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956543#comment-16956543
 ] 

Benjamin Mahler commented on MESOS-9767:


[~ggarg] is this issue still affecting you or should we close it?

> Add self health monitoring in Mesos master
> --
>
> Key: MESOS-9767
> URL: https://issues.apache.org/jira/browse/MESOS-9767
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Affects Versions: 1.6.0
>Reporter: Gaurav Garg
>Priority: Major
>
> We have seen issue where Mesos master got stuck and was not responding to 
> HTTP endpoints like "/metrics/snapshot". This results in calls by the 
> frameworks and metrics collector to the master to hang. Currently we emit 
> 'master alive' metric using prometheus. If master hangs, this metrics is not 
> published and we detect the hangs using alerts on top of this metrics. By the 
> time someone would have got the alert and restarted the master process, 
> 15-30mins would have passed by. This results in SLA violation by Mesos 
> cluster users.
> It will be nice to implement a self health check monitoring to detect if the 
> Mesos master is hung/stuck. This will help us to quickly crash the master 
> process so that one of the other member of the quorum can acquire ZK 
> leadership lock.
> We can use the "/master/health" endpoint for health checks. 
> Health checks can be initiated in 
> [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]]
>  just after the child master process is 
> [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]]
> We can leverage the 
> [HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]]
>  for this one. One downside is that HealthChecker currently takes TaskId as 
> an input which is not valid for master health check. 
> We can add following flags to control the self heath checking:
>  # self_monitoring_enabled: Whether self monitoring is enabled.
>  # self_monitoring_consecutive_failures: After this many number of health 
> failures, master is crashed.
>  # self_monitoring_interval_secs: Interval at which health checks are 
> performed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10007) random "Failed to get exit status for Command" for short-lived commands

2019-10-21 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956524#comment-16956524
 ] 

Benjamin Mahler commented on MESOS-10007:
-

Hi [~Charle], thanks for the nice ticket! I attached a patch that avoids the 
double reaping, please try that out and let me know if that works for your test 
case.

[~gilbert] [~qianzhang] note that the launcher also has a double reap, but it 
does not depend on the status, so the only issue in that case is if the pid is 
reused before SubprocessLauncher::destroy calls destroy. You may want to audit 
the containerization code for double reap issues.

> random "Failed to get exit status for Command" for short-lived commands
> ---
>
> Key: MESOS-10007
> URL: https://issues.apache.org/jira/browse/MESOS-10007
> Project: Mesos
>  Issue Type: Bug
>  Components: executor, libprocess
>Reporter: Charles
>Priority: Major
>  Labels: foundations
> Attachments: 
> 0001-Avoid-double-reaping-race-in-the-command-executor.patch, 
> executor_race_reprod.diff, test_scheduler.py
>
>
> Hi,
> While testing Mesos to see if we could use it at work, I encountered a random 
> bug which I believe happens when a command exits really quickly, when run via 
> the command executor.
> See the attached test case, but basically all it does is constantly start 
> "exit 0" tasks.
> At some point, a task randomly fails with the error "Failed to get exit 
> status for Command":
>  
> {noformat}
> 'state': 'TASK_FAILED', 'message': 'Failed to get exit status for Command', 
> 'source': 'SOURCE_EXECUTOR',{noformat}
>   
> I've had a look at the code, and I found something which could potentially 
> explain it - it's the first time I look at the code so apologies if I'm 
> missing something.
>  We can see the error originates from `reaped`:
> [https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L1017]
> {noformat}
> } else if (status_->isNone()) {
>   taskState = TASK_FAILED;
>   message = "Failed to get exit status for Command";
> } else {{noformat}
>  
> Looking at the code, we can see that the `status_` future can be set to 
> `None` in `ReaperProcess::reap`:
> [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L69]
>  
>  
> {noformat}
> Future> ReaperProcess::reap(pid_t pid)
> {
>   // Check to see if this pid exists.
>   if (os::exists(pid)) {
> Owned>> promise(new Promise>());
> promises.put(pid, promise);
> return promise->future();
>   } else {
> return None();
>   }
> }{noformat}
>  
>  
> So we could have this if the process has already been reaped (`kill -0` will 
> fail).
>  
> Now, looking at the code path which spawns the process:
> `launchTaskSubprocess`
> [https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L724]
>  
> calls `subprocess`:
> [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L315]
>  
> If we look at the bottom of the function we can see the following:
> [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462]
>  
>  
> {noformat}
>   // We need to bind a copy of this Subprocess into the onAny callback
>   // below to ensure that we don't close the file descriptors before
>   // the subprocess has terminated (i.e., because the caller doesn't
>   // keep a copy of this Subprocess around themselves).
>   process::reap(process.data->pid)
> .onAny(lambda::bind(internal::cleanup, lambda::_1, promise, process));  
> return process;{noformat}
>  
>  
> So at this point we've already called `process::reap`.
>  
> And after that, the executor also calls `process::reap`:
> [https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801]
>  
>  
> {noformat}
> // Monitor this process.
> process::reap(pid.get())
>   .onAny(defer(self(), ::reaped, pid.get(), lambda::_1));{noformat}
>  
>  
> But if we look at the implementation of `process::reap`:
> [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L152]
>  
>  
> {noformat}
> Future> reap(pid_t pid)
> {
>   // The reaper process is instantiated in `process::initialize`.
>   process::initialize();  return dispatch(
>   internal::reaper,
>   ::ReaperProcess::reap,
>   pid);
> }{noformat}
> We can see that `ReaperProcess::reap` is going to get called asynchronously.
>  
> Doesn't this mean that it's possible that the first call to `reap` set up by 
> `subprocess` 
> ([https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462)|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462]
> will get executed first, and if the task has already exited by that time, the 
> child will get 

[jira] [Commented] (MESOS-10014) `tryUntrackFrameworkUnderRole` check failed in `HierarchicalAllocatorProcess::removeFramework`.

2019-10-18 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954979#comment-16954979
 ] 

Benjamin Mahler commented on MESOS-10014:
-

Here is it with the allocation printed out:

{noformat}
F1018 14:25:28.148012 38564 hierarchical.cpp:745] Check failed: 
tryUntrackFrameworkUnderRole(framework, role)  Framework: 
62c69c24-83be-4b12-bbf7-ee84585eb18e- role: default-role allocation: { 
62c69c24-83be-4b12-bbf7-ee84585eb18e-S0: disk(allocated: 
default-role)[RAW(,,profile)]:200 }
{noformat}

> `tryUntrackFrameworkUnderRole` check failed in 
> `HierarchicalAllocatorProcess::removeFramework`.
> ---
>
> Key: MESOS-10014
> URL: https://issues.apache.org/jira/browse/MESOS-10014
> Project: Mesos
>  Issue Type: Bug
>  Components: master, test
>Affects Versions: 1.10
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: flaky-test, resource-management
> Attachments: AgentPendingOperationAfterMasterFailover-badrun.txt
>
>
> `ContentType/OperationReconciliationTest.AgentPendingOperationAfterMasterFailover/0`
>  test failed:
> {code:java}
> F1018 09:05:14.310616 21391 hierarchical.cpp:745] Check failed: 
> tryUntrackFrameworkUnderRole(framework, role)  Framework: 
> e6284079-cb6a-4a47-8f9a-ea9b84ff622a- role: default-role
> *** Check failure stack trace: ***
> @ 0x7f40fff0a1f6  google::LogMessage::Fail()
> @ 0x7f40fff0a14f  google::LogMessage::SendToLog()
> @ 0x7f40fff09a91  google::LogMessage::Flush()
> @ 0x7f40fff0d12f  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f410fd828ac  
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeFramework()
> @  0x186b29f  
> _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES8_EEvRKNS_3PIDIT_EEMSA_FvT0_EOT1_ENKUlOS6_PNS_11ProcessBaseEE_clESJ_SL_
> @  0x189c273  
> _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS3_11FrameworkIDESA_EEvRKNS1_3PIDIT_EEMSC_FvT0_EOT1_EUlOS8_PNS1_11ProcessBaseEE_JS8_SN_EEEDTclcl7forwardISC_Efp_Espcl7forwardIT0_Efp0_EEEOSC_DpOSP_
> @  0x18990b7  
> _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_JS9_St12_PlaceholderILi113invoke_expandISP_St5tupleIJS9_SR_EESU_IJOSO_EEJLm0ELm1DTcl6invokecl7forwardISD_Efp_Espcl6expandcl3getIXT2_EEcl7forwardISH_Efp0_EEcl7forwardISK_Efp2_OSD_OSH_N5cpp1416integer_sequenceImJXspT2_SL_
> @  0x1896100  
> _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_IS9_St12_PlaceholderILi1clIISO_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImILm0ELm1_Ecl16forward_as_tuplespcl7forwardIT_Efp_DpOSX_
> @  0x1895174  
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS6_11FrameworkIDESD_EEvRKNS4_3PIDIT_EEMSF_FvT0_EOT1_EUlOSB_PNS4_11ProcessBaseEE_ISB_St12_PlaceholderILi1EISQ_EEEDTclcl7forwardISF_Efp_Espcl7forwardIT0_Efp0_EEEOSF_DpOSV_
> @  0x1894b2b  
> _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS7_11FrameworkIDESE_EEvRKNS5_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS5_11ProcessBaseEE_JSC_St12_PlaceholderILi1EJSR_EEEvOSG_DpOT0_
> @  0x18943bc  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNSA_11FrameworkIDESH_EEvRKNS1_3PIDIT_EEMSJ_FvT0_EOT1_EUlOSF_S3_E_ISF_St12_PlaceholderILi1EEclEOS3_
> @ 0x7f41016deb22  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
> @ 0x7f410169620c  process::ProcessBase::consume()
> @ 0x7f41016c0696  
> _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
> @  0x1822baa  process::ProcessBase::serve()
> @ 0x7f4101692af1  process::ProcessManager::resume()
> @ 0x7f410168ed68  
> _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv
> @ 0x7f41016b81e2  
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x7f41016b7244  
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEclEv
> @ 0x7f41016b6088  
> 

[jira] [Commented] (MESOS-10005) Only broadcast framework update to agents associated with framework

2019-10-18 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954913#comment-16954913
 ] 

Benjamin Mahler commented on MESOS-10005:
-

Attached a patch that can be applied on 1.9.x to see if parallel broadcasting 
helps alleviate this.

> Only broadcast framework update to agents associated with framework
> ---
>
> Key: MESOS-10005
> URL: https://issues.apache.org/jira/browse/MESOS-10005
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.9.0
> Environment: Ubuntu Bionic 18.04, Mesos 1.9.0 on the master, Mesos 
> 1.4.1 on the agents. Spark 2.1.1 is the primary framework running.
>Reporter: Terra Field
>Priority: Major
> Attachments: 0001-Send-framework-updates-in-parallel.patch, 
> mesos-master.log.gz, mesos-master.stacks - 2 - 1.9.0.gz, mesos-master.stacks 
> - 3 - 1.9.0.gz, mesos-master.stacks - 4 - framework update - 1.9.0.gz, 
> mesos-master.stacks - 5 - new healthy master.gz
>
>
> We have at any given time ~100 frameworks connected to our Mesos Master with 
> agents spread across anywhere from 6,000 to 11,000 EC2 instances. We've been 
> encounter a crash (which I'll document separately) and when that happens, the 
> new Mesos Master will sometimes (but not always) struggle to catch up, and 
> eventually crash again. Usually the third or fourth crash will end with a 
> stable master (not ideal, but at least we can get to stable).
> Looking over the logs, I'm seeing hundreds of attempts to contact dead agents 
> each second (and presumably many contacts with healthy agents that don't 
> throw an error):
> {noformat}
> W1003 21:39:39.28  8618 process.cpp:1917] Failed to send 
> 'mesos.internal.UpdateFrameworkMessage' to '100.82.103.99:5051', connect: 
> Failed to connect to 100.82.103.99:5051: Connection refused W1003 
> 21:39:39.300143  8618 process.cpp:1917] Failed to send 
> 'mesos.internal.UpdateFrameworkMessage' to '100.85.122.190:5051', connect: 
> Failed to connect to 100.85.122.190:5051: Connection refused W1003 
> 21:39:39.300285  8618 process.cpp:1917] Failed to send 
> 'mesos.internal.UpdateFrameworkMessage' to '100.85.84.187:5051', connect: 
> Failed to connect to 100.85.84.187:5051: Connection refused W1003 
> 21:39:39.302122  8618 process.cpp:1917] Failed to send 
> 'mesos.internal.UpdateFrameworkMessage' to '100.82.163.228:5051', connect: 
> Failed to connect to 100.82.163.228:5051: Connection refused{noformat}
> I gave [~bmahler] a perf trace of the master on Slack at this point, and it 
> looks like the master at is spending a significant amount of time doing 
> framework update broadcasting. I'll attach the perf dump to the ticket, as 
> well as the log of what the master did while it was alive.
> It sounds like currently, every framework update (100 total frameworks in our 
> case) results in a broadcast to all 6000-11000 agents (depending on how busy 
> the cluster is). Also, since our health checks rely on the UI currently, we 
> usually end up killing the master because it fails a health check for long 
> periods of time while overwhelmed by doing these broadcasts.
> Could optimizations to be made to either throttle these broadcasts or to only 
> target nodes which need those framework updates?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10008) Invalid quota config can crash master

2019-10-08 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10008:
---

Assignee: Benjamin Mahler

> Invalid quota config can crash master
> -
>
> Key: MESOS-10008
> URL: https://issues.apache.org/jira/browse/MESOS-10008
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Andrei Sekretenko
>Assignee: Benjamin Mahler
>Priority: Major
>
> We are observing the following crash on the 1.9.1 master:
> {code}
> I1008 10:12:15.148486  4687 http.cpp:1115] HTTP POST for 
> /master/api/v1?_ts=1570529541073_QUOTA from 10.0.7.253:35410 with 
> User-Agent='Mozilla/5.0 (Windows NT 6.1; Win64; x64) Ap>
> I1008 10:12:15.148665  4687 http.cpp:263] Processing call UPDATE_QUOTA
> I1008 10:12:15.148756  4687 quota_handler.cpp:1136] Authorizing principal 
> 'bootstrapuser' to update quota config for role 's1'
> I1008 10:12:15.149169  4685 registrar.cpp:487] Applied 1 operations in 
> 56277ns; attempting to update the registry
> I1008 10:12:15.149338  4681 coordinator.cpp:348] Coordinator attempting to 
> write APPEND action at position 13
> I1008 10:12:15.149467  4689 replica.cpp:541] Replica received write request 
> for position 13 from __req_res__(29)@10.0.7.253:5050
> I1008 10:12:15.151820  4683 replica.cpp:695] Replica received learned notice 
> for position 13 from log-network(2)@10.0.7.253:5050
> I1008 10:12:15.153559  4679 registrar.cpp:544] Successfully updated the 
> registry in 4.348928ms
> I1008 10:12:15.153592  4678 coordinator.cpp:348] Coordinator attempting to 
> write TRUNCATE action at position 14
> I1008 10:12:15.153715  4679 hierarchical.cpp:1619] Updated quota for role 
> 's1',  guarantees: {} limits: cpus:2; disk:-9.22337203685478e+15; gpus:3; 
> mem:1
> I1008 10:12:15.153796  4677 replica.cpp:541] Replica received write request 
> for position 14 from __req_res__(30)@10.0.7.253:5050
> I1008 10:12:15.155380  4691 replica.cpp:695] Replica received learned notice 
> for position 14 from log-network(2)@10.0.7.253:5050
> I1008 10:12:15.249722  4677 authenticator.cpp:324] dstip=10.0.7.253 
> type=audit timestamp=2019-10-08 10:12:15.249673984+00:00 reason="Valid 
> authentication token" uid="bootstrapuser" obje>
> I1008 10:12:15.249956  4682 http.cpp:1115] HTTP GET for 
> /master/state-summary?_ts=1570529541169 from 10.0.7.253:35414 with 
> User-Agent='Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebK>
> I1008 10:12:15.250633  4691 http.cpp:1132] HTTP GET for 
> /master/state-summary?_ts=1570529541169 from 10.0.7.253:35414: '200 OK' after 
> 1.72621ms
> I1008 10:12:15.570379  4689 hierarchical.cpp:1908] Before allocation, 
> required quota headroom is {} and available quota headroom is cpus:0.9; 
> disk:75853; mem:5507
> F1008 10:12:15.570580  4689 resource_quantities.cpp:330] Check failed: scalar 
> >= Value::Scalar() (-9.22337203685478e+15 vs. 0)
> *** Check failure stack trace: ***
> @ 0x7fc786f0148d  google::LogMessage::Fail()
> @ 0x7fc786f036e8  google::LogMessage::SendToLog()
> @ 0x7fc786f01023  google::LogMessage::Flush()
> @ 0x7fc786f04029  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fc785954dfa  mesos::ResourceQuantities::add()
> @ 0x7fc785954fb6  mesos::ResourceQuantities::fromScalarResource()
> @ 0x7fc78595e135  mesos::shrinkResources()
> @ 0x7fc785a874a9  
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::__allocate()
> @ 0x7fc785a88089  
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::_allocate()
> @ 0x7fc785a93882  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchI7NothingN5mesos8internal6master9allocator8internal28Hier>
> @ 0x7fc786e49e21  process::ProcessBase::consume()
> @ 0x7fc786e6141b  process::ProcessManager::resume()
> @ 0x7fc786e670b6  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @ 0x7fc782a28b22  (unknown)
> @ 0x7fc7821be94a  (unknown)
> @ 0x7fc781eef07f  clone
> {code}
> Note that the value of disk quota limit is *logged* as "negative".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9889) Master CPU high due to unexpected foreachkey behaviour in Master::__reregisterSlave

2019-10-03 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944011#comment-16944011
 ] 

Benjamin Mahler commented on MESOS-9889:


Targeting for backporting, since this seems to cause a serious performance 
issue if triggered.

> Master CPU high due to unexpected foreachkey behaviour in 
> Master::__reregisterSlave
> ---
>
> Key: MESOS-9889
> URL: https://issues.apache.org/jira/browse/MESOS-9889
> Project: Mesos
>  Issue Type: Bug
>Reporter: haosdent
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: foundations
>
> At 
> https://github.com/apache/mesos/blob/9932550e9632e7fbb9a45b217793c7f508f57001/src/master/master.cpp#L7707-L7708
> {code}
> void Master::__reregisterSlave(
> ...
> foreachkey (FrameworkID frameworkId,
>slaves.unreachableTasks.at(slaveInfo.id())) {
> ...
> foreach (TaskID taskId,
>  slaves.unreachableTasks.at(slaveInfo.id()).get(frameworkId)) 
> {
> {code}
> Our case is when network flapping, 3~4 agents reregister, then master would 
> CPU full and could not process any requests during that period.
> After change 
> {code}
> -foreachkey (FrameworkID frameworkId,
> -   slaves.unreachableTasks.at(slaveInfo.id())) {
> +foreach (FrameworkID frameworkId,
> +   slaves.unreachableTasks.at(slaveInfo.id()).keys()) {
> {code}
> The problem gone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9889) Master CPU high due to unexpected foreachkey behaviour in Master::__reregisterSlave

2019-10-03 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9889:
--

Assignee: Benjamin Mahler

> Master CPU high due to unexpected foreachkey behaviour in 
> Master::__reregisterSlave
> ---
>
> Key: MESOS-9889
> URL: https://issues.apache.org/jira/browse/MESOS-9889
> Project: Mesos
>  Issue Type: Bug
>Reporter: haosdent
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: foundations
>
> At 
> https://github.com/apache/mesos/blob/9932550e9632e7fbb9a45b217793c7f508f57001/src/master/master.cpp#L7707-L7708
> {code}
> void Master::__reregisterSlave(
> ...
> foreachkey (FrameworkID frameworkId,
>slaves.unreachableTasks.at(slaveInfo.id())) {
> ...
> foreach (TaskID taskId,
>  slaves.unreachableTasks.at(slaveInfo.id()).get(frameworkId)) 
> {
> {code}
> Our case is when network flapping, 3~4 agents reregister, then master would 
> CPU full and could not process any requests during that period.
> After change 
> {code}
> -foreachkey (FrameworkID frameworkId,
> -   slaves.unreachableTasks.at(slaveInfo.id())) {
> +foreach (FrameworkID frameworkId,
> +   slaves.unreachableTasks.at(slaveInfo.id()).keys()) {
> {code}
> The problem gone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9123) Expose quota consumption metrics.

2019-09-25 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937890#comment-16937890
 ] 

Benjamin Mahler commented on MESOS-9123:


https://reviews.apache.org/r/71489/
https://reviews.apache.org/r/71490/
https://reviews.apache.org/r/71491/

> Expose quota consumption metrics.
> -
>
> Key: MESOS-9123
> URL: https://issues.apache.org/jira/browse/MESOS-9123
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Andrei Sekretenko
>Priority: Major
>  Labels: allocator, mesosphere, metrics, resource-management
>
> Currently, quota related metrics exposes quota guarantee and allocated quota. 
> We should expose "consumed" which is allocated quota plus unallocated 
> reservations. We already have this info in the allocator as 
> `consumedQuotaScalarQuantities`, just needs to expose it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9939) PersistentVolumeEndpointsTest.DynamicReservation is flaky.

2019-08-28 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918160#comment-16918160
 ] 

Benjamin Mahler commented on MESOS-9939:


Just took a look with [~mzhu], linking in cause of the issue.

> PersistentVolumeEndpointsTest.DynamicReservation is flaky.
> --
>
> Key: MESOS-9939
> URL: https://issues.apache.org/jira/browse/MESOS-9939
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: resource-management
>
> {noformat}
> [ RUN  ] PersistentVolumeEndpointsTest.DynamicReservation
> I0813 20:55:33.670486 32445 cluster.cpp:177] Creating default 'local' 
> authorizer
> I0813 20:55:33.674396 32457 master.cpp:440] Master 
> 87e437ee-0796-49fd-bfab-e7866bb7a81d (6c6cd7a3b2c1) started on 
> 172.17.0.2:36761
> I0813 20:55:33.674434 32457 master.cpp:443] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1000secs" --allocator="hierarchical" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/9zz3CO/credentials" --filter_gpu_resources="true" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_operator_event_stream_subscribers="1000" 
> --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
> --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
> --publish_per_framework_metrics="true" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --require_agent_domain="false" --role_sorter="drf" --roles="role1" 
> --root_submissions="true" --version="false" 
> --webui_dir="/tmp/SRC/build/mesos-1.9.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/9zz3CO/master" --zk_session_timeout="10secs"
> I0813 20:55:33.674772 32457 master.cpp:492] Master only allowing 
> authenticated frameworks to register
> I0813 20:55:33.674784 32457 master.cpp:498] Master only allowing 
> authenticated agents to register
> I0813 20:55:33.674793 32457 master.cpp:504] Master only allowing 
> authenticated HTTP frameworks to register
> I0813 20:55:33.674800 32457 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/9zz3CO/credentials'
> I0813 20:55:33.675024 32457 master.cpp:548] Using default 'crammd5' 
> authenticator
> I0813 20:55:33.675189 32457 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0813 20:55:33.675369 32457 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0813 20:55:33.675529 32457 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0813 20:55:33.675685 32457 master.cpp:629] Authorization enabled
> W0813 20:55:33.675709 32457 master.cpp:692] The '--roles' flag is deprecated. 
> This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
> more information
> I0813 20:55:33.676091 32460 whitelist_watcher.cpp:77] No whitelist given
> I0813 20:55:33.676143 32455 hierarchical.cpp:241] Initialized hierarchical 
> allocator process
> I0813 20:55:33.678655 32452 master.cpp:2168] Elected as the leading master!
> I0813 20:55:33.678683 32452 master.cpp:1664] Recovering from registrar
> I0813 20:55:33.678833 32454 registrar.cpp:339] Recovering registrar
> I0813 20:55:33.679450 32454 registrar.cpp:383] Successfully fetched the 
> registry (0B) in 576us
> I0813 20:55:33.679579 32454 registrar.cpp:487] Applied 1 operations in 
> 46310ns; attempting to update the registry
> I0813 20:55:33.680164 32454 registrar.cpp:544] Successfully updated the 
> registry in 525824ns
> I0813 20:55:33.680292 32454 registrar.cpp:416] Successfully recovered 
> registrar
> I0813 20:55:33.680759 32447 master.cpp:1817] Recovered 0 agents from the 
> registry (143B); allowing 10mins for agents to reregister
> I0813 20:55:33.680793 32459 hierarchical.cpp:280] Skipping recovery of 
> hierarchical allocator: nothing to recover
> W0813 20:55:33.687850 32445 process.cpp:2877] Attempted to spawn already 
> running process 

[jira] [Comment Edited] (MESOS-9939) PersistentVolumeEndpointsTest.DynamicReservation is flaky.

2019-08-27 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917090#comment-16917090
 ] 

Benjamin Mahler edited comment on MESOS-9939 at 8/27/19 8:41 PM:
-

This stumped me a bit, I do see some difference between a good run and a bad 
run between when the agent receives the CREATE operation (it seems later than 
expected in the bad run).

I fixed some race prone code in the path, but I don't see how it could cause 
the failure:
https://reviews.apache.org/r/71376/


was (Author: bmahler):
This stumped me a bit, I do see some difference between a good run and a bad 
run between when the agent receives the CREATE operation (it seems later than 
expected in the bad run). The only potential for strange racing I found was:

https://reviews.apache.org/r/71376/

After pushing this patch, I'm tempted to resolve this and re-open if we find 
it's still flaky, unless someone else has any other findings.

> PersistentVolumeEndpointsTest.DynamicReservation is flaky.
> --
>
> Key: MESOS-9939
> URL: https://issues.apache.org/jira/browse/MESOS-9939
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: resource-management
>
> {noformat}
> [ RUN  ] PersistentVolumeEndpointsTest.DynamicReservation
> I0813 20:55:33.670486 32445 cluster.cpp:177] Creating default 'local' 
> authorizer
> I0813 20:55:33.674396 32457 master.cpp:440] Master 
> 87e437ee-0796-49fd-bfab-e7866bb7a81d (6c6cd7a3b2c1) started on 
> 172.17.0.2:36761
> I0813 20:55:33.674434 32457 master.cpp:443] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1000secs" --allocator="hierarchical" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/9zz3CO/credentials" --filter_gpu_resources="true" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_operator_event_stream_subscribers="1000" 
> --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
> --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
> --publish_per_framework_metrics="true" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --require_agent_domain="false" --role_sorter="drf" --roles="role1" 
> --root_submissions="true" --version="false" 
> --webui_dir="/tmp/SRC/build/mesos-1.9.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/9zz3CO/master" --zk_session_timeout="10secs"
> I0813 20:55:33.674772 32457 master.cpp:492] Master only allowing 
> authenticated frameworks to register
> I0813 20:55:33.674784 32457 master.cpp:498] Master only allowing 
> authenticated agents to register
> I0813 20:55:33.674793 32457 master.cpp:504] Master only allowing 
> authenticated HTTP frameworks to register
> I0813 20:55:33.674800 32457 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/9zz3CO/credentials'
> I0813 20:55:33.675024 32457 master.cpp:548] Using default 'crammd5' 
> authenticator
> I0813 20:55:33.675189 32457 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0813 20:55:33.675369 32457 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0813 20:55:33.675529 32457 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0813 20:55:33.675685 32457 master.cpp:629] Authorization enabled
> W0813 20:55:33.675709 32457 master.cpp:692] The '--roles' flag is deprecated. 
> This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
> more information
> I0813 20:55:33.676091 32460 whitelist_watcher.cpp:77] No whitelist given
> I0813 20:55:33.676143 32455 hierarchical.cpp:241] Initialized hierarchical 
> allocator process
> I0813 20:55:33.678655 32452 master.cpp:2168] Elected as the leading master!
> I0813 20:55:33.678683 32452 master.cpp:1664] Recovering from registrar
> I0813 20:55:33.678833 32454 registrar.cpp:339] Recovering registrar
> I0813 20:55:33.679450 32454 

[jira] [Commented] (MESOS-9939) PersistentVolumeEndpointsTest.DynamicReservation is flaky.

2019-08-27 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917090#comment-16917090
 ] 

Benjamin Mahler commented on MESOS-9939:


This stumped me a bit, I do see some difference between a good run and a bad 
run between when the agent receives the CREATE operation (it seems later than 
expected in the bad run). The only potential for strange racing I found was:

https://reviews.apache.org/r/71376/

After pushing this patch, I'm tempted to resolve this and re-open if we find 
it's still flaky, unless someone else has any other findings.

> PersistentVolumeEndpointsTest.DynamicReservation is flaky.
> --
>
> Key: MESOS-9939
> URL: https://issues.apache.org/jira/browse/MESOS-9939
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: resource-management
>
> {noformat}
> [ RUN  ] PersistentVolumeEndpointsTest.DynamicReservation
> I0813 20:55:33.670486 32445 cluster.cpp:177] Creating default 'local' 
> authorizer
> I0813 20:55:33.674396 32457 master.cpp:440] Master 
> 87e437ee-0796-49fd-bfab-e7866bb7a81d (6c6cd7a3b2c1) started on 
> 172.17.0.2:36761
> I0813 20:55:33.674434 32457 master.cpp:443] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1000secs" --allocator="hierarchical" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/9zz3CO/credentials" --filter_gpu_resources="true" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_operator_event_stream_subscribers="1000" 
> --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
> --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
> --publish_per_framework_metrics="true" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --require_agent_domain="false" --role_sorter="drf" --roles="role1" 
> --root_submissions="true" --version="false" 
> --webui_dir="/tmp/SRC/build/mesos-1.9.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/9zz3CO/master" --zk_session_timeout="10secs"
> I0813 20:55:33.674772 32457 master.cpp:492] Master only allowing 
> authenticated frameworks to register
> I0813 20:55:33.674784 32457 master.cpp:498] Master only allowing 
> authenticated agents to register
> I0813 20:55:33.674793 32457 master.cpp:504] Master only allowing 
> authenticated HTTP frameworks to register
> I0813 20:55:33.674800 32457 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/9zz3CO/credentials'
> I0813 20:55:33.675024 32457 master.cpp:548] Using default 'crammd5' 
> authenticator
> I0813 20:55:33.675189 32457 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0813 20:55:33.675369 32457 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0813 20:55:33.675529 32457 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0813 20:55:33.675685 32457 master.cpp:629] Authorization enabled
> W0813 20:55:33.675709 32457 master.cpp:692] The '--roles' flag is deprecated. 
> This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
> more information
> I0813 20:55:33.676091 32460 whitelist_watcher.cpp:77] No whitelist given
> I0813 20:55:33.676143 32455 hierarchical.cpp:241] Initialized hierarchical 
> allocator process
> I0813 20:55:33.678655 32452 master.cpp:2168] Elected as the leading master!
> I0813 20:55:33.678683 32452 master.cpp:1664] Recovering from registrar
> I0813 20:55:33.678833 32454 registrar.cpp:339] Recovering registrar
> I0813 20:55:33.679450 32454 registrar.cpp:383] Successfully fetched the 
> registry (0B) in 576us
> I0813 20:55:33.679579 32454 registrar.cpp:487] Applied 1 operations in 
> 46310ns; attempting to update the registry
> I0813 20:55:33.680164 32454 registrar.cpp:544] Successfully updated the 
> registry in 525824ns
> I0813 20:55:33.680292 32454 registrar.cpp:416] Successfully recovered 
> registrar
> I0813 

[jira] [Created] (MESOS-9953) Expose quota config in the GET_ROLES response.

2019-08-24 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-9953:
--

 Summary: Expose quota config in the GET_ROLES response.
 Key: MESOS-9953
 URL: https://issues.apache.org/jira/browse/MESOS-9953
 Project: Mesos
  Issue Type: Task
  Components: HTTP API
Reporter: Benjamin Mahler


Currently, GET_ROLES does not expose the quota configuration, unlike /roles.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9806) Address allocator performance regression due to the removal of quota role sorter.

2019-08-22 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913777#comment-16913777
 ] 

Benjamin Mahler commented on MESOS-9806:


{noformat}
commit de90b2b3078e06975ab2061db821cfe7dda8
Author: Benjamin Mahler 
Date:   Thu Aug 22 17:41:28 2019 -0400

Optimized Resources::shrink.

Master:
*HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 30.37 secs
Made 0 allocation in 27.05 secs

Master + this patch:
*HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 24.15 secs
Made 0 allocation in 20.48 secs

Review: https://reviews.apache.org/r/71353
{noformat}

{noformat}
commit 05e5ca4b3446e34447f632463efe9a34b4bace7f
Author: Benjamin Mahler 
Date:   Thu Aug 22 17:42:57 2019 -0400

Added ResourceQuantities::fromScalarResource.

Master + previous patches:
*HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 24.15 secs
Made 0 allocation in 20.48 secs

Master + previous patches + this patch:
Master + this patch:
*HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 23.37 secs
Made 0 allocation in 19.72 secs

Review: https://reviews.apache.org/r/71354
{noformat}

> Address allocator performance regression due to the removal of quota role 
> sorter.
> -
>
> Key: MESOS-9806
> URL: https://issues.apache.org/jira/browse/MESOS-9806
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: resource-management
>
> In MESOS-9802, we removed the quota role sorter which is tech debt.
> However, this slows down the allocator. The problem is that in the first 
> stage, even though a cluster might have no active roles with non-default 
> quota, the allocator will now have to sort and go through each and every role 
> in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, 
> the allocator could experience ~50% performance degradation.
> There are a couple of ways to address this issue. For example, we could make 
> the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return 
> all the roles with non-default quota. Alternatively, an even better approach 
> would be to deprecate the sorter concept and just have two standalone 
> functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree 
> structure (not yet exist in the allocator) and return the sorted roles.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9806) Address allocator performance regression due to the removal of quota role sorter.

2019-08-22 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913709#comment-16913709
 ] 

Benjamin Mahler commented on MESOS-9806:


{noformat}
commit b6c87d7c44346b2497ace65b1d2060ee423aa772
Author: Benjamin Mahler 
Date:   Wed Aug 21 20:10:42 2019 -0400

Eliminated double lookups in the allocator.

Review: https://reviews.apache.org/r/71345
{noformat}

{noformat}
commit 790c4e72e1460035b13bf27f2cb8999709e9767e
Author: Benjamin Mahler 
Date:   Wed Aug 21 20:11:31 2019 -0400

Avoid duplicate allocatableTo call in the allocator.

Review: https://reviews.apache.org/r/71346
{noformat}

> Address allocator performance regression due to the removal of quota role 
> sorter.
> -
>
> Key: MESOS-9806
> URL: https://issues.apache.org/jira/browse/MESOS-9806
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: resource-management
>
> In MESOS-9802, we removed the quota role sorter which is tech debt.
> However, this slows down the allocator. The problem is that in the first 
> stage, even though a cluster might have no active roles with non-default 
> quota, the allocator will now have to sort and go through each and every role 
> in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, 
> the allocator could experience ~50% performance degradation.
> There are a couple of ways to address this issue. For example, we could make 
> the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return 
> all the roles with non-default quota. Alternatively, an even better approach 
> would be to deprecate the sorter concept and just have two standalone 
> functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree 
> structure (not yet exist in the allocator) and return the sorted roles.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9806) Address allocator performance regression due to the removal of quota role sorter.

2019-08-21 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912823#comment-16912823
 ] 

Benjamin Mahler commented on MESOS-9806:


https://reviews.apache.org/r/71345/
https://reviews.apache.org/r/71346/
https://reviews.apache.org/r/71347/

Master branch:
{noformat}
HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 36.1185929secs
Made 0 allocation in 32.62218602secs
{noformat}

After all patches in review chain:
{noformat}
HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 21.389381617secs
Made 0 allocation in 18.593000222secs
{noformat}

> Address allocator performance regression due to the removal of quota role 
> sorter.
> -
>
> Key: MESOS-9806
> URL: https://issues.apache.org/jira/browse/MESOS-9806
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: resource-management
>
> In MESOS-9802, we removed the quota role sorter which is tech debt.
> However, this slows down the allocator. The problem is that in the first 
> stage, even though a cluster might have no active roles with non-default 
> quota, the allocator will now have to sort and go through each and every role 
> in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, 
> the allocator could experience ~50% performance degradation.
> There are a couple of ways to address this issue. For example, we could make 
> the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return 
> all the roles with non-default quota. Alternatively, an even better approach 
> would be to deprecate the sorter concept and just have two standalone 
> functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree 
> structure (not yet exist in the allocator) and return the sorted roles.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9949) Track allocated/offered in the allocator's role tree.

2019-08-21 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-9949:
--

 Summary: Track allocated/offered in the allocator's role tree.
 Key: MESOS-9949
 URL: https://issues.apache.org/jira/browse/MESOS-9949
 Project: Mesos
  Issue Type: Task
  Components: allocation, master
Reporter: Benjamin Mahler


Currently the allocator's role tree only tracks the reserved resources for each 
role subtree. For metrics purposes, it would be ideal to track offered / 
allocated as well.

This requires augmenting the allocator's structs and recoverResources to hold 
the two categories independently and transition from offered -> allocated as 
applicable when recovering resources. This might require a slight change to the 
recoverResources interface.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9948) master::Slave::hasExecutor occupies 37% of a 150 second perf sample from a user.

2019-08-21 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-9948:
--

 Summary: master::Slave::hasExecutor occupies 37% of a 150 second 
perf sample from a user.
 Key: MESOS-9948
 URL: https://issues.apache.org/jira/browse/MESOS-9948
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Benjamin Mahler
 Attachments: long-fei-enable-debug-slow-master.gz

If you drop the attached perf stacks into flamescope, you can see that 
mesos::internal::master::Slave::hasExecutor occupies 37% of the overall samples!

This function does 3 hashmap lookups, 1 can be eliminated for a quick win. 
However, the larger improvement here will come from eliminating many of the 
calls to this function.

This was reported by [~carlone].



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9123) Expose quota consumption metrics.

2019-08-14 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907416#comment-16907416
 ] 

Benjamin Mahler commented on MESOS-9123:


An alternative approach to consider (rather than using the role tree in the 
master), is to enhance the allocator interface to let the allocator know when 
resources transition from offered to allocated, which would enable the 
allocator to expose quota consumption.

> Expose quota consumption metrics.
> -
>
> Key: MESOS-9123
> URL: https://issues.apache.org/jira/browse/MESOS-9123
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: allocator, mesosphere, metrics, resource-management
>
> Currently, quota related metrics exposes quota guarantee and allocated quota. 
> We should expose "consumed" which is allocated quota plus unallocated 
> reservations. We already have this info in the allocator as 
> `consumedQuotaScalarQuantities`, just needs to expose it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


  1   2   3   4   5   6   7   8   9   10   >