[jira] [Assigned] (MESOS-9881) StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery is flaky.

2019-07-05 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-9881:
--

Assignee: Chun-Hung Hsiao

> StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery is 
> flaky.
> --
>
> Key: MESOS-9881
> URL: https://issues.apache.org/jira/browse/MESOS-9881
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Mahler
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: flaky-test, storage
>
> This failed in CI:
> {noformat}
> 1 tests failed.
> FAILED:  
> CSIVersion/StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery/v0
> Error Message:
> ../../../3rdparty/libprocess/include/process/gmock.hpp:667
> Mock function called more times than expected - returning default value.
> Function call: filter(@0x5617542ee270 master@172.17.0.3:35735, 
> @0x7f83cc053c30 264-byte object <48-23 06-32 84-7F 00-00 40-DE 07-CC 83-7F 
> 00-00 2B-00 00-00 00-00 00-00 2B-00 00-00 00-00 00-00 4C-65 6E-67 74-68 00-6F 
> 20-AF 00-54 17-56 00-00 10-AF 00-54 17-56 00-00 02-00 00-00 AC-11 00-03 ... 
> 20-20 05-CC 83-7F 00-00 00-00 00-00 6E-20 76-61 50-2B 4B-53 17-56 00-00 40-2B 
> 4B-53 17-56 00-00 60-DA 07-CC 83-7F 00-00 CA-03 00-00 00-00 00-00 CA-03 00-00 
> 00-00 00-00 10-01 00-00 00-00 00-00>)
>   Returns: false
>  Expected: to be never called
>Actual: called once - over-saturated and active
> Stack Trace:
> ../../../3rdparty/libprocess/include/process/gmock.hpp:667
> Mock function called more times than expected - returning default value.
> Function call: filter(@0x5617542ee270 master@172.17.0.3:35735, 
> @0x7f83cc053c30 264-byte object <48-23 06-32 84-7F 00-00 40-DE 07-CC 83-7F 
> 00-00 2B-00 00-00 00-00 00-00 2B-00 00-00 00-00 00-00 4C-65 6E-67 74-68 00-6F 
> 20-AF 00-54 17-56 00-00 10-AF 00-54 17-56 00-00 02-00 00-00 AC-11 00-03 ... 
> 20-20 05-CC 83-7F 00-00 00-00 00-00 6E-20 76-61 50-2B 4B-53 17-56 00-00 40-2B 
> 4B-53 17-56 00-00 60-DA 07-CC 83-7F 00-00 CA-03 00-00 00-00 00-00 CA-03 00-00 
> 00-00 00-00 10-01 00-00 00-00 00-00>)
>   Returns: false
>  Expected: to be never called
>Actual: called once - over-saturated and active
> {noformat}
> Full test output:
> {noformat}
> [ RUN  ] 
> CSIVersion/StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery/v0
> I0702 06:51:02.172196  6961 cluster.cpp:176] Creating default 'local' 
> authorizer
> I0702 06:51:02.183229 17274 master.cpp:440] Master 
> c310f701-ca24-4ea8-a4be-df3aa3637194 (005dc56bde82) started on 
> 172.17.0.3:35735
> I0702 06:51:02.184095 17274 master.cpp:443] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="50ms" --allocator="hierarchical" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/Pq6bYz/credentials" --filter_gpu_resources="true" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_operator_event_stream_subscribers="1000" 
> --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
> --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
> --publish_per_framework_metrics="true" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
> --version="false" 
> --webui_dir="/tmp/SRC/build/mesos-1.9.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/Pq6bYz/master" --zk_session_timeout="10secs"
> I0702 06:51:02.185236 17274 master.cpp:492] Master only allowing 
> authenticated frameworks to register
> I0702 06:51:02.185819 17274 master.cpp:498] Master only allowing 
> authenticated agents to register
> I0702 06:51:02.186395 17274 master.cpp:504] Master only allowing 
> authenticated HTTP frameworks to register
> I0702 06:51:02.186951 17274 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/Pq6bYz/credentials'
> I0702 06:51:02.187907 17274 master.cpp:548] Using default 

[jira] [Assigned] (MESOS-9811) Don't use reverse DNS for hostname validation

2019-07-05 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9811:
--

   Resolution: Fixed
 Assignee: Benno Evers
Fix Version/s: 1.9.0

{noformat}
commit 0a081e01a3f4af8141a8085ed2f97ee85ea48fe1
Author: Benno Evers 
Date:   Wed Jun 19 15:49:11 2019 +0200

Introduced RFC6125-compliant hostname validation scheme.

This commit introduces a new libprocess SSL flag
`hostname_validation_scheme`, which can be set to 'legacy'
to select the previous hostname validation behaviour or to
'openssl' to use standardized OpenSSL algorithms to handle
hostname validation as part of the TLS handshake.

As a nice side-effect, the new scheme gets rid of reverse DNS
lookups during TLS connection establishment, which used to be
a common source of hard-to-debug unresponsiveness in Mesos
components.

See `docs/ssl.md` in the follow-up commit for details of and
differences between the schemes.

Review: https://reviews.apache.org/r/70749
{noformat}

> Don't use reverse DNS for hostname validation
> -
>
> Key: MESOS-9811
> URL: https://issues.apache.org/jira/browse/MESOS-9811
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: foundations, libprocess, ssl
> Fix For: 1.9.0
>
>
> Upon connection we first resolve the hostname and forget about it
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/http.cpp#L1462-L1504
> then later use reverse DNS on the remote address to get back a hostname
> https://github.com/apache/mesos/blob/4708c2a368e12a89669135f4d0dd05d9b0b2/3rdparty/libprocess/src/posix/libevent/libevent_ssl_socket.cpp#L548-L556
> and verify the server certificate against *that*.
> Instead, we should verify the server certificate against the hostname that 
> was used by t he client to initiate the connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9867) Libevent fd cleanup failure may cause hangs in combination with client certificate validation

2019-07-05 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879280#comment-16879280
 ] 

Benno Evers commented on MESOS-9867:


Since this requires very specific conditions in order to happen, I just added a 
warning if these conditions are met.

Over time, this will solve itself as people are using newer and newer libevent 
versions.

{noformat}
commit 1a6760c60dc823b088ffbcf48909cf3e371570f3 (HEAD -> master, origin/master, 
mesosphere-private/ci/bevers/tls-hostname-validation)
Author: Benno Evers 
Date:   Wed Jun 26 16:30:12 2019 +0200

Added warnings about known problems with libevent epoll backend.

Some SSL options are known to cause issues in combination with
older versions of libevent. Detect and warn about this situation.

See MESOS-9867 for details.

Review: https://reviews.apache.org/r/70993
{noformat}

> Libevent fd cleanup failure may cause hangs in combination with client 
> certificate validation
> -
>
> Key: MESOS-9867
> URL: https://issues.apache.org/jira/browse/MESOS-9867
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: libevent, libprocess, ssl, tls
>
> A listening LibeventSSLSocket will check cryptographic certificate validity 
> during the OpenSSL handshake and afterwards call the `openssl::verify()` 
> function to perform hostname validation and other checks on the client 
> certificate. If these checks fail, the bufferevent is deleted and the 
> connection closed:
> {noformat}
> // libevent_ssl_socket.cpp, accept_SSL_callback()
>   if (verify.isError()) {
> VLOG(1) << "Failed accept, verification error: " << 
> verify.error();
> request->promise.fail(verify.error());
> SSL_free(ssl);
> bufferevent_free(bev);
> // TODO(jmlvanre): Clean up for readability. Consider RAII
> // or constructing the impl earlier.
> CHECK(request->socket >= 0);
> Try close = os::close(request->socket);
> if (close.isError()) {
>   LOG(FATAL)
> << "Failed to close socket " << stringify(request->socket)
> << ": " << close.error();
> }
> delete request;
> return;
>   }
> {noformat}
> However, when we close the socket fd in the above code, libevent had already 
> registered that file descriptor with epoll() to watch for read and write 
> events on that socket. Since the socket is closed, attempts to remove the 
> corresponding fd from the epoll() structs will fail: (See also: 
> https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/)
> {noformat}
> [warn] Epoll MOD(4) on fd 9 failed.  Old events were 6; read change was 2 
> (del); write change was 0 (none): Bad file descriptor
> [warn] Epoll MOD(1) on fd 9 failed.  Old events were 6; read change was 0 
> (none); write change was 2 (del): Bad file descriptor
> {noformat}
> However, that in itself is harmless since the kernel will remove the kernel 
> object that was associated with fd 9 from the data structure associated with 
> that epoll instance in the kernel. So while we get an error attempting to 
> remove fd 9, there is actually nothing left to remove. However, in a case of 
> epoll failure, libprocess does not adjust the number of readers and writers 
> on that file descriptor:
> {noformat}
> // evmap.c, evmap_io_del()
> [...]
> if (evsel->del(base, ev->ev_fd, old, res, extra) == -1)
>return (-1);
> [...]
> ctx->nread = nread;
> ctx->nwrite = nwrite;
> {noformat}
> In the above, ctx is part of an array collecting information for each file 
> descriptor. That still wouldn't be so bad, however libevent also only adds 
> file descriptors to `epoll()` struct the *first* time we attempt to create a 
> read or write event on that file descriptor:
> {noformat}
> // evmap.c, evmap_io_add()
> if (ev->ev_events & EV_READ) {
> if (++nread == 1)
> res |= EV_READ;
> }
> if (ev->ev_events & EV_WRITE) {
> if (++nwrite == 1)
> res |= EV_WRITE;
> }
> [...]
> if (res) {
> [...]
> if (evsel->add(base, ev->ev_fd,
> old, (ev->ev_events & EV_ET) | res, extra) == -1)
> return (-1);
> [...]
> }
> {noformat}
> So when the same file descriptor is attempted to be used again by libevent 
> for epoll() polling, the process will hang because reads or writes to that 
> file descriptor are never noticed.
> This can be 

[jira] [Comment Edited] (MESOS-9867) Libevent fd cleanup failure may cause hangs in combination with client certificate validation

2019-07-05 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879280#comment-16879280
 ] 

Benno Evers edited comment on MESOS-9867 at 7/5/19 1:56 PM:


Since this requires very specific conditions in order to happen, I just added a 
warning if these conditions are met.

Over time, this will solve itself as people are using newer and newer libevent 
versions.

{noformat}
commit 1a6760c60dc823b088ffbcf48909cf3e371570f3
Author: Benno Evers 
Date:   Wed Jun 26 16:30:12 2019 +0200

Added warnings about known problems with libevent epoll backend.

Some SSL options are known to cause issues in combination with
older versions of libevent. Detect and warn about this situation.

See MESOS-9867 for details.

Review: https://reviews.apache.org/r/70993
{noformat}


was (Author: bennoe):
Since this requires very specific conditions in order to happen, I just added a 
warning if these conditions are met.

Over time, this will solve itself as people are using newer and newer libevent 
versions.

{noformat}
commit 1a6760c60dc823b088ffbcf48909cf3e371570f3 (HEAD -> master, origin/master, 
mesosphere-private/ci/bevers/tls-hostname-validation)
Author: Benno Evers 
Date:   Wed Jun 26 16:30:12 2019 +0200

Added warnings about known problems with libevent epoll backend.

Some SSL options are known to cause issues in combination with
older versions of libevent. Detect and warn about this situation.

See MESOS-9867 for details.

Review: https://reviews.apache.org/r/70993
{noformat}

> Libevent fd cleanup failure may cause hangs in combination with client 
> certificate validation
> -
>
> Key: MESOS-9867
> URL: https://issues.apache.org/jira/browse/MESOS-9867
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: libevent, libprocess, ssl, tls
>
> A listening LibeventSSLSocket will check cryptographic certificate validity 
> during the OpenSSL handshake and afterwards call the `openssl::verify()` 
> function to perform hostname validation and other checks on the client 
> certificate. If these checks fail, the bufferevent is deleted and the 
> connection closed:
> {noformat}
> // libevent_ssl_socket.cpp, accept_SSL_callback()
>   if (verify.isError()) {
> VLOG(1) << "Failed accept, verification error: " << 
> verify.error();
> request->promise.fail(verify.error());
> SSL_free(ssl);
> bufferevent_free(bev);
> // TODO(jmlvanre): Clean up for readability. Consider RAII
> // or constructing the impl earlier.
> CHECK(request->socket >= 0);
> Try close = os::close(request->socket);
> if (close.isError()) {
>   LOG(FATAL)
> << "Failed to close socket " << stringify(request->socket)
> << ": " << close.error();
> }
> delete request;
> return;
>   }
> {noformat}
> However, when we close the socket fd in the above code, libevent had already 
> registered that file descriptor with epoll() to watch for read and write 
> events on that socket. Since the socket is closed, attempts to remove the 
> corresponding fd from the epoll() structs will fail: (See also: 
> https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/)
> {noformat}
> [warn] Epoll MOD(4) on fd 9 failed.  Old events were 6; read change was 2 
> (del); write change was 0 (none): Bad file descriptor
> [warn] Epoll MOD(1) on fd 9 failed.  Old events were 6; read change was 0 
> (none); write change was 2 (del): Bad file descriptor
> {noformat}
> However, that in itself is harmless since the kernel will remove the kernel 
> object that was associated with fd 9 from the data structure associated with 
> that epoll instance in the kernel. So while we get an error attempting to 
> remove fd 9, there is actually nothing left to remove. However, in a case of 
> epoll failure, libprocess does not adjust the number of readers and writers 
> on that file descriptor:
> {noformat}
> // evmap.c, evmap_io_del()
> [...]
> if (evsel->del(base, ev->ev_fd, old, res, extra) == -1)
>return (-1);
> [...]
> ctx->nread = nread;
> ctx->nwrite = nwrite;
> {noformat}
> In the above, ctx is part of an array collecting information for each file 
> descriptor. That still wouldn't be so bad, however libevent also only adds 
> file descriptors to `epoll()` struct the *first* time we attempt to create a 
> read or write event on that file descriptor:
> {noformat}
> // evmap.c, evmap_io_add()
> if (ev->ev_events & EV_READ) {

[jira] [Assigned] (MESOS-9878) Enable libprocess users to pass a custom SSL context when using Socket

2019-07-05 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9878:
--

   Resolution: Fixed
 Assignee: Benno Evers
Fix Version/s: 1.9.0

{noformat}
commit ec129665a346f86c738522536f89de7c519f3e0d
Author: Benno Evers 
Date:   Fri Jun 28 20:12:44 2019 +0200

Added ability to pass custom SSL context to `Socket::connect()`.

Users of libprocess can now pass a custom SSL context when
connecting a generic socket via the `Socket::connect()`
function.

Additionally the API of `Socket::connect()` was also reworked
according to the following boundary conditions requested by
libprocess maintainers:

 * When libprocess is compiled without SSL support, neither the
   declaration of the TLS configuration object nor the `connnect()`
   overload that accepts the TLS configuration should be available.
 * Passing just the servername is not an acceptable short-hand for
   using the default TLS configuration together with that servername.
 * When the incorrect overload is selected (i.e. passing TLS config
   to a poll socket or omitting TLS configuration for a TLS socket),
   the program should abort.

This following changes are introduced according to the requirements
above:

 * A new class `openssl::TLSClientConfig` is introduced when libprocess
   is compiled with ssl support.
 * A new overload
   `Socket::connect(const Address&, const TLSClientConfig&)` is
   introduced when libprocess is compiled with ssl support.
 * All call sites are adjusted to check the socket kind before calling
   `connect()`.

Review: https://reviews.apache.org/r/70991
{noformat}

> Enable libprocess users to pass a custom SSL context when using Socket
> --
>
> Key: MESOS-9878
> URL: https://issues.apache.org/jira/browse/MESOS-9878
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Minor
>  Labels: libprocess
> Fix For: 1.9.0
>
>
> Connections made through the `Socket::connect()` API will always use the 
> libprocess-global SSL configuration made through the `LIBPROCESS_SSL_*` 
> environment variables.
> Libprocess users might want to override these options while still using the 
> generic socket class.
> Therefore we should provide a way to pass custom configuration to the 
> `Socket::connect()` function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9878) Enable libprocess users to pass a custom SSL context when using Socket

2019-07-05 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879278#comment-16879278
 ] 

Benno Evers commented on MESOS-9878:


https://reviews.apache.org/r/70991/

> Enable libprocess users to pass a custom SSL context when using Socket
> --
>
> Key: MESOS-9878
> URL: https://issues.apache.org/jira/browse/MESOS-9878
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Priority: Minor
>  Labels: libprocess
>
> Connections made through the `Socket::connect()` API will always use the 
> libprocess-global SSL configuration made through the `LIBPROCESS_SSL_*` 
> environment variables.
> Libprocess users might want to override these options while still using the 
> generic socket class.
> Therefore we should provide a way to pass custom configuration to the 
> `Socket::connect()` function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)