[jira] [Commented] (MESOS-9152) Close all file descriptors except whitelist_fds in posix/subprocess.

2018-09-05 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605138#comment-16605138
 ] 

Qian Zhang commented on MESOS-9152:
---

RR: https://reviews.apache.org/r/68642/

> Close all file descriptors except whitelist_fds in posix/subprocess.
> 
>
> Key: MESOS-9152
> URL: https://issues.apache.org/jira/browse/MESOS-9152
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gilbert Song
>Assignee: Qian Zhang
>Priority: Major
>  Labels: containerizer, mesosphere
>
> Close all file descriptors except whitelist_fds in posix/subprocess 
> (currently whitelist_fds are not honored yet). This would avoid the fd being 
> leaked. Please follow the steps from this commit to make corresponding change:
>  
> https://issues.apache.org/jira/browse/MESOS-8917?focusedCommentId=16522629=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16522629



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7076) libprocess tests fail when using libevent 2.1.8

2018-09-05 Thread Till Toenshoff (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605095#comment-16605095
 ] 

Till Toenshoff commented on MESOS-7076:
---

Autotools specific patches:
https://reviews.apache.org/r/68640/
https://reviews.apache.org/r/68641/

> libprocess tests fail when using libevent 2.1.8
> ---
>
> Key: MESOS-7076
> URL: https://issues.apache.org/jira/browse/MESOS-7076
> Project: Mesos
>  Issue Type: Bug
>  Components: build, libprocess, test
> Environment: macOS 10.12.3, libevent 2.1.8 (installed via Homebrew)
>Reporter: Jan Schlicht
>Assignee: Till Toenshoff
>Priority: Critical
>  Labels: ci
> Attachments: libevent-openssl11.patch
>
>
> Running {{libprocess-tests}} on Mesos compiled with {{--enable-libevent 
> --enable-ssl}} on an operating system using libevent 2.1.8, SSL related tests 
> fail like
> {noformat}
> [ RUN  ] SSLTest.SSLSocket
> I0207 15:20:46.017881 2528580544 openssl.cpp:419] CA file path is 
> unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0207 15:20:46.017904 2528580544 openssl.cpp:424] CA directory path 
> unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0207 15:20:46.017918 2528580544 openssl.cpp:429] Will not verify peer 
> certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0207 15:20:46.017923 2528580544 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0207 15:20:46.033001 2528580544 openssl.cpp:419] CA file path is 
> unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0207 15:20:46.033179 2528580544 openssl.cpp:424] CA directory path 
> unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0207 15:20:46.033196 2528580544 openssl.cpp:429] Will not verify peer 
> certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0207 15:20:46.033201 2528580544 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> ../../../3rdparty/libprocess/src/tests/ssl_tests.cpp:257: Failure
> Failed to wait 15secs for Socket(socket.get()).recv()
> [  FAILED  ] SSLTest.SSLSocket (15196 ms)
> {noformat}
> Tests failing are
> {noformat}
> SSLTest.SSLSocket
> SSLTest.NoVerifyBadCA
> SSLTest.VerifyCertificate
> SSLTest.ProtocolMismatch
> SSLTest.ECDHESupport
> SSLTest.PeerAddress
> SSLTest.HTTPSGet
> SSLTest.HTTPSPost
> SSLTest.SilentSocket
> SSLTest.ShutdownThenSend
> SSLVerifyIPAdd/SSLTest.BasicSameProcess/0, where GetParam() = "false"
> SSLVerifyIPAdd/SSLTest.BasicSameProcess/1, where GetParam() = "true"
> SSLVerifyIPAdd/SSLTest.BasicSameProcessUnix/0, where GetParam() = "false"
> SSLVerifyIPAdd/SSLTest.BasicSameProcessUnix/1, where GetParam() = "true"
> SSLVerifyIPAdd/SSLTest.RequireCertificate/0, where GetParam() = "false"
> SSLVerifyIPAdd/SSLTest.RequireCertificate/1, where GetParam() = "true"
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7947) Add GC capability to nested containers

2018-09-05 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605066#comment-16605066
 ] 

Joseph Wu commented on MESOS-7947:
--

Also backported into 1.7.0.

> Add GC capability to nested containers
> --
>
> Key: MESOS-7947
> URL: https://issues.apache.org/jira/browse/MESOS-7947
> Project: Mesos
>  Issue Type: Improvement
>  Components: executor
>Reporter: Chun-Hung Hsiao
>Assignee: Joseph Wu
>Priority: Major
> Fix For: 1.7.0, 1.8.0
>
>
> We should extend the existing API or add a new API for nested containers for 
> an executor to tell the Mesos agent that a nested container is no longer 
> needed and can be scheduled for GC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8907) curl fetcher fails with HTTP/2

2018-09-05 Thread Till Toenshoff (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604989#comment-16604989
 ] 

Till Toenshoff commented on MESOS-8907:
---

Just checked, that version was released around  Oct 13, 2013.

> curl fetcher fails with HTTP/2
> --
>
> Key: MESOS-8907
> URL: https://issues.apache.org/jira/browse/MESOS-8907
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Reporter: James Peach
>Priority: Major
>
> {noformat}
> [ RUN  ] 
> ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2
> ...
> I0510 20:52:00.209815 25010 registry_puller.cpp:287] Pulling image 
> 'quay.io/coreos/alpine-sh' from 
> 'docker-manifest://quay.iocoreos/alpine-sh?latest#https' to 
> '/tmp/ImageAlpine_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_2_wF7EfM/store/docker/staging/qit1Jn'
> E0510 20:52:00.756072 25003 slave.cpp:6176] Container 
> '5eb869c5-555c-4dc9-a6ce-ddc2e7dbd01a' for executor 
> 'ad9aa898-026e-47d8-bac6-0ff993ec5904' of framework 
> 7dbe7cd6-8ffe-4bcf-986a-17ba677b5a69- failed to start: Failed to decode 
> HTTP responses: Decoding failed
> HTTP/2 200
> server: nginx/1.13.12
> date: Fri, 11 May 2018 03:52:00 GMT
> content-type: application/vnd.docker.distribution.manifest.v1+prettyjws
> content-length: 4486
> docker-content-digest: 
> sha256:61bd5317a92c3213cfe70e2b629098c51c50728ef48ff984ce929983889ed663
> x-frame-options: DENY
> strict-transport-security: max-age=63072000; preload
> ...
> {noformat}
> Note that curl is saying the HTTP version is "HTTP/2". This happens on modern 
> curl that automatically negotiates HTTP/2, but the docker fetcher isn't 
> prepared to parse that.
> {noformat}
> $ curl -i --raw -L -s -S -o -  'http://quay.io/coreos/alpine-sh?latest#https'
> HTTP/1.1 301 Moved Permanently
> Content-Type: text/html
> Date: Fri, 11 May 2018 04:07:44 GMT
> Location: https://quay.io/coreos/alpine-sh?latest
> Server: nginx/1.13.12
> Content-Length: 186
> Connection: keep-alive
> HTTP/2 301
> server: nginx/1.13.12
> date: Fri, 11 May 2018 04:07:45 GMT
> content-type: text/html; charset=utf-8
> content-length: 287
> location: https://quay.io/coreos/alpine-sh/?latest
> x-frame-options: DENY
> strict-transport-security: max-age=63072000; preload
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8907) curl fetcher fails with HTTP/2

2018-09-05 Thread Till Toenshoff (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604976#comment-16604976
 ] 

Till Toenshoff commented on MESOS-8907:
---

{noformat}
--http1.1

(HTTP) Tells curl to use HTTP version 1.1.

This option overrides -0, --http1.0 and --http2. Added in 7.33.0.
{noformat}

> curl fetcher fails with HTTP/2
> --
>
> Key: MESOS-8907
> URL: https://issues.apache.org/jira/browse/MESOS-8907
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Reporter: James Peach
>Priority: Major
>
> {noformat}
> [ RUN  ] 
> ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2
> ...
> I0510 20:52:00.209815 25010 registry_puller.cpp:287] Pulling image 
> 'quay.io/coreos/alpine-sh' from 
> 'docker-manifest://quay.iocoreos/alpine-sh?latest#https' to 
> '/tmp/ImageAlpine_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_2_wF7EfM/store/docker/staging/qit1Jn'
> E0510 20:52:00.756072 25003 slave.cpp:6176] Container 
> '5eb869c5-555c-4dc9-a6ce-ddc2e7dbd01a' for executor 
> 'ad9aa898-026e-47d8-bac6-0ff993ec5904' of framework 
> 7dbe7cd6-8ffe-4bcf-986a-17ba677b5a69- failed to start: Failed to decode 
> HTTP responses: Decoding failed
> HTTP/2 200
> server: nginx/1.13.12
> date: Fri, 11 May 2018 03:52:00 GMT
> content-type: application/vnd.docker.distribution.manifest.v1+prettyjws
> content-length: 4486
> docker-content-digest: 
> sha256:61bd5317a92c3213cfe70e2b629098c51c50728ef48ff984ce929983889ed663
> x-frame-options: DENY
> strict-transport-security: max-age=63072000; preload
> ...
> {noformat}
> Note that curl is saying the HTTP version is "HTTP/2". This happens on modern 
> curl that automatically negotiates HTTP/2, but the docker fetcher isn't 
> prepared to parse that.
> {noformat}
> $ curl -i --raw -L -s -S -o -  'http://quay.io/coreos/alpine-sh?latest#https'
> HTTP/1.1 301 Moved Permanently
> Content-Type: text/html
> Date: Fri, 11 May 2018 04:07:44 GMT
> Location: https://quay.io/coreos/alpine-sh?latest
> Server: nginx/1.13.12
> Content-Length: 186
> Connection: keep-alive
> HTTP/2 301
> server: nginx/1.13.12
> date: Fri, 11 May 2018 04:07:45 GMT
> content-type: text/html; charset=utf-8
> content-length: 287
> location: https://quay.io/coreos/alpine-sh/?latest
> x-frame-options: DENY
> strict-transport-security: max-age=63072000; preload
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9210) Mesos v1 scheduler library does not properly handle SUBSCRIBE retries

2018-09-05 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9210:
-

 Summary: Mesos v1 scheduler library does not properly handle 
SUBSCRIBE retries
 Key: MESOS-9210
 URL: https://issues.apache.org/jira/browse/MESOS-9210
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.6.1, 1.5.1, 1.7.0
Reporter: Vinod Kone
Assignee: Till Toenshoff


After the authentication related refactor done as part of 
[https://reviews.apache.org/r/62594/,] the state of the scheduler is checked in 
`send` 
([https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L234)]
  but it is changed in `_send` 
([https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L234).]
 As a result, we can have 2 SUBSCRIBE calls in flight at the same time on the 
same connection! This is not good and not spec compliant of a HTTP client that 
is expecting a streaming response.

We need to fix the library to either drop the retried SUBSCRIBE call if one is 
in progress (as it was before the refactor) or close the old connection and 
start a new connection to send the retried SUBSCRIBE call.

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-09-05 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604950#comment-16604950
 ] 

Qian Zhang commented on MESOS-8568:
---

commit ba370822c94c8e9881eff3f63a02b38e18335ae4
Author: Qian Zhang 
Date: Thu Aug 23 17:44:53 2018 +0800

Made command check always waits before removing the nested container.
 
 Review: [https://reviews.apache.org/r/68495]

 

commit b5c43f40b41b44ccae05d61e4aba8d004678cde1
Author: Qian Zhang 
Date: Wed Aug 29 11:22:41 2018 +0800

Made checker library retry to remove the previous check container.
 
 Previously when checker library fails to remove the previous check
 container, it will discard the promise and launch a new check container
 which will cause two problems:
 1. The discarded promise is used to launch the new check container,
 that means even the new check container is launched successfully,
 we still have no chance to process its check result since the
 promise has already been discarded.
 2. The previous check container will never get a chance to be removed
 which is leak, i.e., its runtime directory and sandbox directory
 will not be removed.
 
 Now in this patch, when checker library fails to remove the previous
 check container, we make it remove the previous check container again.
 
 Review: https://reviews.apache.org/r/68555

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 1.6.0, 1.6.1
>Reporter: Andrei Budnik
>Assignee: Qian Zhang
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8907) curl fetcher fails with HTTP/2

2018-09-05 Thread Till Toenshoff (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604932#comment-16604932
 ] 

Till Toenshoff commented on MESOS-8907:
---

That option would be {{--http1.1}} [~alexr]

> curl fetcher fails with HTTP/2
> --
>
> Key: MESOS-8907
> URL: https://issues.apache.org/jira/browse/MESOS-8907
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Reporter: James Peach
>Priority: Major
>
> {noformat}
> [ RUN  ] 
> ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2
> ...
> I0510 20:52:00.209815 25010 registry_puller.cpp:287] Pulling image 
> 'quay.io/coreos/alpine-sh' from 
> 'docker-manifest://quay.iocoreos/alpine-sh?latest#https' to 
> '/tmp/ImageAlpine_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_2_wF7EfM/store/docker/staging/qit1Jn'
> E0510 20:52:00.756072 25003 slave.cpp:6176] Container 
> '5eb869c5-555c-4dc9-a6ce-ddc2e7dbd01a' for executor 
> 'ad9aa898-026e-47d8-bac6-0ff993ec5904' of framework 
> 7dbe7cd6-8ffe-4bcf-986a-17ba677b5a69- failed to start: Failed to decode 
> HTTP responses: Decoding failed
> HTTP/2 200
> server: nginx/1.13.12
> date: Fri, 11 May 2018 03:52:00 GMT
> content-type: application/vnd.docker.distribution.manifest.v1+prettyjws
> content-length: 4486
> docker-content-digest: 
> sha256:61bd5317a92c3213cfe70e2b629098c51c50728ef48ff984ce929983889ed663
> x-frame-options: DENY
> strict-transport-security: max-age=63072000; preload
> ...
> {noformat}
> Note that curl is saying the HTTP version is "HTTP/2". This happens on modern 
> curl that automatically negotiates HTTP/2, but the docker fetcher isn't 
> prepared to parse that.
> {noformat}
> $ curl -i --raw -L -s -S -o -  'http://quay.io/coreos/alpine-sh?latest#https'
> HTTP/1.1 301 Moved Permanently
> Content-Type: text/html
> Date: Fri, 11 May 2018 04:07:44 GMT
> Location: https://quay.io/coreos/alpine-sh?latest
> Server: nginx/1.13.12
> Content-Length: 186
> Connection: keep-alive
> HTTP/2 301
> server: nginx/1.13.12
> date: Fri, 11 May 2018 04:07:45 GMT
> content-type: text/html; charset=utf-8
> content-length: 287
> location: https://quay.io/coreos/alpine-sh/?latest
> x-frame-options: DENY
> strict-transport-security: max-age=63072000; preload
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9209) Include 'Connection: close' header in agent streaming API responses.

2018-09-05 Thread Benjamin Mahler (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9209:
--

Assignee: (was: Benjamin Mahler)

> Include 'Connection: close' header in agent streaming API responses.
> 
>
> Key: MESOS-9209
> URL: https://issues.apache.org/jira/browse/MESOS-9209
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Benjamin Mahler
>Priority: Major
>
> We've seen some HTTP intermediaries (e.g. ELB) decide to re-use connections 
> to mesos as an optimization to avoid re-connection overhead. As a result, 
> when the end-client of the streaming API disconnects from the intermediary, 
> the intermediary leaves the connection to mesos open in an attempt to re-use 
> the connection for another request once the response completes. Mesos then 
> thinks that the subscriber never disconnected and the intermediary happily 
> continues to read the streaming events even though there's no end-client.
> To help indicate to intermediaries that the connection SHOULD NOT be re-used, 
> we can set the 'Connection: close' header for streaming API responses. It may 
> not be respected (since the language seems to be SHOULD NOT), but some 
> intermediaries may respect it and close the connection if the end-client 
> disconnects.
> Note that libprocess' http server currently doesn't close the the connection 
> based on a handler setting this header, but it doesn't matter here since the 
> streaming API responses are infinite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9209) Include 'Connection: close' header in agent streaming API responses.

2018-09-05 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-9209:
--

 Summary: Include 'Connection: close' header in agent streaming API 
responses.
 Key: MESOS-9209
 URL: https://issues.apache.org/jira/browse/MESOS-9209
 Project: Mesos
  Issue Type: Improvement
  Components: HTTP API
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


We've seen some HTTP intermediaries (e.g. ELB) decide to re-use connections to 
mesos as an optimization to avoid re-connection overhead. As a result, when the 
end-client of the streaming API disconnects from the intermediary, the 
intermediary leaves the connection to mesos open in an attempt to re-use the 
connection for another request once the response completes. Mesos then thinks 
that the subscriber never disconnected and the intermediary happily continues 
to read the streaming events even though there's no end-client.

To help indicate to intermediaries that the connection SHOULD NOT be re-used, 
we can set the 'Connection: close' header for streaming API responses. It may 
not be respected (since the language seems to be SHOULD NOT), but some 
intermediaries may respect it and close the connection if the end-client 
disconnects.

Note that libprocess' http server currently doesn't close the the connection 
based on a handler setting this header, but it doesn't matter here since the 
streaming API responses are infinite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9196) Removing rootfs mounts may fail with EBUSY.

2018-09-05 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604832#comment-16604832
 ] 

Gilbert Song commented on MESOS-9196:
-

commit 2c29298f8cf4c96c68ed115acd5a8f335700d735
Author: Gilbert Song 
Date:   Sat Sep 1 00:13:43 2018 -0700

Added an unit test for rootfs cleanup EBUSY fix.

Review: https://reviews.apache.org/r/68599

> Removing rootfs mounts may fail with EBUSY.
> ---
>
> Key: MESOS-9196
> URL: https://issues.apache.org/jira/browse/MESOS-9196
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Jie Yu
>Priority: Blocker
>  Labels: containerizer
> Fix For: 1.5.2, 1.6.2, 1.7.0
>
>
> We observed in production environment that this
> {code}
> Failed to destroy the provisioned rootfs when destroying container: Collect 
> failed: Failed to destroy overlay-mounted rootfs 
> '/var/lib/mesos/slave/provisioner/containers/6332cf3d-9897-475b-88b3-40e983a2a531/containers/e8f36ad7-c9ae-40da-9d14-431e98174735/backends/overlay/rootfses/d601ef1b-11b9-445a-b607-7c6366cd21ec':
>  Failed to unmount 
> '/var/lib/mesos/slave/provisioner/containers/6332cf3d-9897-475b-88b3-40e983a2a531/containers/e8f36ad7-c9ae-40da-9d14-431e98174735/backends/overlay/rootfses/d601ef1b-11b9-445a-b607-7c6366cd21ec':
>  Device or resource busy
> {code}
> Consider fixing the issue by using detach unmount when unmounting container 
> rootfs. See MESOS-3349 for details.
> The root cause on why "Device or resource busy" is received when doing rootfs 
> unmount is still unknown.
> _UPDATE_: The production environment has a cronjob that scan filesystems to 
> build index (updatedb for mlocate). This can explain the EBUSY we receive 
> when doing `unmount`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-09-05 Thread Andrei Budnik (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-8545:
-
Comment: was deleted

(was: `libprocess::finalize()` solves the problem, because it waits for 
termination of all libprocess actors (including `HttpProxy`) in 
`[ProcessManager::finalize()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/process.cpp#L2395-L2420]`.
 This guarantees that all responses are sent back to the agent before 
IOSwitchboard exits from its `main()` function.)

> AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
> ---
>
> Key: MESOS-8545
> URL: https://issues.apache.org/jira/browse/MESOS-8545
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0, 1.6.1, 1.7.0
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: Mesosphere, flaky-test
> Attachments: 
> AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun.txt, 
> AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun2.txt
>
>
> {code:java}
> I0205 17:11:01.091872 4898 http_proxy.cpp:132] Returning '500 Internal Server 
> Error' for '/slave(974)/api/v1' (Disconnected)
> /home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/src/tests/api_tests.cpp:6596:
>  Failure
> Value of: (response).get().status
> Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8096) Enqueueing events in MockHTTPScheduler can lead to segfaults.

2018-09-05 Thread Alexander Rukletsov (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604170#comment-16604170
 ] 

Alexander Rukletsov commented on MESOS-8096:


Might be related to this issue, from {{clang-analyzer}}, courtesy [~mcypark]:
{noformat}
src/scheduler/scheduler.cpp:911:5: warning: Call to virtual function during 
destruction will not dispatch to derived class 
[clang-analyzer-optin.cplusplus.VirtualCall]
stop();
^
{noformat}
Likely a hypothetical control flow starting from 
{{src/tests/http_fault_tolerance_tests.cpp:872}}
{noformat}
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:5:
 warning: Use of memory after it is freed [clang-analyzer-cplusplus.NewDelete]
return function_mocker_->AddNewExpectation(
^
/tmp/SRC/src/tests/http_fault_tolerance_tests.cpp:872:3: note: Calling 
'MockSpec::InternalExpectedAt'
  EXPECT_CALL(*scheduler, connected(_))
  ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1845:32:
 note: expanded from macro 'EXPECT_CALL'
#define EXPECT_CALL(obj, call) GMOCK_EXPECT_CALL_IMPL_(obj, call)
   ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1844:5:
 note: expanded from macro 'GMOCK_EXPECT_CALL_IMPL_'
((obj).gmock_##call).InternalExpectedAt(__FILE__, __LINE__, #obj, #call)
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:12:
 note: Calling 'FunctionMockerBase::AddNewExpectation'
return function_mocker_->AddNewExpectation(
   ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1609:9:
 note: Memory is allocated
new TypedExpectation(this, file, line, source_text, m);
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1615:9:
 note: Assuming 'implicit_sequence' is equal to NULL
if (implicit_sequence != NULL) {
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1615:5:
 note: Taking false branch
if (implicit_sequence != NULL) {
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1619:13:
 note: Calling '~linked_ptr'
return *expectation;
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:153:19:
 note: Calling 'linked_ptr::depart'
  ~linked_ptr() { depart(); }
  ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:205:5:
 note: Taking true branch
if (link_.depart()) delete value_;
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:205:25:
 note: Memory is released
if (link_.depart()) delete value_;
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:153:19:
 note: Returning; memory was released
  ~linked_ptr() { depart(); }
  ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1619:13:
 note: Returning from '~linked_ptr'
return *expectation;
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:12:
 note: Returning; memory was released
return function_mocker_->AddNewExpectation(
   ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:5:
 note: Use of memory after it is freed
return function_mocker_->AddNewExpectation(
^
{noformat}
There are what seems to be equivalent output for the following places:
{noformat}
/tmp/SRC/src/tests/uri_fetcher_tests.cpp:140:3: note: Calling 
'MockSpec::InternalExpectedAt'
  EXPECT_CALL(server, test(_))
  ^
{noformat}
{noformat}
/tmp/SRC/src/tests/default_executor_tests.cpp:2042:3: note: Calling 
'MockSpec::InternalExpectedAt'
  EXPECT_CALL(*scheduler, connected(_))
  ^
{noformat}
{noformat}
/tmp/SRC/src/tests/scheduler_tests.cpp:2037:3: note: Calling 
'MockSpec::InternalExpectedAt'
  EXPECT_CALL(*scheduler, connected(_))
  ^
{noformat}
{noformat}
/tmp/SRC/src/tests/fetcher_tests.cpp:535:3: note: Calling 
'MockSpec::InternalExpectedAt'
  EXPECT_CALL(*http.process, test(_))
  ^
{noformat}
Of all the {{EXPECT_CALL}} s in the codebase, these are the only instances that 
are pointed out. It is still unclear that there's an issue here, but it seems 
worth checking out, especially since these files are known-flaky.

> Enqueueing events in MockHTTPScheduler can lead to segfaults.
> -
>
>