Re: mesos container cluster came across health check coredump log

2017-03-29 Thread Jie Yu
+ AlexR, haosdent

For posterity, the root cause of this problem is that when agent is running
inside a docker container and `--docker_mesos_image` flag is specified, the
pid namespace of the executor container (which initiate the health check)
is different than the root pid namespace. Therefore, getting the network
namespace handle using `/proc//ns/net` does not work because the 'pid'
here is in the root pid namespace (reported by docker daemon).

Alex and haosdent, I think we should fix this issue. As suggested above, we
can launch the executor container with --pid=host if `--docker_mesos_image`
is specified.

- Jie

On Wed, Mar 29, 2017 at 3:56 AM, tommy xiao  wrote:

> it resolved by add --pid=host.  thanks for community guys supports. thanks
> a lot.
>
> 2017-03-29 9:52 GMT+08:00 tommy xiao :
>
>> My Environment is specified:
>>
>> mesos 1.2 in docker containerized.
>>
>> send a sample nginx docker container with mesos native health check.
>>
>> then get sandbox core dump.
>>
>> i have digg into more information for your reference:
>>
>> in mesos slave container, i can only see task container pid. but i can't
>> found process nginx pid.
>>
>> but in host console, i can found the nginx pid. so how can i get the pid
>> in container?
>>
>>
>>
>>
>> 2017-03-28 13:49 GMT+08:00 tommy xiao :
>>
>>> https://issues.apache.org/jira/browse/MESOS-6184
>>>
>>> anyone give some hint?
>>>
>>> ```
>>>
>>> I0328 11:48:12.922181 48 exec.cpp:162] Version: 1.2.0
>>> I0328 11:48:12.929252 54 exec.cpp:237] Executor registered on agent
>>> a29dc3a5-3e3f-4058-8ab4-dd7de2ae58d1-S4
>>> I0328 11:48:12.931640 54 docker.cpp:850] Running docker -H
>>> unix:///var/run/docker.sock run --cpu-shares 10 --memory 33554432
>>> --env-file /tmp/gvqGyb -v /data/mesos/slaves/a29dc3a5-3e
>>> 3f-4058-8ab4-dd7de2ae58d1-S4/frameworks/d7ef5d2b-f924-42d9-a
>>> 274-c020afba6bce-/executors/0-hc-xychu-datamanmesos-2f3b
>>> 47f9ffc048539c7b22baa6c32d8f/runs/458189b8-2ff4-4337-ad3a-67321e96f5cb:/mnt/mesos/sandbox
>>> --net bridge --label=USER_NAME=xychu --label=GROUP_NAME=groupautotest
>>> --label=APP_ID=hc --label=VCLUSTER=clusterautotest --label=USER=xychu
>>> --label=CLUSTER=datamanmesos --label=SLOT=0 --label=APP=hc -p 31000:80/tcp
>>> --name 
>>> mesos-a29dc3a5-3e3f-4058-8ab4-dd7de2ae58d1-S4.458189b8-2ff4-4337-ad3a-67321e96f5cb
>>> nginx
>>> I0328 11:48:16.145714 53 health_checker.cpp:196] Ignoring failure as
>>> health check still in grace period
>>> W0328 11:48:26.289958 49 health_checker.cpp:202] Health check failed 1
>>> times consecutively: HTTP health check failed: curl returned terminated
>>> with signal Aborted (core dumped): ABORT: (../../../3rdparty/libprocess/
>>> include/process/posix/subprocess.hpp:190): Failed to execute
>>> Subprocess::ChildHook: Failed to enter the net namespace of pid 18596: Pid
>>> 18596 does not exist
>>>
>>>-
>>>   -
>>>  - Aborted at 1490672906 (unix time) try "date -d @1490672906"
>>>  if you are using GNU date ***
>>>  PC: @ 0x7f26bfb485f7 __GI_raise
>>>  - SIGABRT (@0x4a) received by PID 74 (TID 0x7f26ba152700) from
>>>  PID 74; stack trace: ***
>>>  @ 0x7f26c0703100 (unknown)
>>>  @ 0x7f26bfb485f7 __GI_raise
>>>  @ 0x7f26bfb49ce8 __GI_abort
>>>  @ 0x7f26c315778e _Abort()
>>>  @ 0x7f26c31577cc _Abort()
>>>  @ 0x7f26c237a4b6 process::internal::childMain()
>>>  @ 0x7f26c2379e9c std::_Function_handler<>::_M_invoke()
>>>  @ 0x7f26c2379e53 process::internal::defaultClone()
>>>  @ 0x7f26c237b951 process::internal::cloneChild()
>>>  @ 0x7f26c237954f process::subprocess()
>>>  @ 0x7f26c15a9fb1 mesos::internal::checks::Healt
>>>  hCheckerProcess::httpHealthCheck()
>>>  @ 0x7f26c15ababd mesos::internal::checks::Healt
>>>  hCheckerProcess::performSingleCheck()
>>>  @ 0x7f26c2331389 process::ProcessManager::resume()
>>>  @ 0x7f26c233a3f7 _ZNSt6thread5_ImplISt12_Bind_s
>>>  impleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M
>>>  _runEv
>>>  @ 0x7f26c04a1220 (unknown)
>>>  @ 0x7f26c06fbdc5 start_thread
>>>  @ 0x7f26bfc0928d __clone
>>>  W0328 11:48:36.340055 55 health_checker.cpp:202] Health check
>>>  failed 2 times consecutively: HTTP health check failed: curl 
>>> returned
>>>  terminated with signal Aborted (core dumped): ABORT:
>>>  
>>> (../../../3rdparty/libprocess/include/process/posix/subprocess.hpp:190):
>>>  Failed to execute Subprocess::ChildHook: Failed to enter the net 
>>> namespace
>>>  of pid 18596: Pid 18596 does not exist
>>>  - Aborted at 1490672916 (unix time) try "date -d @1490672916"
>>>  if you are using GNU date ***
>>>  PC: @ 0x7f26bfb485f7 __GI_raise
>>>  - SIGABRT (@0x4b) received by PID 75 (TID 0x7f26b9951700) from
>>>  PID 

Re: Mesos (and Marathon) port mapping

2017-03-29 Thread Jie Yu
Thomas,

I think you are confused about the port mapping for NAT purpose, and the port
mapping isolator
.
Those two very different thing. The port mapping isolator (unfortunate
naming), as described in the doc, gives you network namespace per container
without requiring ip per container. No NAT is involved. I think for you
case, you should not use it and it does not work for DockerContainerizer.

- Jie

On Wed, Mar 29, 2017 at 2:22 AM, Thomas HUMMEL 
wrote:

>
>
> On 03/28/2017 06:53 PM, Tomek Janiszewski wrote:
>
> 1. Mentioned port range is the Mesos Agent resource setting, so if you
> don't explicitly define port range it would be used.
> https://github.com/apache/mesos/blob/1.2.0/src/slave/constants.hpp#L86
>
> 2. With ports mapping two or more applications could attach to same
> container port but will be exposed under different host port.
>
>
> Thanks for your answer.
>
> 1. So it's not network/portmapping isolator specific, right ? Even without
> it, non-ephemeral ports would be considered as part of the offer and would
> be chosen in this range by default ?
>
> 2. So containers, even with network/port_mapping isolation, *share* the
> non-ephemeral port range, although doc states "The agent assigns each
> container a non-overlapping range of the ports" which I first read as "each
> container gets it's own port range", right ?
>
> So I am a bit confused since what's described here
>
> http://mesos.apache.org/documentation/latest/port-mapping-isolator/
>
> in the "Configuring network ports" seems to be valid even without port
> mapping isolator.
>
> Am I getting this right this time ?
>
> Thanks.
>
> --
> Thomas HUMMEL
>
>


Re: Mesos (and Marathon) port mapping

2017-03-29 Thread Thomas HUMMEL



On 03/29/2017 01:21 PM, Dick Davies wrote:

I should say this was tested around mesos 1.0, they may have changed
things - but yes this is vanilla networking, no CNI or anything like that.


As a matter of fact, that's what I experience.


But I'm guessing if you're using BRIDGE networking and specifying a
hostPort: you're causing work for yourself (unless you actually care what
port the slave is using).


Why would it make a difference regarding hostPort if you're using BRIDGE 
vs HOST mode ?


Still it doesn't explain why in Marathon UI I see a hostPort in the 
31000 - 32000 range when I specify a 9090 hostPort and I can verify on 
the slave that only 9090 is bound to.


Thanks

--
TH


Systemd After=network.target vs. After=network-online.target

2017-03-29 Thread Petr Novak
Hello,

I have used Mesosphere Zookeeper RPM and it uses After=network.target in its
zookeeper.service systemd unit file. This doesn't guarantee that Zookeeper
service will start after network is available. To ensure this
After=network-online.target should be used. I spent some time with this when
Zookeeper was disconnected after reboots because of delays during reboot on
my VMs. If Centos starts up fast, which it typically does on a clean
install, it doesn't manifest as a problem. Actually I have experienced it on
RHEL with some custom company layer on top which introduced these delays.

 

I would recommend to change it to avoid some bad user experience with slow
VMs. As well I think it is better for production. Or are there any reasons
why not?

 

I used mesosphere-zookeeper.x86_64 3.4.6-0.1.20141204175332.centos7 version.

 

Regards,

Petr

 

 



Re: Mesos (and Marathon) port mapping

2017-03-29 Thread Dick Davies
I should say this was tested around mesos 1.0, they may have changed
things - but yes this is vanilla networking, no CNI or anything like that.

But I'm guessing if you're using BRIDGE networking and specifying a
hostPort: you're causing work for yourself (unless you actually care what
port the slave is using).

On 29 March 2017 at 10:22, Thomas HUMMEL  wrote:
>
>
> On 03/28/2017 06:53 PM, Tomek Janiszewski wrote:
>
> 1. Mentioned port range is the Mesos Agent resource setting, so if you don't
> explicitly define port range it would be used.
> https://github.com/apache/mesos/blob/1.2.0/src/slave/constants.hpp#L86
>
> 2. With ports mapping two or more applications could attach to same
> container port but will be exposed under different host port.
>
>
> Thanks for your answer.
>
> 1. So it's not network/portmapping isolator specific, right ? Even without
> it, non-ephemeral ports would be considered as part of the offer and would
> be chosen in this range by default ?
>
> 2. So containers, even with network/port_mapping isolation, *share* the
> non-ephemeral port range, although doc states "The agent assigns each
> container a non-overlapping range of the ports" which I first read as "each
> container gets it's own port range", right ?
>
> So I am a bit confused since what's described here
>
> http://mesos.apache.org/documentation/latest/port-mapping-isolator/
>
> in the "Configuring network ports" seems to be valid even without port
> mapping isolator.
>
> Am I getting this right this time ?
>
> Thanks.
>
> --
> Thomas HUMMEL
>


Re: mesos container cluster came across health check coredump log

2017-03-29 Thread tommy xiao
it resolved by add --pid=host.  thanks for community guys supports. thanks
a lot.

2017-03-29 9:52 GMT+08:00 tommy xiao :

> My Environment is specified:
>
> mesos 1.2 in docker containerized.
>
> send a sample nginx docker container with mesos native health check.
>
> then get sandbox core dump.
>
> i have digg into more information for your reference:
>
> in mesos slave container, i can only see task container pid. but i can't
> found process nginx pid.
>
> but in host console, i can found the nginx pid. so how can i get the pid
> in container?
>
>
>
>
> 2017-03-28 13:49 GMT+08:00 tommy xiao :
>
>> https://issues.apache.org/jira/browse/MESOS-6184
>>
>> anyone give some hint?
>>
>> ```
>>
>> I0328 11:48:12.922181 48 exec.cpp:162] Version: 1.2.0
>> I0328 11:48:12.929252 54 exec.cpp:237] Executor registered on agent
>> a29dc3a5-3e3f-4058-8ab4-dd7de2ae58d1-S4
>> I0328 11:48:12.931640 54 docker.cpp:850] Running docker -H
>> unix:///var/run/docker.sock run --cpu-shares 10 --memory 33554432
>> --env-file /tmp/gvqGyb -v /data/mesos/slaves/a29dc3a5-3e
>> 3f-4058-8ab4-dd7de2ae58d1-S4/frameworks/d7ef5d2b-f924-42d9-
>> a274-c020afba6bce-/executors/0-hc-xychu-datamanmesos-2f3
>> b47f9ffc048539c7b22baa6c32d8f/runs/458189b8-2ff4-4337-ad3a-
>> 67321e96f5cb:/mnt/mesos/sandbox --net bridge --label=USER_NAME=xychu
>> --label=GROUP_NAME=groupautotest --label=APP_ID=hc
>> --label=VCLUSTER=clusterautotest --label=USER=xychu
>> --label=CLUSTER=datamanmesos --label=SLOT=0 --label=APP=hc -p 31000:80/tcp
>> --name 
>> mesos-a29dc3a5-3e3f-4058-8ab4-dd7de2ae58d1-S4.458189b8-2ff4-4337-ad3a-67321e96f5cb
>> nginx
>> I0328 11:48:16.145714 53 health_checker.cpp:196] Ignoring failure as
>> health check still in grace period
>> W0328 11:48:26.289958 49 health_checker.cpp:202] Health check failed 1
>> times consecutively: HTTP health check failed: curl returned terminated
>> with signal Aborted (core dumped): ABORT: (../../../3rdparty/libprocess/
>> include/process/posix/subprocess.hpp:190): Failed to execute
>> Subprocess::ChildHook: Failed to enter the net namespace of pid 18596: Pid
>> 18596 does not exist
>>
>>-
>>   -
>>  - Aborted at 1490672906 (unix time) try "date -d @1490672906"
>>  if you are using GNU date ***
>>  PC: @ 0x7f26bfb485f7 __GI_raise
>>  - SIGABRT (@0x4a) received by PID 74 (TID 0x7f26ba152700) from
>>  PID 74; stack trace: ***
>>  @ 0x7f26c0703100 (unknown)
>>  @ 0x7f26bfb485f7 __GI_raise
>>  @ 0x7f26bfb49ce8 __GI_abort
>>  @ 0x7f26c315778e _Abort()
>>  @ 0x7f26c31577cc _Abort()
>>  @ 0x7f26c237a4b6 process::internal::childMain()
>>  @ 0x7f26c2379e9c std::_Function_handler<>::_M_invoke()
>>  @ 0x7f26c2379e53 process::internal::defaultClone()
>>  @ 0x7f26c237b951 process::internal::cloneChild()
>>  @ 0x7f26c237954f process::subprocess()
>>  @ 0x7f26c15a9fb1 mesos::internal::checks::Healt
>>  hCheckerProcess::httpHealthCheck()
>>  @ 0x7f26c15ababd mesos::internal::checks::Healt
>>  hCheckerProcess::performSingleCheck()
>>  @ 0x7f26c2331389 process::ProcessManager::resume()
>>  @ 0x7f26c233a3f7 _ZNSt6thread5_ImplISt12_Bind_s
>>  impleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M
>>  _runEv
>>  @ 0x7f26c04a1220 (unknown)
>>  @ 0x7f26c06fbdc5 start_thread
>>  @ 0x7f26bfc0928d __clone
>>  W0328 11:48:36.340055 55 health_checker.cpp:202] Health check
>>  failed 2 times consecutively: HTTP health check failed: curl 
>> returned
>>  terminated with signal Aborted (core dumped): ABORT:
>>  
>> (../../../3rdparty/libprocess/include/process/posix/subprocess.hpp:190):
>>  Failed to execute Subprocess::ChildHook: Failed to enter the net 
>> namespace
>>  of pid 18596: Pid 18596 does not exist
>>  - Aborted at 1490672916 (unix time) try "date -d @1490672916"
>>  if you are using GNU date ***
>>  PC: @ 0x7f26bfb485f7 __GI_raise
>>  - SIGABRT (@0x4b) received by PID 75 (TID 0x7f26b9951700) from
>>  PID 75; stack trace: ***
>>  @ 0x7f26c0703100 (unknown)
>>  @ 0x7f26bfb485f7 __GI_raise
>>  @ 0x7f26bfb49ce8 __GI_abort
>>  @ 0x7f26c315778e _Abort()
>>  @ 0x7f26c31577cc _Abort()
>>  @ 0x7f26c237a4b6 process::internal::childMain()
>>  @ 0x7f26c2379e9c std::_Function_handler<>::_M_invoke()
>>  @ 0x7f26c2379e53 process::internal::defaultClone()
>>  @ 0x7f26c237b951 process::internal::cloneChild()
>>  @ 0x7f26c237954f process::subprocess()
>>  @ 0x7f26c15a9fb1 mesos::internal::checks::Healt
>>  hCheckerProcess::httpHealthCheck()
>>  @ 0x7f26c15ababd mesos::internal::checks::Healt
>>  hCheckerProcess::performSingleCheck()
>>  @ 0x7f26c2331389 process::ProcessManager::resume()

Re: Mesos (and Marathon) port mapping

2017-03-29 Thread Thomas HUMMEL

Also,


does network/portmapping isolator makes sense if the containerizer is 
docker ?






Re: Mesos (and Marathon) port mapping

2017-03-29 Thread Thomas HUMMEL



On 03/28/2017 06:53 PM, Tomek Janiszewski wrote:
1. Mentioned port range is the Mesos Agent resource setting, so if you 
don't explicitly define port range it would be used. 
https://github.com/apache/mesos/blob/1.2.0/src/slave/constants.hpp#L86


2. With ports mapping two or more applications could attach to same 
container port but will be exposed under different host port.




Thanks for your answer.

1. So it's not network/portmapping isolator specific, right ? Even 
without it, non-ephemeral ports would be considered as part of the offer 
and would be chosen in this range by default ?


2. So containers, even with network/port_mapping isolation, *share* the 
non-ephemeral port range, although doc states "The agent assigns each 
container a non-overlapping range of the ports" which I first read as 
"each container gets it's own port range", right ?


So I am a bit confused since what's described here

http://mesos.apache.org/documentation/latest/port-mapping-isolator/

in the "Configuring network ports" seems to be valid even without port 
mapping isolator.


Am I getting this right this time ?

Thanks.

--
Thomas HUMMEL