Re: Help needed (alas, urgently)

2016-01-15 Thread Paul Bell
In chasing down this problem, I stumbled upon something of moment: the
problem does NOT seem to happen with kernel 3.13.

Some weeks back, in the hope of getting past another problem wherein the
root filesystem "becomes" R/O, I upgraded from 3.13 to 3.19 (Ubuntu 14.04
LTS). The kernel upgrade was done as shown here (there's some extra stuff
to get rid of Ubuntu desktop and liberate some disk space):

  apt-get update
  apt-get -y remove unbuntu-desktop
  apt-get -y purge lightdm
  rm -Rf /var/lib/lightdm-data
  apt-get -y remove --purge libreoffice-core
  apt-get -y remove --purge libreoffice-common

  echo "  Installing new kernel"

  apt-get -y install linux-generic-lts-vivid
  apt-get -y autoremove linux-image-3.13.0-32-generic
  apt-get -y autoremove linux-image-3.13.0-71-generic
  update-grub
  reboot

After the reboot, a "uname -r" shows kernel 3.19.0-42-generic.

Under this kernel I can now reliably reproduce the failure to stop a
MongoDB container. Specifically, any & all attempts to kill the container,
e.g.,via

Marathon HTTP Delete (which leads to docker-mesos-executor presenting
"docker stop" command)
Getting inside the running container shell and issuing "kill" or
db.shutDown()

causes the mongod container

   - to show in its log that it's shutting down normally
   - to enter a 100% CPU loop
   - to become unkillable (only reboot "fixes" things)

Note finally that my conclusion about kernel 3.13 "working" is at present a
weak induction. But I do know that when I reverted to that kernel I could,
at least once, stop the containers w/o any problems; whereas at 3.19 I can
reliably reproduce the problem. I will try to make this induction stronger
as the day wears on.

Did I do something "wrong" in my kernel upgrade steps?

Is anyone aware of such an issue in 3.19 or of work done post-3.13 in the
area of task termination & signal handling?

Thanks for your help.

-Paul


On Thu, Jan 14, 2016 at 5:14 PM, Paul Bell  wrote:

> I spoke to soon, I'm afraid.
>
> Next time I did the stop (with zero timeout), I see the same phenomenon: a
> mongo container showing repeated:
>
> killing docker task
> shutting down
>
>
> What else can I try?
>
> Thank you.
>
> On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell  wrote:
>
>> Hi Tim,
>>
>> I set docker_stop_timeout to zero as you asked. I am pleased to report
>> (though a bit fearful about being pleased) that this change seems to have
>> shut everyone down pretty much instantly.
>>
>> Can you explain what's happening, e.g., does docker_stop_timeout=0 cause
>> the immediate use of "kill -9" as opposed to "kill -2"?
>>
>> I will keep testing the behavior.
>>
>> Thank you.
>>
>> -Paul
>>
>> On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell  wrote:
>>
>>> Hi Tim,
>>>
>>> Things have gotten slightly odder (if that's possible). When I now start
>>> the application 5 or so containers, only one "ecxconfigdb" gets started -
>>> and even he took a few tries. That is, I see him failing, moving to
>>> deploying, then starting again. But I've no evidence (no STDOUT, and no
>>> docker ctr logs) that show why.
>>>
>>> In any event, ecxconfigdb does start. Happily, when I try to stop the
>>> application I am seeing the phenomena I posted before: killing docker task,
>>> shutting down repeated many times. The UN-stopped container is now running
>>> at 100% CPU.
>>>
>>> I will try modifying docker_stop_timeout. Back shortly
>>>
>>> Thanks again.
>>>
>>> -Paul
>>>
>>> PS: what do you make of the "broken pipe" error in the docker.log?
>>>
>>> *from /var/log/upstart/docker.log*
>>>
>>> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json
>>> [34mINFO [0m[3054] GET
>>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>>> [31mERRO [0m[3054] Handler for GET
>>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>>> returned error: No such image:
>>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>> [31mERRO [0m[3054] HTTP Error
>>> [31merr [0m=No such image:
>>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>> [31mstatusCode [0m=404
>>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>>> [34mINFO [0m[3054] POST
>>> /v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>> [34mINFO [0m[3054] POST
>>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1=1=1
>>> [34mINFO [0m[3054] POST
>>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start
>>> [34mINFO [0m[3054] GET
>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>> [34mINFO [0m[3054] GET
>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>> [34mINFO [0m[3054] GET 

installing a framework after teardown

2016-01-15 Thread Viktor Sadovnikov
Hello,

I have removed a framework from Mesos Cluster by curl -X POST -d
'frameworkId=-b036-4cb7-af53-4c837dc9521d-0002'
http://${MASTER_IP}:5050/master/teardown;.
This successfully removed all the framework tasks and scheduler.

However now Mesos Cluster rejects my attempts to re-install the framework.
Is there a way to gracefully recover from this situation?

I0115 12:54:57.916470 28856 sched.cpp:1024] Got error 'Framework has been
removed'
I0115 12:54:57.916509 28856 sched.cpp:1805] Asked to abort the driver
I0115 12:54:57.916824 28856 sched.cpp:1070] Aborting framework
'8ca5c18f-b036-4cb7-af53-4c837dc9521d-0001'

With regards,
Viktor


Re: Help needed (alas, urgently)

2016-01-15 Thread Tim Chen
Hi Paul,

No problem, I haven't even spent much time yet and glad you resolved the
problem yourself.

We always welcome people doing this :)

Tim

On Fri, Jan 15, 2016 at 10:48 AM, Paul Bell  wrote:

> Tim,
>
> I've tracked down the cause of this problem: it's the result of some kind
> of incompatibility between kernel 3.19 and "VMware Tools". I know little
> more than that.
>
> I installed VMware Tools via *apt-get install open-vm-tools-lts-trusty*.
> Everything worked fine on 3.13. But when I upgrade to 3.19, the error
> occurs quite reliably. Revert back to 3.13 and the error goes away.
>
> I looked high & low for some statement of kernel requirements for VMware
> Tools, but can find none.
>
> Sorry to have wasted your time.
>
> -Paul
>
> On Fri, Jan 15, 2016 at 9:19 AM, Paul Bell  wrote:
>
>> In chasing down this problem, I stumbled upon something of moment: the
>> problem does NOT seem to happen with kernel 3.13.
>>
>> Some weeks back, in the hope of getting past another problem wherein the
>> root filesystem "becomes" R/O, I upgraded from 3.13 to 3.19 (Ubuntu 14.04
>> LTS). The kernel upgrade was done as shown here (there's some extra stuff
>> to get rid of Ubuntu desktop and liberate some disk space):
>>
>>   apt-get update
>>   apt-get -y remove unbuntu-desktop
>>   apt-get -y purge lightdm
>>   rm -Rf /var/lib/lightdm-data
>>   apt-get -y remove --purge libreoffice-core
>>   apt-get -y remove --purge libreoffice-common
>>
>>   echo "  Installing new kernel"
>>
>>   apt-get -y install linux-generic-lts-vivid
>>   apt-get -y autoremove linux-image-3.13.0-32-generic
>>   apt-get -y autoremove linux-image-3.13.0-71-generic
>>   update-grub
>>   reboot
>>
>> After the reboot, a "uname -r" shows kernel 3.19.0-42-generic.
>>
>> Under this kernel I can now reliably reproduce the failure to stop a
>> MongoDB container. Specifically, any & all attempts to kill the container,
>> e.g.,via
>>
>> Marathon HTTP Delete (which leads to docker-mesos-executor presenting
>> "docker stop" command)
>> Getting inside the running container shell and issuing "kill" or
>> db.shutDown()
>>
>> causes the mongod container
>>
>>- to show in its log that it's shutting down normally
>>- to enter a 100% CPU loop
>>- to become unkillable (only reboot "fixes" things)
>>
>> Note finally that my conclusion about kernel 3.13 "working" is at present
>> a weak induction. But I do know that when I reverted to that kernel I
>> could, at least once, stop the containers w/o any problems; whereas at 3.19
>> I can reliably reproduce the problem. I will try to make this induction
>> stronger as the day wears on.
>>
>> Did I do something "wrong" in my kernel upgrade steps?
>>
>> Is anyone aware of such an issue in 3.19 or of work done post-3.13 in the
>> area of task termination & signal handling?
>>
>> Thanks for your help.
>>
>> -Paul
>>
>>
>> On Thu, Jan 14, 2016 at 5:14 PM, Paul Bell  wrote:
>>
>>> I spoke to soon, I'm afraid.
>>>
>>> Next time I did the stop (with zero timeout), I see the same phenomenon:
>>> a mongo container showing repeated:
>>>
>>> killing docker task
>>> shutting down
>>>
>>>
>>> What else can I try?
>>>
>>> Thank you.
>>>
>>> On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell  wrote:
>>>
 Hi Tim,

 I set docker_stop_timeout to zero as you asked. I am pleased to report
 (though a bit fearful about being pleased) that this change seems to have
 shut everyone down pretty much instantly.

 Can you explain what's happening, e.g., does docker_stop_timeout=0
 cause the immediate use of "kill -9" as opposed to "kill -2"?

 I will keep testing the behavior.

 Thank you.

 -Paul

 On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell  wrote:

> Hi Tim,
>
> Things have gotten slightly odder (if that's possible). When I now
> start the application 5 or so containers, only one "ecxconfigdb" gets
> started - and even he took a few tries. That is, I see him failing, moving
> to deploying, then starting again. But I've no evidence (no STDOUT, and no
> docker ctr logs) that show why.
>
> In any event, ecxconfigdb does start. Happily, when I try to stop the
> application I am seeing the phenomena I posted before: killing docker 
> task,
> shutting down repeated many times. The UN-stopped container is now running
> at 100% CPU.
>
> I will try modifying docker_stop_timeout. Back shortly
>
> Thanks again.
>
> -Paul
>
> PS: what do you make of the "broken pipe" error in the docker.log?
>
> *from /var/log/upstart/docker.log*
>
> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json
> [34mINFO [0m[3054] GET
> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
> [31mERRO [0m[3054] Handler for GET
> 

Re: 答复: can mesos run in SUSE Linux 11?

2016-01-15 Thread Shuai Lin
1. For kernel < 3.10, process isolation would have problems. See the
discussion in https://issues.apache.org/jira/browse/MESOS-3974 for details.

2. From http://mesos.apache.org/gettingstarted/ , GCC 4.8.1+ or clang 3.5+)
is required to compile the source.



On Sat, Jan 16, 2016 at 3:36 PM, Linyuxin  wrote:

> Thanks for the reply.
>
>
>
> I still have two questions:
>
> 1.   Is there any pitfall if the linux kernel version less than 3.10
> which is recommended in the document?
>
> 2.   I compiled the source in SUSE 11 SP3 with g++4.7, but I
> encountered a error:
>
> configure: error: *** A compiler with support for C++11 language features
> is required.
>
> Any suggestion?
>
>
>
> *发件人:* Shuai Lin [mailto:linshuai2...@gmail.com]
> *发送时间:* 2016年1月16日 15:13
> *收件人:* user@mesos.apache.org
> *主题:* Re: can mesos run in SUSE Linux 11?
>
>
>
> There is no official package for SUSE on the downloads page of mesosphere:
> https://open.mesosphere.com/downloads/mesos/#apache-mesos-0.26.0 . So I
> guess you have to either compile from source, or run mesos master/slave in
> docker containers.
>
>
>
> On Sat, Jan 16, 2016 at 1:47 PM, Linyuxin  wrote:
>
> Hi All,
>
>
>
>  I want to know if mesos can run in SUSE Linux 11.
>
> I can not find any information from the document reference.
>
>
>
> Thanks.
>
>
>


Re: can mesos run in SUSE Linux 11?

2016-01-15 Thread Shuai Lin
There is no official package for SUSE on the downloads page of mesosphere:
https://open.mesosphere.com/downloads/mesos/#apache-mesos-0.26.0 . So I
guess you have to either compile from source, or run mesos master/slave in
docker containers.

On Sat, Jan 16, 2016 at 1:47 PM, Linyuxin  wrote:

> Hi All,
>
>
>
>  I want to know if mesos can run in SUSE Linux 11.
>
> I can not find any information from the document reference.
>
>
>
> Thanks.
>


Re: Share GPU resources via attributes or as custom resources (INTERNAL)

2016-01-15 Thread Klaus Ma
Yes, "attributes" is the way for now.
But after Marathon supporting Mesos' Multiple Roles (MESOS-1763), you can
use role info to define resource groups.


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Fri, Jan 15, 2016 at 4:22 PM,  wrote:

> Thanks Haosdent,
>
>
>
> If what you say about Marathon is right (i.e., that Marathon’s constraints
> only work with Mesos’ attributes), then I cannot use --resources="gpu(*):4",
> since I have no way in Marathon to specify my job needs a GPU resource (at
> least using the web interface), right?
>
>
>
> I guess I will have to experiment with attributes.
>
>
>
> Cheers,
>
> Humberto
>
>
>
>
>
>
>
> *From:* haosdent [mailto:haosd...@gmail.com]
> *Sent:* 14. januar 2016 19:07
> *To:* user
> *Subject:* Re: Share GPU resources via attributes or as custom resources
> (INTERNAL)
>
>
>
> >Then, if a job is sent to the machine when the 4 GPUs are already busy,
> the job will fail to start, right?
>
> I not sure this. But if job fail, Marathon would retry as you said.
>
>
>
> >a job is sent to the machine, all 4 GPUs will become busy
>
> If you specify your task only use 1 gpu in resources field. I think Mesos
> could continue provide offers which have gpu. And I remember
> Marathon constraints only could work with --attributes.
>
>
>
> On Fri, Jan 15, 2016 at 1:02 AM,  wrote:
>
> I have a machine with 4 GPUs and want to use Mesos+Marathon to schedule
> the jobs to be run in the machine. Each job will use maximum 1 GPU and
> sharing 1 GPU between small jobs would be ok.
> I know Mesos does not directly support GPUs, but it seems I might use
> custom resources or attributes to do what I want. But how exactly should
> this be done?
>
> If I use --attributes="hasGpu:true", would a job be sent to the machine
> when another job is already running in the machine (and only using 1 GPU)?
> I would say all jobs requesting a machine with a hasGpu attribute would be
> sent to the machine (as long as it has free CPU and memory resources).
> Then, if a job is sent to the machine when the 4 GPUs are already busy, the
> job will fail to start, right? Could then Marathon be used to re-send the
> job after some time, until it is accepted by the machine?
>
> If I specify --resources="gpu(*):4", it is my understanding that once a
> job is sent to the machine, all 4 GPUs will become busy to the eyes of
> Mesos (even if this is not really true). If that is right, would this
> work-around work: specify 4 different resources: gpu:A, gpu:B, gpu:C and
> gpu:D; and use constraints in Marathon like this  "constraints": [["gpu",
> "LIKE", " [A-D]"]]?
>
> Cheers
>
>
>
>
>
> --
>
> Best Regards,
>
> Haosdent Huang
>


答复: can mesos run in SUSE Linux 11?

2016-01-15 Thread Linyuxin
Thanks for the reply.

I still have two questions:

1.   Is there any pitfall if the linux kernel version less than 3.10 which 
is recommended in the document?

2.   I compiled the source in SUSE 11 SP3 with g++4.7, but I encountered a 
error:

configure: error: *** A compiler with support for C++11 language features is 
required.
Any suggestion?

发件人: Shuai Lin [mailto:linshuai2...@gmail.com]
发送时间: 2016年1月16日 15:13
收件人: user@mesos.apache.org
主题: Re: can mesos run in SUSE Linux 11?

There is no official package for SUSE on the downloads page of mesosphere: 
https://open.mesosphere.com/downloads/mesos/#apache-mesos-0.26.0 . So I guess 
you have to either compile from source, or run mesos master/slave in docker 
containers.

On Sat, Jan 16, 2016 at 1:47 PM, Linyuxin 
> wrote:
Hi All,

 I want to know if mesos can run in SUSE Linux 11.
I can not find any information from the document reference.

Thanks.



Re: Share GPU resources via attributes or as custom resources (INTERNAL)

2016-01-15 Thread Guangya Liu
I think that you can use 'curl' to test, please refer to
https://open.mesosphere.com/advanced-course/advanced-usage-of-marathon/

On Fri, Jan 15, 2016 at 4:22 PM,  wrote:

> Thanks Haosdent,
>
>
>
> If what you say about Marathon is right (i.e., that Marathon’s constraints
> only work with Mesos’ attributes), then I cannot use --resources="gpu(*):4",
> since I have no way in Marathon to specify my job needs a GPU resource (at
> least using the web interface), right?
>
>
>
> I guess I will have to experiment with attributes.
>
>
>
> Cheers,
>
> Humberto
>
>
>
>
>
>
>
> *From:* haosdent [mailto:haosd...@gmail.com]
> *Sent:* 14. januar 2016 19:07
> *To:* user
> *Subject:* Re: Share GPU resources via attributes or as custom resources
> (INTERNAL)
>
>
>
> >Then, if a job is sent to the machine when the 4 GPUs are already busy,
> the job will fail to start, right?
>
> I not sure this. But if job fail, Marathon would retry as you said.
>
>
>
> >a job is sent to the machine, all 4 GPUs will become busy
>
> If you specify your task only use 1 gpu in resources field. I think Mesos
> could continue provide offers which have gpu. And I remember
> Marathon constraints only could work with --attributes.
>
>
>
> On Fri, Jan 15, 2016 at 1:02 AM,  wrote:
>
> I have a machine with 4 GPUs and want to use Mesos+Marathon to schedule
> the jobs to be run in the machine. Each job will use maximum 1 GPU and
> sharing 1 GPU between small jobs would be ok.
> I know Mesos does not directly support GPUs, but it seems I might use
> custom resources or attributes to do what I want. But how exactly should
> this be done?
>
> If I use --attributes="hasGpu:true", would a job be sent to the machine
> when another job is already running in the machine (and only using 1 GPU)?
> I would say all jobs requesting a machine with a hasGpu attribute would be
> sent to the machine (as long as it has free CPU and memory resources).
> Then, if a job is sent to the machine when the 4 GPUs are already busy, the
> job will fail to start, right? Could then Marathon be used to re-send the
> job after some time, until it is accepted by the machine?
>
> If I specify --resources="gpu(*):4", it is my understanding that once a
> job is sent to the machine, all 4 GPUs will become busy to the eyes of
> Mesos (even if this is not really true). If that is right, would this
> work-around work: specify 4 different resources: gpu:A, gpu:B, gpu:C and
> gpu:D; and use constraints in Marathon like this  "constraints": [["gpu",
> "LIKE", " [A-D]"]]?
>
> Cheers
>
>
>
>
>
> --
>
> Best Regards,
>
> Haosdent Huang
>


RE: Share GPU resources via attributes or as custom resources (INTERNAL)

2016-01-15 Thread Humberto.Castejon
Thanks Haosdent,

If what you say about Marathon is right (i.e., that Marathon’s constraints only 
work with Mesos’ attributes), then I cannot use --resources="gpu(*):4", since I 
have no way in Marathon to specify my job needs a GPU resource (at least using 
the web interface), right?

I guess I will have to experiment with attributes.

Cheers,
Humberto



From: haosdent [mailto:haosd...@gmail.com]
Sent: 14. januar 2016 19:07
To: user
Subject: Re: Share GPU resources via attributes or as custom resources 
(INTERNAL)

>Then, if a job is sent to the machine when the 4 GPUs are already busy, the 
>job will fail to start, right?
I not sure this. But if job fail, Marathon would retry as you said.

>a job is sent to the machine, all 4 GPUs will become busy
If you specify your task only use 1 gpu in resources field. I think Mesos could 
continue provide offers which have gpu. And I remember Marathon constraints 
only could work with --attributes.

On Fri, Jan 15, 2016 at 1:02 AM, 
> wrote:
I have a machine with 4 GPUs and want to use Mesos+Marathon to schedule the 
jobs to be run in the machine. Each job will use maximum 1 GPU and sharing 1 
GPU between small jobs would be ok.
I know Mesos does not directly support GPUs, but it seems I might use custom 
resources or attributes to do what I want. But how exactly should this be done?

If I use --attributes="hasGpu:true", would a job be sent to the machine when 
another job is already running in the machine (and only using 1 GPU)? I would 
say all jobs requesting a machine with a hasGpu attribute would be sent to the 
machine (as long as it has free CPU and memory resources). Then, if a job is 
sent to the machine when the 4 GPUs are already busy, the job will fail to 
start, right? Could then Marathon be used to re-send the job after some time, 
until it is accepted by the machine?

If I specify --resources="gpu(*):4", it is my understanding that once a job is 
sent to the machine, all 4 GPUs will become busy to the eyes of Mesos (even if 
this is not really true). If that is right, would this work-around work: 
specify 4 different resources: gpu:A, gpu:B, gpu:C and gpu:D; and use 
constraints in Marathon like this  "constraints": [["gpu", "LIKE", " [A-D]"]]?

Cheers



--
Best Regards,
Haosdent Huang