Re: Help needed (alas, urgently)
In chasing down this problem, I stumbled upon something of moment: the problem does NOT seem to happen with kernel 3.13. Some weeks back, in the hope of getting past another problem wherein the root filesystem "becomes" R/O, I upgraded from 3.13 to 3.19 (Ubuntu 14.04 LTS). The kernel upgrade was done as shown here (there's some extra stuff to get rid of Ubuntu desktop and liberate some disk space): apt-get update apt-get -y remove unbuntu-desktop apt-get -y purge lightdm rm -Rf /var/lib/lightdm-data apt-get -y remove --purge libreoffice-core apt-get -y remove --purge libreoffice-common echo " Installing new kernel" apt-get -y install linux-generic-lts-vivid apt-get -y autoremove linux-image-3.13.0-32-generic apt-get -y autoremove linux-image-3.13.0-71-generic update-grub reboot After the reboot, a "uname -r" shows kernel 3.19.0-42-generic. Under this kernel I can now reliably reproduce the failure to stop a MongoDB container. Specifically, any & all attempts to kill the container, e.g.,via Marathon HTTP Delete (which leads to docker-mesos-executor presenting "docker stop" command) Getting inside the running container shell and issuing "kill" or db.shutDown() causes the mongod container - to show in its log that it's shutting down normally - to enter a 100% CPU loop - to become unkillable (only reboot "fixes" things) Note finally that my conclusion about kernel 3.13 "working" is at present a weak induction. But I do know that when I reverted to that kernel I could, at least once, stop the containers w/o any problems; whereas at 3.19 I can reliably reproduce the problem. I will try to make this induction stronger as the day wears on. Did I do something "wrong" in my kernel upgrade steps? Is anyone aware of such an issue in 3.19 or of work done post-3.13 in the area of task termination & signal handling? Thanks for your help. -Paul On Thu, Jan 14, 2016 at 5:14 PM, Paul Bellwrote: > I spoke to soon, I'm afraid. > > Next time I did the stop (with zero timeout), I see the same phenomenon: a > mongo container showing repeated: > > killing docker task > shutting down > > > What else can I try? > > Thank you. > > On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell wrote: > >> Hi Tim, >> >> I set docker_stop_timeout to zero as you asked. I am pleased to report >> (though a bit fearful about being pleased) that this change seems to have >> shut everyone down pretty much instantly. >> >> Can you explain what's happening, e.g., does docker_stop_timeout=0 cause >> the immediate use of "kill -9" as opposed to "kill -2"? >> >> I will keep testing the behavior. >> >> Thank you. >> >> -Paul >> >> On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell wrote: >> >>> Hi Tim, >>> >>> Things have gotten slightly odder (if that's possible). When I now start >>> the application 5 or so containers, only one "ecxconfigdb" gets started - >>> and even he took a few tries. That is, I see him failing, moving to >>> deploying, then starting again. But I've no evidence (no STDOUT, and no >>> docker ctr logs) that show why. >>> >>> In any event, ecxconfigdb does start. Happily, when I try to stop the >>> application I am seeing the phenomena I posted before: killing docker task, >>> shutting down repeated many times. The UN-stopped container is now running >>> at 100% CPU. >>> >>> I will try modifying docker_stop_timeout. Back shortly >>> >>> Thanks again. >>> >>> -Paul >>> >>> PS: what do you make of the "broken pipe" error in the docker.log? >>> >>> *from /var/log/upstart/docker.log* >>> >>> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json >>> [34mINFO [0m[3054] GET >>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json >>> [31mERRO [0m[3054] Handler for GET >>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json >>> returned error: No such image: >>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b >>> [31mERRO [0m[3054] HTTP Error >>> [31merr [0m=No such image: >>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b >>> [31mstatusCode [0m=404 >>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json >>> [34mINFO [0m[3054] POST >>> /v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b >>> [34mINFO [0m[3054] POST >>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1=1=1 >>> [34mINFO [0m[3054] POST >>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start >>> [34mINFO [0m[3054] GET >>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >>> [34mINFO [0m[3054] GET >>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >>> [34mINFO [0m[3054] GET
installing a framework after teardown
Hello, I have removed a framework from Mesos Cluster by curl -X POST -d 'frameworkId=-b036-4cb7-af53-4c837dc9521d-0002' http://${MASTER_IP}:5050/master/teardown;. This successfully removed all the framework tasks and scheduler. However now Mesos Cluster rejects my attempts to re-install the framework. Is there a way to gracefully recover from this situation? I0115 12:54:57.916470 28856 sched.cpp:1024] Got error 'Framework has been removed' I0115 12:54:57.916509 28856 sched.cpp:1805] Asked to abort the driver I0115 12:54:57.916824 28856 sched.cpp:1070] Aborting framework '8ca5c18f-b036-4cb7-af53-4c837dc9521d-0001' With regards, Viktor
Re: Help needed (alas, urgently)
Hi Paul, No problem, I haven't even spent much time yet and glad you resolved the problem yourself. We always welcome people doing this :) Tim On Fri, Jan 15, 2016 at 10:48 AM, Paul Bellwrote: > Tim, > > I've tracked down the cause of this problem: it's the result of some kind > of incompatibility between kernel 3.19 and "VMware Tools". I know little > more than that. > > I installed VMware Tools via *apt-get install open-vm-tools-lts-trusty*. > Everything worked fine on 3.13. But when I upgrade to 3.19, the error > occurs quite reliably. Revert back to 3.13 and the error goes away. > > I looked high & low for some statement of kernel requirements for VMware > Tools, but can find none. > > Sorry to have wasted your time. > > -Paul > > On Fri, Jan 15, 2016 at 9:19 AM, Paul Bell wrote: > >> In chasing down this problem, I stumbled upon something of moment: the >> problem does NOT seem to happen with kernel 3.13. >> >> Some weeks back, in the hope of getting past another problem wherein the >> root filesystem "becomes" R/O, I upgraded from 3.13 to 3.19 (Ubuntu 14.04 >> LTS). The kernel upgrade was done as shown here (there's some extra stuff >> to get rid of Ubuntu desktop and liberate some disk space): >> >> apt-get update >> apt-get -y remove unbuntu-desktop >> apt-get -y purge lightdm >> rm -Rf /var/lib/lightdm-data >> apt-get -y remove --purge libreoffice-core >> apt-get -y remove --purge libreoffice-common >> >> echo " Installing new kernel" >> >> apt-get -y install linux-generic-lts-vivid >> apt-get -y autoremove linux-image-3.13.0-32-generic >> apt-get -y autoremove linux-image-3.13.0-71-generic >> update-grub >> reboot >> >> After the reboot, a "uname -r" shows kernel 3.19.0-42-generic. >> >> Under this kernel I can now reliably reproduce the failure to stop a >> MongoDB container. Specifically, any & all attempts to kill the container, >> e.g.,via >> >> Marathon HTTP Delete (which leads to docker-mesos-executor presenting >> "docker stop" command) >> Getting inside the running container shell and issuing "kill" or >> db.shutDown() >> >> causes the mongod container >> >>- to show in its log that it's shutting down normally >>- to enter a 100% CPU loop >>- to become unkillable (only reboot "fixes" things) >> >> Note finally that my conclusion about kernel 3.13 "working" is at present >> a weak induction. But I do know that when I reverted to that kernel I >> could, at least once, stop the containers w/o any problems; whereas at 3.19 >> I can reliably reproduce the problem. I will try to make this induction >> stronger as the day wears on. >> >> Did I do something "wrong" in my kernel upgrade steps? >> >> Is anyone aware of such an issue in 3.19 or of work done post-3.13 in the >> area of task termination & signal handling? >> >> Thanks for your help. >> >> -Paul >> >> >> On Thu, Jan 14, 2016 at 5:14 PM, Paul Bell wrote: >> >>> I spoke to soon, I'm afraid. >>> >>> Next time I did the stop (with zero timeout), I see the same phenomenon: >>> a mongo container showing repeated: >>> >>> killing docker task >>> shutting down >>> >>> >>> What else can I try? >>> >>> Thank you. >>> >>> On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell wrote: >>> Hi Tim, I set docker_stop_timeout to zero as you asked. I am pleased to report (though a bit fearful about being pleased) that this change seems to have shut everyone down pretty much instantly. Can you explain what's happening, e.g., does docker_stop_timeout=0 cause the immediate use of "kill -9" as opposed to "kill -2"? I will keep testing the behavior. Thank you. -Paul On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell wrote: > Hi Tim, > > Things have gotten slightly odder (if that's possible). When I now > start the application 5 or so containers, only one "ecxconfigdb" gets > started - and even he took a few tries. That is, I see him failing, moving > to deploying, then starting again. But I've no evidence (no STDOUT, and no > docker ctr logs) that show why. > > In any event, ecxconfigdb does start. Happily, when I try to stop the > application I am seeing the phenomena I posted before: killing docker > task, > shutting down repeated many times. The UN-stopped container is now running > at 100% CPU. > > I will try modifying docker_stop_timeout. Back shortly > > Thanks again. > > -Paul > > PS: what do you make of the "broken pipe" error in the docker.log? > > *from /var/log/upstart/docker.log* > > [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json > [34mINFO [0m[3054] GET > /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json > [31mERRO [0m[3054] Handler for GET >
Re: 答复: can mesos run in SUSE Linux 11?
1. For kernel < 3.10, process isolation would have problems. See the discussion in https://issues.apache.org/jira/browse/MESOS-3974 for details. 2. From http://mesos.apache.org/gettingstarted/ , GCC 4.8.1+ or clang 3.5+) is required to compile the source. On Sat, Jan 16, 2016 at 3:36 PM, Linyuxinwrote: > Thanks for the reply. > > > > I still have two questions: > > 1. Is there any pitfall if the linux kernel version less than 3.10 > which is recommended in the document? > > 2. I compiled the source in SUSE 11 SP3 with g++4.7, but I > encountered a error: > > configure: error: *** A compiler with support for C++11 language features > is required. > > Any suggestion? > > > > *发件人:* Shuai Lin [mailto:linshuai2...@gmail.com] > *发送时间:* 2016年1月16日 15:13 > *收件人:* user@mesos.apache.org > *主题:* Re: can mesos run in SUSE Linux 11? > > > > There is no official package for SUSE on the downloads page of mesosphere: > https://open.mesosphere.com/downloads/mesos/#apache-mesos-0.26.0 . So I > guess you have to either compile from source, or run mesos master/slave in > docker containers. > > > > On Sat, Jan 16, 2016 at 1:47 PM, Linyuxin wrote: > > Hi All, > > > > I want to know if mesos can run in SUSE Linux 11. > > I can not find any information from the document reference. > > > > Thanks. > > >
Re: can mesos run in SUSE Linux 11?
There is no official package for SUSE on the downloads page of mesosphere: https://open.mesosphere.com/downloads/mesos/#apache-mesos-0.26.0 . So I guess you have to either compile from source, or run mesos master/slave in docker containers. On Sat, Jan 16, 2016 at 1:47 PM, Linyuxinwrote: > Hi All, > > > > I want to know if mesos can run in SUSE Linux 11. > > I can not find any information from the document reference. > > > > Thanks. >
Re: Share GPU resources via attributes or as custom resources (INTERNAL)
Yes, "attributes" is the way for now. But after Marathon supporting Mesos' Multiple Roles (MESOS-1763), you can use role info to define resource groups. Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer Platform OpenSource Technology, STG, IBM GCG +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me On Fri, Jan 15, 2016 at 4:22 PM,wrote: > Thanks Haosdent, > > > > If what you say about Marathon is right (i.e., that Marathon’s constraints > only work with Mesos’ attributes), then I cannot use --resources="gpu(*):4", > since I have no way in Marathon to specify my job needs a GPU resource (at > least using the web interface), right? > > > > I guess I will have to experiment with attributes. > > > > Cheers, > > Humberto > > > > > > > > *From:* haosdent [mailto:haosd...@gmail.com] > *Sent:* 14. januar 2016 19:07 > *To:* user > *Subject:* Re: Share GPU resources via attributes or as custom resources > (INTERNAL) > > > > >Then, if a job is sent to the machine when the 4 GPUs are already busy, > the job will fail to start, right? > > I not sure this. But if job fail, Marathon would retry as you said. > > > > >a job is sent to the machine, all 4 GPUs will become busy > > If you specify your task only use 1 gpu in resources field. I think Mesos > could continue provide offers which have gpu. And I remember > Marathon constraints only could work with --attributes. > > > > On Fri, Jan 15, 2016 at 1:02 AM, wrote: > > I have a machine with 4 GPUs and want to use Mesos+Marathon to schedule > the jobs to be run in the machine. Each job will use maximum 1 GPU and > sharing 1 GPU between small jobs would be ok. > I know Mesos does not directly support GPUs, but it seems I might use > custom resources or attributes to do what I want. But how exactly should > this be done? > > If I use --attributes="hasGpu:true", would a job be sent to the machine > when another job is already running in the machine (and only using 1 GPU)? > I would say all jobs requesting a machine with a hasGpu attribute would be > sent to the machine (as long as it has free CPU and memory resources). > Then, if a job is sent to the machine when the 4 GPUs are already busy, the > job will fail to start, right? Could then Marathon be used to re-send the > job after some time, until it is accepted by the machine? > > If I specify --resources="gpu(*):4", it is my understanding that once a > job is sent to the machine, all 4 GPUs will become busy to the eyes of > Mesos (even if this is not really true). If that is right, would this > work-around work: specify 4 different resources: gpu:A, gpu:B, gpu:C and > gpu:D; and use constraints in Marathon like this "constraints": [["gpu", > "LIKE", " [A-D]"]]? > > Cheers > > > > > > -- > > Best Regards, > > Haosdent Huang >
答复: can mesos run in SUSE Linux 11?
Thanks for the reply. I still have two questions: 1. Is there any pitfall if the linux kernel version less than 3.10 which is recommended in the document? 2. I compiled the source in SUSE 11 SP3 with g++4.7, but I encountered a error: configure: error: *** A compiler with support for C++11 language features is required. Any suggestion? 发件人: Shuai Lin [mailto:linshuai2...@gmail.com] 发送时间: 2016年1月16日 15:13 收件人: user@mesos.apache.org 主题: Re: can mesos run in SUSE Linux 11? There is no official package for SUSE on the downloads page of mesosphere: https://open.mesosphere.com/downloads/mesos/#apache-mesos-0.26.0 . So I guess you have to either compile from source, or run mesos master/slave in docker containers. On Sat, Jan 16, 2016 at 1:47 PM, Linyuxin> wrote: Hi All, I want to know if mesos can run in SUSE Linux 11. I can not find any information from the document reference. Thanks.
Re: Share GPU resources via attributes or as custom resources (INTERNAL)
I think that you can use 'curl' to test, please refer to https://open.mesosphere.com/advanced-course/advanced-usage-of-marathon/ On Fri, Jan 15, 2016 at 4:22 PM,wrote: > Thanks Haosdent, > > > > If what you say about Marathon is right (i.e., that Marathon’s constraints > only work with Mesos’ attributes), then I cannot use --resources="gpu(*):4", > since I have no way in Marathon to specify my job needs a GPU resource (at > least using the web interface), right? > > > > I guess I will have to experiment with attributes. > > > > Cheers, > > Humberto > > > > > > > > *From:* haosdent [mailto:haosd...@gmail.com] > *Sent:* 14. januar 2016 19:07 > *To:* user > *Subject:* Re: Share GPU resources via attributes or as custom resources > (INTERNAL) > > > > >Then, if a job is sent to the machine when the 4 GPUs are already busy, > the job will fail to start, right? > > I not sure this. But if job fail, Marathon would retry as you said. > > > > >a job is sent to the machine, all 4 GPUs will become busy > > If you specify your task only use 1 gpu in resources field. I think Mesos > could continue provide offers which have gpu. And I remember > Marathon constraints only could work with --attributes. > > > > On Fri, Jan 15, 2016 at 1:02 AM, wrote: > > I have a machine with 4 GPUs and want to use Mesos+Marathon to schedule > the jobs to be run in the machine. Each job will use maximum 1 GPU and > sharing 1 GPU between small jobs would be ok. > I know Mesos does not directly support GPUs, but it seems I might use > custom resources or attributes to do what I want. But how exactly should > this be done? > > If I use --attributes="hasGpu:true", would a job be sent to the machine > when another job is already running in the machine (and only using 1 GPU)? > I would say all jobs requesting a machine with a hasGpu attribute would be > sent to the machine (as long as it has free CPU and memory resources). > Then, if a job is sent to the machine when the 4 GPUs are already busy, the > job will fail to start, right? Could then Marathon be used to re-send the > job after some time, until it is accepted by the machine? > > If I specify --resources="gpu(*):4", it is my understanding that once a > job is sent to the machine, all 4 GPUs will become busy to the eyes of > Mesos (even if this is not really true). If that is right, would this > work-around work: specify 4 different resources: gpu:A, gpu:B, gpu:C and > gpu:D; and use constraints in Marathon like this "constraints": [["gpu", > "LIKE", " [A-D]"]]? > > Cheers > > > > > > -- > > Best Regards, > > Haosdent Huang >
RE: Share GPU resources via attributes or as custom resources (INTERNAL)
Thanks Haosdent, If what you say about Marathon is right (i.e., that Marathon’s constraints only work with Mesos’ attributes), then I cannot use --resources="gpu(*):4", since I have no way in Marathon to specify my job needs a GPU resource (at least using the web interface), right? I guess I will have to experiment with attributes. Cheers, Humberto From: haosdent [mailto:haosd...@gmail.com] Sent: 14. januar 2016 19:07 To: user Subject: Re: Share GPU resources via attributes or as custom resources (INTERNAL) >Then, if a job is sent to the machine when the 4 GPUs are already busy, the >job will fail to start, right? I not sure this. But if job fail, Marathon would retry as you said. >a job is sent to the machine, all 4 GPUs will become busy If you specify your task only use 1 gpu in resources field. I think Mesos could continue provide offers which have gpu. And I remember Marathon constraints only could work with --attributes. On Fri, Jan 15, 2016 at 1:02 AM,> wrote: I have a machine with 4 GPUs and want to use Mesos+Marathon to schedule the jobs to be run in the machine. Each job will use maximum 1 GPU and sharing 1 GPU between small jobs would be ok. I know Mesos does not directly support GPUs, but it seems I might use custom resources or attributes to do what I want. But how exactly should this be done? If I use --attributes="hasGpu:true", would a job be sent to the machine when another job is already running in the machine (and only using 1 GPU)? I would say all jobs requesting a machine with a hasGpu attribute would be sent to the machine (as long as it has free CPU and memory resources). Then, if a job is sent to the machine when the 4 GPUs are already busy, the job will fail to start, right? Could then Marathon be used to re-send the job after some time, until it is accepted by the machine? If I specify --resources="gpu(*):4", it is my understanding that once a job is sent to the machine, all 4 GPUs will become busy to the eyes of Mesos (even if this is not really true). If that is right, would this work-around work: specify 4 different resources: gpu:A, gpu:B, gpu:C and gpu:D; and use constraints in Marathon like this "constraints": [["gpu", "LIKE", " [A-D]"]]? Cheers -- Best Regards, Haosdent Huang