Hi Pradeep, Glad it finally works! Not sure if you are using systemd.slice or not, are you running to this issue: https://issues.apache.org/jira/browse/MESOS-1195
Hope Jie Yu can give you some help on this ;-) Thanks, Guangya On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale <pradeepkiruv...@gmail.com> wrote: > Hi Guangya, > > > Thanks for sharing the information. > > Now I could launch the tasks. The problem was with the permission. If I > start all the slaves and Master as root it works fine. > Else I have problem with launching the tasks. > > But on one of the slave I could not launch the slave as root, I am facing > the following issue. > > Failed to create a containerizer: Could not create MesosContainerizer: > Failed to create launcher: Failed to create Linux launcher: Failed to mount > cgroups hierarchy at '/sys/fs/cgroup/freezer': 'freezer' is already > attached to another hierarchy > > I took that out from the cluster for now. The tasks are getting scheduled > on the other two slave nodes. > > Thanks for your timely help > > -Pradeep > > On 5 October 2015 at 10:54, Guangya Liu <gyliu...@gmail.com> wrote: > >> Hi Pradeep, >> >> My steps was pretty simple just as >> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples >> >> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1 >> ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos >> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build# GLOG_v=1 >> ./bin/mesos-slave.sh --master=192.168.0.107:5050 >> >> Then schedule a task on any of the node, here I was using slave node >> mesos007, you can see that the two tasks was launched on different host. >> >> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master= >> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 100" >> --resources="cpus(*):1;mem(*):256" >> I1005 16:49:11.013432 2971 sched.cpp:164] Version: 0.26.0 >> I1005 16:49:11.027802 2992 sched.cpp:262] New master detected at >> master@192.168.0.107:5050 >> I1005 16:49:11.029579 2992 sched.cpp:272] No credentials provided. >> Attempting to register without authentication >> I1005 16:49:11.038182 2985 sched.cpp:641] Framework registered with >> c0e5fdde-595e-4768-9d04-25901d4523b6-0002 >> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002 >> task cluster-test submitted to slave >> c0e5fdde-595e-4768-9d04-25901d4523b6-S0 <<<<<<<<<<<<<<<<<< >> Received status update TASK_RUNNING for task cluster-test >> ^C >> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master= >> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 100" >> --resources="cpus(*):1;mem(*):256" >> I1005 16:50:18.346984 3036 sched.cpp:164] Version: 0.26.0 >> I1005 16:50:18.366114 3055 sched.cpp:262] New master detected at >> master@192.168.0.107:5050 >> I1005 16:50:18.368010 3055 sched.cpp:272] No credentials provided. >> Attempting to register without authentication >> I1005 16:50:18.376338 3056 sched.cpp:641] Framework registered with >> c0e5fdde-595e-4768-9d04-25901d4523b6-0003 >> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003 >> task cluster-test submitted to slave >> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<< >> Received status update TASK_RUNNING for task cluster-test >> >> Thanks, >> >> Guangya >> >> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale < >> pradeepkiruv...@gmail.com> wrote: >> >>> Hi Guangya, >>> >>> Thanks for your reply. >>> >>> I just want to know how did you launch the tasks. >>> >>> 1. What processes you have started on Master? >>> 2. What are the processes you have started on Slaves? >>> >>> I am missing something here, otherwise all my slave have enough memory >>> and cpus to launch the tasks I mentioned. >>> What I am missing is some configuration steps. >>> >>> Thanks & Regards, >>> Pradeep >>> >>> >>> On 3 October 2015 at 13:14, Guangya Liu <gyliu...@gmail.com> wrote: >>> >>>> Hi Pradeep, >>>> >>>> I did some test with your case and found that the task can run randomly >>>> on the three slave hosts, every time may have different result. The logic >>>> is here: >>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266 >>>> The allocator will help random shuffle the slaves every time when >>>> allocate resources for offers. >>>> >>>> I see that every of your task need the minimum resources as " >>>> resources="cpus(*):3;mem(*):2560", can you help check if all of your >>>> slaves have enough resources? If you want your task run on other slaves, >>>> then those slaves need to have at least 3 cpus and 2550M memory. >>>> >>>> Thanks >>>> >>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale < >>>> pradeepkiruv...@gmail.com> wrote: >>>> >>>>> Hi Ondrej, >>>>> >>>>> Thanks for your reply >>>>> >>>>> I did solve that issue, yes you are right there was an issue with >>>>> slave IP address setting. >>>>> >>>>> Now I am facing issue with the scheduling the tasks. When I try to >>>>> schedule a task using >>>>> >>>>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test" >>>>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P" >>>>> --resources="cpus(*):3;mem(*):2560" >>>>> >>>>> The tasks always get scheduled on the same node. The resources from >>>>> the other nodes are not getting used to schedule the tasks. >>>>> >>>>> I just start the mesos slaves like below >>>>> >>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos >>>>> --hostname=slave1 >>>>> >>>>> If I submit the task using the above (mesos-execute) command from same >>>>> as one of the slave it runs on that system. >>>>> >>>>> But when I submit the task from some different system. It uses just >>>>> that system and queues the tasks not runs on the other slaves. >>>>> Some times I see the message "Failed to getgid: unknown user" >>>>> >>>>> Do I need to start some process to push the task on all the slaves >>>>> equally? Am I missing something here? >>>>> >>>>> Regards, >>>>> Pradeep >>>>> >>>>> >>>>> >>>>> On 2 October 2015 at 15:07, Ondrej Smola <ondrej.sm...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Pradeep, >>>>>> >>>>>> the problem is with IP your slave advertise - mesos by default >>>>>> resolves your hostname - there are several solutions (let say your node >>>>>> ip >>>>>> is 192.168.56.128) >>>>>> >>>>>> 1) export LIBPROCESS_IP=192.168.56.128 >>>>>> 2) set mesos options - ip, hostname >>>>>> >>>>>> one way to do this is to create files >>>>>> >>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip >>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname >>>>>> >>>>>> for more configuration options see >>>>>> http://mesos.apache.org/documentation/latest/configuration >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale < >>>>>> pradeepkiruv...@gmail.com>: >>>>>> >>>>>>> Hi Guangya, >>>>>>> >>>>>>> Thanks for reply. I found one interesting log message. >>>>>>> >>>>>>> 7410 master.cpp:5977] Removed slave >>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave >>>>>>> registered at the same address >>>>>>> >>>>>>> Mostly because of this issue, the systems/slave nodes are getting >>>>>>> registered and de-registered to make a room for the next node. I can >>>>>>> even >>>>>>> see this on >>>>>>> the UI interface, for some time one node got added and after some >>>>>>> time that will be replaced with the new slave node. >>>>>>> >>>>>>> The above log is followed by the below log messages. >>>>>>> >>>>>>> >>>>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action (18 >>>>>>> bytes) to leveldb took 104089ns >>>>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at 384 >>>>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown >>>>>>> socket with fd 15: Transport endpoint is not connected >>>>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave >>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051 >>>>>>> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578; >>>>>>> ports(*):[31000-32000] >>>>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave >>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051 >>>>>>> (192.168.0.116) disconnected >>>>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave >>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8; >>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: ) >>>>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave >>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051 >>>>>>> (192.168.0.116) >>>>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown >>>>>>> socket with fd 16: Transport endpoint is not connected >>>>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave >>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051 >>>>>>> (192.168.0.116) >>>>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave >>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated >>>>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received >>>>>>> learned notice for position 384 >>>>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action (20 >>>>>>> bytes) to leveldb took 95171ns >>>>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys from >>>>>>> leveldb took 20333ns >>>>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at 384 >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Pradeep >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 2 October 2015 at 02:35, Guangya Liu <gyliu...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Pradeep, >>>>>>>> >>>>>>>> Please check some of my questions in line. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Guangya >>>>>>>> >>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale < >>>>>>>> pradeepkiruv...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi All, >>>>>>>>> >>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master and >>>>>>>>> 3 Slaves. >>>>>>>>> >>>>>>>>> One slave runs on the Master Node itself and Other slaves run on >>>>>>>>> different nodes. Here node means the physical boxes. >>>>>>>>> >>>>>>>>> I tried running the tasks by configuring one Node cluster. Tested >>>>>>>>> the task scheduling using mesos-execute, works fine. >>>>>>>>> >>>>>>>>> When I configure three Node cluster (1master and 3 slaves) and try >>>>>>>>> to see the resources on the master (in GUI) only the Master node >>>>>>>>> resources >>>>>>>>> are visible. >>>>>>>>> The other nodes resources are not visible. Some times visible but >>>>>>>>> in a de-actived state. >>>>>>>>> >>>>>>>> Can you please append some logs from mesos-slave and mesos-master? >>>>>>>> There should be some logs in either master or slave telling you what is >>>>>>>> wrong. >>>>>>>> >>>>>>>>> >>>>>>>>> *Please let me know what could be the reason. All the nodes are in >>>>>>>>> the same network. * >>>>>>>>> >>>>>>>>> When I try to schedule a task using >>>>>>>>> >>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050 >>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l >>>>>>>>> 10845760 -g >>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560" >>>>>>>>> >>>>>>>>> The tasks always get scheduled on the same node. The resources >>>>>>>>> from the other nodes are not getting used to schedule the tasks. >>>>>>>>> >>>>>>>> Based on your previous question, there is only one node in your >>>>>>>> cluster, that's why other nodes are not available. We need first >>>>>>>> identify >>>>>>>> what is wrong with other three nodes first. >>>>>>>> >>>>>>>>> >>>>>>>>> I*s it required to register the frameworks from every slave node >>>>>>>>> on the Master?* >>>>>>>>> >>>>>>>> It is not required. >>>>>>>> >>>>>>>>> >>>>>>>>> *I have configured this cluster using the git-hub code.* >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks & Regards, >>>>>>>>> Pradeep >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >