Hi Guangya, Thanks for your reply.
I just want to know how did you launch the tasks. 1. What processes you have started on Master? 2. What are the processes you have started on Slaves? I am missing something here, otherwise all my slave have enough memory and cpus to launch the tasks I mentioned. What I am missing is some configuration steps. Thanks & Regards, Pradeep On 3 October 2015 at 13:14, Guangya Liu <gyliu...@gmail.com> wrote: > Hi Pradeep, > > I did some test with your case and found that the task can run randomly on > the three slave hosts, every time may have different result. The logic is > here: > https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266 > The allocator will help random shuffle the slaves every time when > allocate resources for offers. > > I see that every of your task need the minimum resources as " > resources="cpus(*):3;mem(*):2560", can you help check if all of your > slaves have enough resources? If you want your task run on other slaves, > then those slaves need to have at least 3 cpus and 2550M memory. > > Thanks > > On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale < > pradeepkiruv...@gmail.com> wrote: > >> Hi Ondrej, >> >> Thanks for your reply >> >> I did solve that issue, yes you are right there was an issue with slave >> IP address setting. >> >> Now I am facing issue with the scheduling the tasks. When I try to >> schedule a task using >> >> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test" >> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P" >> --resources="cpus(*):3;mem(*):2560" >> >> The tasks always get scheduled on the same node. The resources from the >> other nodes are not getting used to schedule the tasks. >> >> I just start the mesos slaves like below >> >> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos --hostname=slave1 >> >> If I submit the task using the above (mesos-execute) command from same as >> one of the slave it runs on that system. >> >> But when I submit the task from some different system. It uses just that >> system and queues the tasks not runs on the other slaves. >> Some times I see the message "Failed to getgid: unknown user" >> >> Do I need to start some process to push the task on all the slaves >> equally? Am I missing something here? >> >> Regards, >> Pradeep >> >> >> >> On 2 October 2015 at 15:07, Ondrej Smola <ondrej.sm...@gmail.com> wrote: >> >>> Hi Pradeep, >>> >>> the problem is with IP your slave advertise - mesos by default resolves >>> your hostname - there are several solutions (let say your node ip is >>> 192.168.56.128) >>> >>> 1) export LIBPROCESS_IP=192.168.56.128 >>> 2) set mesos options - ip, hostname >>> >>> one way to do this is to create files >>> >>> echo "192.168.56.128" > /etc/mesos-slave/ip >>> echo "abc.mesos.com" > /etc/mesos-slave/hostname >>> >>> for more configuration options see >>> http://mesos.apache.org/documentation/latest/configuration >>> >>> >>> >>> >>> >>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <pradeepkiruv...@gmail.com>: >>> >>>> Hi Guangya, >>>> >>>> Thanks for reply. I found one interesting log message. >>>> >>>> 7410 master.cpp:5977] Removed slave >>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave >>>> registered at the same address >>>> >>>> Mostly because of this issue, the systems/slave nodes are getting >>>> registered and de-registered to make a room for the next node. I can even >>>> see this on >>>> the UI interface, for some time one node got added and after some time >>>> that will be replaced with the new slave node. >>>> >>>> The above log is followed by the below log messages. >>>> >>>> >>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action (18 >>>> bytes) to leveldb took 104089ns >>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at 384 >>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown socket >>>> with fd 15: Transport endpoint is not connected >>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave >>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051 >>>> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578; >>>> ports(*):[31000-32000] >>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave >>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051 >>>> (192.168.0.116) disconnected >>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave >>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8; >>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: ) >>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave >>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051 >>>> (192.168.0.116) >>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown socket >>>> with fd 16: Transport endpoint is not connected >>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave >>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051 >>>> (192.168.0.116) >>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave >>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated >>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received learned >>>> notice for position 384 >>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action (20 >>>> bytes) to leveldb took 95171ns >>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys from >>>> leveldb took 20333ns >>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at 384 >>>> >>>> >>>> Thanks, >>>> Pradeep >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 2 October 2015 at 02:35, Guangya Liu <gyliu...@gmail.com> wrote: >>>> >>>>> Hi Pradeep, >>>>> >>>>> Please check some of my questions in line. >>>>> >>>>> Thanks, >>>>> >>>>> Guangya >>>>> >>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale < >>>>> pradeepkiruv...@gmail.com> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master and 3 >>>>>> Slaves. >>>>>> >>>>>> One slave runs on the Master Node itself and Other slaves run on >>>>>> different nodes. Here node means the physical boxes. >>>>>> >>>>>> I tried running the tasks by configuring one Node cluster. Tested the >>>>>> task scheduling using mesos-execute, works fine. >>>>>> >>>>>> When I configure three Node cluster (1master and 3 slaves) and try to >>>>>> see the resources on the master (in GUI) only the Master node resources >>>>>> are >>>>>> visible. >>>>>> The other nodes resources are not visible. Some times visible but in >>>>>> a de-actived state. >>>>>> >>>>> Can you please append some logs from mesos-slave and mesos-master? >>>>> There should be some logs in either master or slave telling you what is >>>>> wrong. >>>>> >>>>>> >>>>>> *Please let me know what could be the reason. All the nodes are in >>>>>> the same network. * >>>>>> >>>>>> When I try to schedule a task using >>>>>> >>>>>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test" >>>>>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P" >>>>>> --resources="cpus(*):3;mem(*):2560" >>>>>> >>>>>> The tasks always get scheduled on the same node. The resources from >>>>>> the other nodes are not getting used to schedule the tasks. >>>>>> >>>>> Based on your previous question, there is only one node in your >>>>> cluster, that's why other nodes are not available. We need first identify >>>>> what is wrong with other three nodes first. >>>>> >>>>>> >>>>>> I*s it required to register the frameworks from every slave node on >>>>>> the Master?* >>>>>> >>>>> It is not required. >>>>> >>>>>> >>>>>> *I have configured this cluster using the git-hub code.* >>>>>> >>>>>> >>>>>> Thanks & Regards, >>>>>> Pradeep >>>>>> >>>>>> >>>>> >>>> >>> >> >