Re: Running a task in Mesos cluster

Pradeep Kiruvale Mon, 05 Oct 2015 01:23:11 -0700

Hi Guangya,

Thanks for your reply.


I just want to know how did you launch the tasks.

1. What processes you have started on Master?
2. What are the processes you have started on Slaves?

I am missing something here, otherwise all my slave have enough memory and
cpus to launch the tasks I mentioned.
What I am missing is some configuration steps.

Thanks & Regards,
Pradeep


On 3 October 2015 at 13:14, Guangya Liu <gyliu...@gmail.com> wrote:

> Hi Pradeep,
>
> I did some test with your case and found that the task can run randomly on
> the three slave hosts, every time may have different result. The logic is
> here:
> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
> The allocator will help random shuffle the slaves every time when
> allocate resources for offers.
>
> I see that every of your task need the minimum resources as "
> resources="cpus(*):3;mem(*):2560", can you help check if all of your
> slaves have enough resources? If you want your task run on other slaves,
> then those slaves need to have at least 3 cpus and 2550M memory.
>
> Thanks
>
> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
> pradeepkiruv...@gmail.com> wrote:
>
>> Hi Ondrej,
>>
>> Thanks for your reply
>>
>> I did solve that issue, yes you are right there was an issue with slave
>> IP address setting.
>>
>> Now I am facing issue with the scheduling the tasks. When I try to
>> schedule a task using
>>
>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>> --resources="cpus(*):3;mem(*):2560"
>>
>> The tasks always get scheduled on the same node. The resources from the
>> other nodes are not getting used to schedule the tasks.
>>
>>  I just start the mesos slaves like below
>>
>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos  --hostname=slave1
>>
>> If I submit the task using the above (mesos-execute) command from same as
>> one of the slave it runs on that system.
>>
>> But when I submit the task from some different system. It uses just that
>> system and queues the tasks not runs on the other slaves.
>> Some times I see the message "Failed to getgid: unknown user"
>>
>> Do I need to start some process to push the task on all the slaves
>> equally? Am I missing something here?
>>
>> Regards,
>> Pradeep
>>
>>
>>
>> On 2 October 2015 at 15:07, Ondrej Smola <ondrej.sm...@gmail.com> wrote:
>>
>>> Hi Pradeep,
>>>
>>> the problem is with IP your slave advertise - mesos by default resolves
>>> your hostname - there are several solutions  (let say your node ip is
>>> 192.168.56.128)
>>>
>>> 1)  export LIBPROCESS_IP=192.168.56.128
>>> 2)  set mesos options - ip, hostname
>>>
>>> one way to do this is to create files
>>>
>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>
>>> for more configuration options see
>>> http://mesos.apache.org/documentation/latest/configuration
>>>
>>>
>>>
>>>
>>>
>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <pradeepkiruv...@gmail.com>:
>>>
>>>> Hi Guangya,
>>>>
>>>> Thanks for reply. I found one interesting log message.
>>>>
>>>>  7410 master.cpp:5977] Removed slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>> registered at the same address
>>>>
>>>> Mostly because of this issue, the systems/slave nodes are getting
>>>> registered and de-registered to make a room for the next node. I can even
>>>> see this on
>>>> the UI interface, for some time one node got added and after some time
>>>> that will be replaced with the new slave node.
>>>>
>>>> The above log is followed by the below log messages.
>>>>
>>>>
>>>> I1002 10:01:12.753865  7416 leveldb.cpp:343] Persisting action (18
>>>> bytes) to leveldb took 104089ns
>>>> I1002 10:01:12.753885  7416 replica.cpp:679] Persisted action at 384
>>>> E1002 10:01:12.753891  7417 process.cpp:1912] Failed to shutdown socket
>>>> with fd 15: Transport endpoint is not connected
>>>> I1002 10:01:12.753988  7413 master.cpp:3930] Registered slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
>>>> ports(*):[31000-32000]
>>>> I1002 10:01:12.754065  7413 master.cpp:1080] Slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>> (192.168.0.116) disconnected
>>>> I1002 10:01:12.754072  7416 hierarchical.hpp:675] Added slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>> I1002 10:01:12.754084  7413 master.cpp:2534] Disconnecting slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>> (192.168.0.116)
>>>> E1002 10:01:12.754118  7417 process.cpp:1912] Failed to shutdown socket
>>>> with fd 16: Transport endpoint is not connected
>>>> I1002 10:01:12.754132  7413 master.cpp:2553] Deactivating slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>> (192.168.0.116)
>>>> I1002 10:01:12.754237  7416 hierarchical.hpp:768] Slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>> I1002 10:01:12.754240  7413 replica.cpp:658] Replica received learned
>>>> notice for position 384
>>>> I1002 10:01:12.754360  7413 leveldb.cpp:343] Persisting action (20
>>>> bytes) to leveldb took 95171ns
>>>> I1002 10:01:12.754395  7413 leveldb.cpp:401] Deleting ~2 keys from
>>>> leveldb took 20333ns
>>>> I1002 10:01:12.754406  7413 replica.cpp:679] Persisted action at 384
>>>>
>>>>
>>>> Thanks,
>>>> Pradeep
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 2 October 2015 at 02:35, Guangya Liu <gyliu...@gmail.com> wrote:
>>>>
>>>>> Hi Pradeep,
>>>>>
>>>>> Please check some of my questions in line.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Guangya
>>>>>
>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>> pradeepkiruv...@gmail.com> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master and 3
>>>>>> Slaves.
>>>>>>
>>>>>> One slave runs on the Master Node itself and Other slaves run on
>>>>>> different nodes. Here node means the physical boxes.
>>>>>>
>>>>>> I tried running the tasks by configuring one Node cluster. Tested the
>>>>>> task scheduling using mesos-execute, works fine.
>>>>>>
>>>>>> When I configure three Node cluster (1master and 3 slaves) and try to
>>>>>> see the resources on the master (in GUI) only the Master node resources 
>>>>>> are
>>>>>> visible.
>>>>>>  The other nodes resources are not visible. Some times visible but in
>>>>>> a de-actived state.
>>>>>>
>>>>> Can you please append some logs from mesos-slave and mesos-master?
>>>>> There should be some logs in either master or slave telling you what is
>>>>> wrong.
>>>>>
>>>>>>
>>>>>> *Please let me know what could be the reason. All the nodes are in
>>>>>> the same network. *
>>>>>>
>>>>>> When I try to schedule a task using
>>>>>>
>>>>>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>>>>>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>>>>>> --resources="cpus(*):3;mem(*):2560"
>>>>>>
>>>>>> The tasks always get scheduled on the same node. The resources from
>>>>>> the other nodes are not getting used to schedule the tasks.
>>>>>>
>>>>> Based on your previous question, there is only one node in your
>>>>> cluster, that's why other nodes are not available. We need first identify
>>>>> what is wrong with other three nodes first.
>>>>>
>>>>>>
>>>>>> I*s it required to register the frameworks from every slave node on
>>>>>> the Master?*
>>>>>>
>>>>> It is not required.
>>>>>
>>>>>>
>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>
>>>>>>
>>>>>> Thanks & Regards,
>>>>>> Pradeep
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Running a task in Mesos cluster

Reply via email to