Re: Files not being copied to all slaves from hdfs w/spark-submit

2015-10-03 Thread haosdent
Yes, I would only onto 1 slave which would execute your task. Why you need
download files to other slaves which not execute your task?

On Sun, Oct 4, 2015 at 6:27 AM, Rodrick Brown 
wrote:

> It downloads but only onto 1 slave how do others get around this? I would
> think the files would be placed on all slaves.
>
> Sent from Outlook 
>
>
>
>
> On Sat, Oct 3, 2015 at 4:43 AM -0700, "haosdent" 
> wrote:
>
> Hi, @Rodrick Sorry for could not understand your problem here. Do you mean
>> your job depends on files on hdfs and it could not download in slaves after
>> you execute spark-submit?
>>
>> On Sat, Oct 3, 2015 at 5:07 AM, Rodrick Brown 
>> wrote:
>>
>>> For some reason my jobs are not being copied to all the slaves when
>>> they’re download from hdfs am I missing something obvious?
>>> They only seem to be copied to the node where the job is submitted.
>>>
>>> --
>>>
>>> [image: Orchard Platform] 
>>>
>>> Rodrick Brown / DevOPs Engineer
>>> +1 917 445 6839 / rodr...@orchardplatform.com
>>> 
>>>
>>> Orchard Platform
>>> 101 5th Avenue, 4th Floor, New York, NY 10003
>>> http://www.orchardplatform.com
>>>
>>> Orchard Blog  | Marketplace
>>> Lending Meetup 
>>>
>>>
>>> *NOTICE TO RECIPIENTS*: This communication is confidential and intended
>>> for the use of the addressee only. If you are not an intended recipient of
>>> this communication, please delete it immediately and notify the sender
>>> by return email. Unauthorized reading, dissemination, distribution or
>>> copying of this communication is prohibited. This communication does not 
>>> constitute
>>> an offer to sell or a solicitation of an indication of interest to purchase
>>> any loan, security or any other financial product or instrument, nor is it
>>> an offer to sell or a solicitation of an indication of interest to purchase
>>> any products or services to any persons who are prohibited from receiving
>>> such information under applicable law. The contents of this communication
>>> may not be accurate or complete and are subject to change without notice.
>>> As such, Orchard App, Inc. (including its subsidiaries and affiliates,
>>> "Orchard") makes no representation regarding the accuracy or
>>> completeness of the information contained herein. The intended recipient is
>>> advised to consult its own professional advisors, including those
>>> specializing in legal, tax and accounting matters. Orchard does not
>>> provide legal, tax or accounting advice.
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
> *NOTICE TO RECIPIENTS*: This communication is confidential and intended
> for the use of the addressee only. If you are not an intended recipient of
> this communication, please delete it immediately and notify the sender by
> return email. Unauthorized reading, dissemination, distribution or copying
> of this communication is prohibited. This communication does not constitute
> an offer to sell or a solicitation of an indication of interest to purchase
> any loan, security or any other financial product or instrument, nor is it
> an offer to sell or a solicitation of an indication of interest to purchase
> any products or services to any persons who are prohibited from receiving
> such information under applicable law. The contents of this communication
> may not be accurate or complete and are subject to change without notice.
> As such, Orchard App, Inc. (including its subsidiaries and affiliates,
> "Orchard") makes no representation regarding the accuracy or completeness
> of the information contained herein. The intended recipient is advised to
> consult its own professional advisors, including those specializing in
> legal, tax and accounting matters. Orchard does not provide legal, tax or
> accounting advice.
>



-- 
Best Regards,
Haosdent Huang


Re: Files not being copied to all slaves from hdfs w/spark-submit

2015-10-03 Thread Rodrick Brown
It downloads but only onto 1 slave how do others get around this? I would think 
the files would be placed on all slaves. 

Sent from Outlook




On Sat, Oct 3, 2015 at 4:43 AM -0700, "haosdent"  wrote:










Hi, @Rodrick Sorry for could not understand your problem here. Do you mean your 
job depends on files on hdfs and it could not download in slaves after you 
execute spark-submit?
On Sat, Oct 3, 2015 at 5:07 AM, Rodrick Brown  wrote:
For some reason my jobs are not being copied to all the slaves when they’re 
download from hdfs am I missing something obvious? They only seem to be copied 
to the node where the job is submitted. 
-- 



Rodrick Brown / DevOPs Engineer 
+1 917 445 6839 / rodr...@orchardplatform.com

Orchard Platform 
101 5th Avenue, 4th Floor, New York, NY 10003 
http://www.orchardplatform.com

Orchard Blog | Marketplace Lending Meetup





NOTICE TO RECIPIENTS: This communication is confidential and intended for the 
use of the addressee only. If you are not an intended recipient of this 
communication, please delete it immediately and notify the sender by return 
email. Unauthorized reading, dissemination, distribution or copying of this 
communication is prohibited. This communication does not constitute an offer to 
sell or a solicitation of an indication of interest to purchase any loan, 
security or any other financial product or instrument, nor is it an offer to 
sell or a solicitation of an indication of interest to purchase any products or 
services to any persons who are prohibited from receiving such information 
under applicable law. The contents of this communication may not be accurate or 
complete and are subject to change without notice. As such, Orchard App, Inc. 
(including its subsidiaries and affiliates, "Orchard") makes no representation 
regarding the accuracy or completeness of the information contained herein. The 
intended recipient is advised to consult its own professional advisors, 
including those specializing in legal, tax and accounting matters. Orchard does 
not provide legal, tax or accounting advice.


-- 
Best Regards,
Haosdent Huang
-- 
*NOTICE TO RECIPIENTS*: This communication is confidential and intended for 
the use of the addressee only. If you are not an intended recipient of this 
communication, please delete it immediately and notify the sender by return 
email. Unauthorized reading, dissemination, distribution or copying of this 
communication is prohibited. This communication does not constitute an 
offer to sell or a solicitation of an indication of interest to purchase 
any loan, security or any other financial product or instrument, nor is it 
an offer to sell or a solicitation of an indication of interest to purchase 
any products or services to any persons who are prohibited from receiving 
such information under applicable law. The contents of this communication 
may not be accurate or complete and are subject to change without notice. 
As such, Orchard App, Inc. (including its subsidiaries and affiliates, 
"Orchard") makes no representation regarding the accuracy or completeness 
of the information contained herein. The intended recipient is advised to 
consult its own professional advisors, including those specializing in 
legal, tax and accounting matters. Orchard does not provide legal, tax or 
accounting advice.


Re: Files not being copied to all slaves from hdfs w/spark-submit

2015-10-03 Thread haosdent
Hi, @Rodrick Sorry for could not understand your problem here. Do you mean
your job depends on files on hdfs and it could not download in slaves after
you execute spark-submit?

On Sat, Oct 3, 2015 at 5:07 AM, Rodrick Brown 
wrote:

> For some reason my jobs are not being copied to all the slaves when
> they’re download from hdfs am I missing something obvious?
> They only seem to be copied to the node where the job is submitted.
>
> --
>
> [image: Orchard Platform] 
>
> Rodrick Brown / DevOPs Engineer
> +1 917 445 6839 / rodr...@orchardplatform.com
> 
>
> Orchard Platform
> 101 5th Avenue, 4th Floor, New York, NY 10003
> http://www.orchardplatform.com
>
> Orchard Blog  | Marketplace Lending
> Meetup 
>
>
> *NOTICE TO RECIPIENTS*: This communication is confidential and intended
> for the use of the addressee only. If you are not an intended recipient of
> this communication, please delete it immediately and notify the sender by
> return email. Unauthorized reading, dissemination, distribution or copying
> of this communication is prohibited. This communication does not constitute
> an offer to sell or a solicitation of an indication of interest to purchase
> any loan, security or any other financial product or instrument, nor is it
> an offer to sell or a solicitation of an indication of interest to purchase
> any products or services to any persons who are prohibited from receiving
> such information under applicable law. The contents of this communication
> may not be accurate or complete and are subject to change without notice.
> As such, Orchard App, Inc. (including its subsidiaries and affiliates,
> "Orchard") makes no representation regarding the accuracy or completeness
> of the information contained herein. The intended recipient is advised to
> consult its own professional advisors, including those specializing in
> legal, tax and accounting matters. Orchard does not provide legal, tax or
> accounting advice.
>



-- 
Best Regards,
Haosdent Huang


Re: Running a task in Mesos cluster

2015-10-03 Thread Guangya Liu
Hi Pradeep,

I did some test with your case and found that the task can run randomly on
the three slave hosts, every time may have different result. The logic is
here:
https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
The allocator will help random shuffle the slaves every time when allocate
resources for offers.

I see that every of your task need the minimum resources as "
resources="cpus(*):3;mem(*):2560", can you help check if all of your slaves
have enough resources? If you want your task run on other slaves, then
those slaves need to have at least 3 cpus and 2550M memory.

Thanks

On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale 
wrote:

> Hi Ondrej,
>
> Thanks for your reply
>
> I did solve that issue, yes you are right there was an issue with slave IP
> address setting.
>
> Now I am facing issue with the scheduling the tasks. When I try to
> schedule a task using
>
> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
> --resources="cpus(*):3;mem(*):2560"
>
> The tasks always get scheduled on the same node. The resources from the
> other nodes are not getting used to schedule the tasks.
>
>  I just start the mesos slaves like below
>
> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos  --hostname=slave1
>
> If I submit the task using the above (mesos-execute) command from same as
> one of the slave it runs on that system.
>
> But when I submit the task from some different system. It uses just that
> system and queues the tasks not runs on the other slaves.
> Some times I see the message "Failed to getgid: unknown user"
>
> Do I need to start some process to push the task on all the slaves
> equally? Am I missing something here?
>
> Regards,
> Pradeep
>
>
>
> On 2 October 2015 at 15:07, Ondrej Smola  wrote:
>
>> Hi Pradeep,
>>
>> the problem is with IP your slave advertise - mesos by default resolves
>> your hostname - there are several solutions  (let say your node ip is
>> 192.168.56.128)
>>
>> 1)  export LIBPROCESS_IP=192.168.56.128
>> 2)  set mesos options - ip, hostname
>>
>> one way to do this is to create files
>>
>> echo "192.168.56.128" > /etc/mesos-slave/ip
>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>
>> for more configuration options see
>> http://mesos.apache.org/documentation/latest/configuration
>>
>>
>>
>>
>>
>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale :
>>
>>> Hi Guangya,
>>>
>>> Thanks for reply. I found one interesting log message.
>>>
>>>  7410 master.cpp:5977] Removed slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>> registered at the same address
>>>
>>> Mostly because of this issue, the systems/slave nodes are getting
>>> registered and de-registered to make a room for the next node. I can even
>>> see this on
>>> the UI interface, for some time one node got added and after some time
>>> that will be replaced with the new slave node.
>>>
>>> The above log is followed by the below log messages.
>>>
>>>
>>> I1002 10:01:12.753865  7416 leveldb.cpp:343] Persisting action (18
>>> bytes) to leveldb took 104089ns
>>> I1002 10:01:12.753885  7416 replica.cpp:679] Persisted action at 384
>>> E1002 10:01:12.753891  7417 process.cpp:1912] Failed to shutdown socket
>>> with fd 15: Transport endpoint is not connected
>>> I1002 10:01:12.753988  7413 master.cpp:3930] Registered slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
>>> ports(*):[31000-32000]
>>> I1002 10:01:12.754065  7413 master.cpp:1080] Slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>> (192.168.0.116) disconnected
>>> I1002 10:01:12.754072  7416 hierarchical.hpp:675] Added slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>> I1002 10:01:12.754084  7413 master.cpp:2534] Disconnecting slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>> (192.168.0.116)
>>> E1002 10:01:12.754118  7417 process.cpp:1912] Failed to shutdown socket
>>> with fd 16: Transport endpoint is not connected
>>> I1002 10:01:12.754132  7413 master.cpp:2553] Deactivating slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>> (192.168.0.116)
>>> I1002 10:01:12.754237  7416 hierarchical.hpp:768] Slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>> I1002 10:01:12.754240  7413 replica.cpp:658] Replica received learned
>>> notice for position 384
>>> I1002 10:01:12.754360  7413 leveldb.cpp:343] Persisting action (20
>>> bytes) to leveldb took 95171ns
>>> I1002 10:01:12.754395  7413 leveldb.cpp:401] Deleting ~2 keys from
>>> leveldb took 20333ns
>>> I1002 10:01:12.754406  7413 replica.cpp:679] Persisted action at 384
>>>
>>>
>>> Thanks,
>>> Pradeep
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

[RESULT][VOTE] Release Apache Mesos 0.21.2 (rc1)

2015-10-03 Thread Adam Bordelon
Hi all,

The vote for Mesos 0.21.2 (rc1) has passed with the following votes.

+1 (Binding)
--
Adam B
Vinod Kone
Michael Park
Niklas Nielsen

There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/0.21.2

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.21.2

The mesos-0.21.2.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) already has the archives download
link, which should be active soon.

Thanks,
-Adam-


Re: Running a task in Mesos cluster

2015-10-03 Thread Ondrej Smola
Yes there should be configuration options for this in mesos configuration -
see documentation. I am leaving now so i wont be able to respond till Sunday

2015-10-03 11:18 GMT+02:00 Pradeep Kiruvale :

> I have different login names for different system. I have a client system,
> from where I launch the tasks. But these tasks are not getting any
> resources. So, they are not getting scheduled.
>
> I mean to say my cluster arrangement is 1 client, 1 Master, 3 slaves. All
> are different physical systems.
>
> Is there any way of run the tasks under one unified user?
>
> Regards,
> Pradeep
>
> On 3 October 2015 at 10:43, Ondrej Smola  wrote:
>
>>
>> mesos framework receive offers and based on those offers it decides where
>> to run tasks.
>>
>>
>> mesos-execute is little framework that executes your task (hackbench) -
>> see here https://github.com/apache/mesos/blob/master/src/cli/execute.cpp
>>
>> https://github.com/apache/mesos/blob/master/src/cli/execute.cpp#L320 you
>> can see that it uses user that run mesos-execute command
>>
>> error you can see should be from here (su command)
>>
>> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/posix/os.hpp#L520
>>
>> under which user do you run mesos-execute and mesos daemons?
>>
>> 2015-10-02 15:26 GMT+02:00 Pradeep Kiruvale :
>>
>>> Hi Ondrej,
>>>
>>> Thanks for your reply
>>>
>>> I did solve that issue, yes you are right there was an issue with slave
>>> IP address setting.
>>>
>>> Now I am facing issue with the scheduling the tasks. When I try to
>>> schedule a task using
>>>
>>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>>> --resources="cpus(*):3;mem(*):2560"
>>>
>>> The tasks always get scheduled on the same node. The resources from the
>>> other nodes are not getting used to schedule the tasks.
>>>
>>>  I just start the mesos slaves like below
>>>
>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>>  --hostname=slave1
>>>
>>> If I submit the task using the above (mesos-execute) command from same
>>> as one of the slave it runs on that system.
>>>
>>> But when I submit the task from some different system. It uses just that
>>> system and queues the tasks not runs on the other slaves.
>>> Some times I see the message "Failed to getgid: unknown user"
>>>
>>> Do I need to start some process to push the task on all the slaves
>>> equally? Am I missing something here?
>>>
>>> Regards,
>>> Pradeep
>>>
>>>
>>>
>>> On 2 October 2015 at 15:07, Ondrej Smola  wrote:
>>>
 Hi Pradeep,

 the problem is with IP your slave advertise - mesos by default resolves
 your hostname - there are several solutions  (let say your node ip is
 192.168.56.128)

 1)  export LIBPROCESS_IP=192.168.56.128
 2)  set mesos options - ip, hostname

 one way to do this is to create files

 echo "192.168.56.128" > /etc/mesos-slave/ip
 echo "abc.mesos.com" > /etc/mesos-slave/hostname

 for more configuration options see
 http://mesos.apache.org/documentation/latest/configuration





 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale 
 :

> Hi Guangya,
>
> Thanks for reply. I found one interesting log message.
>
>  7410 master.cpp:5977] Removed slave
> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
> registered at the same address
>
> Mostly because of this issue, the systems/slave nodes are getting
> registered and de-registered to make a room for the next node. I can even
> see this on
> the UI interface, for some time one node got added and after some time
> that will be replaced with the new slave node.
>
> The above log is followed by the below log messages.
>
>
> I1002 10:01:12.753865  7416 leveldb.cpp:343] Persisting action (18
> bytes) to leveldb took 104089ns
> I1002 10:01:12.753885  7416 replica.cpp:679] Persisted action at 384
> E1002 10:01:12.753891  7417 process.cpp:1912] Failed to shutdown
> socket with fd 15: Transport endpoint is not connected
> I1002 10:01:12.753988  7413 master.cpp:3930] Registered slave
> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
> ports(*):[31000-32000]
> I1002 10:01:12.754065  7413 master.cpp:1080] Slave
> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
> (192.168.0.116) disconnected
> I1002 10:01:12.754072  7416 hierarchical.hpp:675] Added slave
> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
> I1002 10:01:12.754084  7413 master.cpp:2534] Disconnecting slave
> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
> (192.168.0.116)
> E1002 10:01:12.

Re: Running a task in Mesos cluster

2015-10-03 Thread Pradeep Kiruvale
I have different login names for different system. I have a client system,
from where I launch the tasks. But these tasks are not getting any
resources. So, they are not getting scheduled.

I mean to say my cluster arrangement is 1 client, 1 Master, 3 slaves. All
are different physical systems.

Is there any way of run the tasks under one unified user?

Regards,
Pradeep

On 3 October 2015 at 10:43, Ondrej Smola  wrote:

>
> mesos framework receive offers and based on those offers it decides where
> to run tasks.
>
>
> mesos-execute is little framework that executes your task (hackbench) -
> see here https://github.com/apache/mesos/blob/master/src/cli/execute.cpp
>
> https://github.com/apache/mesos/blob/master/src/cli/execute.cpp#L320 you
> can see that it uses user that run mesos-execute command
>
> error you can see should be from here (su command)
>
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/posix/os.hpp#L520
>
> under which user do you run mesos-execute and mesos daemons?
>
> 2015-10-02 15:26 GMT+02:00 Pradeep Kiruvale :
>
>> Hi Ondrej,
>>
>> Thanks for your reply
>>
>> I did solve that issue, yes you are right there was an issue with slave
>> IP address setting.
>>
>> Now I am facing issue with the scheduling the tasks. When I try to
>> schedule a task using
>>
>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>> --resources="cpus(*):3;mem(*):2560"
>>
>> The tasks always get scheduled on the same node. The resources from the
>> other nodes are not getting used to schedule the tasks.
>>
>>  I just start the mesos slaves like below
>>
>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos  --hostname=slave1
>>
>> If I submit the task using the above (mesos-execute) command from same as
>> one of the slave it runs on that system.
>>
>> But when I submit the task from some different system. It uses just that
>> system and queues the tasks not runs on the other slaves.
>> Some times I see the message "Failed to getgid: unknown user"
>>
>> Do I need to start some process to push the task on all the slaves
>> equally? Am I missing something here?
>>
>> Regards,
>> Pradeep
>>
>>
>>
>> On 2 October 2015 at 15:07, Ondrej Smola  wrote:
>>
>>> Hi Pradeep,
>>>
>>> the problem is with IP your slave advertise - mesos by default resolves
>>> your hostname - there are several solutions  (let say your node ip is
>>> 192.168.56.128)
>>>
>>> 1)  export LIBPROCESS_IP=192.168.56.128
>>> 2)  set mesos options - ip, hostname
>>>
>>> one way to do this is to create files
>>>
>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>
>>> for more configuration options see
>>> http://mesos.apache.org/documentation/latest/configuration
>>>
>>>
>>>
>>>
>>>
>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale :
>>>
 Hi Guangya,

 Thanks for reply. I found one interesting log message.

  7410 master.cpp:5977] Removed slave
 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
 registered at the same address

 Mostly because of this issue, the systems/slave nodes are getting
 registered and de-registered to make a room for the next node. I can even
 see this on
 the UI interface, for some time one node got added and after some time
 that will be replaced with the new slave node.

 The above log is followed by the below log messages.


 I1002 10:01:12.753865  7416 leveldb.cpp:343] Persisting action (18
 bytes) to leveldb took 104089ns
 I1002 10:01:12.753885  7416 replica.cpp:679] Persisted action at 384
 E1002 10:01:12.753891  7417 process.cpp:1912] Failed to shutdown socket
 with fd 15: Transport endpoint is not connected
 I1002 10:01:12.753988  7413 master.cpp:3930] Registered slave
 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
 (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
 ports(*):[31000-32000]
 I1002 10:01:12.754065  7413 master.cpp:1080] Slave
 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
 (192.168.0.116) disconnected
 I1002 10:01:12.754072  7416 hierarchical.hpp:675] Added slave
 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
 mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
 I1002 10:01:12.754084  7413 master.cpp:2534] Disconnecting slave
 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
 (192.168.0.116)
 E1002 10:01:12.754118  7417 process.cpp:1912] Failed to shutdown socket
 with fd 16: Transport endpoint is not connected
 I1002 10:01:12.754132  7413 master.cpp:2553] Deactivating slave
 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
 (192.168.0.116)
 I1002 10:01:12.754237  7416 hierarchical.hpp:768] Slave
 6a11063e-b8

Re: Running a task in Mesos cluster

2015-10-03 Thread Ondrej Smola
mesos framework receive offers and based on those offers it decides where
to run tasks.


mesos-execute is little framework that executes your task (hackbench) - see
here https://github.com/apache/mesos/blob/master/src/cli/execute.cpp

https://github.com/apache/mesos/blob/master/src/cli/execute.cpp#L320 you
can see that it uses user that run mesos-execute command

error you can see should be from here (su command)
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/posix/os.hpp#L520

under which user do you run mesos-execute and mesos daemons?

2015-10-02 15:26 GMT+02:00 Pradeep Kiruvale :

> Hi Ondrej,
>
> Thanks for your reply
>
> I did solve that issue, yes you are right there was an issue with slave IP
> address setting.
>
> Now I am facing issue with the scheduling the tasks. When I try to
> schedule a task using
>
> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
> --resources="cpus(*):3;mem(*):2560"
>
> The tasks always get scheduled on the same node. The resources from the
> other nodes are not getting used to schedule the tasks.
>
>  I just start the mesos slaves like below
>
> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos  --hostname=slave1
>
> If I submit the task using the above (mesos-execute) command from same as
> one of the slave it runs on that system.
>
> But when I submit the task from some different system. It uses just that
> system and queues the tasks not runs on the other slaves.
> Some times I see the message "Failed to getgid: unknown user"
>
> Do I need to start some process to push the task on all the slaves
> equally? Am I missing something here?
>
> Regards,
> Pradeep
>
>
>
> On 2 October 2015 at 15:07, Ondrej Smola  wrote:
>
>> Hi Pradeep,
>>
>> the problem is with IP your slave advertise - mesos by default resolves
>> your hostname - there are several solutions  (let say your node ip is
>> 192.168.56.128)
>>
>> 1)  export LIBPROCESS_IP=192.168.56.128
>> 2)  set mesos options - ip, hostname
>>
>> one way to do this is to create files
>>
>> echo "192.168.56.128" > /etc/mesos-slave/ip
>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>
>> for more configuration options see
>> http://mesos.apache.org/documentation/latest/configuration
>>
>>
>>
>>
>>
>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale :
>>
>>> Hi Guangya,
>>>
>>> Thanks for reply. I found one interesting log message.
>>>
>>>  7410 master.cpp:5977] Removed slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>> registered at the same address
>>>
>>> Mostly because of this issue, the systems/slave nodes are getting
>>> registered and de-registered to make a room for the next node. I can even
>>> see this on
>>> the UI interface, for some time one node got added and after some time
>>> that will be replaced with the new slave node.
>>>
>>> The above log is followed by the below log messages.
>>>
>>>
>>> I1002 10:01:12.753865  7416 leveldb.cpp:343] Persisting action (18
>>> bytes) to leveldb took 104089ns
>>> I1002 10:01:12.753885  7416 replica.cpp:679] Persisted action at 384
>>> E1002 10:01:12.753891  7417 process.cpp:1912] Failed to shutdown socket
>>> with fd 15: Transport endpoint is not connected
>>> I1002 10:01:12.753988  7413 master.cpp:3930] Registered slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
>>> ports(*):[31000-32000]
>>> I1002 10:01:12.754065  7413 master.cpp:1080] Slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>> (192.168.0.116) disconnected
>>> I1002 10:01:12.754072  7416 hierarchical.hpp:675] Added slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>> I1002 10:01:12.754084  7413 master.cpp:2534] Disconnecting slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>> (192.168.0.116)
>>> E1002 10:01:12.754118  7417 process.cpp:1912] Failed to shutdown socket
>>> with fd 16: Transport endpoint is not connected
>>> I1002 10:01:12.754132  7413 master.cpp:2553] Deactivating slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>> (192.168.0.116)
>>> I1002 10:01:12.754237  7416 hierarchical.hpp:768] Slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>> I1002 10:01:12.754240  7413 replica.cpp:658] Replica received learned
>>> notice for position 384
>>> I1002 10:01:12.754360  7413 leveldb.cpp:343] Persisting action (20
>>> bytes) to leveldb took 95171ns
>>> I1002 10:01:12.754395  7413 leveldb.cpp:401] Deleting ~2 keys from
>>> leveldb took 20333ns
>>> I1002 10:01:12.754406  7413 replica.cpp:679] Persisted action at 384
>>>
>>>
>>> Thanks,
>>> Pradeep
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 2 October 2015 at 02:35,