Below are the logs from Master. -Pradeep
1007 12:16:28.257853 8005 leveldb.cpp:343] Persisting action (20 bytes) to leveldb took 119428ns I1007 12:16:28.257884 8005 leveldb.cpp:401] Deleting ~2 keys from leveldb took 18847ns I1007 12:16:28.257891 8005 replica.cpp:679] Persisted action at 1440 I1007 12:16:28.257912 8005 replica.cpp:664] Replica learned TRUNCATE action at position 1440 I1007 12:16:36.666616 8002 http.cpp:336] HTTP GET for /master/state.json from 192.168.0.102:40721 with User-Agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.52 Safari/537.36' I1007 12:16:39.126030 8001 master.cpp:2179] Received SUBSCRIBE call for framework 'Balloon Framework (C++)' at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 I1007 12:16:39.126428 8001 master.cpp:2250] Subscribing framework Balloon Framework (C++) with checkpointing disabled and capabilities [ ] E1007 12:16:39.127459 8007 process.cpp:1912] Failed to shutdown socket with fd 13: Transport endpoint is not connected I1007 12:16:39.127535 8000 hierarchical.hpp:515] Added framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 I1007 12:16:39.127734 8001 master.cpp:1119] Framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 disconnected I1007 12:16:39.127765 8001 master.cpp:2475] Disconnecting framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 E1007 12:16:39.127768 8007 process.cpp:1912] Failed to shutdown socket with fd 14: Transport endpoint is not connected I1007 12:16:39.127789 8001 master.cpp:2499] Deactivating framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 I1007 12:16:39.127879 8006 hierarchical.hpp:599] Deactivated framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 I1007 12:16:39.127913 8001 master.cpp:1143] Giving framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 0ns to failover I1007 12:16:39.129273 8005 master.cpp:4815] Framework failover timeout, removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 I1007 12:16:39.129312 8005 master.cpp:5571] Removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 I1007 12:16:39.129858 8003 hierarchical.hpp:552] Removed framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 I1007 12:16:40.676519 8000 master.cpp:2179] Received SUBSCRIBE call for framework 'Balloon Framework (C++)' at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 I1007 12:16:40.676678 8000 master.cpp:2250] Subscribing framework Balloon Framework (C++) with checkpointing disabled and capabilities [ ] I1007 12:16:40.677178 8006 hierarchical.hpp:515] Added framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 E1007 12:16:40.677217 8007 process.cpp:1912] Failed to shutdown socket with fd 13: Transport endpoint is not connected I1007 12:16:40.677409 8000 master.cpp:1119] Framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 disconnected I1007 12:16:40.677441 8000 master.cpp:2475] Disconnecting framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 I1007 12:16:40.677453 8000 master.cpp:2499] Deactivating framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 E1007 12:16:40.677459 8007 process.cpp:1912] Failed to shutdown socket with fd 13: Transport endpoint is not connected I1007 12:16:40.677501 8000 master.cpp:1143] Giving framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 0ns to failover I1007 12:16:40.677520 8005 hierarchical.hpp:599] Deactivated framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 I1007 12:16:40.678864 8004 master.cpp:4815] Framework failover timeout, removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 I1007 12:16:40.678906 8004 master.cpp:5571] Removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 I1007 12:16:40.679147 8001 hierarchical.hpp:552] Removed framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 I1007 12:16:41.853121 8002 master.cpp:2179] Received SUBSCRIBE call for framework 'Balloon Framework (C++)' at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 I1007 12:16:41.853281 8002 master.cpp:2250] Subscribing framework Balloon Framework (C++) with checkpointing disabled and capabilities [ ] E1007 12:16:41.853806 8007 process.cpp:1912] Failed to shutdown socket with fd 13: Transport endpoint is not connected I1007 12:16:41.853833 8004 hierarchical.hpp:515] Added framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 I1007 12:16:41.854032 8002 master.cpp:1119] Framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 disconnected I1007 12:16:41.854063 8002 master.cpp:2475] Disconnecting framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 I1007 12:16:41.854076 8002 master.cpp:2499] Deactivating framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 E1007 12:16:41.854080 8007 process.cpp:1912] Failed to shutdown socket with fd 13: Transport endpoint is not connected I1007 12:16:41.854126 8005 hierarchical.hpp:599] Deactivated framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 I1007 12:16:41.854121 8002 master.cpp:1143] Giving framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 0ns to failover I1007 12:16:41.855482 8006 master.cpp:4815] Framework failover timeout, removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 I1007 12:16:41.855515 8006 master.cpp:5571] Removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 I1007 12:16:41.855692 8001 hierarchical.hpp:552] Removed framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 I1007 12:16:42.772830 8000 master.cpp:2179] Received SUBSCRIBE call for framework 'Balloon Framework (C++)' at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 I1007 12:16:42.772974 8000 master.cpp:2250] Subscribing framework Balloon Framework (C++) with checkpointing disabled and capabilities [ ] I1007 12:16:42.773470 8004 hierarchical.hpp:515] Added framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 E1007 12:16:42.773495 8007 process.cpp:1912] Failed to shutdown socket with fd 13: Transport endpoint is not connected I1007 12:16:42.773679 8000 master.cpp:1119] Framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 disconnected I1007 12:16:42.773697 8000 master.cpp:2475] Disconnecting framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 I1007 12:16:42.773708 8000 master.cpp:2499] Deactivating framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 E1007 12:16:42.773710 8007 process.cpp:1912] Failed to shutdown socket with fd 14: Transport endpoint is not connected I1007 12:16:42.773761 8000 master.cpp:1143] Giving framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 0ns to failover I1007 12:16:42.773779 8001 hierarchical.hpp:599] Deactivated framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 I1007 12:16:42.775089 8005 master.cpp:4815] Framework failover timeout, removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 I1007 12:16:42.775126 8005 master.cpp:5571] Removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 I1007 12:16:42.775324 8005 hierarchical.hpp:552] Removed framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 I1007 12:16:47.665941 8001 http.cpp:336] HTTP GET for /master/state.json from 192.168.0.102:40722 with User-Agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.52 Safari/537.36' On 7 October 2015 at 12:12, Guangya Liu <gyliu...@gmail.com> wrote: > Hi Pradeep, > > Can you please append more log for your master node? Just want to see what > is wrong with your master, why the framework start to failover? > > Thanks, > > Guangya > > On Wed, Oct 7, 2015 at 5:27 PM, Pradeep Kiruvale < > pradeepkiruv...@gmail.com> wrote: > >> Hi Guangya, >> >> I am running a frame work from some other physical node, which is part of >> the same network. Still I am getting below messages and the framework not >> getting registered. >> >> Any idea what is the reason? >> >> I1007 11:24:58.781914 32392 master.cpp:4815] Framework failover timeout, >> removing framework 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon >> Framework (C++)) at >> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203 >> I1007 11:24:58.781968 32392 master.cpp:5571] Removing framework >> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon Framework (C++)) at >> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203 >> I1007 11:24:58.782352 32392 hierarchical.hpp:552] Removed framework >> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 >> E1007 11:24:58.782577 32399 process.cpp:1912] Failed to shutdown socket >> with fd 13: Transport endpoint is not connected >> I1007 11:24:59.699587 32396 master.cpp:2179] Received SUBSCRIBE call for >> framework 'Balloon Framework (C++)' at >> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203 >> I1007 11:24:59.699717 32396 master.cpp:2250] Subscribing framework >> Balloon Framework (C++) with checkpointing disabled and capabilities [ ] >> I1007 11:24:59.700251 32393 hierarchical.hpp:515] Added framework >> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0020 >> E1007 11:24:59.700253 32399 process.cpp:1912] Failed to shutdown socket >> with fd 13: Transport endpoint is not connected >> >> >> Regards, >> Pradeep >> >> >> On 5 October 2015 at 13:51, Guangya Liu <gyliu...@gmail.com> wrote: >> >>> Hi Pradeep, >>> >>> I think that the problem might be caused by that you are running the lxc >>> container on master node and not sure if there are any port conflict or >>> what else wrong. >>> >>> For my case, I was running the client in a new node but not on master >>> node, perhaps you can have a try to put your client on a new node but not >>> on master node. >>> >>> Thanks, >>> >>> Guangya >>> >>> >>> On Mon, Oct 5, 2015 at 7:30 PM, Pradeep Kiruvale < >>> pradeepkiruv...@gmail.com> wrote: >>> >>>> Hi Guangya, >>>> >>>> Hmm!...That is strange in my case! >>>> >>>> If I run from the mesos-execute on one of the slave/master node then >>>> the tasks get their resources and they get scheduled well. >>>> But if I start the mesos-execute on another node which is neither >>>> slave/master then I have this issue. >>>> >>>> I am using an lxc container on master as a client to launch the tasks. >>>> This is also in the same network as master/slaves. >>>> And I just launch the task as you did. But the tasks are not getting >>>> scheduled. >>>> >>>> >>>> On master the logs are same as I sent you before >>>> >>>> Deactivating framework 77539063-89ce-4efa-a20b-ca788abbd912-0066 >>>> >>>> On both of the slaves I can see the below logs >>>> >>>> I1005 13:23:32.547987 4831 slave.cpp:1980] Asked to shut down >>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0060 by >>>> master@192.168.0.102:5050 >>>> W1005 13:23:32.548135 4831 slave.cpp:1995] Cannot shut down unknown >>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0060 >>>> I1005 13:23:33.697707 4833 slave.cpp:3926] Current disk usage 3.60%. >>>> Max allowed age: 6.047984349521910days >>>> I1005 13:23:34.098599 4829 slave.cpp:1980] Asked to shut down >>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0061 by >>>> master@192.168.0.102:5050 >>>> W1005 13:23:34.098740 4829 slave.cpp:1995] Cannot shut down unknown >>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0061 >>>> I1005 13:23:35.274569 4831 slave.cpp:1980] Asked to shut down >>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0062 by >>>> master@192.168.0.102:5050 >>>> W1005 13:23:35.274683 4831 slave.cpp:1995] Cannot shut down unknown >>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0062 >>>> I1005 13:23:36.193964 4829 slave.cpp:1980] Asked to shut down >>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0063 by >>>> master@192.168.0.102:5050 >>>> W1005 13:23:36.194090 4829 slave.cpp:1995] Cannot shut down unknown >>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0063 >>>> I1005 13:24:01.914788 4827 slave.cpp:1980] Asked to shut down >>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0064 by >>>> master@192.168.0.102:5050 >>>> W1005 13:24:01.914937 4827 slave.cpp:1995] Cannot shut down unknown >>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0064 >>>> I1005 13:24:03.469974 4833 slave.cpp:1980] Asked to shut down >>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0065 by >>>> master@192.168.0.102:5050 >>>> W1005 13:24:03.470118 4833 slave.cpp:1995] Cannot shut down unknown >>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0065 >>>> I1005 13:24:04.642654 4826 slave.cpp:1980] Asked to shut down >>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0066 by >>>> master@192.168.0.102:5050 >>>> W1005 13:24:04.642812 4826 slave.cpp:1995] Cannot shut down unknown >>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0066 >>>> >>>> >>>> >>>> On 5 October 2015 at 13:09, Guangya Liu <gyliu...@gmail.com> wrote: >>>> >>>>> Hi Pradeep, >>>>> >>>>> From your log, seems that the master process is exiting and this >>>>> caused the framework fail over to another mesos master. Can you please >>>>> show >>>>> more detail for your issue reproduced steps? >>>>> >>>>> I did some test by running mesos-execute on a client host which does >>>>> not have any mesos service and the task can schedule well. >>>>> >>>>> root@mesos008:~/src/mesos/m1/mesos/build# ./src/mesos-execute >>>>> --master=192.168.0.107:5050 --name="cluster-test" >>>>> --command="/bin/sleep 10" --resources="cpus(*):1;mem(*):256" >>>>> I1005 18:59:47.974123 1233 sched.cpp:164] Version: 0.26.0 >>>>> I1005 18:59:47.990890 1248 sched.cpp:262] New master detected at >>>>> master@192.168.0.107:5050 >>>>> I1005 18:59:47.993074 1248 sched.cpp:272] No credentials provided. >>>>> Attempting to register without authentication >>>>> I1005 18:59:48.001194 1249 sched.cpp:641] Framework registered with >>>>> 04b9af5e-e9b6-4c59-8734-eba407163922-0002 >>>>> Framework registered with 04b9af5e-e9b6-4c59-8734-eba407163922-0002 >>>>> task cluster-test submitted to slave >>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0 >>>>> Received status update TASK_RUNNING for task cluster-test >>>>> Received status update TASK_FINISHED for task cluster-test >>>>> I1005 18:59:58.431144 1249 sched.cpp:1771] Asked to stop the driver >>>>> I1005 18:59:58.431591 1249 sched.cpp:1040] Stopping framework >>>>> '04b9af5e-e9b6-4c59-8734-eba407163922-0002' >>>>> root@mesos008:~/src/mesos/m1/mesos/build# ps -ef | grep mesos >>>>> root 1259 1159 0 19:06 pts/0 00:00:00 grep --color=auto mesos >>>>> >>>>> Thanks, >>>>> >>>>> Guangya >>>>> >>>>> >>>>> On Mon, Oct 5, 2015 at 6:50 PM, Pradeep Kiruvale < >>>>> pradeepkiruv...@gmail.com> wrote: >>>>> >>>>>> Hi Guangya, >>>>>> >>>>>> I am facing one more issue. If I try to schedule the tasks from some >>>>>> external client system running the same cli mesos-execute. >>>>>> The tasks are not getting launched. The tasks reach the Master and it >>>>>> just drops the requests, below are the logs related to that >>>>>> >>>>>> I1005 11:33:35.025594 21369 master.cpp:2250] Subscribing framework >>>>>> with checkpointing disabled and capabilities [ ] >>>>>> E1005 11:33:35.026100 21373 process.cpp:1912] Failed to shutdown >>>>>> socket with fd 14: Transport endpoint is not connected >>>>>> I1005 11:33:35.026129 21372 hierarchical.hpp:515] Added framework >>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 >>>>>> I1005 11:33:35.026298 21369 master.cpp:1119] Framework >>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at >>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 >>>>>> disconnected >>>>>> I1005 11:33:35.026329 21369 master.cpp:2475] Disconnecting framework >>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at >>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 >>>>>> I1005 11:33:35.026340 21369 master.cpp:2499] Deactivating framework >>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at >>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 >>>>>> E1005 11:33:35.026345 21373 process.cpp:1912] Failed to shutdown >>>>>> socket with fd 14: Transport endpoint is not connected >>>>>> I1005 11:33:35.026376 21369 master.cpp:1143] Giving framework >>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at >>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 0ns >>>>>> to failover >>>>>> I1005 11:33:35.026743 21372 hierarchical.hpp:599] Deactivated >>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 >>>>>> W1005 11:33:35.026757 21368 master.cpp:4828] Master returning >>>>>> resources offered to framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 >>>>>> because the framework has terminated or is inactive >>>>>> I1005 11:33:35.027014 21371 hierarchical.hpp:1103] Recovered >>>>>> cpus(*):8; mem(*):14868; disk(*):218835; ports(*):[31000-32000] (total: >>>>>> cpus(*):8; mem(*):14868; disk(*):218835; ports(*):[31000-32000], >>>>>> allocated: >>>>>> ) on slave 77539063-89ce-4efa-a20b-ca788abbd912-S2 from framework >>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 >>>>>> I1005 11:33:35.027159 21371 hierarchical.hpp:1103] Recovered >>>>>> cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000] (total: >>>>>> cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000], >>>>>> allocated: >>>>>> ) on slave 77539063-89ce-4efa-a20b-ca788abbd912-S1 from framework >>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 >>>>>> I1005 11:33:35.027668 21366 master.cpp:4815] Framework failover >>>>>> timeout, removing framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () >>>>>> at >>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 >>>>>> I1005 11:33:35.027715 21366 master.cpp:5571] Removing framework >>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at >>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 >>>>>> >>>>>> >>>>>> Can you please tell me what is the reason? The client is in the same >>>>>> network as well. But it does not run any master or slave processes. >>>>>> >>>>>> Thanks & Regards, >>>>>> Pradeeep >>>>>> >>>>>> On 5 October 2015 at 12:13, Guangya Liu <gyliu...@gmail.com> wrote: >>>>>> >>>>>>> Hi Pradeep, >>>>>>> >>>>>>> Glad it finally works! Not sure if you are using systemd.slice or >>>>>>> not, are you running to this issue: >>>>>>> https://issues.apache.org/jira/browse/MESOS-1195 >>>>>>> >>>>>>> Hope Jie Yu can give you some help on this ;-) >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Guangya >>>>>>> >>>>>>> On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale < >>>>>>> pradeepkiruv...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Guangya, >>>>>>>> >>>>>>>> >>>>>>>> Thanks for sharing the information. >>>>>>>> >>>>>>>> Now I could launch the tasks. The problem was with the permission. >>>>>>>> If I start all the slaves and Master as root it works fine. >>>>>>>> Else I have problem with launching the tasks. >>>>>>>> >>>>>>>> But on one of the slave I could not launch the slave as root, I am >>>>>>>> facing the following issue. >>>>>>>> >>>>>>>> Failed to create a containerizer: Could not create >>>>>>>> MesosContainerizer: Failed to create launcher: Failed to create Linux >>>>>>>> launcher: Failed to mount cgroups hierarchy at >>>>>>>> '/sys/fs/cgroup/freezer': >>>>>>>> 'freezer' is already attached to another hierarchy >>>>>>>> >>>>>>>> I took that out from the cluster for now. The tasks are getting >>>>>>>> scheduled on the other two slave nodes. >>>>>>>> >>>>>>>> Thanks for your timely help >>>>>>>> >>>>>>>> -Pradeep >>>>>>>> >>>>>>>> On 5 October 2015 at 10:54, Guangya Liu <gyliu...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Pradeep, >>>>>>>>> >>>>>>>>> My steps was pretty simple just as >>>>>>>>> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples >>>>>>>>> >>>>>>>>> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1 >>>>>>>>> ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos >>>>>>>>> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build# >>>>>>>>> GLOG_v=1 ./bin/mesos-slave.sh --master=192.168.0.107:5050 >>>>>>>>> >>>>>>>>> Then schedule a task on any of the node, here I was using slave >>>>>>>>> node mesos007, you can see that the two tasks was launched on >>>>>>>>> different >>>>>>>>> host. >>>>>>>>> >>>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute >>>>>>>>> --master=192.168.0.107:5050 --name="cluster-test" >>>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256" >>>>>>>>> I1005 16:49:11.013432 2971 sched.cpp:164] Version: 0.26.0 >>>>>>>>> I1005 16:49:11.027802 2992 sched.cpp:262] New master detected at >>>>>>>>> master@192.168.0.107:5050 >>>>>>>>> I1005 16:49:11.029579 2992 sched.cpp:272] No credentials >>>>>>>>> provided. Attempting to register without authentication >>>>>>>>> I1005 16:49:11.038182 2985 sched.cpp:641] Framework registered >>>>>>>>> with c0e5fdde-595e-4768-9d04-25901d4523b6-0002 >>>>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002 >>>>>>>>> task cluster-test submitted to slave >>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0 <<<<<<<<<<<<<<<<<< >>>>>>>>> Received status update TASK_RUNNING for task cluster-test >>>>>>>>> ^C >>>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute >>>>>>>>> --master=192.168.0.107:5050 --name="cluster-test" >>>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256" >>>>>>>>> I1005 16:50:18.346984 3036 sched.cpp:164] Version: 0.26.0 >>>>>>>>> I1005 16:50:18.366114 3055 sched.cpp:262] New master detected at >>>>>>>>> master@192.168.0.107:5050 >>>>>>>>> I1005 16:50:18.368010 3055 sched.cpp:272] No credentials >>>>>>>>> provided. Attempting to register without authentication >>>>>>>>> I1005 16:50:18.376338 3056 sched.cpp:641] Framework registered >>>>>>>>> with c0e5fdde-595e-4768-9d04-25901d4523b6-0003 >>>>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003 >>>>>>>>> task cluster-test submitted to slave >>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<< >>>>>>>>> Received status update TASK_RUNNING for task cluster-test >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Guangya >>>>>>>>> >>>>>>>>> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale < >>>>>>>>> pradeepkiruv...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Guangya, >>>>>>>>>> >>>>>>>>>> Thanks for your reply. >>>>>>>>>> >>>>>>>>>> I just want to know how did you launch the tasks. >>>>>>>>>> >>>>>>>>>> 1. What processes you have started on Master? >>>>>>>>>> 2. What are the processes you have started on Slaves? >>>>>>>>>> >>>>>>>>>> I am missing something here, otherwise all my slave have enough >>>>>>>>>> memory and cpus to launch the tasks I mentioned. >>>>>>>>>> What I am missing is some configuration steps. >>>>>>>>>> >>>>>>>>>> Thanks & Regards, >>>>>>>>>> Pradeep >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 3 October 2015 at 13:14, Guangya Liu <gyliu...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Pradeep, >>>>>>>>>>> >>>>>>>>>>> I did some test with your case and found that the task can run >>>>>>>>>>> randomly on the three slave hosts, every time may have different >>>>>>>>>>> result. >>>>>>>>>>> The logic is here: >>>>>>>>>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266 >>>>>>>>>>> The allocator will help random shuffle the slaves every time >>>>>>>>>>> when allocate resources for offers. >>>>>>>>>>> >>>>>>>>>>> I see that every of your task need the minimum resources as " >>>>>>>>>>> resources="cpus(*):3;mem(*):2560", can you help check if all of >>>>>>>>>>> your slaves have enough resources? If you want your task run on >>>>>>>>>>> other >>>>>>>>>>> slaves, then those slaves need to have at least 3 cpus and 2550M >>>>>>>>>>> memory. >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> >>>>>>>>>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale < >>>>>>>>>>> pradeepkiruv...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Ondrej, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for your reply >>>>>>>>>>>> >>>>>>>>>>>> I did solve that issue, yes you are right there was an issue >>>>>>>>>>>> with slave IP address setting. >>>>>>>>>>>> >>>>>>>>>>>> Now I am facing issue with the scheduling the tasks. When I try >>>>>>>>>>>> to schedule a task using >>>>>>>>>>>> >>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050 >>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l >>>>>>>>>>>> 10845760 -g >>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560" >>>>>>>>>>>> >>>>>>>>>>>> The tasks always get scheduled on the same node. The resources >>>>>>>>>>>> from the other nodes are not getting used to schedule the tasks. >>>>>>>>>>>> >>>>>>>>>>>> I just start the mesos slaves like below >>>>>>>>>>>> >>>>>>>>>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos >>>>>>>>>>>> --hostname=slave1 >>>>>>>>>>>> >>>>>>>>>>>> If I submit the task using the above (mesos-execute) command >>>>>>>>>>>> from same as one of the slave it runs on that system. >>>>>>>>>>>> >>>>>>>>>>>> But when I submit the task from some different system. It uses >>>>>>>>>>>> just that system and queues the tasks not runs on the other slaves. >>>>>>>>>>>> Some times I see the message "Failed to getgid: unknown user" >>>>>>>>>>>> >>>>>>>>>>>> Do I need to start some process to push the task on all the >>>>>>>>>>>> slaves equally? Am I missing something here? >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Pradeep >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 2 October 2015 at 15:07, Ondrej Smola < >>>>>>>>>>>> ondrej.sm...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Pradeep, >>>>>>>>>>>>> >>>>>>>>>>>>> the problem is with IP your slave advertise - mesos by default >>>>>>>>>>>>> resolves your hostname - there are several solutions (let say >>>>>>>>>>>>> your node ip >>>>>>>>>>>>> is 192.168.56.128) >>>>>>>>>>>>> >>>>>>>>>>>>> 1) export LIBPROCESS_IP=192.168.56.128 >>>>>>>>>>>>> 2) set mesos options - ip, hostname >>>>>>>>>>>>> >>>>>>>>>>>>> one way to do this is to create files >>>>>>>>>>>>> >>>>>>>>>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip >>>>>>>>>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname >>>>>>>>>>>>> >>>>>>>>>>>>> for more configuration options see >>>>>>>>>>>>> http://mesos.apache.org/documentation/latest/configuration >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale < >>>>>>>>>>>>> pradeepkiruv...@gmail.com>: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Guangya, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for reply. I found one interesting log message. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 7410 master.cpp:5977] Removed slave >>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new >>>>>>>>>>>>>> slave >>>>>>>>>>>>>> registered at the same address >>>>>>>>>>>>>> >>>>>>>>>>>>>> Mostly because of this issue, the systems/slave nodes are >>>>>>>>>>>>>> getting registered and de-registered to make a room for the next >>>>>>>>>>>>>> node. I >>>>>>>>>>>>>> can even see this on >>>>>>>>>>>>>> the UI interface, for some time one node got added and after >>>>>>>>>>>>>> some time that will be replaced with the new slave node. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The above log is followed by the below log messages. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting >>>>>>>>>>>>>> action (18 bytes) to leveldb took 104089ns >>>>>>>>>>>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action >>>>>>>>>>>>>> at 384 >>>>>>>>>>>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to >>>>>>>>>>>>>> shutdown socket with fd 15: Transport endpoint is not connected >>>>>>>>>>>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave >>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@ >>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) with cpus(*):8; mem(*):14930; >>>>>>>>>>>>>> disk(*):218578; ports(*):[31000-32000] >>>>>>>>>>>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave >>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@ >>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) disconnected >>>>>>>>>>>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave >>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with >>>>>>>>>>>>>> cpus(*):8; >>>>>>>>>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: >>>>>>>>>>>>>> ) >>>>>>>>>>>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting >>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@ >>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) >>>>>>>>>>>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to >>>>>>>>>>>>>> shutdown socket with fd 16: Transport endpoint is not connected >>>>>>>>>>>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating >>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@ >>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) >>>>>>>>>>>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave >>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated >>>>>>>>>>>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received >>>>>>>>>>>>>> learned notice for position 384 >>>>>>>>>>>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting >>>>>>>>>>>>>> action (20 bytes) to leveldb took 95171ns >>>>>>>>>>>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys >>>>>>>>>>>>>> from leveldb took 20333ns >>>>>>>>>>>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action >>>>>>>>>>>>>> at 384 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Pradeep >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 2 October 2015 at 02:35, Guangya Liu <gyliu...@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Pradeep, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Please check some of my questions in line. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Guangya >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale < >>>>>>>>>>>>>>> pradeepkiruv...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi All, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 >>>>>>>>>>>>>>>> Master and 3 Slaves. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> One slave runs on the Master Node itself and Other slaves >>>>>>>>>>>>>>>> run on different nodes. Here node means the physical boxes. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I tried running the tasks by configuring one Node cluster. >>>>>>>>>>>>>>>> Tested the task scheduling using mesos-execute, works fine. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> When I configure three Node cluster (1master and 3 slaves) >>>>>>>>>>>>>>>> and try to see the resources on the master (in GUI) only the >>>>>>>>>>>>>>>> Master node >>>>>>>>>>>>>>>> resources are visible. >>>>>>>>>>>>>>>> The other nodes resources are not visible. Some times >>>>>>>>>>>>>>>> visible but in a de-actived state. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Can you please append some logs from mesos-slave and >>>>>>>>>>>>>>> mesos-master? There should be some logs in either master or >>>>>>>>>>>>>>> slave telling >>>>>>>>>>>>>>> you what is wrong. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *Please let me know what could be the reason. All the nodes >>>>>>>>>>>>>>>> are in the same network. * >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> When I try to schedule a task using >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050 >>>>>>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l >>>>>>>>>>>>>>>> 10845760 -g >>>>>>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560" >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The tasks always get scheduled on the same node. The >>>>>>>>>>>>>>>> resources from the other nodes are not getting used to >>>>>>>>>>>>>>>> schedule the tasks. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Based on your previous question, there is only one node in >>>>>>>>>>>>>>> your cluster, that's why other nodes are not available. We need >>>>>>>>>>>>>>> first >>>>>>>>>>>>>>> identify what is wrong with other three nodes first. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I*s it required to register the frameworks from every >>>>>>>>>>>>>>>> slave node on the Master?* >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It is not required. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *I have configured this cluster using the git-hub code.* >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks & Regards, >>>>>>>>>>>>>>>> Pradeep >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >