We had problems with Docker version >= 1.9 (yours is even newer), as noted in https://singa.incubator.apache.org/docs/docker.html#launch_pseudo
Basically new versions of Docker changed the DNS resolution mechanism: the Docker daemon no longer updates the /etc/hosts file of existing containers when new one is launched. One suggestion is to downgrade Docker to 1.8: sudo apt-get install docker-engine=1.8.3-0~trusty Another option is to enter IP addresses manually into /etc/hosts files. But we have not tried it with Weaver, so there's high chance that it won't work with Weaver. On 22 June 2016 at 14:39, Venkat Katta <ska...@adobe.com> wrote: > docker version : 1.11.2 > > regards, > venkat satish katta > ------------------------------ > *From:* Anh Dinh <dinh...@comp.nus.edu.sg> > *Sent:* Wednesday, June 22, 2016 12:04:56 PM > *To:* Wang Wei; Venkat Katta > > *Cc:* dev@singa.incubator.apache.org > *Subject:* Re: Error while running singa on mesos > > what version of Docker are you running? > > Anh. > > > On 22 June 2016 at 14:26, Wang Wei <wang...@apache.org> wrote: > >> >> ---------- Forwarded message ---------- >> From: Venkat Katta <ska...@adobe.com> >> Date: Wed, Jun 22, 2016 at 1:31 PM >> Subject: Re: Error while running singa on mesos >> To: Wang Wei <wang...@apache.org> >> >> >> It works fine if I replace the node0 and node2 with their IP address. I >> am using weave for transparent communication between the containers. In >> singa.conf to connect to zookeeper i used node0 but not the ipaddress of >> node0 it is able to connect why can't singa resolve the hostname. And while >> running singa with mesos it is using localhost rather ip address node1 and >> node2, also we are not giving any arguement while running the singa >> regarding ip address of the slaves. >> >> >> F0622 05:18:28.932391 1513 socket.cc:98] Check failed: port != -1 (-1 >> vs. -1) tcp://localhost:* >> >> >> Thanks, >> >> Venkat satish katta >> ------------------------------ >> *From:* Wang Wei <wang...@apache.org> >> *Sent:* Wednesday, June 22, 2016 8:46:36 AM >> *To:* Venkat Katta >> >> *Subject:* Re: Error while running singa on mesos >> >> If you are using Docker (withou mesos), it could be the problem of >> network routing. May need to configure the Docker to setup the network then >> node0 and node2 can be accessed from node1. >> We are trying your configuration. >> >> regards, >> wang wei >> >> >> On Wed, Jun 22, 2016 at 10:32 AM, Wang Wei <wang...@apache.org> wrote: >> >>> Hi Venkat, >>> >>> It should be the problem of the node address. >>> Pls replace node0 and node2 with their IP addresses. >>> >>> regards, >>> wei >>> >>> On Wed, Jun 22, 2016 at 2:40 AM, Venkat Katta <ska...@adobe.com> wrote: >>> >>>> i tried running without mesos i got the same error >>>> >>>> >>>> root@node0:~/incubator-singa# ./bin/singa-run.sh -conf >>>> examples/cifar10/hybrid.conf >>>> Unique JOB_ID is 4 >>>> Record job information to /tmp/singa-log/job-info/job-4-20160621-183305 >>>> Executing @ node2 : cd /root/incubator-singa; source >>>> /root/incubator-singa/conf/profile; ./singa -singa_conf >>>> /root/incubator-singa/conf/singa.conf -singa_job 4 -conf >>>> /root/incubator-singa/examples/cifar10/hybrid.conf >>>> Executing @ node0 : cd /root/incubator-singa; source >>>> /root/incubator-singa/conf/profile; ./singa -singa_conf >>>> /root/incubator-singa/conf/singa.conf -singa_job 4 -conf >>>> /root/incubator-singa/examples/cifar10/hybrid.conf >>>> F0621 18:33:24.171468 725 socket.cc:98] Check failed: port != -1 (-1 >>>> vs. -1) tcp://node2:* >>>> *** Check failure stack trace: *** >>>> @ 0x7f10d0a6b9fd google::LogMessage::Fail() >>>> @ 0x7f10d0a6d89d google::LogMessage::SendToLog() >>>> @ 0x7f10d0a6b5ec google::LogMessage::Flush() >>>> @ 0x7f10d0a6e1be google::LogMessageFatal::~LogMessageFatal() >>>> @ 0x7f10d0e05d79 singa::Router::Bind() >>>> @ 0x7f10d0d7a8bc singa::Driver::Train() >>>> @ 0x7f10d0d7f48b singa::Driver::Train() >>>> @ 0x40c915 main >>>> @ 0x7f10c5f13f45 (unknown) >>>> @ 0x40cb7e (unknown) >>>> F0621 18:33:06.244278 1042 socket.cc:98] Check failed: port != -1 (-1 >>>> vs. -1) tcp://node0:* >>>> *** Check failure stack trace: *** >>>> @ 0x7f6d4516d9fd google::LogMessage::Fail() >>>> @ 0x7f6d4516f89d google::LogMessage::SendToLog() >>>> @ 0x7f6d4516d5ec google::LogMessage::Flush() >>>> @ 0x7f6d451701be google::LogMessageFatal::~LogMessageFatal() >>>> @ 0x7f6d45507d79 singa::Router::Bind() >>>> @ 0x7f6d4547c8bc singa::Driver::Train() >>>> @ 0x7f6d4548148b singa::Driver::Train() >>>> @ 0x40c915 main >>>> @ 0x7f6d3a615f45 (unknown) >>>> @ 0x40cb7e (unknown) >>>> bash: line 1: 725 Aborted (core dumped) ./singa >>>> -singa_conf /root/incubator-singa/conf/singa.conf -singa_job 4 -conf >>>> /root/incubator-singa/examples/cifar10/hybrid.conf -host node2 >>>> bash: line 1: 1042 Aborted (core dumped) ./singa >>>> -singa_conf /root/incubator-singa/conf/singa.conf -singa_job 4 -conf >>>> /root/incubator-singa/examples/cifar10/hybrid.conf -host node0 >>>> E0621 18:33:07.467438 1067 job_manager.cc:156] job 4 not exists >>>> >>>> >>>> ------------------------------ >>>> *From:* Wang Wei <wang...@apache.org> >>>> *Sent:* Tuesday, June 21, 2016 7:09:46 PM >>>> *To:* Venkat Katta >>>> *Cc:* dev@singa.incubator.apache.org >>>> *Subject:* Re: Error while running singa on mesos >>>> >>>> Hi, >>>> >>>> Can you try to run it without Mesos? >>>> 1. Compile singa with enable-dist >>>> 2. change conf/singa.conf to set the zookeeper host >>>> 3. update the conf/hostfile one line per machine >>>> 4. update the conf/profile to export LD_LIBRARY_PATH >>>> >>>> regards, >>>> Wei >>>> >>>> On Tue, Jun 21, 2016 at 8:52 PM, Venkat Katta <ska...@adobe.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> >>>>> I am actually trying to run singa on mesos in fully distributed >>>>> architecture. I built the docker images as given in the documentation. I >>>>> am >>>>> using mesos 0.28.2 and singa 0.3-rc3.I am running each docker container >>>>> using --net=host flag so that they take the ip of the system. Singa works >>>>> as long as the workers are all in one machine . >>>>> When I try to use two machines for training it shows error >>>>> >>>>> >>>>> F0617 10:00:43.862246 2742 socket.cc:98] Check failed: port != -1 (-1 >>>>> vs. -1) tcp://localhost:* >>>>> >>>>> >>>>> so while running the scheduler do we need to give it hostfile >>>>> containing all the hosts. How does it know the remaining hosts in cluster. >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> >>>>> Venkat Satish Katta. >>>>> >>>> >>>> >>> >> >> >