Hi Pradeep, And thank you for your reply!
That, too, is very interesting. I think I need to synthesize what you and Greg are telling me and come up with a clean solution. Agent nodes can crash. Moreover, I can stop the mesos-slave service, and start it later with a reboot in between. So I am interested in fully understanding the causal chain here before I try to fix anything. -Paul On Tue, Mar 29, 2016 at 5:51 PM, Paul Bell <arach...@gmail.com> wrote: > Whoa...interessant! > > The node *may* have been rebooted. Uptime says 2 days. I'll need to check > my notes. > > Can you point me to reference re Ubuntu behavior? > > Based on what you've told me so far, it sounds as if the sequence: > > stop service > reboot agent node > start service > > > could lead to trouble - or do I misunderstand? > > > Thank you again for your help. > > -Paul > > On Tue, Mar 29, 2016 at 5:36 PM, Greg Mann <g...@mesosphere.io> wrote: > >> Paul, >> This would be relevant for any system which is automatically deleting >> files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to >> be completely nuked at boot time. Was the agent node rebooted prior to this >> problem? >> >> On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell <arach...@gmail.com> wrote: >> >>> Hi Greg, >>> >>> Thanks very much for your quick reply. >>> >>> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not >>> systemd. I will look at the link you provide. >>> >>> Is there any chance that it might apply to non-systemd platforms? >>> >>> Cordially, >>> >>> Paul >>> >>> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann <g...@mesosphere.io> wrote: >>> >>>> Hi Paul, >>>> Noticing the logging output, "Failed to find resources file >>>> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble >>>> may be related to the location of your agent's work_dir. See this ticket: >>>> https://issues.apache.org/jira/browse/MESOS-4541 >>>> >>>> Some users have reported issues resulting from the systemd-tmpfiles >>>> service garbage collecting files in /tmp, perhaps this is related? What >>>> platform is your agent running on? >>>> >>>> You could try specifying a different agent work directory outside of >>>> /tmp/ via the `--work_dir` command-line flag. >>>> >>>> Cheers, >>>> Greg >>>> >>>> >>>> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell <arach...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am hoping someone can shed some light on this. >>>>> >>>>> An agent node failed to start, that is, when I did "service >>>>> mesos-slave start" the service came up briefly & then stopped. Before >>>>> stopping it produced the log shown below. The last thing it wrote is >>>>> "Trying to create path '/mesos' in Zookeeper". >>>>> >>>>> This mention of the mesos znode prompted me to go for a clean slate by >>>>> removing the mesos znode from Zookeeper. >>>>> >>>>> After doing this, the mesos-slave service started perfectly. >>>>> >>>>> What might be happening here, and also what's the right way to >>>>> trouble-shoot such a problem? Mesos is version 0.23.0. >>>>> >>>>> Thanks for your help. >>>>> >>>>> -Paul >>>>> >>>>> >>>>> Log file created at: 2016/03/29 14:19:39 >>>>> Running on machine: 71.100.202.193 >>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg >>>>> I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging >>>>> started! >>>>> I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 >>>>> by root >>>>> I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 >>>>> I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 >>>>> I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: >>>>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 >>>>> I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: >>>>> posix/cpu,posix/mem >>>>> I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave >>>>> I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ >>>>> 71.100.202.193:5051 >>>>> I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: >>>>> --attributes="hostType:shard1" --authenticatee="crammd5" >>>>> --cgroups_cpu_enable_pids_and_tids_count="false" >>>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" >>>>> --cgroups_limit_swap="false" --cgroups_root="mesos" >>>>> --container_disk_watch_interval="15secs" --containerizers="docker,mesos" >>>>> --default_role="*" --disk_watch_interval="1mins" >>>>> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" >>>>> --docker_remove_delay="6hrs" >>>>> --docker_sandbox_directory="/mnt/mesos/sandbox" >>>>> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" >>>>> --enforce_container_disk_quota="false" >>>>> --executor_registration_timeout="5mins" >>>>> --executor_shutdown_grace_period="5secs" >>>>> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" >>>>> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" >>>>> --hadoop_home="" --help="false" --hostname="71.100.202.193" >>>>> --initialize_driver_logging="true" --ip="71.100.202.193" >>>>> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" >>>>> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" >>>>> --master="zk://71.100.202.191:2181/mesos" >>>>> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" >>>>> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" >>>>> --quiet="false" --recover="reconnect" --recovery_timeout="15mins" >>>>> --registration_backoff_factor="1secs" >>>>> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true" >>>>> --strict="true" --switch_user="true" --version="false" >>>>> --work_dir="/tmp/mesos" >>>>> I0329 14:19:39.616835 5870 slave.cpp:354] Slave resources: cpus(*):4; >>>>> mem(*):23089; disk(*):122517; ports(*):[31000-32000] >>>>> I0329 14:19:39.617032 5870 slave.cpp:384] Slave hostname: >>>>> 71.100.202.193 >>>>> I0329 14:19:39.617046 5870 slave.cpp:389] Slave checkpoint: true >>>>> I0329 14:19:39.618841 5894 state.cpp:36] Recovering state from >>>>> '/tmp/mesos/meta' >>>>> I0329 14:19:39.618872 5894 state.cpp:672] Failed to find resources >>>>> file '/tmp/mesos/meta/resources/resources.info' >>>>> I0329 14:19:39.619730 5898 group.cpp:313] Group process (group(1)@ >>>>> 71.100.202.193:5051) connected to ZooKeeper >>>>> I0329 14:19:39.619760 5898 group.cpp:787] Syncing group operations: >>>>> queue size (joins, cancels, datas) = (0, 0, 0) >>>>> I0329 14:19:39.619773 5898 group.cpp:385] Trying to create path >>>>> '/mesos' in ZooKeeper >>>>> >>>>> >>>> >>> >> >