Hi Pradeep,

And thank you for your reply!

That, too, is very interesting. I think I need to synthesize what you and
Greg are telling me and come up with a clean solution. Agent nodes can
crash. Moreover, I can stop the mesos-slave service, and start it later
with a reboot in between.

So I am interested in fully understanding the causal chain here before I
try to fix anything.

-Paul



On Tue, Mar 29, 2016 at 5:51 PM, Paul Bell <arach...@gmail.com> wrote:

> Whoa...interessant!
>
> The node *may* have been rebooted. Uptime says 2 days. I'll need to check
> my notes.
>
> Can you point me to reference re Ubuntu behavior?
>
> Based on what you've told me so far, it sounds as if the sequence:
>
> stop service
> reboot agent node
> start service
>
>
> could lead to trouble - or do I misunderstand?
>
>
> Thank you again for your help.
>
> -Paul
>
> On Tue, Mar 29, 2016 at 5:36 PM, Greg Mann <g...@mesosphere.io> wrote:
>
>> Paul,
>> This would be relevant for any system which is automatically deleting
>> files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to
>> be completely nuked at boot time. Was the agent node rebooted prior to this
>> problem?
>>
>> On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell <arach...@gmail.com> wrote:
>>
>>> Hi Greg,
>>>
>>> Thanks very much for your quick reply.
>>>
>>> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
>>> systemd. I will look at the link you provide.
>>>
>>> Is there any chance that it might apply to non-systemd platforms?
>>>
>>> Cordially,
>>>
>>> Paul
>>>
>>> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann <g...@mesosphere.io> wrote:
>>>
>>>> Hi Paul,
>>>> Noticing the logging output, "Failed to find resources file
>>>> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble
>>>> may be related to the location of your agent's work_dir. See this ticket:
>>>> https://issues.apache.org/jira/browse/MESOS-4541
>>>>
>>>> Some users have reported issues resulting from the systemd-tmpfiles
>>>> service garbage collecting files in /tmp, perhaps this is related? What
>>>> platform is your agent running on?
>>>>
>>>> You could try specifying a different agent work directory outside of
>>>> /tmp/ via the `--work_dir` command-line flag.
>>>>
>>>> Cheers,
>>>> Greg
>>>>
>>>>
>>>> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell <arach...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am hoping someone can shed some light on this.
>>>>>
>>>>> An agent node failed to start, that is, when I did "service
>>>>> mesos-slave start" the service came up briefly & then stopped. Before
>>>>> stopping it produced the log shown below. The last thing it wrote is
>>>>> "Trying to create path '/mesos' in Zookeeper".
>>>>>
>>>>> This mention of the mesos znode prompted me to go for a clean slate by
>>>>> removing the mesos znode from Zookeeper.
>>>>>
>>>>> After doing this, the mesos-slave service started perfectly.
>>>>>
>>>>> What might be happening here, and also what's the right way to
>>>>> trouble-shoot such a problem? Mesos is version 0.23.0.
>>>>>
>>>>> Thanks for your help.
>>>>>
>>>>> -Paul
>>>>>
>>>>>
>>>>> Log file created at: 2016/03/29 14:19:39
>>>>> Running on machine: 71.100.202.193
>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
>>>>> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging
>>>>> started!
>>>>> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39
>>>>> by root
>>>>> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
>>>>> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
>>>>> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
>>>>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
>>>>> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
>>>>> posix/cpu,posix/mem
>>>>> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
>>>>> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
>>>>> 71.100.202.193:5051
>>>>> I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
>>>>> --attributes="hostType:shard1" --authenticatee="crammd5"
>>>>> --cgroups_cpu_enable_pids_and_tids_count="false"
>>>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>>>>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>>>>> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
>>>>> --default_role="*" --disk_watch_interval="1mins"
>>>>> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
>>>>> --docker_remove_delay="6hrs"
>>>>> --docker_sandbox_directory="/mnt/mesos/sandbox"
>>>>> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
>>>>> --enforce_container_disk_quota="false"
>>>>> --executor_registration_timeout="5mins"
>>>>> --executor_shutdown_grace_period="5secs"
>>>>> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
>>>>> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
>>>>> --hadoop_home="" --help="false" --hostname="71.100.202.193"
>>>>> --initialize_driver_logging="true" --ip="71.100.202.193"
>>>>> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
>>>>> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
>>>>> --master="zk://71.100.202.191:2181/mesos"
>>>>> --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
>>>>> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
>>>>> --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
>>>>> --registration_backoff_factor="1secs"
>>>>> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
>>>>> --strict="true" --switch_user="true" --version="false"
>>>>> --work_dir="/tmp/mesos"
>>>>> I0329 14:19:39.616835  5870 slave.cpp:354] Slave resources: cpus(*):4;
>>>>> mem(*):23089; disk(*):122517; ports(*):[31000-32000]
>>>>> I0329 14:19:39.617032  5870 slave.cpp:384] Slave hostname:
>>>>> 71.100.202.193
>>>>> I0329 14:19:39.617046  5870 slave.cpp:389] Slave checkpoint: true
>>>>> I0329 14:19:39.618841  5894 state.cpp:36] Recovering state from
>>>>> '/tmp/mesos/meta'
>>>>> I0329 14:19:39.618872  5894 state.cpp:672] Failed to find resources
>>>>> file '/tmp/mesos/meta/resources/resources.info'
>>>>> I0329 14:19:39.619730  5898 group.cpp:313] Group process (group(1)@
>>>>> 71.100.202.193:5051) connected to ZooKeeper
>>>>> I0329 14:19:39.619760  5898 group.cpp:787] Syncing group operations:
>>>>> queue size (joins, cancels, datas) = (0, 0, 0)
>>>>> I0329 14:19:39.619773  5898 group.cpp:385] Trying to create path
>>>>> '/mesos' in ZooKeeper
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to