Stephan Erb created AURORA-1801: ----------------------------------- Summary: TaskObserver thread stops refreshing after filesystem race condition Key: AURORA-1801 URL: https://issues.apache.org/jira/browse/AURORA-1801 Project: Aurora Issue Type: Bug Components: Observer Reporter: Stephan Erb
It seems like that a race condition accessing the Mesos filesystem layout can bubble up and terminate the {{TaskObserver}} thread responsible for refreshing the internal data structure of available tasks. Restarting the observer fixes the problem. Exception triggering the issue: {code} Traceback (most recent call last): File "/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.bce9e54ac7cded79a75603fb4e6bcef2c7d1e6bc/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 126, in _excepting_run self.__real_run(*args, **kw) File "apache/thermos/observer/task_observer.py", line 135, in run File "apache/thermos/observer/detector.py", line 74, in refresh File "apache/thermos/observer/detector.py", line 58, in _refresh_detectors File "apache/aurora/executor/common/path_detector.py", line 34, in get_paths File "apache/aurora/executor/common/path_detector.py", line 34, in <genexpr> File "apache/aurora/executor/common/path_detector.py", line 33, in iterate File "/usr/lib/python2.7/posixpath.py", line 376, in realpath resolved = _resolve_link(component) File "/usr/lib/python2.7/posixpath.py", line 399, in _resolve_link resolved = os.readlink(path) OSError: [Errno 2] No such file or directory: '/var/lib/mesos/slaves/0768bcb3-205d-4409-a726-3001ad3ef902-S10/frameworks/20151001-085346-58917130-5050-37976-0000/executors/thermos-role-env-myname-0-f9fe0318-d39f-49d3-bdf8-e954d5879b33/runs/latest' {code} Solution space: * terminate the observer process if the {{TaskOberver}} thread fails * prevent unknown exceptions from aborting the {{TaskOberver}} run loop * prevent the observed race condition in {{detector.py}} or {{path_detector.py} -- This message was sent by Atlassian JIRA (v6.3.4#6332)