great. would be awesome if this can be included in 0.18.1. On Mon, Oct 30, 2017 at 3:18 PM, Erb, Stephan <stephan....@blue-yonder.com> wrote:
> This problem reminds me of a patch I have added for the observer to commit > suicide on unhandled errors in threads. See https://github.com/apache/ > aurora/commit/d56f8c64466a94a990db3308a3130d3fce0584af > > > > I will prepare a similar patch for the executor. > > > > *From: *Bill Farner <wfar...@apache.org> > *Reply-To: *"user@aurora.apache.org" <user@aurora.apache.org> > *Date: *Friday, 27. October 2017 at 05:34 > *To: *"user@aurora.apache.org" <user@aurora.apache.org> > *Subject: *Re: orphaned thermos > > > > If the executor runs out of memory, i think it should be assumed that it > will no longer be well-behaved. It seems most sensible for the mesos agent > to clean up in this case. > > > > On Thu, Oct 26, 2017 at 11:56 AM, Mohit Jaggi <mohit.ja...@uber.com> > wrote: > > We found several zombie executors on a cluster. Thermos logs indicate > reaching system limits while trying to shutdown(?). Mesos agent is unable > to get status of this container from docker daemon (docker inspect fails). > Shouldn't thermos exit in such a case? > > > > > > 22 WARNING: Your kernel does not support swap limit capabilities, memory > limited without swap. > > 23 twitter.common.app debug: Initializing: twitter.common.log (Logging > subsystem.) > > 24 Writing log files to disk in /mnt/mesos/sandbox > > 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0 > > 26 I1023 19:04:32.264870 42 exec.cpp:237] Executor registered on agent > b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295 > > 27 Writing log files to disk in /mnt/mesos/sandbox > > 28 Traceback (most recent call last): > > 29 File > "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", > line 1 26, in _excepting_run > > 30 self.__real_run(*args, **kw) > > 31 File "apache/thermos/monitoring/resource.py", line 243, in run > > 32 File > "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", > lin e 79, in wait > > 33 thread.start() > > 34 File "/usr/lib/python2.7/threading.py", line 745, in start > > 35 _start_new_thread(self.__bootstrap, ()) > > 36 thread.error: can't start new thread > > 37 ERROR] *Failed to stop health checkers:* > > 38 ERROR] Traceback (most recent call last): > > 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown > > 40 propagate_deadline(self._chained_checker.stop, > timeout=self.STOP_TIMEOUT) > > 41 File "apache/aurora/executor/aurora_executor.py", line 35, in > propagate_deadline > > 42 return deadline(*args, daemon=True, propagate=True, **kw) > > 43 File > "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", > line 6 1, in deadline > > 44 AnonymousThread().start() > > 45 File "/usr/lib/python2.7/threading.py", line 745, in start > > 46 _start_new_thread(self.__bootstrap, ()) > > 47 *error: can't start new thread* > > 48 > > 49 ERROR]* Failed to stop runner:* > > 50 ERROR] Traceback (most recent call last): > > 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown > > 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT) > > 53 File "apache/aurora/executor/aurora_executor.py", line 35, in > propagate_deadline > > 54 return deadline(*args, daemon=True, propagate=True, **kw) > > 55 File > "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", > line 6 1, in deadline > > 56 AnonymousThread().start() > > 57 File "/usr/lib/python2.7/threading.py", line 745, in start > > 58 _start_new_thread(self.__bootstrap, ()) > > 59 *error: can't start new thread* > > 60 > > 61 Traceback (most recent call last): > > 62 File > "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", > line 1 26, in _excepting_run > > 63 self.__real_run(*args, **kw) > > 64 File "apache/aurora/executor/status_manager.py", line 62, in run > > 65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown > > 66 File > "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", > line 5 6, in defer > > 67 deferred.start() > > 68 File "/usr/lib/python2.7/threading.py", line 745, in start > > 69 _start_new_thread(self.__bootstrap, ()) > > 70* thread.error: can't start new thread* > > >