great. would be awesome if this can be included in 0.18.1.

On Mon, Oct 30, 2017 at 3:18 PM, Erb, Stephan <stephan....@blue-yonder.com>
wrote:

> This problem reminds me of a patch I have added for the observer to commit
> suicide on unhandled errors in threads. See https://github.com/apache/
> aurora/commit/d56f8c64466a94a990db3308a3130d3fce0584af
>
>
>
> I will prepare a similar patch for the executor.
>
>
>
> *From: *Bill Farner <wfar...@apache.org>
> *Reply-To: *"user@aurora.apache.org" <user@aurora.apache.org>
> *Date: *Friday, 27. October 2017 at 05:34
> *To: *"user@aurora.apache.org" <user@aurora.apache.org>
> *Subject: *Re: orphaned thermos
>
>
>
> If the executor runs out of memory, i think it should be assumed that it
> will no longer be well-behaved.  It seems most sensible for the mesos agent
> to clean up in this case.
>
>
>
> On Thu, Oct 26, 2017 at 11:56 AM, Mohit Jaggi <mohit.ja...@uber.com>
> wrote:
>
> We found several zombie executors on a cluster. Thermos logs indicate
> reaching system limits while trying to shutdown(?). Mesos agent is unable
> to get status of this container from docker daemon (docker inspect fails).
> Shouldn't thermos exit in such a case?
>
>
>
>
>
>  22 WARNING: Your kernel does not support swap limit capabilities, memory 
> limited without swap.
>
>  23 twitter.common.app debug: Initializing: twitter.common.log (Logging 
> subsystem.)
>
>  24 Writing log files to disk in /mnt/mesos/sandbox
>
>  25 I1023 19:04:32.261165     7 exec.cpp:162] Version: 1.2.0
>
>  26 I1023 19:04:32.264870    42 exec.cpp:237] Executor registered on agent 
> b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>
>  27 Writing log files to disk in /mnt/mesos/sandbox
>
>  28 Traceback (most recent call last):
>
>  29   File 
> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 1    26, in _excepting_run
>
>  30     self.__real_run(*args, **kw)
>
>  31   File "apache/thermos/monitoring/resource.py", line 243, in run
>
>  32   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py",
>  lin    e 79, in wait
>
>  33     thread.start()
>
>  34   File "/usr/lib/python2.7/threading.py", line 745, in start
>
>  35     _start_new_thread(self.__bootstrap, ())
>
>  36 thread.error: can't start new thread
>
>  37 ERROR] *Failed to stop health checkers:*
>
>  38 ERROR] Traceback (most recent call last):
>
>  39   File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>
>  40     propagate_deadline(self._chained_checker.stop, 
> timeout=self.STOP_TIMEOUT)
>
>  41   File "apache/aurora/executor/aurora_executor.py", line 35, in 
> propagate_deadline
>
>  42     return deadline(*args, daemon=True, propagate=True, **kw)
>
>  43   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
>  line 6    1, in deadline
>
>  44     AnonymousThread().start()
>
>  45   File "/usr/lib/python2.7/threading.py", line 745, in start
>
>  46     _start_new_thread(self.__bootstrap, ())
>
>  47 *error: can't start new thread*
>
> 48
>
>  49 ERROR]* Failed to stop runner:*
>
> 50 ERROR] Traceback (most recent call last):
>
>  51   File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>
>  52     propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>
>  53   File "apache/aurora/executor/aurora_executor.py", line 35, in 
> propagate_deadline
>
>  54     return deadline(*args, daemon=True, propagate=True, **kw)
>
>  55   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
>  line 6    1, in deadline
>
>  56     AnonymousThread().start()
>
>  57   File "/usr/lib/python2.7/threading.py", line 745, in start
>
>  58     _start_new_thread(self.__bootstrap, ())
>
>  59 *error: can't start new thread*
>
>  60
>
>  61 Traceback (most recent call last):
>
>  62   File 
> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 1    26, in _excepting_run
>
>  63     self.__real_run(*args, **kw)
>
>  64   File "apache/aurora/executor/status_manager.py", line 62, in run
>
>  65   File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
>
>  66   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py",
>  line 5    6, in defer
>
>  67     deferred.start()
>
>  68   File "/usr/lib/python2.7/threading.py", line 745, in start
>
>  69     _start_new_thread(self.__bootstrap, ())
>
>  70* thread.error: can't start new thread*
>
>
>

Reply via email to