Hi again,
An update about one of the issues, the lost database connection.
This seems to affect the GitPoller instance. Other database activity, such
as both forced_scheduler and triggered jobs work as normal.
It seems like the GitPoller (maybe all pollers) are not able to recover
from a lost database connection. A full shutdown and start is needed to
recover. This seems to be similar to the worker reconnect failures I've
mentioned, that code is not able to recover from a failed worker
subscription, and the connection ends up as a zombie, a live connection,
but still dead.
In the case earlier today, I got an exception during the sighup operation:
2019-03-16 13:02:31+0000 [-] while polling for changes
Traceback (most recent call last):
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 1418, in _inlineCallbacks
result = g.send(result)
File
"sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py", line
233, in poll
yield self.setState('lastRev', self.lastRev)
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- <exception caught here> ---
File
"sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py", line
233, in poll
yield self.setState('lastRev', self.lastRev)
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 1418, in _inlineCallbacks
result = g.send(result)
File "sandbox/lib/python3.6/site-packages/buildbot/util/state.py",
line 43, in setState
yield self.master.db.state.setState(self._objectid, key, value)
builtins.AttributeError: 'NoneType' object has no attribute 'db'
2019-03-16 13:02:31+0000 [-] Caught exception while deactivating
ClusteredService(...)
Traceback (most recent call last):
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 1475, in gotResult
_inlineCallbacks(r, g, status)
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 1418, in _inlineCallbacks
result = g.send(result)
File "sandbox/lib/python3.6/site-packages/buildbot/util/service.py",
line 341, in stopService
log.err(e, _why="Caught exception while deactivating
ClusteredService(%s)" % self.name)
--- <exception caught here> ---
File "sandbox/lib/python3.6/site-packages/buildbot/util/service.py",
line 339, in stopService
yield self._unclaimService()
File "sandbox/lib/python3.6/site-packages/buildbot/changes/base.py",
line 51, in _unclaimService
return
self.master.data.updates.trySetChangeSourceMaster(self.serviceid,
builtins.AttributeError: 'NoneType' object has no attribute 'data'
On Fri, 15 Mar 2019 02:29:30 +0100, Yngve N. Pettersen <yn...@vivaldi.com>
wrote:
Hi,
About a month ago we transferred our build system from the old chromium
developed buildbot system to one based on Buildbot 2.0. In that period
we have had a couple of major issues that I thought I'd summarize:
* We have had two crashes of the buildbot master process. I do not know
what causes the crashes, and the twisted.log does not contain any
information about what happened, so my guess is that it is either the
Ubuntu 18 Python 3.6 that crashed, or the Twisted/buildbot scripts did
so in a non-logging fashion.
* We have had at least two cases where the master lost its connection to
the Database server, and did not recover, and restarting the master was
the only option. The probable commonality with these cases is that it
seems to have happened when using the reconfigure/sighup option to
update the buildbot configuration. In at least one case the log seemed
to include an exception regarding the Database connection (which is a
remote postgresql server)
* We have had a couple of cases where the network connection between the
master and some of the workers have been interrupted. In the major case,
this lead to having to restart the worker instances on all the affected
workers. This was the topic of an email to this list a few weeks ago. In
this case logs show that the workers correctly connected, but that the
master then failed (due to an exception) to correctly register the
worker, and failed to cut the connection to the worker (so that it could
try to reconnect again) either when the registration process failed, or
later when checking open connections (if it does), and apparently also
responded to pings from the worker. It also did not detect that a worker
was not really connected when it tried to ping it when trying to assign
it a job.
This reconnect issue is such a major problem and hassle that, when we
did a restart of that network connection, we shut down the *master*
instance while taking down the network connection, and restarting it
afterwards.
--
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS
_______________________________________________
users mailing list
users@buildbot.net
https://lists.buildbot.net/mailman/listinfo/users