Hi again,
A further update on that DB issue.
It seems that the last such incident caused two of the new branch
schedulers I was adding to not get properly registered. The schedulers are
listed, but AFAICT the git polling does not work, and it also seems like
the restart I did to fix the DB issue did not fix the problem.
On Sat, 16 Mar 2019 19:22:14 +0100, Yngve N. Pettersen <yn...@vivaldi.com>
wrote:
Hi again,
An update about one of the issues, the lost database connection.
This seems to affect the GitPoller instance. Other database activity,
such as both forced_scheduler and triggered jobs work as normal.
It seems like the GitPoller (maybe all pollers) are not able to recover
from a lost database connection. A full shutdown and start is needed to
recover. This seems to be similar to the worker reconnect failures I've
mentioned, that code is not able to recover from a failed worker
subscription, and the connection ends up as a zombie, a live connection,
but still dead.
In the case earlier today, I got an exception during the sighup
operation:
2019-03-16 13:02:31+0000 [-] while polling for changes
Traceback (most recent call last):
File
"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
1418, in _inlineCallbacks
result = g.send(result)
File
"sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py",
line 233, in poll
yield self.setState('lastRev', self.lastRev)
File
"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File
"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- <exception caught here> ---
File
"sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py",
line 233, in poll
yield self.setState('lastRev', self.lastRev)
File
"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 1418, in _inlineCallbacks
result = g.send(result)
File "sandbox/lib/python3.6/site-packages/buildbot/util/state.py",
line 43, in setState
yield self.master.db.state.setState(self._objectid, key, value)
builtins.AttributeError: 'NoneType' object has no attribute 'db'
2019-03-16 13:02:31+0000 [-] Caught exception while deactivating
ClusteredService(...)
Traceback (most recent call last):
File
"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File
"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
1475, in gotResult
_inlineCallbacks(r, g, status)
File
"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
1418, in _inlineCallbacks
result = g.send(result)
File
"sandbox/lib/python3.6/site-packages/buildbot/util/service.py", line
341, in stopService
log.err(e, _why="Caught exception while deactivating
ClusteredService(%s)" % self.name)
--- <exception caught here> ---
File
"sandbox/lib/python3.6/site-packages/buildbot/util/service.py", line
339, in stopService
yield self._unclaimService()
File
"sandbox/lib/python3.6/site-packages/buildbot/changes/base.py", line 51,
in _unclaimService
return
self.master.data.updates.trySetChangeSourceMaster(self.serviceid,
builtins.AttributeError: 'NoneType' object has no attribute 'data'
On Fri, 15 Mar 2019 02:29:30 +0100, Yngve N. Pettersen
<yn...@vivaldi.com> wrote:
Hi,
About a month ago we transferred our build system from the old chromium
developed buildbot system to one based on Buildbot 2.0. In that period
we have had a couple of major issues that I thought I'd summarize:
* We have had two crashes of the buildbot master process. I do not know
what causes the crashes, and the twisted.log does not contain any
information about what happened, so my guess is that it is either the
Ubuntu 18 Python 3.6 that crashed, or the Twisted/buildbot scripts did
so in a non-logging fashion.
* We have had at least two cases where the master lost its connection
to the Database server, and did not recover, and restarting the master
was the only option. The probable commonality with these cases is that
it seems to have happened when using the reconfigure/sighup option to
update the buildbot configuration. In at least one case the log seemed
to include an exception regarding the Database connection (which is a
remote postgresql server)
* We have had a couple of cases where the network connection between
the master and some of the workers have been interrupted. In the major
case, this lead to having to restart the worker instances on all the
affected workers. This was the topic of an email to this list a few
weeks ago. In this case logs show that the workers correctly connected,
but that the master then failed (due to an exception) to correctly
register the worker, and failed to cut the connection to the worker (so
that it could try to reconnect again) either when the registration
process failed, or later when checking open connections (if it does),
and apparently also responded to pings from the worker. It also did not
detect that a worker was not really connected when it tried to ping it
when trying to assign it a job.
This reconnect issue is such a major problem and hassle that, when we
did a restart of that network connection, we shut down the *master*
instance while taking down the network connection, and restarting it
afterwards.
--
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS
_______________________________________________
users mailing list
users@buildbot.net
https://lists.buildbot.net/mailman/listinfo/users