Re: [users@bb.net] A summary of first month issues using Buildbot 2.0

Yngve N. Pettersen Sat, 16 Mar 2019 11:22:34 -0700

Hi again,

An update about one of the issues, the lost database connection.

This seems to affect the GitPoller instance. Other database activity, suchas both forced_scheduler and triggered jobs work as normal.

It seems like the GitPoller (maybe all pollers) are not able to recoverfrom a lost database connection. A full shutdown and start is needed torecover. This seems to be similar to the worker reconnect failures I'vementioned, that code is not able to recover from a failed workersubscription, and the connection ends up as a zombie, a live connection,but still dead.


In the case earlier today, I got an exception during the sighup operation:

2019-03-16 13:02:31+0000 [-] while polling for changes
  Traceback (most recent call last):

File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",line 1418, in _inlineCallbacks

      result = g.send(result)

File"sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py", line233, in poll

      yield self.setState('lastRev', self.lastRev)

File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",line 1613, in unwindGenerator

      return _cancellableInlineCallbacks(gen)

File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",line 1529, in _cancellableInlineCallbacks

      _inlineCallbacks(None, g, status)
  --- <exception caught here> ---

File"sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py", line233, in poll

      yield self.setState('lastRev', self.lastRev)
    File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 1418, in _inlineCallbacks
      result = g.send(result)

File "sandbox/lib/python3.6/site-packages/buildbot/util/state.py",line 43, in setState

      yield self.master.db.state.setState(self._objectid, key, value)
  builtins.AttributeError: 'NoneType' object has no attribute 'db'

2019-03-16 13:02:31+0000 [-] Caught exception while deactivatingClusteredService(...)

  Traceback (most recent call last):

File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",line 654, in _runCallbacks

      current.result = callback(current.result, *args, **kw)

File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",line 1475, in gotResult

      _inlineCallbacks(r, g, status)

File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",line 1418, in _inlineCallbacks

      result = g.send(result)

File "sandbox/lib/python3.6/site-packages/buildbot/util/service.py",line 341, in stopServicelog.err(e, _why="Caught exception while deactivatingClusteredService(%s)" % self.name)

  --- <exception caught here> ---

File "sandbox/lib/python3.6/site-packages/buildbot/util/service.py",line 339, in stopService

      yield self._unclaimService()

File "sandbox/lib/python3.6/site-packages/buildbot/changes/base.py",line 51, in _unclaimServicereturnself.master.data.updates.trySetChangeSourceMaster(self.serviceid,

  builtins.AttributeError: 'NoneType' object has no attribute 'data'

On Fri, 15 Mar 2019 02:29:30 +0100, Yngve N. Pettersen <yn...@vivaldi.com>wrote:

Hi,
About a month ago we transferred our build system from the old chromiumdeveloped buildbot system to one based on Buildbot 2.0. In that periodwe have had a couple of major issues that I thought I'd summarize:
* We have had two crashes of the buildbot master process. I do not knowwhat causes the crashes, and the twisted.log does not contain anyinformation about what happened, so my guess is that it is either theUbuntu 18 Python 3.6 that crashed, or the Twisted/buildbot scripts didso in a non-logging fashion.
* We have had at least two cases where the master lost its connection tothe Database server, and did not recover, and restarting the master wasthe only option. The probable commonality with these cases is that itseems to have happened when using the reconfigure/sighup option toupdate the buildbot configuration. In at least one case the log seemedto include an exception regarding the Database connection (which is aremote postgresql server)
* We have had a couple of cases where the network connection between themaster and some of the workers have been interrupted. In the major case,this lead to having to restart the worker instances on all the affectedworkers. This was the topic of an email to this list a few weeks ago. Inthis case logs show that the workers correctly connected, but that themaster then failed (due to an exception) to correctly register theworker, and failed to cut the connection to the worker (so that it couldtry to reconnect again) either when the registration process failed, orlater when checking open connections (if it does), and apparently alsoresponded to pings from the worker. It also did not detect that a workerwas not really connected when it tried to ping it when trying to assignit a job.
This reconnect issue is such a major problem and hassle that, when wedid a restart of that network connection, we shut down the *master*instance while taking down the network connection, and restarting itafterwards.



--
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS
_______________________________________________
users mailing list
users@buildbot.net
https://lists.buildbot.net/mailman/listinfo/users

Re: [users@bb.net] A summary of first month issues using Buildbot 2.0

Reply via email to