STINNER Victor <vstin...@python.org> added the comment:

Charris, Pablo and me identified that TCP connections are closed by the load 
balancer on some buildbot workers.

When the "buildbot.python.org" host name is used, TCP connections (tcp port 
9020) go through a load balancer.

Ernest exposed the TCP port 9020 directly to the Internet (without the load 
balancer) using a new host name: "buildbot-api.python.org".

Buildbot workers should be updated to use "buildbot-api.python.org". I also 
suggest to use a keepalive of 60 seconds, rather than 600 seconds.

If your worker got impacted the this issue, I strongly advice you to clean up 
manually the temporary directory (/tmp). When a worker was disconnected, the 
build was interrupted without removing temporary files. On some workers, we got 
around 20 GB of temporary files in /tmp: "ccXXXX" files and "tmpXXXX" files. I 
guess that some files are coming from the compiler, some other from the Python 
test suite.

I updated the buildbot client configuration of the 9 workers operated by Red 
Hat:

Fedora Rawhide x64-86
Fedora Stable x64-86
RHEL8 x64-86
RHEL7 x64-86
RHEL8 FIPS x86-64
Fedora Rawhide AArch64
Fedora Stable AArch64
RHEL 8 ppc64le
RHEL 7 ppc64le

On our owners, I used the following commands:

systemctl stop buildbot-worker.service
du -sh /tmp; rm -f /tmp/{cc,tmp}*; du -sh /tmp
sed -i -e "s/buildmaster_host = 'buildbot.python.org'/buildmaster_host = 
'buildbot-api.python.org'/;s/keepalive = .*/keepalive = 60/" 
/home/buildbot/buildarea/buildbot.tac; grep -E '(host|keepalive) =' 
/home/buildbot/buildarea/buildbot.tac
systemctl start buildbot-worker.service
systemctl status buildbot-worker.service

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue41642>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to