Re: help needed to manage s390x host for ci.debian.net
> > Hello Paul, > > Thank you for the update on the status of the system and the Munin graphs. > I think it's a good idea to wait until after the bookworm release to > investigate further. > > However, I would like to take you up on the offer to schedule and/or > arrange access to an LXC container on your host for investigations. I > believe it could be beneficial to see the system from a different > perspective, and understand/predict issues with SSH timeouts or other > performance issues. > > Please let me know if this is possible and what steps I should take to get > access to the LXC container. > > Thank you for your time and assistance. > > Best regards, > Dipak > >
Re: help needed to manage s390x host for ci.debian.net
Hi Elizabeth, On 18-04-2023 22:46, Elizabeth K. Joseph wrote: I noticed that the Munin graphs are showing that the queue problems from earlier this year seem to have been reduced now, is that correct, or has the VM just not been restarted lately? It would be helpful to have a starting point. I was suspecting that after the last communications there were some changes on your side, because I rebooted the VM maybe two or three times [1] and noticed not seeing the slow down as I was used to. I also had the impression that some of my other pain point improved a bit. At this moment Debian is in freeze [2] in preparation for the release of Debian bookworm, hence it's rather dull on the CI side. That makes interpreting the Munin graphs a bit difficult. In the mean time I have also upgraded the VM to run Debian bookworm (around 2023-03-14), maybe that also make a difference. I propose we wait with further investigations until after the bookworm release. Once our CI hosts are loaded normally again I think we're in a better position to judge real performance. Having said that, I'm happy to schedule and/or arrange access to an LXC container on our host for investigations already before that time if you want to poke around. Paul [1] https://ci.debian.net/munin/ci-worker-s390x-01/ci-worker-s390x-01/uptime.html [2] https://release.debian.org/testing/freeze_policy.html#summary OpenPGP_signature Description: OpenPGP digital signature
Re: help needed to manage s390x host for ci.debian.net
On Sun, Feb 12, 2023 at 1:39 PM Paul Gevers wrote: > I can provide logging from the host, but I'll need detailed instructions > of what people find useful to look at. Recently Antonio taught me a > trick to provide temporary access to a lxc container on any of our > hosts, so if it helps to be on the host (but inside lxc) we can provide > for that. I had a call with Dipak this morning to discuss some of how we can help, between my Linux admin and networking experience, and what we can otherwise pull together on the Z side at IBM, we're eager to see what we can do to make these systems a bit quicker. I think access to the host in an LXC container would be valuable, especially if we're still having any issues with SSH timeouts and could see about replicating. I noticed that the Munin graphs are showing that the queue problems from earlier this year seem to have been reduced now, is that correct, or has the VM just not been restarted lately? It would be helpful to have a starting point. Many thanks. -- Elizabeth K. Joseph || Lyz || pleia2
Re: help needed to manage s390x host for ci.debian.net
Hi, On 28-02-2023 01:39, James Addison wrote: Attempting to sum together what look, to me, like a pair of 2s: * The s390x Debian CI queue size[1] is growing again. Yes, but this time it's because some test seems to be misbehaving (only on s390x or big endian or... ) and fills the disk (and gets killed by a cron job and restarts). I'm suspecting dolfin (and python-oslo.db) at the moment. * A recent bug report[2] by Dipak describes userspace processes getting stuck on an s390 Linux kernel version that Debian's CI infra has been using We reverted to the previous kernel: root@ci-worker-s390x-01:~# uname -a Linux ci-worker-s390x-01 5.10.0-20-s390x #1 SMP Debian 5.10.158-2 (2022-12-13) s390x GNU/Linux Paul OpenPGP_signature Description: OpenPGP digital signature
Re: help needed to manage s390x host for ci.debian.net
Attempting to sum together what look, to me, like a pair of 2s: * The s390x Debian CI queue size[1] is growing again. * A recent bug report[2] by Dipak describes userspace processes getting stuck on an s390 Linux kernel version that Debian's CI infra has been using The bug does seem to have caused CI package build timeouts, as Paul and others have discussed[3]. I was skeptical about the kernel-as-cause theory, but now agree with it. Perhaps the timeouts explain the queue backlog? Also note: Sumanth has offered a fix as an s390 kernel patch[4], and it is pending -- that is, the fix has been uploaded and is awaiting general availability after a delay for people to review the relevant changes -- for distribution in Debian stable. I'm puzzled by some conflicting data, though: the ppc64 queue _isn't_ growing currently. Why did it follow the s390x trend so closely during the previous queue buildup, and yet doesn't appear to be doing so this time? [1] - https://ci.debian.net/munin/debian.net/ci-master.debian.net/debci_queue_size.html [2] - https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1031753 [3] - https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1030545 [4] - https://lists.debian.org/debian-kernel/2023/02/msg00124.html On Sat, 18 Feb 2023 at 14:23, James Addison wrote: > > > James Addison suggested in [3] to increase a prefetch counter in amqp > > (although its the same on all hosts); I have done so on the s390x host and > > at least initially it seems to help keeping the host busier. > > Thanks for applying that - I was hoping that the change might also > result in reductions in the debci queue size for s390x, but that > doesn't appear to have happened, going by > https://ci.debian.net/munin/debian.net/ci-master.debian.net/debci_queue_size.html > > [3] https://salsa.debian.org/ci-team/debci/-/issues/92#note_381306
Re: help needed to manage s390x host for ci.debian.net
Hi, On 21-02-2023 17:46, Dipak Zope1 wrote: I am wondering whether we have downgraded the machines to 5.10.0-20 kernel to get rid of the kernel bug. I think I mentioned it before, we downgraded indeed: root@ci-worker-s390x-01:~# uname -a Linux ci-worker-s390x-01 5.10.0-20-s390x #1 SMP Debian 5.10.158-2 (2022-12-13) s390x GNU/Linux Can we check if the patch solves any of the CI issues? If there's a package available somewhere, I can install it, but I currently don't have the time (nor the will, sorry) to learn how to build Debian s390x kernel packages. Paul OpenPGP_signature Description: OpenPGP digital signature
RE: help needed to manage s390x host for ci.debian.net
I am wondering whether we have downgraded the machines to 5.10.0-20 kernel to get rid of the kernel bug which is known to cause issue in user processes at random - described in the cover letter here: https://lists.debian.org/debian-s390/2023/02/msg00019.html The following patch fixes this issue: https://lists.debian.org/debian-s390/2023/02/msg00019.html The patch fixes the problem introduced in commit 75309018a24d ("s390: add support for TIF_NOTIFY_SIGNAL") Can we check if the patch solves any of the CI issues? -Dipak On 16.02.23 17:49, Paul Gevers wrote: > As you can see e.g. here [1,2] it comes and goes (albeit sometimes the > queue was empty). I don't think its very different, I just never got out > of the s390x host what I was expecting. Long time I blamed it on the > "stealing" that happens on a shared host, but I think there's more. > > https://ci.debian.net/munin/ci-worker-s390x-01/ci-worker-s390x-01/debci_packages_processed.html So a pet peeve of mine are unitless graphs. Can we please annotate what the unit is? If we're looking at "Packages processed by architecture"[1] (which again, isn't a workable unit), then we see that s390x does not have a bad average, nor an overly bad max - given that it's with one worker instead of like 14 for amd64 and 10 for arm64? The average/max for the week is double for amd64 vs. s390x, so what does the queue size mean? Is there still obsolete work in the queue as well or does every item have the same value? (There's no way right now it can catch up with that many items in the queue. Although that again depends on the unit...) Kind regards Philipp Kern [1] https://ci.debian.net/munin/debian.net/ci-master.debian.net/debci_total_packages_processed.html
Re: help needed to manage s390x host for ci.debian.net
On 16.02.23 17:49, Paul Gevers wrote: As you can see e.g. here [1,2] it comes and goes (albeit sometimes the queue was empty). I don't think its very different, I just never got out of the s390x host what I was expecting. Long time I blamed it on the "stealing" that happens on a shared host, but I think there's more. https://ci.debian.net/munin/ci-worker-s390x-01/ci-worker-s390x-01/debci_packages_processed.html So a pet peeve of mine are unitless graphs. Can we please annotate what the unit is? If we're looking at "Packages processed by architecture"[1] (which again, isn't a workable unit), then we see that s390x does not have a bad average, nor an overly bad max - given that it's with one worker instead of like 14 for amd64 and 10 for arm64? The average/max for the week is double for amd64 vs. s390x, so what does the queue size mean? Is there still obsolete work in the queue as well or does every item have the same value? (There's no way right now it can catch up with that many items in the queue. Although that again depends on the unit...) Kind regards Philipp Kern [1] https://ci.debian.net/munin/debian.net/ci-master.debian.net/debci_total_packages_processed.html
Re: help needed to manage s390x host for ci.debian.net
Hi, On 17.02.23 17:04, Antonio Terceiro wrote: So there is for sure something wrong with the client-server connection there. Reworking the client for robustness is on my TODO list for a while. There's a lot of these: Feb 14 08:56:25 ci-worker-s390x-01 debci[1155941]: waiting for header frame: a SSL error occurred But alas, the worker will fail and immediately restart. But what's more concerning is the context: Feb 14 08:39:50 ci-worker-s390x-01 debci[1355790]: bacula testing/s390x tmpfail Feb 14 08:56:25 ci-worker-s390x-01 debci[1155941]: waiting for header frame: a SSL error occurred This looks pretty common: Feb 14 00:45:12 ci-worker-s390x-01 debci[2652291]: libgd2 testing/s390x fail Feb 14 01:01:48 ci-worker-s390x-01 debci[546227]: waiting for header frame: a SSL error occurred Feb 14 02:45:30 ci-worker-s390x-01 debci[1209706]: mmdebstrap testing/s390x pass Feb 14 03:02:05 ci-worker-s390x-01 debci[3642098]: waiting for header frame: a SSL error occurred Feb 14 04:40:10 ci-worker-s390x-01 debci[12655]: cacti testing/s390x tmpfail Feb 14 04:56:51 ci-worker-s390x-01 debci[3015158]: waiting for header frame: a SSL error occurred So we seem to lose at least 15 minutes of worker time when that happens. The failures are sometimes but not necessarily correlated: Feb 17 01:07:17 ci-worker-s390x-01 debci[1149352]: waiting for header frame: a SSL error occurred Feb 17 01:13:46 ci-worker-s390x-01 debci[552417]: waiting for header frame: a SSL error occurred Feb 17 01:16:19 ci-worker-s390x-01 debci[1261598]: waiting for header frame: a SSL error occurred Feb 17 01:21:02 ci-worker-s390x-01 debci[1487252]: waiting for header frame: a SSL error occurred Feb 17 01:53:30 ci-worker-s390x-01 debci[3589185]: waiting for header frame: a SSL error occurred Feb 17 02:03:24 ci-worker-s390x-01 debci[4184831]: waiting for header frame: a SSL error occurred Feb 17 02:18:31 ci-worker-s390x-01 debci[3986861]: waiting for header frame: a SSL error occurred Feb 17 02:41:11 ci-worker-s390x-01 debci[4167140]: waiting for header frame: a SSL error occurred Feb 17 05:44:55 ci-worker-s390x-01 debci[1543385]: waiting for header frame: a SSL error occurred Feb 17 05:47:10 ci-worker-s390x-01 debci[2598734]: waiting for header frame: a SSL error occurred Feb 17 06:24:39 ci-worker-s390x-01 debci[1275755]: waiting for header frame: a SSL error occurred Feb 17 06:50:05 ci-worker-s390x-01 debci[3680449]: waiting for header frame: a SSL error occurred Feb 17 07:33:09 ci-worker-s390x-01 debci[107515]: waiting for header frame: a SSL error occurred Feb 17 07:48:04 ci-worker-s390x-01 debci[2816244]: waiting for header frame: a SSL error occurred Feb 17 07:54:07 ci-worker-s390x-01 debci[2284573]: waiting for header frame: a SSL error occurred Feb 17 12:40:38 ci-worker-s390x-01 debci[4069122]: waiting for header frame: a SSL error occurred Feb 17 15:39:40 ci-worker-s390x-01 debci[3343838]: waiting for header frame: a SSL error occurred Feb 17 20:23:33 ci-worker-s390x-01 debci[3531969]: waiting for header frame: a SSL error occurred Feb 17 21:21:28 ci-worker-s390x-01 debci[1815008]: waiting for header frame: a SSL error occurred Feb 17 23:28:02 ci-worker-s390x-01 debci[2830093]: waiting for header frame: a SSL error occurred Feb 18 01:38:13 ci-worker-s390x-01 debci[376]: waiting for header frame: a SSL error occurred Feb 18 04:21:49 ci-worker-s390x-01 debci[1774710]: waiting for header frame: a SSL error occurred Feb 18 04:21:53 ci-worker-s390x-01 debci[1530267]: waiting for header frame: a SSL error occurred Feb 18 04:43:09 ci-worker-s390x-01 debci[2484158]: waiting for header frame: a SSL error occurred Feb 18 04:54:21 ci-worker-s390x-01 debci[3870455]: waiting for header frame: a SSL error occurred Feb 18 06:46:27 ci-worker-s390x-01 debci[632005]: waiting for header frame: a SSL error occurred Feb 18 06:52:56 ci-worker-s390x-01 debci[516286]: waiting for header frame: a SSL error occurred Feb 18 09:41:23 ci-worker-s390x-01 debci[57375]: waiting for header frame: a SSL error occurred It doesn't look like amqp-consume has a lot of options in this space. I do wonder if a Wireguard tunnel would help, if only to move this from a firewall-mediated TCP stream to a couple of UDP packets that are less likely to be filtered. But I don't know how amenable the firewall is to these either. I'm personally not a friend of munin because it makes math on the graphs hard. Do you have an idea how many packages the s390x manages to process per day and how that compares to the other workers? PubSub queues are not the easiest to introspect and I'd like to know how far we are off in intake into the queue per day vs. what we can process. Kind regards and thanks Philipp Kern
Re: help needed to manage s390x host for ci.debian.net
> James Addison suggested in [3] to increase a prefetch counter in amqp > (although its the same on all hosts); I have done so on the s390x host and at > least initially it seems to help keeping the host busier. Thanks for applying that - I was hoping that the change might also result in reductions in the debci queue size for s390x, but that doesn't appear to have happened, going by https://ci.debian.net/munin/debian.net/ci-master.debian.net/debci_queue_size.html [3] https://salsa.debian.org/ci-team/debci/-/issues/92#note_381306
Re: help needed to manage s390x host for ci.debian.net
On Tue, Feb 14, 2023 at 09:42:09PM +0100, Paul Gevers wrote: > Hi Phil, > > On 13-02-2023 08:57, Philipp Kern wrote: > > On 12.02.23 22:38, Paul Gevers wrote: > > > I have munin [1], but as said, I'm not a trained sysadmin. I don't > > > know what I'm looking for if you ask "statistics on the network". > > > > This is more of a software development / devops question than a sysadmin > > question, but alas. > > I acknowledge that my reach out was broad and didn't only cover s390x. > > > What I am interested in is *application-level* logging on reconnects. > > Presumably the connection to RabbitMQ is outbound? > > Our configuration can be seen here: > https://salsa.debian.org/ci-team/debian-ci-config/-/blob/master/cookbooks/rabbitmq/templates/rabbitmq.conf.erb > > > Is it tunneled? Does your application log somewhere when a reconnect > > happens? Does it say when it successfully connected? > > > > I'd expect good software to log something like this: > > > > [10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"... > > [10:00:05] Connected to broker "rabbitmq.debci.debian.net:12345". > > > > And also: > > > > [10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"... > > [10:00:01] Connection to broker "rabbitmq.debci.debian.net:12345" > > failed: Connection refused > > @terceiro; I haven't seen these kind of logs on the worker hosts. Do you > know if they exist or if we can generate them? The worker does log it's initial connection, see below. > I think I'm seeing something on the main host. > admin@ci-master:/var/log/rabbitmq$ sudo grep 148.100.88.163 > rab...@ci-master.log | grep -v '\[info\]' | grep -v '\[warning\]' > 2023-02-14 00:00:37.522 [error] <0.30951.85> closing AMQP connection > <0.30951.85> (148.100.88.163:49540 -> 10.1.14.198:5671): > 2023-02-14 02:27:56.050 [error] <0.15184.87> closing AMQP connection > <0.15184.87> (148.100.88.163:49988 -> 10.1.14.198:5671): > 2023-02-14 02:36:05.496 [error] <0.17479.87> closing AMQP connection > <0.17479.87> (148.100.88.163:57098 -> 10.1.14.198:5671): > 2023-02-14 04:06:13.869 [error] <0.16105.88> closing AMQP connection > <0.16105.88> (148.100.88.163:42984 -> 10.1.14.198:5671): > 2023-02-14 04:15:27.696 [error] <0.19038.88> closing AMQP connection > <0.19038.88> (148.100.88.163:56650 -> 10.1.14.198:5671): > 2023-02-14 20:05:38.702 [error] <0.23586.97> closing AMQP connection > <0.23586.97> (148.100.88.163:34278 -> 10.1.14.198:5671): > > and a lot more warnings (220 times in 20 hours) as well; like: > 2023-02-14 20:05:09.011 [warning] <0.20860.97> closing AMQP connection > <0.20860.97> (148.100.88.163:45624 -> 10.1.14.198:5671, vhost: '/', user: > 'guest'): > > And a lot (around 544) (obviously I don't know if that's only or even > includes the s390x host): > client unexpectedly closed TCP connection root@ci-worker-s390x-01:~# journalctl -u debci-worker@1.service --since='2 days ago' -t debci -b 0 | grep amqp Feb 15 15:10:21 ci-worker-s390x-01 debci[663]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 15 17:58:53 ci-worker-s390x-01 debci[2740543]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 15 19:23:40 ci-worker-s390x-01 debci[1855652]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 15 19:28:12 ci-worker-s390x-01 debci[1939916]: retry: amqp-publish returned 1, backing off for 10 seconds and trying again... Feb 15 20:50:51 ci-worker-s390x-01 debci[783145]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 15 21:36:25 ci-worker-s390x-01 debci[1966510]: retry: amqp-publish returned 1, backing off for 10 seconds and trying again... Feb 15 22:21:43 ci-worker-s390x-01 debci[3243793]: retry: amqp-publish returned 1, backing off for 10 seconds and trying again... Feb 16 00:29:41 ci-worker-s390x-01 debci[4119188]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 01:21:26 ci-worker-s390x-01 debci[2097411]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 03:02:20 ci-worker-s390x-01 debci[1133799]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 06:40:46 ci-worker-s390x-01 debci[953820]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 08:00:24 ci-worker-s390x-01 debci[2875496]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 09:59:09 ci-worker-s390x-01 debci[3864527]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 11:47:09 ci-worker-s390x-01 debci[2310984]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 14:08:01 ci-worker-s390x-01 debci[1968077]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 16:58:24 ci-worker-s390x-01 debci[2496027]: I:
Re: help needed to manage s390x host for ci.debian.net
Hi, On 13-02-2023 15:59, Dipak Zope1 wrote: There is some issue with 5.10.0-21 kernel and we are working on it. This can cause performance impact on CI servers. I have rebooted to the old kernel yesterday. That helps a bit indeed, although most of the issues I reported predate that kernel upgrade. As Paul mentioned we have upgraded CI servers to better capacity in May last year. Is today’s performance worse than what we observed right after upgrade? As you can see e.g. here [1,2] it comes and goes (albeit sometimes the queue was empty). I don't think its very different, I just never got out of the s390x host what I was expecting. Long time I blamed it on the "stealing" that happens on a shared host, but I think there's more. [1] https://ci.debian.net/munin/ci-worker-s390x-01/ci-worker-s390x-01/debci_packages_processed.html [2] https://ci.debian.net/munin/ci-worker-s390x-01/ci-worker-s390x-01/cpu.html Is the performance deteriorated consistently over period of time or suddenly observed? Is there any incidence – like change/upgrade in software or hardware component which is coinciding with it if it is sudden change? James Addison suggested in [3] to increase a prefetch counter in amqp (although its the same on all hosts); I have done so on the s390x host and at least initially it seems to help keeping the host busier. [3] https://salsa.debian.org/ci-team/debci/-/issues/92#note_381306 Paul OpenPGP_signature Description: OpenPGP digital signature
Re: help needed to manage s390x host for ci.debian.net
Hi Phil, On 13-02-2023 08:57, Philipp Kern wrote: On 12.02.23 22:38, Paul Gevers wrote: I have munin [1], but as said, I'm not a trained sysadmin. I don't know what I'm looking for if you ask "statistics on the network". This is more of a software development / devops question than a sysadmin question, but alas. I acknowledge that my reach out was broad and didn't only cover s390x. What I am interested in is *application-level* logging on reconnects. Presumably the connection to RabbitMQ is outbound? Our configuration can be seen here: https://salsa.debian.org/ci-team/debian-ci-config/-/blob/master/cookbooks/rabbitmq/templates/rabbitmq.conf.erb Is it tunneled? Does your application log somewhere when a reconnect happens? Does it say when it successfully connected? I'd expect good software to log something like this: [10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"... [10:00:05] Connected to broker "rabbitmq.debci.debian.net:12345". And also: [10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"... [10:00:01] Connection to broker "rabbitmq.debci.debian.net:12345" failed: Connection refused @terceiro; I haven't seen these kind of logs on the worker hosts. Do you know if they exist or if we can generate them? I think I'm seeing something on the main host. admin@ci-master:/var/log/rabbitmq$ sudo grep 148.100.88.163 rab...@ci-master.log | grep -v '\[info\]' | grep -v '\[warning\]' 2023-02-14 00:00:37.522 [error] <0.30951.85> closing AMQP connection <0.30951.85> (148.100.88.163:49540 -> 10.1.14.198:5671): 2023-02-14 02:27:56.050 [error] <0.15184.87> closing AMQP connection <0.15184.87> (148.100.88.163:49988 -> 10.1.14.198:5671): 2023-02-14 02:36:05.496 [error] <0.17479.87> closing AMQP connection <0.17479.87> (148.100.88.163:57098 -> 10.1.14.198:5671): 2023-02-14 04:06:13.869 [error] <0.16105.88> closing AMQP connection <0.16105.88> (148.100.88.163:42984 -> 10.1.14.198:5671): 2023-02-14 04:15:27.696 [error] <0.19038.88> closing AMQP connection <0.19038.88> (148.100.88.163:56650 -> 10.1.14.198:5671): 2023-02-14 20:05:38.702 [error] <0.23586.97> closing AMQP connection <0.23586.97> (148.100.88.163:34278 -> 10.1.14.198:5671): and a lot more warnings (220 times in 20 hours) as well; like: 2023-02-14 20:05:09.011 [warning] <0.20860.97> closing AMQP connection <0.20860.97> (148.100.88.163:45624 -> 10.1.14.198:5671, vhost: '/', user: 'guest'): And a lot (around 544) (obviously I don't know if that's only or even includes the s390x host): client unexpectedly closed TCP connection Paul OpenPGP_signature Description: OpenPGP digital signature
RE: help needed to manage s390x host for ci.debian.net
Hello Paul, Philipp and team - There is some issue with 5.10.0-21 kernel and we are working on it. This can cause performance impact on CI servers. So I request to downgrade all our CI machines which are on 5.10.0-21 to 5.10.0-20. There are pretty good changes that we get better output if we are currently running with 5.10.0-21. Besides this I would like to understand more about the performance: As Paul mentioned we have upgraded CI servers to better capacity in May last year. Is today’s performance worse than what we observed right after upgrade? Is the performance deteriorated consistently over period of time or suddenly observed? Is there any incidence – like change/upgrade in software or hardware component which is coinciding with it if it is sudden change? Meanwhile we are working on the kernel fix. I will keep this mail thread updated. Thanks, -Dipak Zope Debian s390 porting team From: Philipp Kern Date: Monday, 13 February 2023 at 1:28 PM To: Paul Gevers , debian-s390 , bar...@velocitysoftware.com , Paul Flint Cc: Debian CI team Subject: [EXTERNAL] Re: help needed to manage s390x host for ci.debian.net Hi, On 12.02.23 22:38, Paul Gevers wrote: > I have munin [1], but as said, I'm not a trained sysadmin. I don't know > what I'm looking for if you ask "statistics on the network". This is more of a software development / devops question than a sysadmin question, but alas. What I am interested in is *application-level* logging on reconnects. Presumably the connection to RabbitMQ is outbound? Is it tunneled? Does your application log somewhere when a reconnect happens? Does it say when it successfully connected? I'd expect good software to log something like this: [10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"... [10:00:05] Connected to broker "rabbitmq.debci.debian.net:12345". And also: [10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"... [10:00:01] Connection to broker "rabbitmq.debci.debian.net:12345" failed: Connection refused Kind regards Philipp Kern
Re: help needed to manage s390x host for ci.debian.net
Hi, On 12.02.23 22:38, Paul Gevers wrote: I have munin [1], but as said, I'm not a trained sysadmin. I don't know what I'm looking for if you ask "statistics on the network". This is more of a software development / devops question than a sysadmin question, but alas. What I am interested in is *application-level* logging on reconnects. Presumably the connection to RabbitMQ is outbound? Is it tunneled? Does your application log somewhere when a reconnect happens? Does it say when it successfully connected? I'd expect good software to log something like this: [10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"... [10:00:05] Connected to broker "rabbitmq.debci.debian.net:12345". And also: [10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"... [10:00:01] Connection to broker "rabbitmq.debci.debian.net:12345" failed: Connection refused Kind regards Philipp Kern
RE: help needed to manage s390x host for ci.debian.net
I am not CI/networking expert, but I will be more than happy to assist. I am at +0530 hrs available widely. Thanks, -Dipak Zope Debian s390 porting team On 13/02/23, 3:09 AM, "Paul Gevers" wrote: Hi Phil and all others offering help, On 12-02-2023 20:32, Philipp Kern wrote: > On 11.02.23 18:18, Paul Gevers wrote: > * [suspect 1] network issues between the s390x and the main ci.d.n >> server (the results (log files) of the autopkgtests are transferred to >> the main server). Our ppc64el hosts are also located at Marist, so I >> would expect commonality here, but also ppc64el isn't performing >> great, so maybe part of the problem is common. > > Do you have any kind of statistics on the network connections? I.e. how > often it reconnects and how long it takes to reconnect? The Marist > network has a very weird firewall inbound (e.g. if I do too many SSH > requests in a row, I'm backholed) - so I would not be surprised if there > is some weirdness there. I have munin [1], but as said, I'm not a trained sysadmin. I don't know what I'm looking for if you ask "statistics on the network". Also, I have no experience with s390x except for deploying the Debian software on the server setup by Phil. All the quirks of s390x are beyond me. I can provide logging from the host, but I'll need detailed instructions of what people find useful to look at. Recently Antonio taught me a trick to provide temporary access to a lxc container on any of our hosts, so if it helps to be on the host (but inside lxc) we can provide for that. Paul [1] https://ci.debian.net/munin/ci-worker-s390x-01/ci-worker-s390x-01/index.html
Re: help needed to manage s390x host for ci.debian.net
Hi Phil and all others offering help, On 12-02-2023 20:32, Philipp Kern wrote: On 11.02.23 18:18, Paul Gevers wrote: * [suspect 1] network issues between the s390x and the main ci.d.n server (the results (log files) of the autopkgtests are transferred to the main server). Our ppc64el hosts are also located at Marist, so I would expect commonality here, but also ppc64el isn't performing great, so maybe part of the problem is common. Do you have any kind of statistics on the network connections? I.e. how often it reconnects and how long it takes to reconnect? The Marist network has a very weird firewall inbound (e.g. if I do too many SSH requests in a row, I'm backholed) - so I would not be surprised if there is some weirdness there. I have munin [1], but as said, I'm not a trained sysadmin. I don't know what I'm looking for if you ask "statistics on the network". Also, I have no experience with s390x except for deploying the Debian software on the server setup by Phil. All the quirks of s390x are beyond me. I can provide logging from the host, but I'll need detailed instructions of what people find useful to look at. Recently Antonio taught me a trick to provide temporary access to a lxc container on any of our hosts, so if it helps to be on the host (but inside lxc) we can provide for that. Paul [1] https://ci.debian.net/munin/ci-worker-s390x-01/ci-worker-s390x-01/index.html OpenPGP_signature Description: OpenPGP digital signature
Re: help needed to manage s390x host for ci.debian.net
On 11.02.23 18:18, Paul Gevers wrote: * [suspect 1] network issues between the s390x and the main ci.d.n server (the results (log files) of the autopkgtests are transferred to the main server). Our ppc64el hosts are also located at Marist, so I would expect commonality here, but also ppc64el isn't performing great, so maybe part of the problem is common. Do you have any kind of statistics on the network connections? I.e. how often it reconnects and how long it takes to reconnect? The Marist network has a very weird firewall inbound (e.g. if I do too many SSH requests in a row, I'm backholed) - so I would not be surprised if there is some weirdness there. Kind regards Philipp Kern