Re: help needed to manage s390x host for ci.debian.net

2023-05-09 Thread Dipak Zope
>
> Hello Paul,
>
> Thank you for the update on the status of the system and the Munin graphs.
> I think it's a good idea to wait until after the bookworm release to
> investigate further.
>
> However, I would like to take you up on the offer to schedule and/or
> arrange access to an LXC container on your host for investigations. I
> believe it could be beneficial to see the system from a different
> perspective, and understand/predict issues with SSH timeouts or other
> performance issues.
>
> Please let me know if this is possible and what steps I should take to get
> access to the LXC container.
>
> Thank you for your time and assistance.
>
> Best regards,
> Dipak
>

>


Re: help needed to manage s390x host for ci.debian.net

2023-04-20 Thread Paul Gevers

Hi Elizabeth,

On 18-04-2023 22:46, Elizabeth K. Joseph wrote:

I noticed that the Munin graphs are showing that the queue problems
from earlier this year seem to have been reduced now, is that correct,
or has the VM just not been restarted lately? It would be helpful to
have a starting point.


I was suspecting that after the last communications there were some 
changes on your side, because I rebooted the VM maybe two or three times 
[1] and noticed not seeing the slow down as I was used to. I also had 
the impression that some of my other pain point improved a bit.


At this moment Debian is in freeze [2] in preparation for the release of 
Debian bookworm, hence it's rather dull on the CI side. That makes 
interpreting the Munin graphs a bit difficult. In the mean time I have 
also upgraded the VM to run Debian bookworm (around 2023-03-14), maybe 
that also make a difference.


I propose we wait with further investigations until after the bookworm 
release. Once our CI hosts are loaded normally again I think we're in a 
better position to judge real performance. Having said that, I'm happy 
to schedule and/or arrange access to an LXC container on our host for 
investigations already before that time if you want to poke around.


Paul

[1] 
https://ci.debian.net/munin/ci-worker-s390x-01/ci-worker-s390x-01/uptime.html

[2] https://release.debian.org/testing/freeze_policy.html#summary


OpenPGP_signature
Description: OpenPGP digital signature


Re: help needed to manage s390x host for ci.debian.net

2023-04-18 Thread Elizabeth K. Joseph
On Sun, Feb 12, 2023 at 1:39 PM Paul Gevers  wrote:
> I can provide logging from the host, but I'll need detailed instructions
> of what people find useful to look at. Recently Antonio taught me a
> trick to provide temporary access to a lxc container on any of our
> hosts, so if it helps to be on the host (but inside lxc) we can provide
> for that.

I had a call with Dipak this morning to discuss some of how we can
help, between my Linux admin and networking experience, and what we
can otherwise pull together on the Z side at IBM, we're eager to see
what we can do to make these systems a bit quicker. I think access to
the host in an LXC container would be valuable, especially if we're
still having any issues with SSH timeouts and could see about
replicating.

I noticed that the Munin graphs are showing that the queue problems
from earlier this year seem to have been reduced now, is that correct,
or has the VM just not been restarted lately? It would be helpful to
have a starting point.

Many thanks.

-- 
Elizabeth K. Joseph || Lyz || pleia2



Re: help needed to manage s390x host for ci.debian.net

2023-02-28 Thread Paul Gevers

Hi,

On 28-02-2023 01:39, James Addison wrote:

Attempting to sum together what look, to me, like a pair of 2s:

   * The s390x Debian CI queue size[1] is growing again.


Yes, but this time it's because some test seems to be misbehaving (only 
on s390x or big endian or... ) and fills the disk (and gets killed by a 
cron job and restarts). I'm suspecting dolfin (and python-oslo.db) at 
the moment.



   * A recent bug report[2] by Dipak describes userspace processes
getting stuck on an s390 Linux kernel version that Debian's CI infra
has been using


We reverted to the previous kernel:
root@ci-worker-s390x-01:~# uname -a
Linux ci-worker-s390x-01 5.10.0-20-s390x #1 SMP Debian 5.10.158-2 
(2022-12-13) s390x GNU/Linux


Paul


OpenPGP_signature
Description: OpenPGP digital signature


Re: help needed to manage s390x host for ci.debian.net

2023-02-27 Thread James Addison
Attempting to sum together what look, to me, like a pair of 2s:

  * The s390x Debian CI queue size[1] is growing again.

  * A recent bug report[2] by Dipak describes userspace processes
getting stuck on an s390 Linux kernel version that Debian's CI infra
has been using

The bug does seem to have caused CI package build timeouts, as Paul
and others have discussed[3].  I was skeptical about the
kernel-as-cause theory, but now agree with it.

Perhaps the timeouts explain the queue backlog?


Also note: Sumanth has offered a fix as an s390 kernel patch[4], and
it is pending -- that is, the fix has been uploaded and is awaiting
general availability after a delay for people to review the relevant
changes -- for distribution in Debian stable.


I'm puzzled by some conflicting data, though: the ppc64 queue _isn't_
growing currently.  Why did it follow the s390x trend so closely
during the previous queue buildup, and yet doesn't appear to be doing
so this time?


[1] - 
https://ci.debian.net/munin/debian.net/ci-master.debian.net/debci_queue_size.html

[2] - https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1031753

[3] - https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1030545

[4] - https://lists.debian.org/debian-kernel/2023/02/msg00124.html

On Sat, 18 Feb 2023 at 14:23, James Addison  wrote:
>
> > James Addison suggested in [3] to increase a prefetch counter in amqp 
> > (although its the same on all hosts); I have done so on the s390x host and 
> > at least initially it seems to help keeping the host busier.
>
> Thanks for applying that - I was hoping that the change might also
> result in reductions in the debci queue size for s390x, but that
> doesn't appear to have happened, going by
> https://ci.debian.net/munin/debian.net/ci-master.debian.net/debci_queue_size.html
>
> [3] https://salsa.debian.org/ci-team/debci/-/issues/92#note_381306



Re: help needed to manage s390x host for ci.debian.net

2023-02-21 Thread Paul Gevers

Hi,

On 21-02-2023 17:46, Dipak Zope1 wrote:
I am wondering whether we have downgraded the machines to 5.10.0-20 
kernel to get rid of the kernel bug.


I think I mentioned it before, we downgraded indeed:
root@ci-worker-s390x-01:~# uname -a
Linux ci-worker-s390x-01 5.10.0-20-s390x #1 SMP Debian 5.10.158-2 
(2022-12-13) s390x GNU/Linux



Can we check if the patch solves any of the CI issues?


If there's a package available somewhere, I can install it, but I 
currently don't have the time (nor the will, sorry) to learn how to 
build Debian s390x kernel packages.


Paul


OpenPGP_signature
Description: OpenPGP digital signature


RE: help needed to manage s390x host for ci.debian.net

2023-02-21 Thread Dipak Zope1
I am wondering whether we have downgraded the machines to 5.10.0-20 kernel to 
get rid of the kernel bug which is known to cause issue in user processes at 
random - described in the cover letter here: 
https://lists.debian.org/debian-s390/2023/02/msg00019.html

The following patch fixes this issue:
https://lists.debian.org/debian-s390/2023/02/msg00019.html
The patch fixes the problem introduced in commit 75309018a24d ("s390: add 
support for TIF_NOTIFY_SIGNAL")

Can we check if the patch solves any of the CI issues?

-Dipak


On 16.02.23 17:49, Paul Gevers wrote:
> As you can see e.g. here [1,2] it comes and goes (albeit sometimes the
> queue was empty). I don't think its very different, I just never got out
> of the s390x host what I was expecting. Long time I blamed it on the
> "stealing" that happens on a shared host, but I think there's more.
>
> https://ci.debian.net/munin/ci-worker-s390x-01/ci-worker-s390x-01/debci_packages_processed.html

So a pet peeve of mine are unitless graphs. Can we please annotate what
the unit is? If we're looking at "Packages processed by architecture"[1]
(which again, isn't a workable unit), then we see that s390x does not
have a bad average, nor an overly bad max - given that it's with one
worker instead of like 14 for amd64 and 10 for arm64?

The average/max for the week is double for amd64 vs. s390x, so what does
the queue size mean? Is there still obsolete work in the queue as well
or does every item have the same value? (There's no way right now it can
catch up with that many items in the queue. Although that again depends
on the unit...)

Kind regards
Philipp Kern

[1]
https://ci.debian.net/munin/debian.net/ci-master.debian.net/debci_total_packages_processed.html


Re: help needed to manage s390x host for ci.debian.net

2023-02-20 Thread Philipp Kern

On 16.02.23 17:49, Paul Gevers wrote:
As you can see e.g. here [1,2] it comes and goes (albeit sometimes the 
queue was empty). I don't think its very different, I just never got out 
of the s390x host what I was expecting. Long time I blamed it on the 
"stealing" that happens on a shared host, but I think there's more.


https://ci.debian.net/munin/ci-worker-s390x-01/ci-worker-s390x-01/debci_packages_processed.html


So a pet peeve of mine are unitless graphs. Can we please annotate what 
the unit is? If we're looking at "Packages processed by architecture"[1] 
(which again, isn't a workable unit), then we see that s390x does not 
have a bad average, nor an overly bad max - given that it's with one 
worker instead of like 14 for amd64 and 10 for arm64?


The average/max for the week is double for amd64 vs. s390x, so what does 
the queue size mean? Is there still obsolete work in the queue as well 
or does every item have the same value? (There's no way right now it can 
catch up with that many items in the queue. Although that again depends 
on the unit...)


Kind regards
Philipp Kern

[1] 
https://ci.debian.net/munin/debian.net/ci-master.debian.net/debci_total_packages_processed.html




Re: help needed to manage s390x host for ci.debian.net

2023-02-18 Thread Philipp Kern

Hi,

On 17.02.23 17:04, Antonio Terceiro wrote:

So there is for sure something wrong with the client-server connection
there. Reworking the client for robustness is on my TODO list for a
while.


There's a lot of these:


Feb 14 08:56:25 ci-worker-s390x-01 debci[1155941]: waiting for header frame: a 
SSL error occurred


But alas, the worker will fail and immediately restart. But what's more 
concerning is the context:



Feb 14 08:39:50 ci-worker-s390x-01 debci[1355790]: bacula testing/s390x tmpfail
Feb 14 08:56:25 ci-worker-s390x-01 debci[1155941]: waiting for header frame: a 
SSL error occurred


This looks pretty common:


Feb 14 00:45:12 ci-worker-s390x-01 debci[2652291]: libgd2 testing/s390x fail
Feb 14 01:01:48 ci-worker-s390x-01 debci[546227]: waiting for header frame: a 
SSL error occurred



Feb 14 02:45:30 ci-worker-s390x-01 debci[1209706]: mmdebstrap testing/s390x pass
Feb 14 03:02:05 ci-worker-s390x-01 debci[3642098]: waiting for header frame: a 
SSL error occurred



Feb 14 04:40:10 ci-worker-s390x-01 debci[12655]: cacti testing/s390x tmpfail
Feb 14 04:56:51 ci-worker-s390x-01 debci[3015158]: waiting for header frame: a 
SSL error occurred


So we seem to lose at least 15 minutes of worker time when that happens. 
The failures are sometimes but not necessarily correlated:



Feb 17 01:07:17 ci-worker-s390x-01 debci[1149352]: waiting for header frame: a 
SSL error occurred
Feb 17 01:13:46 ci-worker-s390x-01 debci[552417]: waiting for header frame: a 
SSL error occurred
Feb 17 01:16:19 ci-worker-s390x-01 debci[1261598]: waiting for header frame: a 
SSL error occurred
Feb 17 01:21:02 ci-worker-s390x-01 debci[1487252]: waiting for header frame: a 
SSL error occurred
Feb 17 01:53:30 ci-worker-s390x-01 debci[3589185]: waiting for header frame: a 
SSL error occurred
Feb 17 02:03:24 ci-worker-s390x-01 debci[4184831]: waiting for header frame: a 
SSL error occurred
Feb 17 02:18:31 ci-worker-s390x-01 debci[3986861]: waiting for header frame: a 
SSL error occurred
Feb 17 02:41:11 ci-worker-s390x-01 debci[4167140]: waiting for header frame: a 
SSL error occurred
Feb 17 05:44:55 ci-worker-s390x-01 debci[1543385]: waiting for header frame: a 
SSL error occurred
Feb 17 05:47:10 ci-worker-s390x-01 debci[2598734]: waiting for header frame: a 
SSL error occurred
Feb 17 06:24:39 ci-worker-s390x-01 debci[1275755]: waiting for header frame: a 
SSL error occurred
Feb 17 06:50:05 ci-worker-s390x-01 debci[3680449]: waiting for header frame: a 
SSL error occurred
Feb 17 07:33:09 ci-worker-s390x-01 debci[107515]: waiting for header frame: a 
SSL error occurred
Feb 17 07:48:04 ci-worker-s390x-01 debci[2816244]: waiting for header frame: a 
SSL error occurred
Feb 17 07:54:07 ci-worker-s390x-01 debci[2284573]: waiting for header frame: a 
SSL error occurred
Feb 17 12:40:38 ci-worker-s390x-01 debci[4069122]: waiting for header frame: a 
SSL error occurred
Feb 17 15:39:40 ci-worker-s390x-01 debci[3343838]: waiting for header frame: a 
SSL error occurred
Feb 17 20:23:33 ci-worker-s390x-01 debci[3531969]: waiting for header frame: a 
SSL error occurred
Feb 17 21:21:28 ci-worker-s390x-01 debci[1815008]: waiting for header frame: a 
SSL error occurred
Feb 17 23:28:02 ci-worker-s390x-01 debci[2830093]: waiting for header frame: a 
SSL error occurred
Feb 18 01:38:13 ci-worker-s390x-01 debci[376]: waiting for header frame: a 
SSL error occurred
Feb 18 04:21:49 ci-worker-s390x-01 debci[1774710]: waiting for header frame: a 
SSL error occurred
Feb 18 04:21:53 ci-worker-s390x-01 debci[1530267]: waiting for header frame: a 
SSL error occurred
Feb 18 04:43:09 ci-worker-s390x-01 debci[2484158]: waiting for header frame: a 
SSL error occurred
Feb 18 04:54:21 ci-worker-s390x-01 debci[3870455]: waiting for header frame: a 
SSL error occurred
Feb 18 06:46:27 ci-worker-s390x-01 debci[632005]: waiting for header frame: a 
SSL error occurred
Feb 18 06:52:56 ci-worker-s390x-01 debci[516286]: waiting for header frame: a 
SSL error occurred
Feb 18 09:41:23 ci-worker-s390x-01 debci[57375]: waiting for header frame: a 
SSL error occurred


It doesn't look like amqp-consume has a lot of options in this space. I 
do wonder if a Wireguard tunnel would help, if only to move this from a 
firewall-mediated TCP stream to a couple of UDP packets that are less 
likely to be filtered. But I don't know how amenable the firewall is to 
these either.


I'm personally not a friend of munin because it makes math on the graphs 
hard. Do you have an idea how many packages the s390x manages to process 
per day and how that compares to the other workers? PubSub queues are 
not the easiest to introspect and I'd like to know how far we are off in 
intake into the queue per day vs. what we can process.


Kind regards and thanks
Philipp Kern



Re: help needed to manage s390x host for ci.debian.net

2023-02-18 Thread James Addison
> James Addison suggested in [3] to increase a prefetch counter in amqp 
> (although its the same on all hosts); I have done so on the s390x host and at 
> least initially it seems to help keeping the host busier.

Thanks for applying that - I was hoping that the change might also
result in reductions in the debci queue size for s390x, but that
doesn't appear to have happened, going by
https://ci.debian.net/munin/debian.net/ci-master.debian.net/debci_queue_size.html

[3] https://salsa.debian.org/ci-team/debci/-/issues/92#note_381306



Re: help needed to manage s390x host for ci.debian.net

2023-02-17 Thread Antonio Terceiro
On Tue, Feb 14, 2023 at 09:42:09PM +0100, Paul Gevers wrote:
> Hi Phil,
> 
> On 13-02-2023 08:57, Philipp Kern wrote:
> > On 12.02.23 22:38, Paul Gevers wrote:
> > > I have munin [1], but as said, I'm not a trained sysadmin. I don't
> > > know what I'm looking for if you ask "statistics on the network".
> > 
> > This is more of a software development / devops question than a sysadmin
> > question, but alas.
> 
> I acknowledge that my reach out was broad and didn't only cover s390x.
> 
> > What I am interested in is *application-level* logging on reconnects.
> > Presumably the connection to RabbitMQ is outbound?
> 
> Our configuration can be seen here:
> https://salsa.debian.org/ci-team/debian-ci-config/-/blob/master/cookbooks/rabbitmq/templates/rabbitmq.conf.erb
> 
> > Is it tunneled? Does your application log somewhere when a reconnect
> > happens? Does it say when it successfully connected?
> > 
> > I'd expect good software to log something like this:
> > 
> > [10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"...
> > [10:00:05] Connected to broker "rabbitmq.debci.debian.net:12345".
> > 
> > And also:
> > 
> > [10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"...
> > [10:00:01] Connection to broker "rabbitmq.debci.debian.net:12345"
> > failed: Connection refused
> 
> @terceiro; I haven't seen these kind of logs on the worker hosts. Do you
> know if they exist or if we can generate them?

The worker does log it's initial connection, see below.

> I think I'm seeing something on the main host.
> admin@ci-master:/var/log/rabbitmq$ sudo grep 148.100.88.163
> rab...@ci-master.log | grep -v '\[info\]' |  grep -v '\[warning\]'
> 2023-02-14 00:00:37.522 [error] <0.30951.85> closing AMQP connection
> <0.30951.85> (148.100.88.163:49540 -> 10.1.14.198:5671):
> 2023-02-14 02:27:56.050 [error] <0.15184.87> closing AMQP connection
> <0.15184.87> (148.100.88.163:49988 -> 10.1.14.198:5671):
> 2023-02-14 02:36:05.496 [error] <0.17479.87> closing AMQP connection
> <0.17479.87> (148.100.88.163:57098 -> 10.1.14.198:5671):
> 2023-02-14 04:06:13.869 [error] <0.16105.88> closing AMQP connection
> <0.16105.88> (148.100.88.163:42984 -> 10.1.14.198:5671):
> 2023-02-14 04:15:27.696 [error] <0.19038.88> closing AMQP connection
> <0.19038.88> (148.100.88.163:56650 -> 10.1.14.198:5671):
> 2023-02-14 20:05:38.702 [error] <0.23586.97> closing AMQP connection
> <0.23586.97> (148.100.88.163:34278 -> 10.1.14.198:5671):
> 
> and a lot more warnings (220 times in 20 hours) as well; like:
> 2023-02-14 20:05:09.011 [warning] <0.20860.97> closing AMQP connection
> <0.20860.97> (148.100.88.163:45624 -> 10.1.14.198:5671, vhost: '/', user:
> 'guest'):
> 
> And a lot (around 544) (obviously I don't know if that's only or even
> includes the s390x host):
> client unexpectedly closed TCP connection

root@ci-worker-s390x-01:~# journalctl -u debci-worker@1.service --since='2 days 
ago' -t debci -b 0 | grep amqp
Feb 15 15:10:21 ci-worker-s390x-01 debci[663]: I: Connecting to AMQP queue 
debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 15 17:58:53 ci-worker-s390x-01 debci[2740543]: I: Connecting to AMQP queue 
debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 15 19:23:40 ci-worker-s390x-01 debci[1855652]: I: Connecting to AMQP queue 
debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 15 19:28:12 ci-worker-s390x-01 debci[1939916]: retry: amqp-publish returned 
1, backing off for 10 seconds and trying again...
Feb 15 20:50:51 ci-worker-s390x-01 debci[783145]: I: Connecting to AMQP queue 
debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 15 21:36:25 ci-worker-s390x-01 debci[1966510]: retry: amqp-publish returned 
1, backing off for 10 seconds and trying again...
Feb 15 22:21:43 ci-worker-s390x-01 debci[3243793]: retry: amqp-publish returned 
1, backing off for 10 seconds and trying again...
Feb 16 00:29:41 ci-worker-s390x-01 debci[4119188]: I: Connecting to AMQP queue 
debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 01:21:26 ci-worker-s390x-01 debci[2097411]: I: Connecting to AMQP queue 
debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 03:02:20 ci-worker-s390x-01 debci[1133799]: I: Connecting to AMQP queue 
debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 06:40:46 ci-worker-s390x-01 debci[953820]: I: Connecting to AMQP queue 
debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 08:00:24 ci-worker-s390x-01 debci[2875496]: I: Connecting to AMQP queue 
debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 09:59:09 ci-worker-s390x-01 debci[3864527]: I: Connecting to AMQP queue 
debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 11:47:09 ci-worker-s390x-01 debci[2310984]: I: Connecting to AMQP queue 
debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 14:08:01 ci-worker-s390x-01 debci[1968077]: I: Connecting to AMQP queue 
debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 16:58:24 ci-worker-s390x-01 debci[2496027]: I: 

Re: help needed to manage s390x host for ci.debian.net

2023-02-16 Thread Paul Gevers

Hi,

On 13-02-2023 15:59, Dipak Zope1 wrote:
There is some issue with 5.10.0-21 kernel and we are working on it. This 
can cause performance impact on CI servers.


I have rebooted to the old kernel yesterday. That helps a bit indeed, 
although most of the issues I reported predate that kernel upgrade.


As Paul mentioned we have upgraded CI servers to better capacity in May 
last year. Is today’s performance worse than what we observed right 
after upgrade?


As you can see e.g. here [1,2] it comes and goes (albeit sometimes the 
queue was empty). I don't think its very different, I just never got out 
of the s390x host what I was expecting. Long time I blamed it on the 
"stealing" that happens on a shared host, but I think there's more.


[1] 
https://ci.debian.net/munin/ci-worker-s390x-01/ci-worker-s390x-01/debci_packages_processed.html


[2] 
https://ci.debian.net/munin/ci-worker-s390x-01/ci-worker-s390x-01/cpu.html


Is the performance deteriorated consistently over period 
of time or suddenly observed? Is there any incidence – like 
change/upgrade in software or hardware component which is coinciding 
with it if it is sudden change?


James Addison suggested in [3] to increase a prefetch counter in amqp 
(although its the same on all hosts); I have done so on the s390x host 
and at least initially it seems to help keeping the host busier.


[3] https://salsa.debian.org/ci-team/debci/-/issues/92#note_381306

Paul


OpenPGP_signature
Description: OpenPGP digital signature


Re: help needed to manage s390x host for ci.debian.net

2023-02-14 Thread Paul Gevers

Hi Phil,

On 13-02-2023 08:57, Philipp Kern wrote:

On 12.02.23 22:38, Paul Gevers wrote:
I have munin [1], but as said, I'm not a trained sysadmin. I don't 
know what I'm looking for if you ask "statistics on the network".


This is more of a software development / devops question than a sysadmin 
question, but alas.


I acknowledge that my reach out was broad and didn't only cover s390x.

What I am interested in is *application-level* 
logging on reconnects. Presumably the connection to RabbitMQ is 
outbound?


Our configuration can be seen here:
https://salsa.debian.org/ci-team/debian-ci-config/-/blob/master/cookbooks/rabbitmq/templates/rabbitmq.conf.erb

Is it tunneled? Does your application log somewhere when a 
reconnect happens? Does it say when it successfully connected?


I'd expect good software to log something like this:

[10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"...
[10:00:05] Connected to broker "rabbitmq.debci.debian.net:12345".

And also:

[10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"...
[10:00:01] Connection to broker "rabbitmq.debci.debian.net:12345" 
failed: Connection refused


@terceiro; I haven't seen these kind of logs on the worker hosts. Do you 
know if they exist or if we can generate them?


I think I'm seeing something on the main host.
admin@ci-master:/var/log/rabbitmq$ sudo grep 148.100.88.163 
rab...@ci-master.log | grep -v '\[info\]' |  grep -v '\[warning\]'
2023-02-14 00:00:37.522 [error] <0.30951.85> closing AMQP connection 
<0.30951.85> (148.100.88.163:49540 -> 10.1.14.198:5671):
2023-02-14 02:27:56.050 [error] <0.15184.87> closing AMQP connection 
<0.15184.87> (148.100.88.163:49988 -> 10.1.14.198:5671):
2023-02-14 02:36:05.496 [error] <0.17479.87> closing AMQP connection 
<0.17479.87> (148.100.88.163:57098 -> 10.1.14.198:5671):
2023-02-14 04:06:13.869 [error] <0.16105.88> closing AMQP connection 
<0.16105.88> (148.100.88.163:42984 -> 10.1.14.198:5671):
2023-02-14 04:15:27.696 [error] <0.19038.88> closing AMQP connection 
<0.19038.88> (148.100.88.163:56650 -> 10.1.14.198:5671):
2023-02-14 20:05:38.702 [error] <0.23586.97> closing AMQP connection 
<0.23586.97> (148.100.88.163:34278 -> 10.1.14.198:5671):


and a lot more warnings (220 times in 20 hours) as well; like:
2023-02-14 20:05:09.011 [warning] <0.20860.97> closing AMQP connection 
<0.20860.97> (148.100.88.163:45624 -> 10.1.14.198:5671, vhost: '/', 
user: 'guest'):


And a lot (around 544) (obviously I don't know if that's only or even 
includes the s390x host):

client unexpectedly closed TCP connection

Paul


OpenPGP_signature
Description: OpenPGP digital signature


RE: help needed to manage s390x host for ci.debian.net

2023-02-13 Thread Dipak Zope1
Hello Paul, Philipp and team -

There is some issue with 5.10.0-21 kernel and we are working on it. This can 
cause performance impact on CI servers.
So I request to downgrade all our CI machines which are on 5.10.0-21 to 
5.10.0-20.
There are pretty good changes that we get better output if we are currently 
running with 5.10.0-21.

Besides this I would like to understand more about the performance:
As Paul mentioned we have upgraded CI servers to better capacity in May last 
year. Is today’s performance worse than what we observed right after upgrade? 
Is the performance deteriorated consistently over period of time or suddenly 
observed? Is there any incidence – like change/upgrade in software or hardware 
component which is coinciding with it if it is sudden change?

Meanwhile we are working on the kernel fix. I will keep this mail thread 
updated.

Thanks,
-Dipak Zope
Debian s390 porting team

From: Philipp Kern 
Date: Monday, 13 February 2023 at 1:28 PM
To: Paul Gevers , debian-s390 
, bar...@velocitysoftware.com 
, Paul Flint 
Cc: Debian CI team 
Subject: [EXTERNAL] Re: help needed to manage s390x host for ci.debian.net
Hi,

On 12.02.23 22:38, Paul Gevers wrote:
> I have munin [1], but as said, I'm not a trained sysadmin. I don't know
> what I'm looking for if you ask "statistics on the network".

This is more of a software development / devops question than a sysadmin
question, but alas. What I am interested in is *application-level*
logging on reconnects. Presumably the connection to RabbitMQ is
outbound? Is it tunneled? Does your application log somewhere when a
reconnect happens? Does it say when it successfully connected?

I'd expect good software to log something like this:

[10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"...
[10:00:05] Connected to broker "rabbitmq.debci.debian.net:12345".

And also:

[10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"...
[10:00:01] Connection to broker "rabbitmq.debci.debian.net:12345"
failed: Connection refused

Kind regards
Philipp Kern


Re: help needed to manage s390x host for ci.debian.net

2023-02-12 Thread Philipp Kern

Hi,

On 12.02.23 22:38, Paul Gevers wrote:
I have munin [1], but as said, I'm not a trained sysadmin. I don't know 
what I'm looking for if you ask "statistics on the network".


This is more of a software development / devops question than a sysadmin 
question, but alas. What I am interested in is *application-level* 
logging on reconnects. Presumably the connection to RabbitMQ is 
outbound? Is it tunneled? Does your application log somewhere when a 
reconnect happens? Does it say when it successfully connected?


I'd expect good software to log something like this:

[10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"...
[10:00:05] Connected to broker "rabbitmq.debci.debian.net:12345".

And also:

[10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"...
[10:00:01] Connection to broker "rabbitmq.debci.debian.net:12345" 
failed: Connection refused


Kind regards
Philipp Kern



RE: help needed to manage s390x host for ci.debian.net

2023-02-12 Thread Dipak Zope1
I am not CI/networking expert, but I will be more than happy to assist.
I am at +0530 hrs available widely.

Thanks,
-Dipak Zope
Debian s390 porting team

On 13/02/23, 3:09 AM, "Paul Gevers"  wrote:
Hi Phil and all others offering help,

On 12-02-2023 20:32, Philipp Kern wrote:
> On 11.02.23 18:18, Paul Gevers wrote:
>   * [suspect 1] network issues between the s390x and the main ci.d.n
>> server (the results (log files) of the autopkgtests are transferred to
>> the main server). Our ppc64el hosts are also located at Marist, so I
>> would expect commonality here, but also ppc64el isn't performing
>> great, so maybe part of the problem is common.
>
> Do you have any kind of statistics on the network connections? I.e. how
> often it reconnects and how long it takes to reconnect? The Marist
> network has a very weird firewall inbound (e.g. if I do too many SSH
> requests in a row, I'm backholed) - so I would not be surprised if there
> is some weirdness there.

I have munin [1], but as said, I'm not a trained sysadmin. I don't know
what I'm looking for if you ask "statistics on the network".

Also, I have no experience with s390x except for deploying the Debian
software on the server setup by Phil. All the quirks of s390x are beyond me.

I can provide logging from the host, but I'll need detailed instructions
of what people find useful to look at. Recently Antonio taught me a
trick to provide temporary access to a lxc container on any of our
hosts, so if it helps to be on the host (but inside lxc) we can provide
for that.

Paul

[1]
https://ci.debian.net/munin/ci-worker-s390x-01/ci-worker-s390x-01/index.html



Re: help needed to manage s390x host for ci.debian.net

2023-02-12 Thread Paul Gevers

Hi Phil and all others offering help,

On 12-02-2023 20:32, Philipp Kern wrote:

On 11.02.23 18:18, Paul Gevers wrote:
  * [suspect 1] network issues between the s390x and the main ci.d.n
server (the results (log files) of the autopkgtests are transferred to 
the main server). Our ppc64el hosts are also located at Marist, so I 
would expect commonality here, but also ppc64el isn't performing 
great, so maybe part of the problem is common.


Do you have any kind of statistics on the network connections? I.e. how 
often it reconnects and how long it takes to reconnect? The Marist 
network has a very weird firewall inbound (e.g. if I do too many SSH 
requests in a row, I'm backholed) - so I would not be surprised if there 
is some weirdness there.


I have munin [1], but as said, I'm not a trained sysadmin. I don't know 
what I'm looking for if you ask "statistics on the network".


Also, I have no experience with s390x except for deploying the Debian 
software on the server setup by Phil. All the quirks of s390x are beyond me.


I can provide logging from the host, but I'll need detailed instructions 
of what people find useful to look at. Recently Antonio taught me a 
trick to provide temporary access to a lxc container on any of our 
hosts, so if it helps to be on the host (but inside lxc) we can provide 
for that.


Paul

[1] 
https://ci.debian.net/munin/ci-worker-s390x-01/ci-worker-s390x-01/index.html


OpenPGP_signature
Description: OpenPGP digital signature


Re: help needed to manage s390x host for ci.debian.net

2023-02-12 Thread Philipp Kern

On 11.02.23 18:18, Paul Gevers wrote:
 * [suspect 1] network issues between the s390x and the main ci.d.n
server (the results (log files) of the autopkgtests are transferred to 
the main server). Our ppc64el hosts are also located at Marist, so I 
would expect commonality here, but also ppc64el isn't performing great, 
so maybe part of the problem is common.


Do you have any kind of statistics on the network connections? I.e. how 
often it reconnects and how long it takes to reconnect? The Marist 
network has a very weird firewall inbound (e.g. if I do too many SSH 
requests in a row, I'm backholed) - so I would not be surprised if there 
is some weirdness there.


Kind regards
Philipp Kern