Re: relayd crashing some times

2023-07-04 Thread Kapetanakis Giannis
Hello,

I will try your diff, but since I have to completely turn off mail service it 
might take a while.

Meanwhile, just a wild guess from my side, although I'm not a dev:

It seems to me that a table is being removed, specifically the table that has 
the hosts for the redirect.
It's like after some active sessions expire (1-2min delay), the table is being 
removed like it's not persistent. Why did the table was removed on the first 
place? Maybe because there was no active host inside that table (table empty).

Then some statistics is being called on that table and it exits since it's not 
there.

If that's the case then it's should indeed not call statistics on the disabled 
table.

regards,

Giannis
ps. I cannot replicate (without new diff) if the load balancer does not have 
active sessions on the redirect when I disable. It also does not happen on the 
backup load balancer

On 03/07/2023 19:18, Alexandr Nedvedicky wrote:
> Hello,
>
> I took a closer look at relayd logs you obtained and tried to locate
> the origin of those messages in source code. the exact mechanism is
> still mystery to me.
>
> I'll start with brief summary:
>
>> Jun 30 01:47:46 ll1 relayd[61766]: pfe: check_table: cannot get table stats: 
>> No such file or directory
> line above is what pfe process reports right before it exits:
>
> 633
> 634 if (ioctl(env->sc_pf->dev, DIOCRGETTSTATS, &io) == -1)
> 635 fatal("%s: cannot get table stats for %s@%s", __func__,
> 636 io.pfrio_table.pfrt_name, io.pfrio_table.pfrt_anchor);
> 637
>
> snippet above comes from pfe_filter.c:check_table() function,
> which is called from  pfe.c:pfe_statistics(). pfe_statics()
> function is being called periodically from timer. I've noticed
> there is function pfe_disable_events() which should disable
> timer (stops calling to pfe_statistics()). However pfe_disable_events()
> is unused we never call it.
>
>> Jun 30 01:45:59 ll1 relayd[52103]: incremented the demote state of group 
>> '0relay'
> line above comes from here carp_demote_ioctl() here:
>
> 214 else
> 215 log_info("%s the demote state of group '%s'",
> 216 (demote > 0) ? "incremented" : "decremented", group);
>
> carp_demote_ioctl() is being called carp_demote_shutdown() which
> itself is being called from  parent_shutdown():
>
>  373 void
>  374 parent_shutdown(struct relayd *env)
>  375 {
>  376 config_purge(env, CONFIG_ALL);
>  377
>  378 proc_kill(env->sc_ps);
>  379 control_cleanup(&env->sc_ps->ps_csock);
>  380 carp_demote_shutdown();
>  381
>  382 free(env->sc_ps);
>  383 free(env);
>  384
>  385 log_info("parent terminating, pid %d", getpid());
>  386
>  387 exit(0);
>  388 }
>
> the relayd parent process is going to piecefully exit anyway. The parent
> exit is confirmed by those lines in log:
>> Jun 30 01:47:46 ll1 relayd[52103]: decremented the demote state of group 
>> '0relay'
>> Jun 30 01:47:46 ll1 relayd[52103]: parent terminating, pid 52103
> the only way how we could arrive to parent_shutdown() function
> is after receiving IMSG_CTL_SHUTDOWN which is sent on behalf
> of command 'relactl stop' which I have no idea where it got
> called from.
>
> anyway I suspect we must disable periodic event in `pfe`
> process to avoid unexpected exit via call to fatal().
>
> can you give a try to diff below?
>
> thanks and
> regards
> sashan
>
> 8<---8<---8<--8<
> diff --git a/usr.sbin/relayd/pfe.c b/usr.sbin/relayd/pfe.c
> index 3a97b749c4b..ad9c9cdc0cc 100644
> --- a/usr.sbin/relayd/pfe.c
> +++ b/usr.sbin/relayd/pfe.c
> @@ -93,6 +93,7 @@ pfe_init(struct privsep *ps, struct privsep_proc *p, void 
> *arg)
>  void
>  pfe_shutdown(void)
>  {
> + pfe_disable_events();
>   flush_rulesets(env);
>   config_purge(env, CONFIG_ALL);
>  }



Re: relayd crashing some times

2023-07-04 Thread Alexandr Nedvedicky
Hello,

On Tue, Jul 04, 2023 at 11:22:34AM +0300, Kapetanakis Giannis wrote:
> Hello,
> 
> I will try your diff, but since I have to completely turn off mail service it 
> might take a while.
> 
> Meanwhile, just a wild guess from my side, although I'm not a dev:
> 
> It seems to me that a table is being removed, specifically the table that has
> the hosts for the redirect.  It's like after some active sessions expire
> (1-2min delay), the table is being removed like it's not persistent. Why did
> the table was removed on the first place? Maybe because there was no active
> host inside that table (table empty).

'why table got removed?' is the right question. the tables are being
removed by kill_tables() function in pfe_filter.c. The function itself is
being called on behalf of flush_rulesets(), which is called by 
pfe_shutdown().
also remember logs you've captured cleanly indicate we are on shutdown road.
so there is a next question: how relayd process got to its shutdown path?

also the relayd which exits: does it run on primary firewall or on secondary 
one?

thanks and
regards
sashan



Re: relayd crashing some times

2023-07-04 Thread Kapetanakis Giannis
On 04/07/2023 11:45, Alexandr Nedvedicky wrote:
> Hello,
>
> On Tue, Jul 04, 2023 at 11:22:34AM +0300, Kapetanakis Giannis wrote:
>> Hello,
>>
>> I will try your diff, but since I have to completely turn off mail service 
>> it might take a while.
>>
>> Meanwhile, just a wild guess from my side, although I'm not a dev:
>>
>> It seems to me that a table is being removed, specifically the table that has
>> the hosts for the redirect.  It's like after some active sessions expire
>> (1-2min delay), the table is being removed like it's not persistent. Why did
>> the table was removed on the first place? Maybe because there was no active
>> host inside that table (table empty).
> 'why table got removed?' is the right question. the tables are being
> removed by kill_tables() function in pfe_filter.c. The function itself is
> being called on behalf of flush_rulesets(), which is called by 
> pfe_shutdown().
> also remember logs you've captured cleanly indicate we are on shutdown 
> road.
> so there is a next question: how relayd process got to its shutdown path?
>
> also the relayd which exits: does it run on primary firewall or on secondary 
> one?

It's always on the primary ONLY.

I see the tables have 3 status
active
disabled when manually being disabled by relayctl table disable
and "empty" when all hosts inside have been disabled. This is our case.

I'm trying to find a way to replicate the issue without disrupting the 
production service.

I've made a copy of the tables and redirects to another IP but I cannot 
replicate the issue.

If I have no active sessions from the redirect, nothing happens.
Even with open openssl s_client to 993/995 didn't trigger it either.

13  redirectdir2-imap   down
13  table   dir2:993empty
26  hostdir12   disabled
27  hostdir22   disabled
14  redirectdir2-popdown
14  table   dir2_:995   empty
28  hostdir12 parent 26 disabled
29  hostdir22 parent 27 disabled
15  redirectdir2-lmtp   down
15  table   dir2_:24empty
30  hostdir12 parent 26 disabled
31  hostdir22 parent 27 disabled
16  redirectdir2-sieve  down
16  table   dir2_:4190  empty
32  hostdir12 parent 26 disabled
33  hostdir22 parent 27 disabled

all anchors are there.

G
this is without your last diff


Re: ARM64 installation with new snapshots not possible any longer

2023-07-04 Thread Patrick Wildt
On Sat, Jun 24, 2023 at 05:48:15PM +0200, Volker Schlecht wrote:
> > > Date: Tue, 20 Jun 2023 09:31:58 +0200
> > > From: develo...@robert-palm.de
> > > 
> > > Hi,
> > > 
> > > I noticed that an ARM64 installation with latest snapshots is not
> > > possible any longer in hetzner cloud arm64 instances (ampere altra).
> > > 
> > > Last snapshot working is  
> > > https://ftp.hostserver.de/archive/2023-06-18-0105/snapshots/arm64/miniroot73.img
> [...]
> > The most likely candidate is:
> > 
> >   CVSROOT:/cvs
> >   Module name:src
> >   Changes by: kette...@cvs.openbsd.org2023/06/18 10:25:21
> >   Modified files:
> >   sys/arch/arm64/dev: agintc.c  Log message:
> >   Remove spurious comment.
> >   ok patrick@
> > 
> > Can you try reverting that change and see of the resulting kernel boots?
> 
> I hope that commit didn't have any functional consequences, so I built a
> -current kernel with the previous diff reverted, and it boots fine now:
> 
> CVSROOT:  /cvs
> Module name:  src
> Changes by:   kette...@cvs.openbsd.org2023/06/17 16:10:19
> 
> Modified files:
>   sys/arch/arm64/dev: agintc.c
> 
> Log message:
> Flush the ITS after we disestablish an MSI.  Fixes an issue seen on Ampere
> eMAG with an AMD GPU with an HD audio function where azalia(4) doesn't
> fully attach.
> 
> ok patrick@

It should not have a functional consequence on the VM, but obviously it
does make a difference because it stops booting.  I have tried to come
up with a diff that replaces INVALL with targeted INV, but those still
lead to a hang.

Since backing this diff out breaks another machine, a solution needs to
be found, not a backout.  I'll have a look.



opensmtpd-filter-dkimsign 's pkg-readme: Modify command line to fix permission error

2023-07-04 Thread @nabbisen



Hello, the great team and the wonderful community,


I found

in /usr/local/share/doc/pkg-readmes/opensmtpd-filter-dkimsign


> To generate the public key ready for dns:
>
>   openssl rsa -in /etc/mail/dkim/private.rsa.key -pubout | \
>     sed '1s/.*/v=DKIM1;p=/;:nl;${s/-.*//;q;};N;s/\n//g;b nl;'

should start with `doas -u _dkimsign ` as below:

```
To generate the public key ready for dns:

  doas -u _dkimsign openssl rsa -in /etc/mail/dkim/private.rsa.key 
-pubout | \

    sed '1s/.*/v=DKIM1;p=/;:nl;${s/-.*//;q;};N;s/\n//g;b nl;'
```

because only _dkimsign can read /etc/mail/dkim/ by default.


Thank you for your reading.


--
Kind regards,
@nabbisen



Minor defect in OpenBSD install program ...

2023-07-04 Thread Why 42? The lists account.


Hi All,

FYI, I think there there is a minor defect in the OpenBSD installation
program.

I noticed what looks like the use of an unset / uninitialised variable in
the text output:
> ...
> Let's install the sets!
> Location of sets? (disk http nfs or 'done') [http] 
> HTTP proxy URL? (e.g. 'http://proxy:8080', or 'none') [none] 
> (Unable to get list from openbsd.org, but that is OK)
> HTTP Server? (hostname or 'done') ftp.fau.de
> Server directory? [pub/OpenBSD/snapshots/armv7] 
> Unable to connect using HTTPS; using HTTP instead.
> Unable to get a verified list of distribution sets.
> Looked at  and found no OpenBSD/armv7 7.3 sets.  The set names looked for 
> were:
> bsd xbase73.tgz
> bsd.rd  xshare73.tgz
> base73.tgz  xfont73.tgz
> comp73.tgz  xserv73.tgz
> man73.tgz   site73.tgz
> game73.tgz  site73-novaya-zemlya.tgz

Notice the "Looked at  and found no" with double space.

I'm providing a valid host and (I believe) path.

This is a bit off the "main path" ... I think the root cause of the issue
here is that the Network (Ethernet) driver is not functioning correctly.

E.g.  if I drop out of the install I see ping statistics like this:
> ...
> Type 'exit' to return to install.
> novaya-zemlya# ping -v 192.168.178.85
> PING 192.168.178.85 (192.168.178.14 --> 192.168.178.85): 56 data bytes
> 64 bytes from 192.168.178.85: icmp_seq=0 ttl=64 time=1008.556 ms
> 64 bytes from 192.168.178.85: icmp_seq=1 ttl=64 time=2.239 ms
> 64 bytes from 192.168.178.85: icmp_seq=5 ttl=64 time=1.156 ms
> 64 bytes from 192.168.178.85: icmp_seq=7 ttl=64 time=0.939 ms
> 64 bytes from 192.168.178.85: icmp_seq=10 ttl=64 time=1.192 ms
> 64 bytes from 192.168.178.85: icmp_seq=19 ttl=64 time=1.131 ms
> 64 bytes from 192.168.178.85: icmp_seq=23 ttl=64 time=1.106 ms
> ^C
> --- 192.168.178.85 ping statistics ---
> 25 packets transmitted, 7 packets received, 72.0% packet loss
> round-trip min/avg/max/std-dev = 0.939/145.188/1008.556/352.469 ms

Presumably the same issue effects the install programs attempts to reach
the HTTP server, leading to some name variable not being set ...

This is with a 7.3 snapshot on an 32-bit ARM platform:
> novaya-zemlya# uname -a
> ksh: uname: not found

> novaya-zemlya# sysctl
> kern.osrelease=7.3
> hw.machine=armv7
> hw.model=ARM Cortex-A9 r2p10
> hw.product=Kosagi Novena Dual/Quad
> hw.disknames=sd0:60443d11093dd341,rd0:b66dc1c5a063c2b5,sd1:b4cca6f4102ee145,sd2:
> hw.ncpufound=1
> machdep.compatible=kosagi,imx6q-novena



Re: relayd crashing some times

2023-07-04 Thread Kapetanakis Giannis

On 03/07/2023 19:18, Alexandr Nedvedicky wrote:

8<---8<---8<--8<
diff --git a/usr.sbin/relayd/pfe.c b/usr.sbin/relayd/pfe.c
index 3a97b749c4b..ad9c9cdc0cc 100644
--- a/usr.sbin/relayd/pfe.c
+++ b/usr.sbin/relayd/pfe.c
@@ -93,6 +93,7 @@ pfe_init(struct privsep *ps, struct privsep_proc *p, void 
*arg)
  void
  pfe_shutdown(void)
  {
+   pfe_disable_events();
flush_rulesets(env);
config_purge(env, CONFIG_ALL);
  }



After adding this I got:

Jul  4 18:39:20 ll1 relayd[44353]: pfe: sync_table: cannot set address 
list: No such process
Jul  4 18:39:20 ll1 relayd[89408]: parent: proc_dispatch: msgbuf_write: 
Broken pipe


This was only the first time I did the restart.

I didn't get it another time, don't know if it's related to this change 
or some other circumstance.


As far as the diff:

I was able to trigger it again, but this time when the patched relayd 
was in BACKUP state (demoted).

I was trying to trigger it on the backup firewall...
I disabled dir1/dir2 hosts in both firewalls. I was expecting fw2 to 
stop, but I saw fw1 stopping (the patched one).


Jul  4 19:07:51 ll1 relayd[17501]: pfe: check_table: cannot get table 
stats for dir-lmtp@relayd/dir-lmtp: No such file or directory


G



Re: relayd crashing some times

2023-07-04 Thread Kapetanakis Giannis

On 04/07/2023 19:13, Kapetanakis Giannis wrote:

On 03/07/2023 19:18, Alexandr Nedvedicky wrote:

8<---8<---8<--8<
diff --git a/usr.sbin/relayd/pfe.c b/usr.sbin/relayd/pfe.c
index 3a97b749c4b..ad9c9cdc0cc 100644
--- a/usr.sbin/relayd/pfe.c
+++ b/usr.sbin/relayd/pfe.c
@@ -93,6 +93,7 @@ pfe_init(struct privsep *ps, struct privsep_proc 
*p, void *arg)

  void
  pfe_shutdown(void)
  {
+    pfe_disable_events();
  flush_rulesets(env);
  config_purge(env, CONFIG_ALL);
  }



After adding this I got:

Jul  4 18:39:20 ll1 relayd[44353]: pfe: sync_table: cannot set address 
list: No such process
Jul  4 18:39:20 ll1 relayd[89408]: parent: proc_dispatch: 
msgbuf_write: Broken pipe


This was only the first time I did the restart.

I didn't get it another time, don't know if it's related to this 
change or some other circumstance.


As far as the diff:

I was able to trigger it again, but this time when the patched relayd 
was in BACKUP state (demoted).

I was trying to trigger it on the backup firewall...
I disabled dir1/dir2 hosts in both firewalls. I was expecting fw2 to 
stop, but I saw fw1 stopping (the patched one).


Jul  4 19:07:51 ll1 relayd[17501]: pfe: check_table: cannot get table 
stats for dir-lmtp@relayd/dir-lmtp: No such file or directory


G


additional note:

dir-lmtp is the only redirect that has 2 listen directives. Don't know 
if this is related.


redirect dir-lmtp {
   listen on $dir_addr port 24
   listen on $imap_vip port 24 interface $imap_if
   pftag RELAYD_dir
   sticky-address
   forward to  port 24 mode least-states check icmp demote 0relay
   session timeout 4200
}

one is for actual job and other is for external checks (nagios/zabbix).

G