Hello,

I'm not familiar enough with relayd, so perhaps other folks
here might provide better way to troubleshoot the issue.

On Fri, Jun 30, 2023 at 11:10:44AM +0300, Kapetanakis Giannis wrote:
> Hello,
>
> This happened to me twice.
> OpenBSD 7.3 with syspatches.
>
> I have a pair of carp/pfsync/pf/relayd firewall-load balancers with many 
> redirects (only) on them.
>
> I wanted to do maintenance of some hosts bellow load balancers.
> After a while relayd crashed on Master firewall only.

    when you say crash: does it mean the relayd was terminated
    by system because of memory/stack/program violation?
    if it is the case is there any chance to collect core file?

    or was it rather voluntary exit, when relayd called its function fatal()

    the 'No such file or director' error code, which comes from DIOCRGETTSTATS
    ioctl() come from line 1746 in sys/net/pf_table.c:

1731 int
1732 pfr_get_tstats(struct pfr_table *filter, struct pfr_tstats *tbl, int *size,
1733         int flags)
1734 {
1735         struct pfr_ktable       *p;
1736         struct pfr_ktableworkq   workq;
1737         int                      n, nn;
1738         time_t                   tzero = gettime();
1739
1740         /* XXX PFR_FLAG_CLSTATS disabled */
1741         ACCEPT_FLAGS(flags, PFR_FLAG_ALLRSETS);
1742         if (pfr_fix_anchor(filter->pfrt_anchor))
1743                 return (EINVAL);
1744         n = nn = pfr_table_count(filter, flags);
1745         if (n < 0)
1746                 return (ENOENT);


    the pfr_table_count() function fails if and only if desired ruleset
    does not exists.

2177 int
2178 pfr_table_count(struct pfr_table *filter, int flags)
2179 {
2180         struct pf_ruleset *rs;
2181
2182         if (flags & PFR_FLAG_ALLRSETS)
2183                 return (pfr_ktable_cnt);
2184         if (filter->pfrt_anchor[0]) {
2185                 rs = pf_find_ruleset(filter->pfrt_anchor);
2186                 return ((rs != NULL) ? rs->tables : -1);
2187         }
2188         return (pf_main_ruleset.tables);
2189 }

    I wonder if it would help if adjust a fatal() line in relayd
    to also capture table name and anchor it is trying to find.
    diff which adjusts a call to fatal is below.

    if you don't want to build the whole tree and do in-place
    build you will need to adjust CFLAGS and LDFLAGS. Something
    like that will be needed:

        cd /path/to/your/src/usr.sbin/relayd
        export CFLAGS='-I/path/to/your/src/sys -I/path/to/your/src/lib/libutil
        export LDFLAGS='-L /path/to/your/src/lib/libutil'
        make


</snip>

>
> same logs on Backup firewall so far, but after a minute or so:
>
> Jun 30 01:47:46 ll1 relayd[61766]: pfe: check_table: cannot get table stats: 
> No such file or directory
    this is where I'd like to see what table relayd is trying
    to look up. The process 61766 then exits using call `exit(1)` 
    on behalf of function fatal()

> Jun 30 01:47:46 ll1 relayd[94434]: ca exiting, pid 94434
> Jun 30 01:47:46 ll1 relayd[83189]: ca exiting, pid 83189
> Jun 30 01:47:46 ll1 relayd[9023]: ca exiting, pid 9023
> Jun 30 01:47:46 ll1 relayd[89820]: ca exiting, pid 89820
> Jun 30 01:47:46 ll1 relayd[94676]: ca exiting, pid 94676
> Jun 30 01:47:46 ll1 relayd[1820]: hce exiting, pid 1820
> Jun 30 01:47:46 ll1 relayd[52103]: lost child: pid 61766 exited abnormally
    parent relayd process noticed the child took exit(1)
    because it could not find table. 

    once you'll be able to run patched relayd can you try to reproduce
    the issue?

    also it will help if you will collect additional data.

        pfctl -vsA > anchors-before
        # reproduce the issue wait for relayd to exit/crrash
        pfctl -vsA > anchors-after

    those data, together with output from adjusted call
    to fatal() should help us to better understand
    what's going on.

thanks for your help
regards
sashan

--------8<---------------8<---------------8<------------------8<--------
diff --git a/usr.sbin/relayd/pfe_filter.c b/usr.sbin/relayd/pfe_filter.c
index 347048ece56..e1ae050b768 100644
--- a/usr.sbin/relayd/pfe_filter.c
+++ b/usr.sbin/relayd/pfe_filter.c
@@ -632,7 +632,8 @@ check_table(struct relayd *env, struct rdr *rdr, struct 
table *table)
                goto toolong;
 
        if (ioctl(env->sc_pf->dev, DIOCRGETTSTATS, &io) == -1)
-               fatal("%s: cannot get table stats", __func__);
+               fatal("%s: cannot get table stats for %s@%s", __func__,
+                   io.pfrio_table.pfrt_name, io.pfrio_table.pfrt_anchor);
 
        return (tstats.pfrts_match);
 
>

Reply via email to