[Keep me in Cc:]
On Thu, Jan 24, 2019 at 05:49:46PM +0100, Andreas Longwitz wrote:
> after some more long term research I have an educated guess whats going
> on in this problem.
> 
> The problem only occurs on i386.
> 
> If I replace the counter_u64_fetch() call in pf_state_expires() by
> the value of V_pf_status.states, then pf works without problems, the
> expire time zero problem is gone:
> 
> 
> --- pf.c.1st    2018-08-14 10:17:41.000000000 +0200
> +++ pf.c        2019-01-19 17:49:18.000000000 +0100
> @@ -1542,7 +1542,7 @@
>         start = state->rule.ptr->timeout[PFTM_ADAPTIVE_START];
>         if (start) {
>                 end = state->rule.ptr->timeout[PFTM_ADAPTIVE_END];
> -               states = counter_u64_fetch(state->rule.ptr->states_cur);
> +               states = V_pf_status.states;
>         } else {
>                 start = V_pf_default_rule.timeout[PFTM_ADAPTIVE_START];
>                 end = V_pf_default_rule.timeout[PFTM_ADAPTIVE_END];
> 
> The use of counter_u64_fetch() looks a little bit at random for me. For
> all states not associated with the pf_default_rule the value of
> pf_status.states is used and for me this value is ok for all rules.
> 
> Further the counter(9) framework was created for quick and lockless
> write of counters, but fetching is more expansive. So I suggest to let
> pf_state_expires() work without a counter fetch.
> 
> Further I can confirm that the counter states_cur of the pf_default_rule
> remains correct, when the patch given above is active. Without the patch
> the counter on my main firewall machine gets slowly negative. I have
> verified this with a lot of live DTrace and kgdb script debugging.
> 
> >> OK, in the meantime I did some more research and I am now quite sure the
> >> problem with the bogus pf_default_rule->states_cur counter is not a
> >> problem in pf. I am convinced it is a problem in counter(9) on i386
> >> server. The critical code is the machine instruction cmpxchg8b used in
> >> /sys/i386/include/counter.h.
> >> 
> >> From intel instruction set reference manual:
> >> Zhis instruction can be used with a LOCK prefix allow the instruction to
> >> be executed atomically.
> >> 
> >> We have two other sources in kernel using cmpxchg8b:
> >>   /sys/i386/include/atomic.h   and
> >>   /sys/cddl/contrib/opensolaris/common/atomic/i386/opensolaris_atomic.S
> > 
> > A single CPU instruction is atomic by definition, with regards to the CPU.
> > A preemption can not happen in a middle of instruction. What the "lock"
> > prefix does is memory locking to avoid unlocked parallel access to the
> > same address by different CPUs.
> > 
> > What is special about counter(9) is that %fs:%esi always points to a
> > per-CPU address, because %fs is unique for every CPU and is constant,
> > so no other CPU may write to this address, so lock prefix isn't needed.
> > 
> > Of course a true SMP i386 isn't a well tested arch, so I won't assert
> > that counter(9) doesn't have bugs on this arch. However, I don't see
> > lock prefix necessary here.
> 
> I think the problem is the cmpxchg8b instruction used in
> counter_u64_fetch(), because this machine instruction always writes to
> memory, also when we only want to read and have (EDX:EAX) = (ECX:EBX):
> 
>     TEMP64 <- DEST
>     IF (EDX:EAX = TEMP64)
>        THEN
>           ZF <- 1
>           DEST <- ECX:EBX
>        ELSE
>           ZF <- 0
>           EDX:EAX <- TEMP64
>           DEST <- TEMP64
>     FI
> 
> If one CPU increments the counter in pf_create_state() and another does
> the fetch, then both CPU's may run the xmpxschg8b at once with the
> chance that both read the same memory value in TEMP64 and the fetching
> CPU is the second CPU that writes and so the increment is lossed. Thats
> what I see without the above patch two or three times a week.

Please try the following patch.  The idea is to make the value to compare
with unlikely to be equal to the memory content, for fetch_one().  Also
it fixes a silly bug in zero_one().

diff --git a/sys/i386/include/counter.h b/sys/i386/include/counter.h
index 7fd26d2a960..aa20831ba18 100644
--- a/sys/i386/include/counter.h
+++ b/sys/i386/include/counter.h
@@ -78,6 +78,9 @@ counter_u64_read_one_8b(uint64_t *p)
        uint32_t res_lo, res_high;
 
        __asm __volatile(
+       "movl   (%0),%%eax\n\t"
+       "movl   4(%0),%%edx\n\t"
+       "addl   $0x10000000,%%edx\n\t"  /* avoid write */
        "movl   %%eax,%%ebx\n\t"
        "movl   %%edx,%%ecx\n\t"
        "cmpxchg8b      (%2)"
@@ -120,11 +123,11 @@ counter_u64_zero_one_8b(uint64_t *p)
 {
 
        __asm __volatile(
+"\n1:\n\t"
        "movl   (%0),%%eax\n\t"
-       "movl   4(%0),%%edx\n"
+       "movl   4(%0),%%edx\n\t"
        "xorl   %%ebx,%%ebx\n\t"
        "xorl   %%ecx,%%ecx\n\t"
-"1:\n\t"
        "cmpxchg8b      (%0)\n\t"
        "jnz    1b"
        :
_______________________________________________
freebsd-pf@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-pf
To unsubscribe, send any mail to "freebsd-pf-unsubscr...@freebsd.org"

Reply via email to