Hi,

recently, a problem with OpenBSD has popped up over here that manifests
itself in "random" connection failures after some time. Network
diagram:

 workstation (1) --- (3b) firewall (3a) --- Internet --- www.example.com (2)

You surf from your workstation to www.example.com. On the firewall, you
can see packets flowing, on the exterior interface.

 (1) -> (2)
 (2) -> (1)
 (1) -> (2)
 (2) -> (1)
 (1) -> (2)
 (2) -> (1)
 (1) -> (2)
 (2) -> (1)
 (1) -> (2)
 (2) -> (1)
 (1) -> (2)
 (2) -> (1)
 (1) -> (2)
 (2) -> (1)
 (1) -> (2)

and so on. Everything works just fine. Now, with nothing changed except
for the firewall being up some days (currently: 13 days), and having
pushed some traffic already, connections start to fail:

On (3a), you see "almost" the same packet sequence like shown above,
shortened for brevity:

 (1) -> (2)
 (2) -> (1)
 (1) -> (2)
 (2) -> (1)
 (1) -> (2)
 (2) -> (1)
 (1) -> (2)
 (2) -> (1)
 (1) -> (2)
 (2) -> (1)
 (1) -> (2)
 (2) -> (1)    <- point where the connection fails
 (2) -> (1)
 (2) -> (1)
 (2) -> (1)
 (2) -> (1)

but on (3b), you see:

 (1) -> (2)
 (2) -> (1)
 (1) -> (2)
 (2) -> (1)
 (1) -> (2)
 (2) -> (1)
 (1) -> (2)
 (2) -> (1)
 (1) -> (2)
 (2) -> (1)
 (1) -> (2)

and then nothing more, like if the web server on the other side had
stopped sending packets. I can't see the packets on pflog0, either, and
using slightly different networking to "bypass" the firewall,
everything works still fine, but "fixing" the problem involves powering
down the firewall.  Simply rebooting it w/o powering it down, does not
fix the problem.

It doesn't really matter which site "www.example.com" is (it starts for
several sites at once, anyway), and, over time, the problem affects
ever more sites until the firewall is hardly usable at all. But
s1.wp.com is usually amongst the first sites to fail.

This problem first occurred for us with 4.6-stable on both i386 and
amd64, and now also occurred on -current with kernel 448 on i386. I'm
underway trying to get yet-more-recent stuff installed to see whether
the problem is fixed.

The experience of the problem being "fixed" by a thorough power-cycle
suggests that there may be some underlying memory corruption problem.


I'd very much appreciate hints for how to go about debugging this,
and/or can probably be remote controlled to do some testing.

TIA!


Kind regards,
--Toni++

Reply via email to