alignment fault on armv7 when using carp(4)

2016-02-06 Thread Anthony Eden
>Synopsis:  
>Category:  arm
>Environment:
System  : OpenBSD 5.9
Details : OpenBSD 5.9 (DBGGENERIC) #0: Sat Feb  6 12:22:27 EST 2016
 r...@beagle2.mit.edu:/usr/src/sys/arch/armv7/compile/DBGGENERIC

Architecture: OpenBSD.armv7
Machine : armv7
>Description:
With two beaglebone black's running -current, an alignment fault is
encountered at ip_input.c:262 in ipv4_input() when they are
configured to use carp(4) to share the same IP address.

Source context from ip_input.c (alignment fault occurs when
ip->ip_dst.s_addr is loaded at line 262):

258:ip = mtod(m, struct ip *);
259:}
260:
261:/* 127/8 must not appear on wire - RFC1122 */
262:if ((ntohl(ip->ip_dst.s_addr) >> IN_CLASSA_NSHIFT) == IN_LOOPBACKNET ||
263:   (ntohl(ip->ip_src.s_addr) >> IN_CLASSA_NSHIFT) == IN_LOOPBACKNET) {
264:if ((ifp->if_flags & IFF_LOOPBACK) == 0) {
265:ipstat.ips_badaddr++;
266:goto bad;

ddb(4) output:

$ Fatal kernel mode data abort: 'Alignment Fault 1'
trapframe: 0xcb2d8e40
DFSR=0001, DFAR=c4cb401e, spsr=8013
r0 =c924d400, r1 =0003, r2 =0045, r3 =0038
r4 =c4cb400e, r5 =c06f2ca4, r6 =0014, r7 =c4d65800
r8 =c0710e50, r9 =c069294c, r10=c0692918, r11=cb2d8eb8
r12=6093, ssp=cb2d8e8c, slr=c040bc88, pc =c04616ec

Stopped at  ipv4_input+0x9c:ldrls   r3, [r4, #0x010]
ddb> trace
ipv4_input+0xc
scp=0xc046165c rlv=0xc0461ab4 (ipintr+0x24)
rsp=0xcb2d8ebc rfp=0xcb2d8ecc
r10=0xc0692918 r8=0xc0710e50 r7=0xc06edd88 r6=0xc06edd88
r5=0x r4=0x0004
ipintr+0xc
scp=0xc0461a9c rlv=0xc041b290 (netintr+0xa0)
rsp=0xcb2d8ed0 rfp=0xcb2d8ef0
netintr+0xc
scp=0xc041b1fc rlv=0xc053f3d0 (softintr_dispatch+0x84)
rsp=0xcb2d8ef4 rfp=0xcb2d8f10
r7=0x r6=0xc0710eb4 r5=0xc0710ec0 r4=0xc89e13a0
softintr_dispatch+0x18
scp=0xc053f364 rlv=0xc053eef8 (arm_do_pending_intr+0x110)
rsp=0xcb2d8f14 rfp=0xcb2d8f40
r6=0xc0710190 r5=0x2013 r4=0x0004
arm_do_pending_intr+0x10
scp=0xc053edf8 rlv=0xc040d9a8 (if_input_process+0xcc)
rsp=0xcb2d8f44 rfp=0xcb2d8f78
r10=0xc0692918 r9=0x r8=0x r7=0xcb2d8f44
r6=0x r5=0xc4d65800 r4=0xc4d57480
if_input_process+0xc
scp=0xc040d8e8 rlv=0xc03b5c2c (taskq_thread+0x90)
rsp=0xcb2d8f7c rfp=0xcb2d8fb0
r10=0xc06e643c r8=0xc06e65d8 r7=0xcb2d8f7c r6=0x0001
r5=0xc89e2040 r4=0xc03b5b04
taskq_thread+0xc
scp=0xc03b5ba8 rlv=0xc0536c10 (proc_trampoline+0x18)
rsp=0xcb2d8fb4 rfp=0xc07f3edc
r7=0x r6=0x r5=0xc89e2040 r4=0xc03b5b9c
Bad frame pointer: 0xc07f3edc

this problem has also been encountered with both BB's running -stable.

>How-To-Repeat:
Install either -current or -stable on two beaglebone black's, with names
beagle1 and beagle2. On a LAN 192.168.123.0/24 with default
gateway 192.168.123.2, set /etc/mygate to 192.168.123.2 on beagle1 and
beagle2, then set /etc/hostname.cpsw0 on beagle1 to be

inet 192.168.123.201 255.255.255.0 NONE

and on beagle2

inet 192.168.123.202 255.255.255.0 NONE

then run the following commands on both to use carp(4):

doas ifconfig carp0 create
doas ifconfig carp0 vhid 1 pass tyrell carpdev cpsw0 192.168.123.222
netmask 255.255.255.0

shortly thereafter a beaglebone will encounter an alignment fault.

>Fix:
The cause of this problem is unknown to me. I would speculate that the
issue lies in m_pullup mishandling alignment, given that netowkring on
the beaglebone black usually functions normally, and that there are
branches prior to the crash in which m_pullup is used in deriving a
pointer to ip, which when using carp(4) apparently misaligned.

In investigating this issue further, I replaced offending 32-bit loads
in the kernel with calls to get_unaligned_le32(), defined as (from
linux/unaligned/packed_struct.h):

struct __una_u32 { u32 x; } __packed;
static inline u32 get_unaligned_le32(const void *p) {
const struct __una_u32 *ptr = (const struct __una_u32 *)p;
return ptr->x;
}

Other than replacements in ip_input.c, udp_usrreq.c was also changed as
well as the macros IN6_IS_ADDR_UNSPECIFIED, IN6_IS_ADDR_LOOPBACK,
IN6_IS_ADDR_V4COMPAT, and IN6_IS_ADDR_V4MAPPED in in6.h.

This resulted in carp(4) appearing to function normally, but beagle1
and beagle2 repeatedly lost networking temporarily and recurrent
'device timeout's appeared in dmesg (as well as carp(4) messages
informing state changes from master to slave and vice versa).

To me that behavior might suggest the problem is deeper than a
bookkeeping mistake of aligning memory in mbuf.

dmesg:
OpenBSD 5.9 (DBGGENERIC) #0: Sat Feb  6 12:22:27 EST 2016
r...@beagle2.mit.edu:/usr/src/sys/arch/armv7/compile/DBGGENERIC
real mem  = 536870912 (512

Re: alignment fault on armv7 when using carp(4)

2016-02-08 Thread Anthony Eden
Thanks for the quick response. Indeed, the patch makes the alignment
faults go away. But the 'device timeout' messages coming from cpsw(4)
remain.

To elaborate a bit, I set up three terminals pinging 192.168.123.201,
192.168.123.202, and 192.168.123.222 (the shared IP). After ~1min I
get no answers from 192.168.123.201 and 192.168.123.201 (although the
times differ). For a few minutes the hosts remain unreachable. dmesg
output looks like

carp0: state transition: BACKUP -> MASTER
carp0: state transition: MASTER -> BACKUP
carp0: state transition: BACKUP -> MASTER
cpsw0: device timeout
carp0: state transition: MASTER -> BACKUP
carp0: state transition: BACKUP -> MASTER
cpsw0: device timeout
...

This bug seems unrelated to the alignment faults issue. I would
investigate given some pointers in the right direction.

If this is under the purview of cpsw(4), would it be advisable to
submit a new bug report?