Yesterday, I was attempting to install 5.8 on a VAX architecture (a DEC
VXT2000 with 18MB RAM and SPX graphics to be specific) by net booting using
the VAX boot.mop file. I was having issues during the process, and I
eventually used a network sniffer attempting to troubleshoot issues. I
noticed that loading over NFS seemed to stop shortly after my VXT received
an ARP request from my laptop, and also that the ARP reply seemed to be of
the wrong length. This issue was also a highly intermittent issue, as well
as variable in the ways it seemed to fail. Sometimes a full reboot would be
required before an NFS download worked again, other times I could retry
without rebooting, and rarely I would be able to boot without issues.
I know VAX is no longer supported, but I discovered issues in
arch-independent sections of the source, so please keep reading.
I did find, I think, the source of the issue with ARP reply packet length.
In libsa, the function arp_reply is what handles arp replies. Some comments
in this section read:
/*
* Convert an ARP request into a reply and send it.
* Notes: Re-uses buffer. Pad to length = 46.
*/
The length I see in the network sniffer is actually 64 bytes, which is 18
bytes longer than the 46 mentioned here. A normal ARP reply or request
(including ethernet frame) is actually 46 bytes, so this appears to be
correct in that the length should ultimately be 46. This function accepts a
pointer to an ethernet packet of type void, and casts it to a structure
devised to hold an ARP request or reply, called ether_arp:
struct ether_arp *arp = pkt;
This structure should be the size of the ethernet frame's 'payload' for an
ARP packet, which comes out to be 28 bytes. This is 18 less than 46. The
additional 18 is for the layer 2 ethernet overhead.
Near the end of the function, we find a call to sendether which 'pads' with
18 bytes:
(void) sendether(d, pkt, sizeof(*arp) + 18,
arp->arp_tha, ETHERTYPE_ARP);
However, this is where we go wrong. In ether.c, we find the function
sendether. Sendether has the following notes:
/* Caller must leave room for ethernet header in front!! */
And later:
eh = (struct ether_header *)pkt - 1;
len += sizeof(*eh);
which is actually what does the padding here, meaning the '+ 18' in the
call to sendether is redundant and wrong.
Looking at this reveals another potential issue. arp_reply passes *pkt to
sendether, and sendether will overwrite some number of bytes before the
buffer, which could be out the bounds of that buffer. Of course, it is also
possible that the buffer that *pkt points inside actually has enough space
within it before *pkt that this is a nonissue, but I need to investigate
this further (starting with finding the caller of arp_reply) to determine
whether it's an issue or not. Although weak evidence, data corruption
caused by ARP reply would seem to explain the issues I had with netbooting
my VXT2000.
To conclude, I believe that line 291 in /sys/lib/libsa/arp.c should be
changed from:
(void) sendether(d, pkt, sizeof(*arp) + 18,
to
(void) sendether(d, pkt, sizeof(*arp),
in order to solve the packet length issue. Aside from this, I will continue
looking into potential issues with exceeding buffer bounds when sendether
is called from arp_reply.
Best Regards,
Joe Zatarski