AHCI - NCQ Errors

2022-02-26 Thread Simon Baker
Hi All,

Just had to recover one of my boxes.  

Looking for a steer or any advice as to what might be going on.  I returned to 
a console which had dropped into ddb, the host itself is running a -current 
release and was updated on the 23rd Feb.  The logs I’ve managed to pull from it 
that seem pertinent are:

Feb 24 04:05:00 fw0 vnstatd[58770]: Error: Commit transaction to database 
failed (10): disk I/O error
Feb 24 04:08:06 fw0 /bsd: ahci0: NCQ errored slot 0 is idle (70003000 active)
Feb 24 04:09:10 fw0 /bsd: ahci0: attempting to idle device
Feb 24 04:09:10 fw0 /bsd: ahci0: stopping the port, softreset slot 31 was still 
active.
Feb 24 04:09:10 fw0 /bsd: ahci0: failed to soft reset device
Feb 24 04:09:10 fw0 /bsd: ahci0: couldn't recover NCQ error, failing all 
outstanding commands.
Feb 24 04:09:10 fw0 /bsd: ahci0: log page read failed, slot 31 was still active.
Feb 24 04:09:10 fw0 /bsd: ahci0: stopping the port, softreset slot 31 was still 
active.
Feb 24 04:09:10 fw0 /bsd: ahci0: attempting to idle device
Feb 24 04:09:10 fw0 /bsd: ahci0: stopping the port, softreset slot 31 was still 
active.
Feb 24 04:09:10 fw0 /bsd: ahci0: failed to soft reset device
Feb 24 04:09:10 fw0 /bsd: ahci0: couldn't recover NCQ error, failing all 
outstanding commands.
Feb 24 04:09:10 fw0 pflogd[90658]: Logging suspended: fwrite: Input/output error
Feb 24 04:10:00 fw0 vnstatd[58770]: Error: Exec step failed (11: database disk 
image is malformed): "update hour set rx=rx+0, tx=tx+1500 where interface=4 and 
date=strftime('%Y-%m-%d %H:00:00', datetime(1645675500, 'unixepoch'), 
'localtime')"
Feb 24 04:10:00 fw0 vnstatd[58770]: Error: Fatal database error detected, 
exiting.
Feb 24 04:10:10 fw0 /bsd: ahci0: NCQ errored slot 29 is idle (2000 active)
Feb 24 04:10:10 fw0 /bsd: ahci0: attempting to idle device
Feb 24 04:10:10 fw0 /bsd: ahci0: stopping the port, softreset slot 31 was still 
active.
Feb 24 04:10:10 fw0 /bsd: ahci0: failed to soft reset device
Feb 24 04:10:10 fw0 /bsd: ahci0: couldn't recover NCQ error, failing all 
outstanding commands.
Feb 24 04:11:12 fw0 /bsd: ahci0: attempting to idle device
Feb 24 04:11:12 fw0 /bsd: ahci0: stopping the port, softreset slot 31 was still 
active.
Feb 24 04:11:12 fw0 /bsd: ahci0: failed to soft reset device
Feb 24 04:11:12 fw0 /bsd: ahci0: NCQ errored slot 8 is idle (0200 active)


The box should have been quiet at this time, no heavy load expected.  

Running fsck on the filesystem didn’t end well for me - it resulted in a slew 
of NCQ error messages, and lost data.  The partitions that I didn’t run fsck 
against kept all their data.  I’ve since wiped and restored all the filesystem 
partitions.

I’ve also replaced the SATA cable, but wondering if anyone can shine a light as 
to what might have happened - the disk (SSD) is only 30 days old, and seems to 
be OK after restoring a backup onto it.

Disk Info:
Model Family: Phison Driven SSDs
Device Model: KINGSTON SA400S37240G

Thanks,

Simon.




Re: login.conf daemon datasize limit effects on VMs with 4GB+ RAM

2022-02-26 Thread Dave Voutila


"Ted Unangst"  writes:

> On 2022-02-25, Robert Nagy wrote:
>> Maybe we need a default vmd class? What do you guys think?
>
> Regardless of what the limit is, this seems like a daemon where people
> will bump into the limit. Perhaps a reminder is in order too?
>

The reminder is good, but we still need to fix the problem that the vmm
process can abort given the child dies so quickly. On my machine, the
call to read(2) results in a zero byte read, tripping the existing fatal
path.


diff ff838b72f50de699ee43d3dac58ff7e8435669ee /usr/src
blob - 4c6c99f1133cec7cb1e38dfd22e595e4d2023842
file + usr.sbin/vmd/vm.c
--- usr.sbin/vmd/vm.c
+++ usr.sbin/vmd/vm.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -292,8 +293,12 @@ start_vm(struct vmd_vm *vm, int fd)
ret = alloc_guest_mem(vcp);

if (ret) {
+   struct rlimit lim;
+   const char *msg = "could not allocate guest memory - exiting";
+   if (getrlimit(RLIMIT_DATA, &lim) == 0)
+   msg = "could not allocate guest memory (data limit is 
%llu) - exiting";
errno = ret;
-   fatal("could not allocate guest memory - exiting");
+   fatal(msg, lim.rlim_cur);
}

ret = vmm_create_vm(vcp);
blob - eb75b4c587884ec43704420ef4172386a5b39bd9
file + usr.sbin/vmd/vmm.c
--- usr.sbin/vmd/vmm.c
+++ usr.sbin/vmd/vmm.c
@@ -616,6 +616,7 @@ vmm_start_vm(struct imsg *imsg, uint32_t *id, pid_t *p
int  ret = EINVAL;
int  fds[2];
size_t   i, j;
+   ssize_t  sz;

if ((vm = vm_getbyvmid(imsg->hdr.peerid)) == NULL) {
log_warnx("%s: can't find vm", __func__);
@@ -674,9 +675,13 @@ vmm_start_vm(struct imsg *imsg, uint32_t *id, pid_t *p
}

/* Read back the kernel-generated vm id from the child */
-   if (read(fds[0], &vcp->vcp_id, sizeof(vcp->vcp_id)) !=
-   sizeof(vcp->vcp_id))
+   sz = read(fds[0], &vcp->vcp_id, sizeof(vcp->vcp_id));
+   if (sz < 0)
fatal("read vcp id");
+   else if (sz != sizeof(vcp->vcp_id)) {
+   log_warn("failed to read vcp id");
+   goto err;
+   }

if (vcp->vcp_id == 0)
goto err;