Kernel locks when configuring motherboard ethernet

2024-01-27 Thread John Klos
...



Since using the LOCKDEBUG kernel, this system can't use the network at all 
without locking up, even after a hardware reset. It's colocated, so while 
I can have someone physically power cycle the machine, I figured I'd leave 
it in case more information can be gained from it as it is.


The serial console can be accessed via another system via cu, and the 
other system can also do a hardware reset. The system obviously can't talk 
on the Internet, but it has netbsd-10 sources and can compile a kernel for 
itself.


The previous kernel that has been running for a couple of weeks had locked 
up twice, and I don't know if that's directly related to this, because it 
had nothing to do with configuring network ports. Interestingly, I've seen 
the same lockups with the previous machine that this machine replaced 
(8 gig Raspberry Pi 4, netbsd-10). These machines are public facing and 
are routing parts of a class C over tinc tunnels.


Here's one lockup:

[ 495715.4076245] fatal breakpoint trap in supervisor mode
[ 495715.4076245] trap type 1 code 0 rip 0x80235385 cs 0x8 rflags 0x202 
cr2 0x76f4a20740
00 ilevel 0x8 rsp 0xa80839aac8c8
[ 495715.4076245] curlwp 0xa0ed91107480 pid 0.3 lowest kstack 
0xa80839aa82c0
Stopped in pid 0.3 (system) at  netbsd:breakpoint+0x5:  leave
breakpoint() at netbsd:breakpoint+0x5
comintr() at netbsd:comintr+0x7e0
intr_kdtrace_wrapper() at netbsd:intr_kdtrace_wrapper+0x26
Xhandle_ioapic_edge1() at netbsd:Xhandle_ioapic_edge1+0x75
--- interrupt ---
npf_tcpsaw() at netbsd:npf_tcpsaw+0x1d
npf_conn_inspect() at netbsd:npf_conn_inspect+0x86
npfk_packet_handler() at netbsd:npfk_packet_handler+0x18e
pfil_run_hooks() at netbsd:pfil_run_hooks+0x128
ip_output() at netbsd:ip_output+0x4c0
ip_forward() at netbsd:ip_forward+0x138
ipintr() at netbsd:ipintr+0xa80
softint_dispatch() at netbsd:softint_dispatch+0x95
DDB lost frame for netbsd:Xsoftintr+0x4c, trying 0xa80839aad0f0
Xsoftintr() at netbsd:Xsoftintr+0x4c
--- interrupt ---
b31c059c10208e97:
ds  c9a0
es  ddb3
fs  1
gs  e8d9
rdi 81845120x86_io
rsi 800
rbp a80839aac8c8
rbx a8003df8c01c
rdx 7f
rcx 22
rax 1
r8  a80839aaca94
r9  0
r10 5ed7b6ca02a0
r11 a8003df91008
r12 a0e6944a1790
r13 800
r14 cc
r15 a0e6944a16c0
rip 80235385breakpoint+0x5
cs  8
rflags  202
rsp a80839aac8c8
ss  0
netbsd:breakpoint+0x5:  leave


Does anyone have any suggestions about what to try next? Does anyone want 
to have a look around themselves?


Thanks,
John Klos


Re: mfii0 disappears on warm reboot

2023-09-07 Thread John Klos

Hi,

Here are the pcidumps from both the working mfii0 and mfii0 after warm reboot:

https://www.klos.com/~john/pcidump_raid.txt
https://www.klos.com/~john/pcidump_noraid.txt

This is what's different:

3c3
< 0x00: 0x005b1000 0x0010 0x01040003 0x
---

0x00: 0x005b1000 0x0017 0x01040003 0x0010

7,10c7,10
< Command register: 0x
<   I/O space accesses: off
<   Memory space accesses: off
<   Bus mastering: off
---

Command register: 0x0007
  I/O space accesses: on
  Memory space accesses: on
  Bus mastering: on

40c40
< Cache Line Size: 0bytes (0x00)
---

Cache Line Size: 64bytes (0x10)

43c43
< 0x10: 0x0001 0x0004 0x 0x0004
---

0x10: 0xe001 0xfea60004 0x 0xfea4

45c45
< 0x30: 0x 0x0050 0x 0x0100
---

0x30: 0xfea4 0x0050 0x 0x0105

49c49
<   base: 0x, disabled
---

  base: 0xe000

52c52
<   base: 0x, disabled
---

  base: 0xfea6

55c55
<   base: 0x, disabled
---

  base: 0xfea0

61,62c61,62
< Expansion ROM Base Address Register: 0x
<   base: 0x
---

Expansion ROM Base Address Register: 0xfea4
  base: 0xfea4

71c71
< Interrupt line: 0x00
---

Interrupt line: 0x05

141c141
< Device Control Register: 0x2810
---

Device Control Register: 0x2840

146,147c146,147
<   Enable Relaxed Ordering: on
<   Max Payload Size: 128 byte
---

  Enable Relaxed Ordering: off
  Max Payload Size: 512 byte

173c173
< Link Control Register: 0x0080
---

Link Control Register: 0x00c0

178c178
<   Common Clock Configuration: off
---

  Common Clock Configuration: on

257c257
<   MSI-X Enable: off
---

  MSI-X Enable: on

269c269
< 0x70: 0x00092810 0x00407482 0x10420080 0x
---

0x70: 0x00092840 0x00407482 0x104200c0 0x

274c274
< 0xc0: 0x000f0011 0x2001 0x3001 0x
---

0xc0: 0x800f0011 0x2001 0x3001 0x

383c383
< 0x120: 0x000f 0x0101004c 0x73cd647b
---

0x120: 0x000f 0x0101004c 0x2d3319c8

471c471
< 0x120: 0x000f 0x0101004c 0x73cd647b 0x
---

0x120: 0x000f 0x0101004c 0x2d3319c8 0x



When the RAID card is working, detaching with "drvctl -d mfii0" and trying 
to reattach caused a panic:


[   256.157518] dk5 at sd0 (doozerroot) deleted
[   256.157518] dk4 at sd0 (doozerswap) deleted
[   256.157518] sd0: detached
[   256.157518] scsibus0: detached
[   256.287530] mfii0: detached
[   269.317565] mfii0 at pci1 dev 0 function 0panic: kernel diagnostic assertion 
"msipic_find_msi_pic_locked(msipic->mp_devid) == NULL" failed: file 
"/usr/src/sys/arch/x86/pci/msipic.c", line 262
[   269.327575] cpu0: Begin traceback...
[   269.327575] vpanic() at netbsd:vpanic+0x183
[   269.327575] kern_assert() at netbsd:kern_assert+0x4b
[   269.327575] msipic_construct_common_msi_pic() at 
netbsd:msipic_construct_common_msi_pic+0x325
[   269.327575] msipic_construct_msix_pic() at 
netbsd:msipic_construct_msix_pic+0x6e
[   269.337510] pci_msix_alloc_common.part.0() at 
netbsd:pci_msix_alloc_common.part.0+0x26
[   269.337510] pci_msix_alloc_exact() at netbsd:pci_msix_alloc_exact+0x5c
[   269.337510] pci_intr_alloc() at netbsd:pci_intr_alloc+0x57
[   269.337510] mfii_attach() at netbsd:mfii_attach+0x2c0
[   269.337510] config_attach_internal() at netbsd:config_attach_internal+0x19f
[   269.347509] config_found() at netbsd:config_found+0xc3
[   269.347509] pci_probe_device() at netbsd:pci_probe_device+0x661
[   269.347509] pci_enumerate_bus() at netbsd:pci_enumerate_bus+0x1a4
[   269.347509] pcirescan() at netbsd:pcirescan+0x4e
[   269.347509] rescanbus() at netbsd:rescanbus+0x16d
[   269.347509] drvctl_ioctl() at netbsd:drvctl_ioctl+0x534
[   269.357510] sys_ioctl() at netbsd:sys_ioctl+0x56d
[   269.357510] syscall() at netbsd:syscall+0x196
[   269.357510] --- syscall (number 54) ---
[   269.357510] netbsd:syscall+0x196:
[   269.357510] cpu0: End traceback...

[   269.357510] dumping to dev 168,1 (offset=8, size=6146568):
[   269.357510] dump [   1.000] Copyright (c) 1996, 1997, 1998, 1999, 2000, 
2001, 2002, 2003,



As per some excellent recommendations, I'd like to try to reset the card, 
but I'm not sure how to use pcictl to do this. Does anyone have an idea?


John


Re: mfii0 disappears on warm reboot

2023-09-06 Thread John Klos

Hi,

I don't know exactly how I sent an email with absolutely no "From:" (I 
have an idea, but I probably couldn't do it again if I tried), but that 
was me, if anyone wants to reply.


Thanks,
John


Lock of NetBSD-current with ifconfig down / up

2022-09-17 Thread John Klos

Hi,

Here's a nice issue :)

Plug in ure* USB ethernet to amd64 machine running NetBSD-current 
(9.99.99, 22-August-2022):


[ 1791670.446266] ure0 at uhub8 port 4
[ 1791670.446266] ure0: Realtek (0x0bda) USB 10/100/1000 LAN (0x8153), rev 
2.10/30.00, addr 6
[ 1791670.446266] ure0: RTL8153 ver 5c30
[ 1791670.566267] rgephy0 at ure0 phy 0: RTL8251 1000BASE-T media interface, 
rev. 0
[ 1791670.586267] rgephy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 
1000baseT-FDX, auto
[ 1791670.586267] ure0: Ethernet address a0:ce:c8:e7:88:5f
[ 1791673.256299] ugen1 at uhub8 port 5
[ 1791673.256299] ugen1: VIA Labs, Inc. (0x2109) PD3.0 USB-C Device (0x), 
rev 2.01/0.01, addr 7

ifconfig ure0 up

No problem:

ure0: flags=0x8943 mtu 1500
capabilities=0x3ff00
capabilities=0x3ff00
capabilities=0x3ff00
enabled=0
ec_capabilities=0x1
ec_enabled=0
address: a0:ce:c8:e7:88:5f
media: Ethernet autoselect (100baseTX full-duplex)
status: active
inet6 fe80::a2ce:c8ff:fee7:885f%ure0/64 flags 0 scopeid 0x9

ifconfig ure0 down

Locks the machine. I couldn't get more information because it's 3000 miles 
away. There's nothing in dmesg because the machine was power cycled.


Initially I imagined it might be due to the ure* driver, but then it 
happened locally.


On an amd64 system running 9.99.98 from 16-July-2022, I ran "ifconfig re1 
down" and the machine locked - no ICMP, nothing for SIGINFO, no response 
to keyboard cnmagic.


It doesn't appear to be hardware, but here's this just because.

[ 1.044097] re1 at pci2 dev 0 function 0: RealTek 8168/8111 PCIe Gigabit 
Ethernet (rev. 0x0c)
[ 1.044097] re1: interrupting at msix2 vec 0
[ 1.044097] re1: RTL8168G/8111G (0x4c00)
[ 1.044097] re1: Ethernet address 4c:cc:6a:01:a5:e0
[ 1.044097] re1: using 256 tx descriptors
[ 1.044097] rgephy1 at re1 phy 7: RTL8251 1000BASE-T media interface, rev. 0

I've ordered some PS/2 keyboards, because I take it that's the only way to 
reliably get in to the kernel debugger on amd64, unless someone knows a 
trick to make USB keyboards usable.


send-pr?

Thanks,
John


Scanning floppy devices with assumed density

2022-07-02 Thread John Klos

Hi, all,

It has been ages since I've used floppies in NetBSD very much, but 
checking the man page, different slice letters are used to indicate disk 
densities. From fdc(4):


 The driver supports the following floppy diskette formats by using
 particular partitions:
   1.44MB 3.5-inch (b)
   1.2MB 5.25-inch (c)
   360KB 5.25-inch (1.2MB drive) (d)
   360KB 5.25-inch (IBM-PC drive) (e)
   720KB 3.5-inch (f)
   720KB 5.25-inch (g)
   360KB 3.5-inch (h)

A user on Reddit pointed out this error on booting a NetBSD 9 kernel on an 
i80486 system:


boot device: fd0 [ 5.121888] fd0d: hard error reading fsbn 0 of 0-2 (st0 
0x40 st1 0x1 st2 0x0 cyl 0 head 0 sec 1)


https://www.reddit.com/r/NetBSD/comments/vh4wgc/a_little_bit_of_fun_booting_the_netbsd_162/idyrf95/

She wondered why fd0d is being used here. I can't imagine this is due to 
scanning for a disklabel, since they've been around forever, so is this 
perhaps due to dkwedge_discover?


John


Re: mfii hanging on boot

2022-06-23 Thread John Klos
I saw almost the same situation. To recover from the error, I had to 
power-down the machine, unplug the battery, keep a few minutes, plug the 
battery and power-up again.


I committed the change yesterday. I guess that it fixes kern/55192 and 
kern/56669.


My machine is only 2500 km away, but it'd still be hard to test. I have a 
spare card I will dig out to test, though.


Thanks,
John


Re: Complete lock-up from using pkgsrc/net/darkstat

2022-05-30 Thread John Klos

Hi,


It also might be relevant to note which port you're running.  It must
be capable of having re and wm interfaces, since you name them, but
that still includes a fair bit.

I still see no statement which port(s) you're running...


Here my brain equated port with ethernet, not with NetBSD port :P These 
are all NetBSD/amd64 systems.


Maybe the problem is related to the number of DNS lookups the child 
process is doing rather than the number of TB the parent process is 
counting?


I know DNS can be a big issue when there're a lot of attempts to look up 
reverse DNS, which is often broken, so I run darkstat with DNS lookup 
disabled, like so:


darkstat_flags="-i re0 -b 127.0.0.1 --no-dns"

Where I am I don't have all the things I normally have, so I'm still 
waiting for a null modem adapter so I can get a serial console on the 
machine that's physically local. Once that's here, I'm going to try very 
hard to get a lockup.



I'm guessing it's a memory issue with darkstat --
? specifically, it has a memory leak that runs the system it runs on out 
of RAM.  I bet if you add a ton of swap to a system on which you run 
darkstat, you'll find it runs longer before it hangs, and, I'm guessing 
you'll notice there is a lot of swap in use before it hangs


darkstat is run as user "nobody" and shows 23 megabytes after a week. It 
has no special unlimiting of any resources. The systems where issues were 
seen range from 4 to 16 gigabytes of memory, and I'd have noticed if any 
of them were in to swap at all (none were).


Is it possible the machine is not, strictly, hung, just doing something 
that renders it unresponsive for a human-perceptible time?  You wrote of 
having to get remote hands to poke an unresponsive machine; how long did 
that take?  Did your remote hands notice whether the disk light was lit 
(if there is such a light)?


When a system was in this specific state, I had someone plug in a USB 
keyboard and tell me if any new green text appeared on the screen. It did 
not (he sent a photo of the screen, too). I then asked him to press (and 
not hold) the power button. He did, and he said nothing happened - nothing 
on the screen, no disk lighting, et cetera (I told him to look for that). 
The systems normally power themselves down relatively nicely from a simple 
press.


I was communicating with him in the morning, hours after attempting to 
stop darkstat, so it had plenty of time to recover. I had also logged in 
to the backup machine and saw that I couldn't reach the internal interface 
of the frozen machine.



I've had machines appear to lock up hard when what's actually going on
is that a large process is dumping core.


If it didn't finish in hours, then that would be a problem :)

I'll post more when I've got a serial console set up.

Thanks,
John


Re: Complete lock-up from using pkgsrc/net/darkstat

2022-05-26 Thread John Klos

> On NetBSD 8, 9, current, [...] Stop darkstat.  Machine locks.

> [...] in case anyone can imagine how and why a complete system lockup
> could happen as the result of an interface being used in promiscuous
> mode for long periods of time (and not when used that way for short
> periods of time.

Don't forget, it may _not_ be "as the result of an interface being used
in promiscuous mode for long periods of time".  That's merely a
correlate (and possibly not a perfect correlate - your sample size is
small); the causality may be more complicated.  (For example, maybe
it's actually as a result of receiving certain traffic which is on that
segment but which it wouldn't normally receive.  Maybe it's got nothing
to do with network interfaces and instead is related to something else
darkstat does - I know nothing about what darkstat does or doesn't do,
except for your implication that it runs interfaces promiscuous.)


You're absolutely right - I don't know this for sure, but I can add some 
additional information.


I've seen occasional lockups (once or twice per year) on a number of 
systems - at least five different systems - which are all running as NAT 
routers and firewalls for various heavily used networks. Two systems were 
running NetBSD 8 with ipfilter, one with wm* as the public interface, the 
other with re*. Two systems were running NetBSD 9 with npf, one with wm*, 
one with re*. The fifth was running 9.99.93 with re0 as the public 
interface and npf.


It was on this last one that I ran "/etc/rc.d/darkstat stop" and saw that 
it completely locked up, and I had to have someone physically go and power 
cycle it. I know that when interfaces switch from promiscuous to 
non-promiscuous, they can lose link for a moment, but the machine wasn't 
reachable from the internal network, either, and didn't respond when a USB 
keyboard was connected to it (no green lines from the kernel). Also, 
pressing the power button didn't trigger a poweroff event, so I know it 
was completely locked.


Random lockups are one thing, and a specific lockup when stopping darkstat 
is another, but to add to this, one location has two identical machines, 
one which occasionally locked under exceptionally high network load, and 
the other that never did. To ascertain whether it was a hardware fault, 
the drives were swapped between them. The problem continued and moved with 
the drive, so then the OSes were reinstalled, and one still kept 
occasionally locking up. Only after seeing the lockup when stopping 
darkstat did I realize that the one that continuously had occasional 
lockups was running darkstat on boot.


These lockups have bugged the heck out of me for many years - at least 
five - and I'm kicking myself that I only realize now that all the 
machines that were 100% stable for multiple years weren't running 
darkstat, and the ones that were problematic were running darkstat. I 
should've realized this ages ago.



It also might be relevant to note which port you're running.  It must
be capable of having re and wm interfaces, since you name them, but
that still includes a fair bit.


With five different machines, I don't think it's likely an issue with all 
five interfaces, but it's always better to have too much information than 
too little:


NetBSD 8.2 (1-May-2020):
wm0 at pci2 dev 0 function 0: Intel i82574L (rev. 0x00)
wm0: for TX and RX interrupting at msix2 vec 0 affinity to 1
wm0: for TX and RX interrupting at msix2 vec 1 affinity to 2
wm0: for LINK interrupting at msix2 vec 2
wm0: PCI-Express bus
wm0: 2048 words FLASH, version 1.8.0, Image Unique ID 
wm0: ASPM L0s and L1 are disabled to workaround the errata.
wm0: Ethernet address 00:1b:21:b5:51:e7
wm0: 0x224480
makphy0 at wm0 phy 1: Marvell 88E1149 Gigabit PHY, rev. 1

NetBSD 8.2 (1-May-2020):
re0 at pci2 dev 0 function 0: RealTek 8168/8111 PCIe Gigabit Ethernet (rev. 
0x0c)
re0: interrupting at msi2 vec 0
re0: Ethernet address 4c:cc:6a:0b:ee:1a
re0: using 256 tx descriptors
rgephy0 at re0 phy 7: RTL8251 1000BASE-T media interface, rev. 0

NetBSD 9.0 (12-June-2020):
[ 1.004517] wm0 at pci3 dev 0 function 0: Intel PRO/1000 PT Quad Port 
Server Adapter (rev. 0x06)
[ 1.004517] wm0: interrupting at ioapic1 pin 3
[ 1.004517] wm0: PCI-Express bus
[ 1.004517] wm0: 4096 words (16 address bits) SPI EEPROM, version 5.10.2, 
Image Unique ID 
[ 1.004517] wm0: ASPM L1 is disabled to workaround the errata.
[ 1.004517] wm0: Ethernet address 00:15:17:73:0d:15
[ 1.004517] wm0: 0x24440
[ 1.004517] igphy0 at wm0 phy 1: Intel IGP01E1000 Gigabit PHY, rev. 0
...

NetBSD 9.1 (24-April-2021):
[ 1.008819] re0 at pci6 dev 0 function 0: RealTek 8168/8111 PCIe Gigabit 
Ethernet (rev. 0x0c)
[ 1.008819] re0: interrupting at msix2 vec 0
[ 1.008819] re0: Ethernet address e0:d5:5e:48:2c:58
[ 1.008819] re0: using 256 tx descriptors
[ 1.008819] rgephy0 at re0 phy 7: RTL8251 1000BASE-T media interface, rev. 0

NetBSD 9.99.93 (26-F

Complete lock-up from using pkgsrc/net/darkstat

2022-05-26 Thread John Klos

So here's an interesting problem:

On NetBSD 8, 9, current, with both ipfilter and with npf, with different 
kinds of ethernet interfaces (re*, wm*), run pkgsrc/net/darkstat. Pass a 
lot of traffic (like a week's worth of Internet traffic). Stop darkstat. 
Machine locks.


I've only recently been able to ascertain that it's directly related to 
darkstat, but because it hasn't happened locally, I don't have any more 
information than that, so not enough for a PR.


I'm going to try to reproduce the lockup on a physically local machine 
with a USB keyboard already attached in hopes that I can actually get in 
to the kernel debugger. I figured it's worth mentioning here in case 
anyone can imagine how and why a complete system lockup could happen as 
the result of an interface being used in promiscuous mode for long periods 
of time (and not when used that way for short periods of time.


Thanks,
John


Issues with older wd* / IDE on various platforms

2020-11-13 Thread John Klos

Hi, all,

I've noticed problems in three places and only recently did it occur to me 
that they may all be related.


1) the last time I tried to install NetBSD on an hpcarm Jornada, I got 
stuck because once the kernel booted from Windows CE, it couldn't access 
the CompactFlash card.


2) recently I tried running NetBSD-9 and -current on an m68040 Mac Quadra 
630. After booting, the kernel would show some errors but would never be 
able to talk to the IDE disk. The exact hardware and disk work fine with 
NetBSD-7 kernels, though.


3) my colocated Amiga 1200 lost the ability to write to its CompactFlash 
card, which is attached to IDE, that is used for booting the kernel. I 
thought the card went bad, so I mailed a new one to the datacenter, but 
the issue contiued. This happened, I think, around the beginning of the 
year when I upgraded to NetBSD-9. Now I have no way to upgrade the kernel 
remotely :(


All three of these machines have much older IDE, so I'm wondering what in 
NetBSD changed that may've have caused this.


Soon I'll be back and will have access to the Quadra and Jornada, and I'm 
open to suggestions about which commits to examine and to test.


Thanks,
John Klos


Re: /dev/crypto missing

2020-07-27 Thread John Klos

I erroneously thought that if pseudo-device crypto wasn't in the kernel,
crypto would be done in userland. That's not the case:


What makes you think crypto isn't being done in userland?


Just a bad guess that the reason for pseudo-device crypto was to do some 
things in the kernel.



The problem looks to me like the server returns garbage on a TLS
connection, which gets mixed up with an OpenSSL debugging message --
or possibly it is garbage _because_ it got mixed up with the OpenSSL
debugging message.  Maybe OpenSSL should handle ENXIO quietly like it
handles ENOENT there, but it looks like there's a deeper problem if
crap that OpenSSL printed got included in the TLS stream!


If this is the case, then why isn't crypto in every kernel configuration
by default, except perhaps special cases?


/dev/crypto is totally obsolete as it exists today.  Really the only
reason it continues to exist is to test opencrypto drivers from
userland before using them in the kernel.


Hmmm... Then I wonder what's really going on. This is from trying to use 
bozohttpd with TLS on an Amiga with exactly the same configuration as used 
on ARM and amd64. I'll have to look in to this a bit more and perhaps open 
a PR.


Thanks,
John


/dev/crypto missing

2020-07-27 Thread John Klos

Hi,

I erroneously thought that if pseudo-device crypto wasn't in the kernel, 
crypto would be done in userland. That's not the case:


openssl s_client -debug -connect 192.80.49.7:443
Could not open /dev/crypto: Device not configured
CONNECTED(0003)
write to 0xe4f02d0 [0xe546000] (293 bytes => 293 (0x125))
 - 16 03 01 01 20 01 00 01-1c 03 03 40 b2 73 a3 d5    ..@.s..
0010 - 13 f4 91 bb ad cf 6b 49-f1 33 6f 86 ae 5b 1e 1e   ..kI.3o..[..
0020 - f5 cb db 10 5e 27 a5 07-10 97 8d 20 f6 9b 7c 26   ^'. ..|&
0030 - f3 52 e6 e5 19 1e 57 24-c2 ff c7 07 6d 34 23 74   .RW$m4#t
0040 - 6c 36 da 86 f8 39 f9 a8-7e 24 1b 6c 00 3e 13 02   l6...9..~$.l.>..
0050 - 13 03 13 01 c0 2c c0 30-00 9f cc a9 cc a8 cc aa   .,.0
0060 - c0 2b c0 2f 00 9e c0 24-c0 28 00 6b c0 23 c0 27   .+./...$.(.k.#.'
0070 - 00 67 c0 0a c0 14 00 39-c0 09 c0 13 00 33 00 9d   .g.9.3..
0080 - 00 9c 00 3d 00 3c 00 35-00 2f 00 ff 01 00 00 95   ...=.<.5./..
0090 - 00 0b 00 04 03 00 01 02-00 0a 00 0c 00 0a 00 1d   
00a0 - 00 17 00 1e 00 19 00 18-00 23 00 00 00 16 00 00   .#..
00b0 - 00 17 00 00 00 0d 00 30-00 2e 04 03 05 03 06 03   ...0
00c0 - 08 07 08 08 08 09 08 0a-08 0b 08 04 08 05 08 06   
00d0 - 04 01 05 01 06 01 03 03-02 03 03 01 02 01 03 02   
00e0 - 02 02 04 02 05 02 06 02-00 2b 00 09 08 03 04 03   .+..
00f0 - 03 03 02 03 01 00 2d 00-02 01 01 00 33 00 26 00   ..-.3.&.
0100 - 24 00 1d 00 20 74 f9 da-78 03 7e ab f9 52 6d da   $... t..x.~..Rm.
0110 - cf 19 9b 11 0d 3c 24 c2-00 44 f1 bf 4b e8 92 33   .<$..D..K..3
0120 - dd 79 33 d7 1e.y3..
read from 0xe4f02d0 [0xe4e7003] (5 bytes => 5 (0x5))
 - 43 6f 75 6c 64Could
4294967295:error:1408F10B:SSL routines:ssl3_get_record:wrong version 
number:/usr/src/crypto/external/bsd/openssl/dist/ssl/record/ssl3_record.c:332:

---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 5 bytes and written 293 bytes
Verification: OK
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---
read from 0xe4f02d0 [0xe558000] (8192 bytes => 45 (0x2D))
 - 20 6e 6f 74 20 6f 70 65-6e 20 2f 64 65 76 2f 63not open /dev/c
0010 - 72 79 70 74 6f 3a 20 44-65 76 69 63 65 20 6e 6f   rypto: Device no
0020 - 74 20 63 6f 6e 66 69 67-75 72 65 64 0at configured.



If this is the case, then why isn't crypto in every kernel configuration 
by default, except perhaps special cases?


John Klos


Re: Horrendous RAIDframe reconstruction performance

2020-06-28 Thread John Klos
Any thoughts about what's going on here? Is this because the drives are 512e 
drives? Three weeks is a LONG time to reconstruct.


So this turns out to be a failing drive. SMART doesn't show it's failing, 
but the one that's failing defaults to having the write cache off, and 
turning it on doesn't change the speed.


I guess it's still usable, in a limited way - I can only write at 5 or 6 
MB/sec, but I can read at 200 MB/sec. Maybe I'll use it in an m68k Mac.


Also, the autoconfigure works, but the forcing of root FS status didn't 
because I was testing it on a system that already had a RAIDframe with 
forced root. However, it still doesn't work on aarch64, but I'll recheck 
this after trying Jared's boot.cfg support.


Thanks, Greg, Michael and Edgar. I learned something :) I am still curious 
about whether I was seeing both good read and write speeds because writes 
weren't going to both drives. I suppose I assumed that all writes would go 
to both drives even while reconstructing, but I suppose that only happens 
when the writes are inside of the area which has already been 
reconstructed, yes?


John


Horrendous RAIDframe reconstruction performance

2020-06-28 Thread John Klos

Hello,

I'm setting up two helium, non-SMR, 512e 8 TB disks (HGST HUH728080ALE604) 
in a RAIDframe mirror:


[ 2.829768] wd2 at atabus2 drive 0
[ 2.829768] wd2: 
[ 2.829768] wd2: drive supports 16-sector PIO transfers, LBA48 addressing
[ 2.829768] wd2: 7452 GB, 15504021 cyl, 16 head, 63 sec, 512 bytes/sect x 
15628053168 sectors (0 bytes/physsect; first aligned sector: 8)
[ 2.859768] wd2: GPT GUID: 4086e8f6-0ddd-4689-a942-80bf1b598539
[ 2.859768] dk0 at wd2: "raid8tb0", 15611274240 blocks at 1024, type: 
raidframe
[ 2.859768] dk1 at wd2: "swap8tb0", 16777216 blocks at 15611275264, type: 
swap
[ 2.869768] wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 
(Ultra/133), WRITE DMA FUA, NCQ (32 tags) w/PRIO
[ 2.869768] wd2(ahcisata0:4:0): using PIO mode 4, DMA mode 2, Ultra-DMA 
mode 6 (Ultra/133) (using DMA), NCQ (31 tags) w/PRIO

(Strange that it says "0 bytes/physsect")


First, it seems that autoconfigure doesn't allow forcing the root 
filesystem. I'm guessing because this is using GPT:


   Autoconfig: Yes
   Root partition: Force

Next, raidctl doesn't handle NAME= for device yet:

raidctl -v -a NAME=raid8tb1 raid0
raidctl: ioctl (RAIDFRAME_ADD_HOT_SPARE) failed: No such file or directory

Finally, even though these are absolutely not SMR drives, rebuild is 
running at 3.5 to 4 MB/sec, whether attached via USB-3 or directly 
attached via SATA:


# raidctl -v -S raid0
Reconstruction is 0% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.
Reconstruction status:
  0% |   | ETA: 485:13:54 -

Interestingly, a bonnie++ run shows 80+ megabytes per second in block 
writes and 160+ megabytes per second on block read, and 185 random seeks 
per second, while reconstructing.


Any thoughts about what's going on here? Is this because the drives are 
512e drives? Three weeks is a LONG time to reconstruct.


This is observed on NetBSD 9.99.68 on both i386 and on aarch64.

Thanks,
John Klos


Reproducible NetBSD 9.0/amd64 panic, but no crash dump

2020-03-13 Thread John Klos
I've had the same NetBSD 9.0/amd64, compiled from 22-Feb sources, system 
panic twice while trying to run the same workload, which is just fetching 
many static files via command-like ftp continuously. This appears to be 
resproducible since both panics happened within an hour of starting the 
workload.


However, no crash dump was recorded after the panics even though 
kern.dump_on_panic is set. savecore says there's no core dump, there's 
plenty of disk space... Not sure how to diagnose this. Suggestions?


Thanks,
John


[ 236045.981322] uvm_fault(0x8cf49ee4fa20, 0x0, 2) -> e
[ 236045.981322] fatal page fault in supervisor mode
[ 236045.981322] trap type 6 code 0x2 rip 0x8098bb73 cs 0x8 rflags 
0x10286 cr2 0 ilevel 0x4 rsp 0xc200c0b827f0
[ 236045.981322] curlwp 0x8cf4e470e8c0 pid 1070.1 lowest kstack 
0xc200c0b802c0
[ 236045.981322] panic: trap
[ 236045.981322] cpu3: Begin traceback...
[ 236045.981322] vpanic() at netbsd:vpanic+0x160
[ 236045.981322] snprintf() at netbsd:snprintf
[ 236045.981322] startlwp() at netbsd:startlwp
[ 236045.981322] alltraps() at netbsd:alltraps+0xbb
[ 236045.981322] thmap_del() at netbsd:thmap_del+0x218
[ 236045.981322] npf_conndb_remove() at netbsd:npf_conndb_remove+0x32
[ 236045.981322] npf_conn_establish() at netbsd:npf_conn_establish+0x1b6
[ 236045.981322] npfk_packet_handler() at netbsd:npfk_packet_handler+0x318
[ 236045.981322] pfil_run_hooks() at netbsd:pfil_run_hooks+0x122
[ 236045.991327] ip_output() at netbsd:ip_output+0x49e
[ 236045.991327] tcp_output() at netbsd:tcp_output+0x1970
[ 236045.991327] tcp_connect_wrapper() at netbsd:tcp_connect_wrapper+0x22b
[ 236045.991327] do_sys_connect() at netbsd:do_sys_connect+0x90
[ 236045.991327] sys_connect() at netbsd:sys_connect+0x49
[ 236045.991327] syscall() at netbsd:syscall+0x157
[ 236045.991327] --- syscall (number 98) ---
[ 236045.991327] 759b6e042bfa:
[ 236045.991327] cpu3: End traceback...

[ 236045.991327] dumping to dev 0,1 (offset=16877935, size=2084794):
[ 236045.991327] dump



[  2159.007866] uvm_fault(0x8413f09b9458, 0x0, 2) -> e
[  2159.007866] fatal page fault in supervisor mode
[  2159.007866] trap type 6 code 0x2 rip 0x8098bb73 cs 0x8 rflags 
0x10286 cr2 0 ilevel 0x4 rsp 0xb900b05fe7f0
[  2159.007866] curlwp 0x8414201602c0 pid 6556.1 lowest kstack 
0xb900b05fc2c0
[  2159.007866] panic: trap
[  2159.007866] cpu3: Begin traceback...
[  2159.007866] vpanic() at netbsd:vpanic+0x160
[  2159.007866] snprintf() at netbsd:snprintf
[  2159.007866] startlwp() at netbsd:startlwp
[  2159.007866] alltraps() at netbsd:alltraps+0xbb
[  2159.007866] thmap_del() at netbsd:thmap_del+0x218
[  2159.007866] npf_conndb_remove() at netbsd:npf_conndb_remove+0x32
[  2159.017872] npf_conn_establish() at netbsd:npf_conn_establish+0x1b6
[  2159.017872] npfk_packet_handler() at netbsd:npfk_packet_handler+0x318
[  2159.017872] pfil_run_hooks() at netbsd:pfil_run_hooks+0x122
[  2159.017872] ip_output() at netbsd:ip_output+0x49e
[  2159.017872] tcp_output() at netbsd:tcp_output+0x1970
[  2159.017872] tcp_connect_wrapper() at netbsd:tcp_connect_wrapper+0x22b
[  2159.017872] do_sys_connect() at netbsd:do_sys_connect+0x90
[  2159.017872] sys_connect() at netbsd:sys_connect+0x49
[  2159.017872] syscall() at netbsd:syscall+0x157
[  2159.017872] --- syscall (number 98) ---
[  2159.017872] 7f2580c42bfa:
[  2159.017872] cpu3: End traceback...

[  2159.017872] dumping to dev 0,1 (offset=16877935, size=2084794):
[  2159.017872] dump


Re: amd64: svs

2018-01-12 Thread John Klos
It looks like two of the Dell machines I use have BIOS updates that 
include the new microcode, but I don't know when it'll be available for 
the other machines I use. The standalone microcode update at 
https://downloadcenter.intel.com/download/27337/Linux-Processor-Microcode-Data-File 
is dated 20171117, and from what I've read, does not contain the latest 
changes.


FWIW, Intel updated their microcode on 8-Jan-2018:

https://downloadcenter.intel.com/download/27431/

I can't find any documentation which shows what exactly is fixed for each 
family of processors.


I'm not updating my BIOSes for now because using intel-microcode-netbsd 
package lets me easily make before-and-after comparisons. Plus, many of 
the BIOS updates I've seen don't say what Intel updates they use - you 
have to run cpuctl yourself afterwards to see...


John


USB fixes hopefully getting in to -8

2017-10-02 Thread John Klos

Hi,

I have a 2 TB drive connected via USB to a Raspberry Pi 2 which is running 
netbsd-8. I tried five times, unsuccessfully, to copy a 200 gigabyte file 
via scp to the drive. Each time the Pi locked up, either while just 
copying or while doing things in other ssh sessions. It seemed to coincide 
at least twice with either logging out or trying to log in.


Since ethernet is on USB on the Pi along with the disk, I decided to try 
an 8.99.3 kernel compiled from yesterday's sources. The scp worked without 
any issue at all, even with other things going on like cvs updates and 
compiling.


Does anyone who might know what's been fixed know if the changes are going 
to be pulled in to -8?


Thanks,
John


NetBSD 7 on amd64 panics

2015-11-07 Thread John Klos

Hi,

I have an amd64 system (Intel Core i3) which has been stable for a couple 
of years doing NAT, IPv6 routing, email, web and so on. Since updating to 
NetBSD 7 in May, it's paniced six times now. The latest:


fatal page fault in supervisor mode
trap type 6 code 0 rip 809cc927 cs 8 rflags 10207 cr2 1004f 
ilevel 4 rsp fe811cfd4ba0

curlwp 0xfe842df3b420 pid 0.3 lowest kstack 0xfe811cfd22c0
panic: trap
cpu0: Begin traceback...
vpanic() at netbsd:vpanic+0x13c
snprintf() at netbsd:snprintf
startlwp() at netbsd:startlwp
alltraps() at netbsd:alltraps+0x96
ipf_frag_natknown() at netbsd:ipf_frag_natknown+0x3a
ipf_nat6_checkin() at netbsd:ipf_nat6_checkin+0xe6
ipf_check() at netbsd:ipf_check+0x82b
pfil_run_hooks() at netbsd:pfil_run_hooks+0xc4
ip6_input() at netbsd:ip6_input+0x307
ip6intr() at netbsd:ip6intr+0x4b
softint_dispatch() at netbsd:softint_dispatch+0x79
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfe811cfd4ff0
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
0:
cpu0: End traceback...
uvm_fault(0xfe842690b480, 0x0,
2) -> e
dumpifnagt atlo  dpeavg e 0f,a1u lt(o fifns ets=up2e1r2v6i7s1or, 
msoidzee=

4167952):
trap type 6 code 2 ridpu mpf fff805dc489 cs 8 rflags 10202 cr2 84 
ilevel 8 rsp fe811dd97e10

curlwp 0xfe8406cb7480 pid 2044.5 lowest kstack 0xfe811dd952c0
Skipping crash dump on recursive panic
panic: wddump: polled command has been queued
cpu0: Begin traceback...
vpanic() at netbsd:vpanic+0x13c
snprintf() at netbsd:snprintf
wddump() at netbsd:wddump+0x282
dumpsys_seg() at netbsd:dumpsys_seg+0xc7
dump_seg_iter() at netbsd:dump_seg_iter+0xce
dodumpsys() at netbsd:dodumpsys+0x24c
dumpsys() at netbsd:dumpsys+0x1d
vpanic() at netbsd:vpanic+0x145
snprintf() at netbsd:snprintf
startlwp() at netbsd:startlwp
alltraps() at netbsd:alltraps+0x96
ipf_frag_natknown() at netbsd:ipf_frag_natknown+0x3a
ipf_nat6_checkin() at netbsd:ipf_nat6_checkin+0xe6
ipf_check() at netbsd:ipf_check+0x82b
pfil_run_hooks() at netbsd:pfil_run_hooks+0xc4
ip6_input() at netbsd:ip6_input+0x307
ip6intr() at netbsd:ip6intr+0x4b
softint_dispatch() at netbsd:softint_dispatch+0x79
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfe811cfd4ff0
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
0:
cpu0: End traceback...
rebooting...

Ideas? I've kept the crashdumps.

John


netbsd-7 panic

2015-08-12 Thread John Klos
I have an amd64 machine running netbsd-7 from 6-August-2015 which 
does common hosting (email, web, DNS), IPv6 tunnels and NAT, 
amongst other things. It paniced like this a few weeks ago when it had 
been running netbsd-7 from March, but otherwise it's been problem free.


Any thoughts or ideas about what could be causing this?


fatal page fault in supervisor mode
trap type 6 code 0 rip 80722fb2 cs 8 rflags 10297 cr2 36 ilevel 2 
rsp ff

fffe811cfebec0
curlwp 0xfe842df3b860 pid 0.5 lowest kstack 0xfe811cfe92c0
panic: trap
cpu0: Begin traceback...
vpanic() at netbsd:vpanic+0x13c
snprintf() at netbsd:snprintf
startlwp() at netbsd:startlwp
alltraps() at netbsd:alltraps+0x96
ipf_frag_delete() at netbsd:ipf_frag_delete+0x74
ipf_frag_expire() at netbsd:ipf_frag_expire+0x152
ipf_slowtimer() at netbsd:ipf_slowtimer+0x15
ipf_timer_func() at netbsd:ipf_timer_func+0x2d
callout_softclock() at netbsd:callout_softclock+0x248
softint_dispatch() at netbsd:softint_dispatch+0x79
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfe811cfebff0
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
0:
cpu0: End traceback...


if_wm between netbsd-6 and netbsd-7 issue

2014-12-26 Thread John Klos

Hi,

I have a machine which is currently running netbsd-6 with six wm* ethernet 
interfaces. The first four are on a PCIe card which shows up like so:


pci4: i/o space, memory space enabled, rd/line, wr/inv ok
wm0 at pci4 dev 0 function 0: Intel PRO/1000 PT Quad Port Server Adapter (rev. 
0x06)
wm0: interrupting at ioapic0 pin 17
wm0: PCI-Express bus
wm0: 65536 word (16 address bits) SPI EEPROM
wm0: Ethernet address 00:15:17:73:0d:15
igphy0 at wm0 phy 1: Intel IGP01E1000 Gigabit PHY, rev. 0
igphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
1000baseT-FDX, auto
wm1 at pci4 dev 0 function 1: Intel PRO/1000 PT Quad Port Server Adapter (rev. 
0x06)
wm1: interrupting at ioapic0 pin 16
wm1: PCI-Express bus
wm1: 65536 word (16 address bits) SPI EEPROM
wm1: Ethernet address 00:15:17:73:0d:14
igphy1 at wm1 phy 1: Intel IGP01E1000 Gigabit PHY, rev. 0
igphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
1000baseT-FDX, auto
ppb4 at pci3 dev 4 function 0: vendor 0x111d product 0x8018 (rev. 0x0e)
ppb4: PCI Express 1.0 
pci5 at ppb4 bus 5
pci5: i/o space, memory space enabled, rd/line, wr/inv ok
wm2 at pci5 dev 0 function 0: Intel PRO/1000 PT Quad Port Server Adapter (rev. 
0x06)
wm2: interrupting at ioapic0 pin 19
wm2: PCI-Express bus
wm2: 65536 word (16 address bits) SPI EEPROM
wm2: Ethernet address 00:15:17:73:0d:17
igphy2 at wm2 phy 1: Intel IGP01E1000 Gigabit PHY, rev. 0
igphy2: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
1000baseT-FDX, auto
wm3 at pci5 dev 0 function 1: Intel PRO/1000 PT Quad Port Server Adapter (rev. 
0x06)
wm3: interrupting at ioapic0 pin 18
wm3: PCI-Express bus
wm3: 65536 word (16 address bits) SPI EEPROM
wm3: Ethernet address 00:15:17:73:0d:16
igphy3 at wm3 phy 1: Intel IGP01E1000 Gigabit PHY, rev. 0
igphy3: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
1000baseT-FDX, auto


I was going to upgrade this system to netbsd-7 (built from the sources 
from three hours ago), but when booting the kernel, this is what I get for 
wm0 through wm3:


pci4: i/o space, memory space enabled, rd/line, wr/inv ok
wm0 at pci4 dev 0 function 0: Intel PRO/1000 PT Quad Port Server Adapter (rev. 
0x06)
wm0: interrupting at ioapic0 pin 17
wm0: PCI-Express bus
wm0: could not acquire SWSM SMBI
wm0: wm_nvm_acquire: failed to get semaphore
wm0: could not acquire SWSM SMBI
wm0: wm_nvm_acquire: failed to get semaphore
wm0: No EEPROM
wm0: unable to read Ethernet address
wm1 at pci4 dev 0 function 1: Intel PRO/1000 PT Quad Port Server Adapter (rev. 
0x06)
wm1: interrupting at ioapic0 pin 16
wm1: PCI-Express bus
wm1: could not acquire SWSM SMBI
wm1: wm_nvm_acquire: failed to get semaphore
wm1: could not acquire SWSM SMBI
wm1: wm_nvm_acquire: failed to get semaphore
wm1: No EEPROM
wm1: unable to read Ethernet address
ppb4 at pci3 dev 4 function 0: vendor 0x111d product 0x8018 (rev. 0x0e)
ppb4: PCI Express capability version 1  x4 @ 
2.5GT/s
pci5 at ppb4 bus 5
pci5: i/o space, memory space enabled, rd/line, wr/inv ok
wm2 at pci5 dev 0 function 0: Intel PRO/1000 PT Quad Port Server Adapter (rev. 
0x06)
wm2: interrupting at ioapic0 pin 19
wm2: PCI-Express bus
wm2: 4096 words (16 address bits) SPI EEPROM
wm2: Ethernet address 00:15:17:73:0d:17
igphy0 at wm2 phy 1: Intel IGP01E1000 Gigabit PHY, rev. 0
igphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
1000baseT-FDX, auto
wm3 at pci5 dev 0 function 1: Intel PRO/1000 PT Quad Port Server Adapter (rev. 
0x06)
wm3: interrupting at ioapic0 pin 18
wm3: PCI-Express bus
wm3: 4096 words (16 address bits) SPI EEPROM
wm3: Ethernet address 00:15:17:73:0d:16
igphy1 at wm3 phy 1: Intel IGP01E1000 Gigabit PHY, rev. 0
igphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
1000baseT-FDX, auto


Does anyone know what's not right now with support for these Intel quad 
gigabit cards?


Thanks,
John Klos


Re: sysctl weirdness on m68k

2014-07-26 Thread John Klos

On NetBSD/atari, luna68k and x68k, kernels print the following
sysctl_createv errors during boot:


I just noticed that evbarm machines also have those errors:

NetBSD 6.99.49 (PI) #1: Fri Jul 25 19:42:40 UTC 2014
j...@chi.ziaspace.com:/usr/current/obj-evbarm/sys/arch/evbarm/compile/PI
total memory = 448 MB
avail memory = 436 MB
sysctl_createv: sysctl_create(machine_arch) returned 17
sysctl_createv: sysctl_locate(multicast) returned 2
sysctl_createv: sysctl_locate(multicast_kludge) returned 2

John


Re: sysctl weirdness on m68k

2014-07-26 Thread John Klos

On NetBSD/atari, luna68k and x68k, kernels print the following
sysctl_createv errors during boot:

...

Is there anyone who also sees these
"sysctl_createv: sysctl_locate(multicast) returned 2"
messages on other ports?


Yes. I don't have current on an Amiga at the moemnt, but mac68k does:

NetBSD 6.99.46 (BRIGGS-$Revision: 6.999 $) #0: Tue Jul  8 22:01:50 UTC 2014

j...@chi.ziaspace.com:/usr/current/obj-mac68k/sys/arch/mac68k/compile/BRIGGS
Apple Macintosh Quadra 610  (68040)
cpu: delay factor 1601
fpu: mc68040
total memory = 260 MB
avail memory = 249 MB
sysctl_createv: sysctl_locate(multicast) returned 2
sysctl_createv: sysctl_locate(multicast_kludge) returned 2

John Klos


Re: Unexpected RAIDframe behavior

2013-09-03 Thread John Klos

If the state of the RAID is not being maintained, then that's a bug,
and needs to be fixed right away.  To my knowledge, however, it does
maintain things correctly.  Feel free to file a PR with the specifics
of any failures in this regard...


I will test this via a clean install and create a PR if I observe the same 
thing. I have a bunch of PRs to go through...


John


re: Unexpected RAIDframe behavior

2013-09-03 Thread John Klos

Parity Re-write is 79% complete.


OK, so this is really more about how parity checking works than
anything else (i guess.)

for RAID1, it reads both disks and compares them, and if one
fails it will write the "master" data.  (more generally, it
reads all disks and if anything fails parity check it writes
corrected parity back to it.)


Ah, so a reboot caused RAIDframe to switch from reconstruction to parity 
creation. That explains what was going on. However, it makes me wonder if 
the state of the RAID is not properly being maintained through reboot. I 
didn't really need all that non-zero data.


Thanks,
John


re: Unexpected RAIDframe behavior

2013-09-02 Thread John Klos

what does raidctl -s and -m (separate commands) say?


raidctl -s raid0
Components:
   /dev/wd0a: optimal
   /dev/wd1a: optimal
No spares.
Component label for /dev/wd0a:
   Row: 0, Column: 0, Num Rows: 1, Num Columns: 2
   Version: 2, Serial Number: 2013090100, Mod Counter: 75
   Clean: No, Status: 0
   sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
   Queue size: 100, blocksize: 512, numBlocks: 3873470720
   RAID Level: 1
   Autoconfig: Yes
   Root partition: Yes
   Last configured as: raid0
Component label for /dev/wd1a:
   Row: 0, Column: 1, Num Rows: 1, Num Columns: 2
   Version: 2, Serial Number: 2013090100, Mod Counter: 75
   Clean: No, Status: 0
   sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
   Queue size: 100, blocksize: 512, numBlocks: 3873470720
   RAID Level: 1
   Autoconfig: Yes
   Root partition: Yes
   Last configured as: raid0
Parity status: DIRTY
Reconstruction is 100% complete.
Parity Re-write is 79% complete.
Copyback is 100% complete.



raidctl -m raid0
raid0: parity map enabled with 4096 regions of 462MB
raid0: regions marked clean after 8 intervals of 40.000s
raid0: write/sync/clean counters 43509/2112/1712
raid0: 871 dirty regions
raid0: parity map will remain enabled on next configure


Unexpected RAIDframe behavior

2013-09-02 Thread John Klos

Hi,

After setting up a RAID-1 mirror with RAIDframe in netbsd-6 (compiled 
from yesterday's tree), I'm seeing strange disk issues. The initial mirror 
was set up and was initializing while the system was booted off of the 
install kernel and both drives (wd0 and wd1) were writing at around 110 
MB/sec each. Some time after a reboot onto the installed system, I started 
seeing this from iostat:


  tty  wd0   wd1 raid0 CPU
 tin tout  KB/t  t/s  MB/s   KB/t  t/s  MB/s   KB/t  t/s  MB/s  us ni sy in id
   0   79 64.00  231 14.42  64.00  462 28.90  0.0000 0.000   0  0  0  0 100
   0  239 64.00  253 15.84  64.00  506 31.62  0.0000 0.000   0  0  1  0 99
   0   79 64.00  200 12.50  64.00  400 25.00  0.0000 0.000   0  0  0  1 99
   0   79 64.00  194 12.13  64.00  388 24.26  0.0000 0.000   0  0  1  0 99
   0   79 64.00  218 13.61  64.00  436 27.23  0.0000 0.000   0  0  0  0 99

raid0 is completely idle - swap isn't even enabled - yet wd1 is doing 
twice the I/O as wd0. Does anyone know why this is the case?


John


Booting with dk0 root

2013-07-25 Thread John Klos

Hi,

Apparently we can't compile a kernel with netbsd root on dk0 type ffs 
because dk0 isn't something the kernel knows about until later. How does 
one use a gpt wedge as a root filesystem? I'm loading the kernel from a 
CompactFlash but would like the root filesystem to be on a 4 TB drive.


Thanks,
John Klos


Can't get 100% CPU on all cores

2012-04-25 Thread John Klos

Is this a NetBSD issue, or something else?

load averages:  7.99,  7.94,  7.91;   up 0+21:43:0721:25:56
34 processes: 2 runnable, 24 sleeping, 8 on CPU
CPU0 states:  0.0% user,  100% nice,  0.0% system,  0.0% interrupt,  0.0% idle
CPU1 states:  0.0% user,  100% nice,  0.0% system,  0.0% interrupt,  0.0% idle
CPU2 states:  0.0% user, 39.7% nice,  0.0% system,  0.0% interrupt, 60.3% idle
CPU3 states:  0.0% user,  100% nice,  0.0% system,  0.0% interrupt,  0.0% idle
CPU4 states:  0.0% user,  100% nice,  0.0% system,  0.0% interrupt,  0.0% idle
CPU5 states:  0.0% user,  100% nice,  0.0% system,  0.0% interrupt,  0.0% idle
CPU6 states:  0.0% user,  100% nice,  0.0% system,  0.0% interrupt,  0.0% idle
CPU7 states:  0.0% user, 99.6% nice,  0.4% system,  0.0% interrupt,  0.0% idle
Memory: 1732M Act, 10M Exec, 1698M File, 27G Free
Swap: 8193M Total, 8193M Free

  PID USERNAME PRI NICE   SIZE   RES STATE  TIME   WCPUCPU COMMAND
 6102 john   5   20  8256K 8124K RUN/3 25:56 99.02% 99.02% dnetc
10545 john   5   20  8212K 4540K CPU/4 24:45 99.02% 99.02% dnetc
24632 john   5   20  8200K 4544K CPU/5 25:01 94.58% 94.58% dnetc
14493 john   5   20  8216K 4552K RUN/6 24:34 93.65% 93.65% dnetc
10715 john   6   20  8204K 4540K CPU/6 25:21 92.72% 92.72% dnetc
13965 john   5   20  8224K 4540K CPU/0 25:14 89.75% 89.75% dnetc
 8117 john   5   20  8220K 2508K CPU/7 25:52 82.03% 82.03% dnetc
28320 john   5   20  8208K 4552K CPU/1 24:20 81.84% 81.84% dnetc


For some reason I can't get 100% CPU on all eight cores. This is with 
netbsd-6 compiled from two days ago on an eight core AMD Zambezi:


cpu0 at mainbus0 apid 16: AMD FX(tm)-8150 Eight-Core Processor   , 
id 0x600f12

...


Nothing else is running on the machine. Strange...

Ideas?

John


Areca 1880?

2011-08-29 Thread John Klos

Hi, all,

Is there any support for the Areca 1880, be it in -current or someone's 
not-yet-checked-in tree?


Thanks,
John Klos


Interesting USB observation

2010-06-30 Thread John Klos

Hi,

While this may be known and not that unexpected, I thought I'd share a 
little observation.


I have a quad core amd64 system running i386 NetBSD 5 (I haven't figured 
out how to get the code to run properly under amd64). I'm running some CPU 
intensive applications which will take almost a week and is generating a 
couple of terabytes of intermediate data. I was running low on space on 
one of the SATA drives, so I connected a 1 TB USB drive.


Observation 1: when scp'ing files from another machine to one of the SATA 
connected drives at 35 MB/sec, I started another scp to the same drive; 
the total rate went up to 50 MB/sec (which I think is all the drive can 
do). While just one of the scps was running, I started an scp to the USB 
drive. The total rate for both scps dropped to 10 MB/sec!


Observation 2: even when reading a modest amount of data from the USB 
drive (about 3 MB/sec), the CPU intensive tasks get slowed down 
significantly. Even though at least two of them are getting their data 
from the SATA drives, they're anywhere from 50% to 10% idle while using 
USB for anything. When I stop talking to the USB drives, they then run at 
96% to 98% CPU each.


While transferring over USB, system overhead for all four CPUs is less 
than 5% per CPU, and interrupt overhead is less than 2% for one CPU and 0% 
for the other three. With no USB tranfers and two tasks running off of a 
SATA drive transferring about 5 MB/sec, the total number of interrupts is 
around 200 to 250 a second. When running tasks or transferring over USB, 
that jumps to about 2000 to 2300!


I always knew that USB kinda sucked, but I had no idea it is this bad! 
Does anyone know if other OSes have this much of a performance impact when 
using USB? Is this just due to really horrible interrupt handling and 
overhead?


ohci0 at pci0 dev 18 function 0: vendor 0x1002 product 0x4397 (rev. 0x00)
ohci0: interrupting at ioapic0 pin 16
ohci0: OHCI version 1.0, legacy support
usb0 at ohci0: USB revision 1.0
...
ehci0 at pci0 dev 18 function 2: vendor 0x1002 product 0x4396 (rev. 0x00)
ehci0: interrupting at ioapic0 pin 17
ehci0: dropped intr workaround enabled
ehci0: BIOS has given up ownership
ehci0: EHCI version 1.0
ehci0: companion controllers, 3 ports each: ohci0 ohci1
...

Thanks,
John Klos


WAPBL and IDE mac68k

2010-05-31 Thread John Klos

Hi,

I've been having a problem with my Quadra 630 system panicing on boot. It 
was hard to figure out what was going on - a netbsd-5 kernel panics, a 
-current kernel boots, but a -current kernel from 31-December-2009 works.


It happens even when I try to boot to single user mode because I see the 
message saying "/: replaying log to memory" right before it panics. Not 
sure why the journaling stuff happens when booting in single user mode 
without mounting any filesystems, but that's what it is.


When I moved the drive to the same machine's SCSI bus, it works fine with 
any kernel, so this is specific to the IDE bus of the Quadra 630-type 
machines.


(1) How does one start up in single user mode WITHOUT filesystems getting 
read?


(2) Who knows enough about WAPBL and IDE busses to guess where to look for 
a possible solution to this problem?


Thanks,
John Klos


Re: Hardware RAID problem with NetBSD 5?

2010-03-30 Thread John Klos

How old is your kernel ? I've had strange issues with ahci and
interrupts on some machines, and a fix (or rather a workaround) for this
has just been pulled up to netbsd-5. It could be your problem as well ...


I had been trying 5.0.2 because I had to keep reinstalling, but I'll try a 
new kernel.


Thanks,
John


Hardware RAID problem with NetBSD 5?

2010-03-29 Thread John Klos

Hi,

I'm not sure if this a problem with the motherboard I'm using or with 
NetBSD, but here goes anyway. I have an MSI MS-7511 amd64 motherboard 
which has a form of hardware RAID on the motherboard. However, after using 
it for a few minutes, the machine locks up HARD. I can't get into the 
debugger, I can't get any sort of activity, and even the reset button 
doesn't work!


It makes me think that perhaps the system disables the reset so that 
writes which need to go out don't get interrupted the very instant that 
the reset button is pushed, but if that's the case this would be the first 
time I've seen anything like this on an x86 system.


If I set the SATA controller to AHCI mode in the BIOS, the system can run 
for days. If set to RAID (I'm simply mirroring two 1 TB drives), it might 
finish a newfs and untargzip a set, but never all of them.


The drives show up as wd* even when they're configured in a mirror:

ahcisata0 port 2: device present, speed: 1.5Gb/s
ahcisata0 port 4: device present, speed: 3.0Gb/s
ahcisata0 port 5: device present, speed: 3.0Gb/s
wd0 at atabus3 drive 0: 
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 78167 MB, 158816 cyl, 16 head, 63 sec, 512 bytes/sect x 160086528 sectors
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd0(ahcisata0:2:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) 
(using DMA)
wd1 at atabus5 drive 0: 
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 931 GB, 1938021 cyl, 16 head, 63 sec, 512 bytes/sect x 1953525168 sectors
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd1(ahcisata0:4:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) 
(using DMA)
wd2 at atabus6 drive 0: 
wd2: drive supports 16-sector PIO transfers, LBA48 addressing
wd2: 931 GB, 1938021 cyl, 16 head, 63 sec, 512 bytes/sect x 1953525168 sectors
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd2(ahcisata0:5:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) 
(using DMA)
...
ataraid0: found 1 RAID volume
ld0 at ataraid0 vendtype 3 unit 0: nVidia ATA RAID-1 array
ld0: 931 GB, 121601 cyl, 255 head, 63 sec, 512 bytes/sect x 1953525120 
sectors

opendisk: can't open dev wd1 (16)
opendisk: can't open dev wd2 (16)
opendisk: can't open dev wd1 (16)
opendisk: can't open dev wd2 (16)
opendisk: can't open dev wd1 (16)
opendisk: can't open dev wd2 (16)
...

wd0 in this boot is another disk which isn't configured as part of any 
RAID.


Strange... Does anyone have any ideas? Has anyone seen behaviour like 
this, particularly the reset button getting disabled?


Thanks,
John


4k sector sizes on new disks?

2010-02-14 Thread John Klos

Hi, all,

Does newfs automatically choose 4k fragment sizes for new drives with 4k 
sectors?


I'm wondering how much fallout there will be when these drives become more 
common. There's a story about it on Slashdot:


http://hardware.slashdot.org/story/10/02/14/1541244/Linux-Not-Quite-Ready-For-New-4K-Sector-Drives

John