Re: Maxphys on -current?
Le ven. 4 août 2023 à 17:27, Jason Thorpe a écrit : > If someone does pick this up, I think it would be a good idea to start from > scratch, because MAXPHYS, as it stands, is used for multiple things. > Thankfully, I think it would be relatively straightforward to do the work > that I am suggesting incrementally. > I believe I've been the last one to look at the tls maxphys branch and at least update it. I agree that it's likely more useful to start it from scratch and do incremental updates. Certainly physio before block interface, and also by controller type, e.g. ATA first and then SCSI or USB. For the branch, I particularly disliked that there were quite a few changes which looked either unrelated, or avoidable. As one of the first steps I've planned to reduce diffs against HEAD. I've not gotten to it yet however. Le ven. 4 août 2023 à 08:04, Brian Buhrow a écrit : > speed of the transfers on either system. Interestingly enough, however, the > FreeBSD > performance is markedly worse on this test. > ... > NetBSD-99.77/amd64 with SATA3 disk > # dd if=/dev/rwd0a of=/dev/null bs=1m count=5 > 5242880 bytes transferred in 292.067 secs (179509496 bytes/sec) > > FreeBSD-13.1/AMD64 with SATA3 disk > # dd if=/dev/da4 of=/dev/null bs=1m count=5 > 5242880 bytes transferred in 322.433936 secs (162603232 bytes/sec) Interesting. FreeBSD da(4) is a character device since FreeBSD has no block devices anymore, so it's not a raw-vs-block device difference. Is the hardware really similar enough to be a fair comparison? In a broader view, I have doubts if there is any practical reason to even have support for bigger than 64kb block size support at all. For HDDs over SATA maybe - the bigger blocks mean potentially more sequential I/O and hence higher total throughput. Also you can queue more I/O and hence avoid the seeks - current NetBSD maximum on SATA is 32 x 64KiB = 2048KiB of queued I/O. For SCSI, you can queue way more I/O than the usual disk cache can hold even with 64KiB blocks, so bigger block size is not very important. Still, I doubt >64KiB blocks on HDD would achieve more than a couple of percent increase over 64KiB ones. For SSDs over SATA, there is no seek to worry about, but the command latency is a concern. But even there, according to Linux benchmarks, the total transfer rate tops out with 64KiB blocks already. For NVMe, the command latency is close to irrelevant - it has a very low latency command interface and very deep command queues. I don't see bigger blocks helping much, if at all. Jaromir
Re: Module autounload proposal: opt-in, not opt-out
Le lun. 8 août 2022 à 07:23, Taylor R Campbell a écrit : > I propose we remove the assumption that ENOTTY means no objection, and > switch from opt-out to opt-in: if a module _has_ been audited to > verify MODULE_CMD_AUTOUNLOAD is safe, it can return 0; otherwise it > will never be autounloaded, even if the author forgot to deny > MODULE_CMD_AUTOUNLOAD. Very reasonable. Please do this. Jaromir
Re: maximum limit of files open with O_EXLOCK
Le sam. 19 juin 2021 à 10:12, nia a écrit : > The Zig developer found the kernel limit: > https://nxr.netbsd.org/xref/src/sys/kern/vfs_lockf.c#116 > > but it doesn't seem to be adjustable through sysctl. > I wonder if it should be. Yes it should be a sysctl. The default should probably also be bumped higher. Jaromir
Re: Ext4 support
Le jeu. 29 avr. 2021 à 22:06, Vincent DEFERT <20@defert.com> a écrit : > I'd like to have full ext4 support so an ext4-formatted disk could be > used to exchange data between Linux and NetBSD, for instance. Most of the commonly used features for this should be actually implemented. Big one missing on the NetBSD side is the journal support, but that would be Hard (TM). If you want just an interoperable disk partition, you can switch it off on the Linux side and you should be good. Jaromir
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
Le mer. 14 avr. 2021 à 03:21, Greg A. Woods a écrit : > However their front-end code does detect it and seems to make use of it, > and has done for some 6 years now according to "git blame" (with no > recent fixes beyond fixing a memory leak on their end). Here we see it > live from FreeBSD's sysctl output, thus my concern that this feature may > be the source of the problem: You can test if this is the problem by disabling the feature in negotiation in NetBSD xbdback.c - comment out the code which sets feature-max-indirect-segments in xbdback_backend_changed(). With the feature disabled, FreeBSD DomU should not use indirect segments. Jaromir
Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)
Le dim. 11 avr. 2021 à 17:51, Robert Elz a écrit : > > Date:Sun, 11 Apr 2021 14:25:40 - (UTC) > From:mlel...@serpens.de (Michael van Elst) > Message-ID: > > | + dg->dg_secperunit = vnd->sc_size / DEV_BSIZE; > > While it shouldn't make any difference for any properly created image > file, make it be > > (vnd->sc_size + DEV_BSIZE - 1) / DEV_BSIZE; > > so that any trailing partial sector remains in the image. I don't think it's a good idea to include the sector which can't be written to. Jaromir
Re: Bounties for xhci features: scatter-gather, suspend/resume
Le jeu. 25 mars 2021 à 21:36, a écrit : > xHCI scatter-gather support > > The infamous "failed to create xfers". xhci wants contiguous > allocations. With too much memory fragmentation, they're hard to do. > This shows up as USB drivers randomly failing on machines that have been > Up for a while. > (chuq attempted to handle this UVM-side recently). I'll take care of this part. Jaromir
Re: Issues with older wd* / IDE on various platforms
Le ven. 13 nov. 2020 à 22:31, Robert Swindells a écrit : > > > John Klos wrote: > >I've noticed problems in three places and only recently did it occur to me > >that they may all be related. > > [snip] > > >All three of these machines have much older IDE, so I'm wondering what in > >NetBSD changed that may've have caused this. > > I would guess that one thing all have in common is lack of DMA. > > I have been using a local patch for an ofppc machine that can only do > PIO with the IDE controller, can try latest -current on it over the > weekend. That could very well be. There is no more automatic mode downgrade from DMA to PIO any more, so if the controller misadvertises DMA support, but it actually doesn't work, then I/O will just fail. Jaromir
Re: notes from running will-it-scale
Very interesting, particularly the outrageous assembly for pmap_{zero,copy}_page(). Is there some way to tell the compiler that the address is already 4096-aligned and avoid the conditionals? Failing that, we could just adopt the FreeBSD assembly for this. Does anyone see a problem with introducing a vfs.timestamp_precision to avoid the rtdscp? Jaromir Le dim. 19 juil. 2020 à 13:21, Mateusz Guzik a écrit : > > Hello, > > I recently took an opportunity to run cross-systems microbenchmarks > with will-it-scale and included NetBSD (amd64). > > https://people.freebsd.org/~mjg/freebsd-dragonflybsd-netbsd-v2.txt > [no linux in this doc, I will probably create a new one soon(tm)] > > The system has a lot of problems in the vfs layer, vm is a mixed bag > with multithreaded cases lagging behind and some singlethreaded being > pretty good (and at least one winning against the other systems). > > Notes: > - rtdscp is very expensive in vms, yet the kernel unconditionally > performs by calling vfs_timestamp. Both FreeBSD and DragonflyBSD have > a knob to change the resolution (and consequently avoid the > instruction), I think you should introduce it and default to less > accuracy on vms. Sample results: > stock pipe1: 2413901 > patched pipe1: 3147312 > stock vfsmix: 13889 > patched vfsmix: 73477 > - sched_yield is apparently a nop when the binary is not linked with > pthread. this does not match other systems and is probably a bug. > - pmap_zero_page/pmap_copy_page compile to atrocious code which keeps > checking for alignment. The compiler does not know what values can be > assigned to pmap_direct_base and improvises. > >0x805200c3 <+0>: add0xf93b46(%rip),%rdi# > 0x814b3c10 >0x805200ca <+7>: mov$0x1000,%edx >0x805200cf <+12>: xor%eax,%eax >0x805200d1 <+14>: test $0x1,%dil >0x805200d5 <+18>: jne0x805200ff > >0x805200d7 <+20>: test $0x2,%dil >0x805200db <+24>: jne0x8052010b > >0x805200dd <+26>: test $0x4,%dil >0x805200e1 <+30>: jne0x80520116 > >0x805200e3 <+32>: mov%edx,%ecx >0x805200e5 <+34>: shr$0x3,%ecx >0x805200e8 <+37>: rep stos %rax,%es:(%rdi) >0x805200eb <+40>: test $0x4,%dl >0x805200ee <+43>: je 0x805200f1 > >0x805200f0 <+45>: stos %eax,%es:(%rdi) >0x805200f1 <+46>: test $0x2,%dl >0x805200f4 <+49>: je 0x805200f8 > >0x805200f6 <+51>: stos %ax,%es:(%rdi) >0x805200f8 <+53>: and$0x1,%edx >0x805200fb <+56>: je 0x805200fe > >0x805200fd <+58>: stos %al,%es:(%rdi) >0x805200fe <+59>: retq >0x805200ff <+60>: stos %al,%es:(%rdi) >0x80520100 <+61>: mov$0xfff,%edx >0x80520105 <+66>: test $0x2,%dil >0x80520109 <+70>: je 0x805200dd > >0x8052010b <+72>: stos %ax,%es:(%rdi) >0x8052010d <+74>: sub$0x2,%edx >0x80520110 <+77>: test $0x4,%dil >0x80520114 <+81>: je 0x805200e3 > >0x80520116 <+83>: stos %eax,%es:(%rdi) >0x80520117 <+84>: sub$0x4,%edx >0x8052011a <+87>: jmp0x805200e3 > > > The thing to do in my opinion is to just provide dedicated asm funcs. > This is the equivalent on FreeBSD (ifunc'ed): > > ENTRY(pagezero_std) > PUSH_FRAME_POINTER > movl$PAGE_SIZE/8,%ecx > xorl%eax,%eax > rep > stosq > POP_FRAME_POINTER > ret > END(pagezero_std) > > ENTRY(pagezero_erms) > PUSH_FRAME_POINTER > movl$PAGE_SIZE,%ecx > xorl%eax,%eax > rep > stosb > POP_FRAME_POINTER > ret > END(pagezero_erms) > > -- > Mateusz Guzik >
Re: kernel stack usage
Le sam. 4 juil. 2020 à 15:30, Kamil Rytarowski a écrit : > > Kamil - what's the difference in gcc between -Wframe-larger-than= and > > -Wstack-size= ? > > > > -Wstack-size doesn't exist? Sorry, meant -Wstack-usage= > > I see according to gcc documentation -Wframe-larger-than doesn't count > > size for alloca() and variable-length arrays, which makes it much less > > useful than -Wstack-usage. > > > > It's not a problem. > > Whenever alloca or VLA is in use, it's already breaking the stack > protector. There are a few exceptions registered in sys/conf/ssp.mk. > > We could add -Walloca -Wvla for USE_SSP=yes builds to catch quickly > inappropriate usage (frequently violated by code optimizer). It's already not used in our kernel code, I checked and I found it used only in some arch-specific stand/ code. So -Walloca should be safe to turn on unconditionally, regardless of SSP. Unless the compiler emits alloca() calls itself. Jaromir
Re: kernel stack usage
Can anybody using clang please confirm kernel build with -Wframe-larger-than=3584? Kamil - what's the difference in gcc between -Wframe-larger-than= and -Wstack-size= ? I see according to gcc documentation -Wframe-larger-than doesn't count size for alloca() and variable-length arrays, which makes it much less useful than -Wstack-usage. Jaromir Le dim. 31 mai 2020 à 16:39, Kamil Rytarowski a écrit : > > Can we adopt -Wframe-larger-than=1024 and mark it fatal? > > This option is supported by GCC and Clang. > > On 31.05.2020 15:55, Jaromír Doleček wrote: > > I think it would make sense to add -Wstack-usage=X to kernel builds. > > > > Either 2KB or 1KB seems to be good limit, not too many offenders > > between 1KB and 2KB so maybe worth it to use 1KB and change them. > > > > I'm sure there would be something similar for LLVM too. > > > > Jaromir > > > > Le dim. 31 mai 2020 à 15:39, Simon Burge a écrit : > >> > >> matthew green wrote: > >> > >>> glad to see this effort and the clean up already! > >>> > >>> ideally, we can break the kernel build if large stack consumers > >>> are added to the kernel. i'd be OK with it being default on, > >>> with of course a way to skip it, and if necessary it can have > >>> a whitelist of "OK large users." > >> > >> I started to look at -fstack-usage which gives a nice MI way of getting > >> the data instead of parsing MD objdump output, but didn't get any > >> further than the below patch. The find(1) command referenced in the > >> patch gives the following > >> > >> 1008 nfs_serv.c:2181:1:nfsrv_link > >> 1056 procfs_linux.c:422:1:procfs_do_pid_stat > >> 1056 vfs_subr.c:1521:1:vfs_buf_print > >> 1072 dl_print.c:80:1:sdl_print > >> 1104 core_elf32.c:300:1:coredump_getseghdrs_elf32 > >> 1104 core_elf32.c:300:1:coredump_getseghdrs_elf64 > >> 1104 sys_ptrace_common.c:1595:1:proc_regio > >> 1120 subr_bufq.c:131:1:bufq_alloc > >> 1136 db_lwp.c:64:1:db_lwp_whatis > >> 1152 kern_ktrace.c:1269:1:ktrwrite > >> 1168 uvm_swap.c:768:1:uvm_swap_stats.part.1 > >> 1312 nfs_serv.c:1905:1:nfsrv_rename > >> 2112 tty_60.c:68:1:compat_60_ptmget_ioctl > >> 2144 memmem.c:83:14:twoway_memmem > >> > >> Anyone else want to run with adding a bit more to this to do some build > >> failure as mrg@ suggests and documenting for options(4)? > >> > >> Cheers, > >> Simon. > >> -- > >> Index: Makefile.kern.inc > >> === > >> RCS file: /cvsroot/src/sys/conf/Makefile.kern.inc,v > >> retrieving revision 1.270 > >> diff -d -p -u -r1.270 Makefile.kern.inc > >> --- Makefile.kern.inc 21 May 2020 18:44:19 - 1.270 > >> +++ Makefile.kern.inc 31 May 2020 13:34:13 - > >> @@ -104,6 +104,11 @@ CFLAGS+= -ffreestanding -fno-zero-initia > >> CFLAGS+= ${${ACTIVE_CC} == "gcc":? -fno-delete-null-pointer-checks > >> :} > >> CFLAGS+= ${DEBUG} ${COPTS} > >> AFLAGS+= -D_LOCORE -Wa,--fatal-warnings > >> +.if defined(KERN_STACK_USAGE) > >> +# example usage to find largest stack users in kernel compile directory: > >> +#find . -name \*.su | xargs awk '{ printf "%6d %s\n", $2, $1 }' | > >> sort +1n > >> +CFLAGS+= -fstack-usage > >> +.endif > >> > >> # XXX > >> .if defined(HAVE_GCC) || defined(HAVE_LLVM) > >> @@ -338,8 +343,8 @@ ${_s:T:R}.o: ${_s} > >> .if !target(__CLEANKERNEL) > >> __CLEANKERNEL: .USE > >> ${_MKMSG} "${.TARGET}ing the kernel objects" > >> - rm -f ${KERNELS} *.map eddep tags *.[io] *.ko *.ln [a-z]*.s vers.c > >> \ > >> - [Ee]rrs linterrs makelinks assym.h.tmp assym.h \ > >> + rm -f ${KERNELS} *.map *.[io] *.ko *.ln [a-z]*.s *.su vers.c \ > >> + eddep tags [Ee]rrs linterrs makelinks assym.h.tmp assym.h \ > >> ${EXTRA_KERNELS} ${EXTRA_CLEAN} > >> .endif > >> > >
Re: Straw proposal: MI kthread vector/fp unit API
Le mar. 23 juin 2020 à 00:14, Eduardo Horvath a écrit : > The SPARC has always had a lazy FPU save logic. The fpstate structure is > not part of the pcb and is allocated on first use. No lazy FPU save logic please. It was eradicated from x86 for a reason: https://access.redhat.com/solutions/3485131 Same should really be done for SPARC. Most certainly, we should not add any new interface using lazy FPU save logic, particularly not for something involving cryptography. Jaromir
Re: UBSan: Undefined Behavior in lf_advlock
Le ven. 5 juin 2020 à 21:49, syzbot a écrit : > [ 44.1699615] panic: UBSan: Undefined Behavior in > /syzkaller/managers/netbsd-kubsan/kernel/sys/kern/vfs_lockf.c:843:16, signed > integer overflow: 131072 + 9223372036854771712 cannot be represented in type > 'long int' > > [ 44.1931600] cpu0: Begin traceback... > [ 44.1999503] vpanic() at netbsd:vpanic+0x287 sys/kern/subr_prf.c:290 > [ 44.2299515] isAlreadyReported() at netbsd:isAlreadyReported > [ 44.2599494] HandleOverflow() at netbsd:HandleOverflow+0x1c9 > sys/../common/lib/libc/misc/ubsan.c:375 > [ 44.2899499] lf_advlock() at netbsd:lf_advlock+0x2124 > sys/kern/vfs_lockf.c:843 This happens in: if (fl->l_len > 0) end = start + fl->l_len - 1; else { when call to fcntl() is arranged so that 'start' ends up 0x2 and fl->l_len 0x7000, overflowing the off_t. I wonder, Is it important to fix cases like that? Jaromir
UEFI boot and PCI interrupt routing (amd64)
Hi, I'm working on a driver for some PCI device, I'm far enough to execute operations which should trigger interrupt, but the interrupt handler (registered via pci_intr_establish()) is not called. It looks like there is some kind of interrupt routing problem, maybe. Any hints on what could/should I check? Unfortunately can't boot the machine in legacy mode, to check whether it's due to mission initialisation by BIOS. Jaromir
Re: kernel stack usage
I think it would make sense to add -Wstack-usage=X to kernel builds. Either 2KB or 1KB seems to be good limit, not too many offenders between 1KB and 2KB so maybe worth it to use 1KB and change them. I'm sure there would be something similar for LLVM too. Jaromir Le dim. 31 mai 2020 à 15:39, Simon Burge a écrit : > > matthew green wrote: > > > glad to see this effort and the clean up already! > > > > ideally, we can break the kernel build if large stack consumers > > are added to the kernel. i'd be OK with it being default on, > > with of course a way to skip it, and if necessary it can have > > a whitelist of "OK large users." > > I started to look at -fstack-usage which gives a nice MI way of getting > the data instead of parsing MD objdump output, but didn't get any > further than the below patch. The find(1) command referenced in the > patch gives the following > > 1008 nfs_serv.c:2181:1:nfsrv_link > 1056 procfs_linux.c:422:1:procfs_do_pid_stat > 1056 vfs_subr.c:1521:1:vfs_buf_print > 1072 dl_print.c:80:1:sdl_print > 1104 core_elf32.c:300:1:coredump_getseghdrs_elf32 > 1104 core_elf32.c:300:1:coredump_getseghdrs_elf64 > 1104 sys_ptrace_common.c:1595:1:proc_regio > 1120 subr_bufq.c:131:1:bufq_alloc > 1136 db_lwp.c:64:1:db_lwp_whatis > 1152 kern_ktrace.c:1269:1:ktrwrite > 1168 uvm_swap.c:768:1:uvm_swap_stats.part.1 > 1312 nfs_serv.c:1905:1:nfsrv_rename > 2112 tty_60.c:68:1:compat_60_ptmget_ioctl > 2144 memmem.c:83:14:twoway_memmem > > Anyone else want to run with adding a bit more to this to do some build > failure as mrg@ suggests and documenting for options(4)? > > Cheers, > Simon. > -- > Index: Makefile.kern.inc > === > RCS file: /cvsroot/src/sys/conf/Makefile.kern.inc,v > retrieving revision 1.270 > diff -d -p -u -r1.270 Makefile.kern.inc > --- Makefile.kern.inc 21 May 2020 18:44:19 - 1.270 > +++ Makefile.kern.inc 31 May 2020 13:34:13 - > @@ -104,6 +104,11 @@ CFLAGS+= -ffreestanding -fno-zero-initia > CFLAGS+= ${${ACTIVE_CC} == "gcc":? -fno-delete-null-pointer-checks :} > CFLAGS+= ${DEBUG} ${COPTS} > AFLAGS+= -D_LOCORE -Wa,--fatal-warnings > +.if defined(KERN_STACK_USAGE) > +# example usage to find largest stack users in kernel compile directory: > +#find . -name \*.su | xargs awk '{ printf "%6d %s\n", $2, $1 }' | sort > +1n > +CFLAGS+= -fstack-usage > +.endif > > # XXX > .if defined(HAVE_GCC) || defined(HAVE_LLVM) > @@ -338,8 +343,8 @@ ${_s:T:R}.o: ${_s} > .if !target(__CLEANKERNEL) > __CLEANKERNEL: .USE > ${_MKMSG} "${.TARGET}ing the kernel objects" > - rm -f ${KERNELS} *.map eddep tags *.[io] *.ko *.ln [a-z]*.s vers.c \ > - [Ee]rrs linterrs makelinks assym.h.tmp assym.h \ > + rm -f ${KERNELS} *.map *.[io] *.ko *.ln [a-z]*.s *.su vers.c \ > + eddep tags [Ee]rrs linterrs makelinks assym.h.tmp assym.h \ > ${EXTRA_KERNELS} ${EXTRA_CLEAN} > .endif >
Re: kernel stack usage
Le sam. 30 mai 2020 à 18:41, Jason Thorpe a écrit : > These two seem slightly bogus. coredump_note_elf64() was storing register > state not the stack, but not nearly 3K worth. procfs_domounts() has nearly > nothing on the stack as far as I can tell, and the one function that could be > auto-inlined that it calls doesn't have much either. > struct statvfs is certainly over 3 KB - line 619 Jaromir
Re: kernel stack usage
I've fixed several where I felt comfortable, feel free to do more: 4096pci_conf_print at pci_subr.c:4812 4096dtv_demux_read at dtv_demux.c:493 3408genfb_calc_hsize at genfb.c:630 2240bwfm_rx_event_cb at bwfm.c:2099 1664wdcprobe_with_reset at wdc.c:491 Jaromir Le sam. 30 mai 2020 à 16:18, Christos Zoulas a écrit : > > In article <20200530095218.gb28...@mail.duskware.de>, > Martin Husemann wrote: > >Hey folks, > > > >triggered by some experiments simonb did on mips I wrote a script to find > >the functions using the bigest stack frame in my current sparc64 kernel. > > > >The top 15 list is: > > > >Frame/b Function > >4096pci_conf_print at pci_subr.c:4812 > >4096dtv_demux_read at dtv_demux.c:493 > >3536SHA3_Selftest at sha3.c:430 > >3408genfb_calc_hsize at genfb.c:630 > >3248radeonfb_pickres at radeonfb.c:4127 > >2304radeonfb_set_cursor at radeonfb.c:3690 > >2272gem_pci_attach at if_gem_pci.c:147 > >2256twoway_memmem at memmem.c:84 > >2240bwfm_rx_event_cb at bwfm.c:2099 > >2240compat_60_ptmget_ioctl at tty_60.c:70 > >2112db_stack_trace_print at db_trace.c:77 > >1664wdcprobe_with_reset at wdc.c:491 > >1424nfsrv_rename at nfs_serv.c:1906 > >1408OF_mapintr at ofw_machdep.c:728 > >1344sysctl_hw_firmware_path at firmload.c:81 > >1280fw_bmr at firewire.c:2296 > >1264cdioctl at cd.c:1204 > >1248cpu_reset_fpustate at cpu.c:400 > >1248aubtfwl_attach_hook at aubtfwl.c:273 > >1248uvm_swap_stats at uvm_swap.c:726 > > > >(left column is size of the frame on sparc64 in bytes) > > > >I think anything > 1k is dubious and should be checked. > > I agree, here is the same for x86_64/GENERIC... > > 4408 8027af14:pci_conf_print+0xd > 4128 80a8dca0:dtv_demux_read+0xb > 3352 80b940bb:procfs_domounts+0xd > 3272 80e36b4b:SHA3_Selftest+0xd > 3264 80c677da:coredump_note_elf64+0xb > 3240 80b537c6:genfb_calc_hsize.isra.0+0x5 > 2704 80c66a88:coredump_note_elf32+0xb > 2408 80227a71:process_machdep_doxstate+0xd > 2184 804381fd:linux_ioctl_termios+0xd > 2168 80440b2d:linux32_ioctl_termios+0xd > 2112 802c5579:gem_pci_attach+0xb > 2104 80e465c3:twoway_memmem+0xd > 2088 806b5c18:bwfm_rx_event_cb+0xd > 2072 8097e221:compat_60_ptmget_ioctl+0xd > 2064 8053ce72:db_stack_trace_print+0x11 > 1488 8064f943:wdcprobe_with_reset+0xb > 1384 80d7ee2b:ipmi_match+0x9 > 1328 80467a95:usb_add_event+0x7 > 1304 80ba85bb:nfsrv_rename+0xd > 1256 8053069f:acpicpu_md_pstate_sysctl_all+0xd > 1256 8052cd13:acpicpu_start+0x9 > 1240 8043b162:linux_sys_rt_sigreturn+0x9 > 1192 80b9392a:procfs_do_pid_stat+0xd > 1176 80d70810:sysctl_hw_firmware_path+0xd > 1160 807b30a3:radeon_cs_ioctl+0xd > 1128 8044610f:oss_ioctl_mixer+0xd > 1128 8024dd3c:cdioctl+0xd > 1112 80c633ca:uvm_swap_stats.part.1+0xd > 1104 804fdf59:fw_bmr+0xb > 1096 80c8c0ab:ktrwrite+0xd > 1080 80573ca6:ahc_print_register+0xd > 1080 80550de8:procfs_getonecpu+0xd > 1080 8048d95e:aubtfwl_attach_hook+0x9 > 1064 80cf925d:proc_regio+0xd > 1064 80ccb407:bufq_alloc+0xd > 1064 80ae8090:ar5112SetPowerTable+0xd > 1064 80582aa0:ahd_print_register+0xd > 1064 80568a53:tpmread+0xd > 1064 80382262:txp_attach+0xd > 1064 802636da:ata_probe_caps+0xd > 1048 80e06097:ar9003_paprd_tx_tone_done+0xd > 1048 80d87e9b:sdl_print+0x9 > 1048 80c67604:coredump_getseghdrs_elf64+0xd > >
ACPI SCI interrupt flooding - any ideas?
Hi, we have the SCI interrupt storm - http://gnats.netbsd.org/53687 Linux has this quirk, and it claims such thing happens due to some unsupported hw features. Would it make sense to have some quirk like that? https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9c4aa1eecb48cfac18ed5e3aca9d9ae58fbafc11 Any ideas what to do? I can try things if given some guidance, but I simply don't know where to start looking. Jaromir
Netowork interface deferred start - does it work?
Hi, I'm looking on some performance improvements for xennet(4) and particularly xvif(4). The drivers use the deferred start and there are lots of comments on how important it is to batch requests, but actually it's really rare that the start routine would ever be called with more than one mbuf in the Tx queue. It seems the softnet is triggered more or less immediately after enqueueing each packet in most cases. Maybe this used to be different when giant lock used to be held more, but not so much nowadays. So, is it supposed to still work? Jaromir
Re: config_mounroot - spinout while attaching nouveaufb0 on amd64 with LOCKDEBUG
Le lun. 17 févr. 2020 à 17:55, matthew green a écrit : > > FWIW, i've been running my radeon with a patch that exlicitly drops > kernel lock around the "real attach" function (the one that config > mountroot ends up calling.) > > we really need to MPSAFE-ify the autoconf subsystem. right now, it > is expected that autoconf runs with kernel lock... i am not sure of > the path we should take for this -- but let's actually have a design > in place we are happy with, while my change below works, it's ugly > and wrong. Would it make sense to simply require all callers of config_mountroot() to have MPSAFE config routines and call the callbacks for them without kernel lock? There are just few of those, and they can simply be changed to do the KERNEL_LOCK()/UNLOCK themselves. Jaromir
config_mounroot - spinout while attaching nouveaufb0 on amd64 with LOCKDEBUG
Hi, while debugging the MSI attachment for nouveaufb0, I've got several times spinout panic like one below. It doesn't happen on every boot, but on almost every one. I confirmed via ddb that this happens due to config_mountroot_thread() holding the kernel lock for too long - that's where backtrack for cpu 0 (which holds the lock) leads. Would it be worth it to spend some efford to run that thread without kernel lock, or postpone the driver initialization even further? Or possibly bump the spinout count (spinout after 1.6 seconds seems a bit too quick), or maybe just disable the spinout check during boot altogether? Jaromir The end of boot: [ 12.1233898] nouveau0: info: No connectors reported connected with modes [ 12.1392346] kern info: [drm] Cannot find any crtc or sizes - going 1024x768 [ 12.1461457] nouveaufb0 at nouveau0 [ 13.7057542] Kernel lock error: _kernel_lock,244: spinout [ 13.7157692] lock address : 0x80a330c0 type : spin [ 13.7157692] initialized : 0x8070c259 [ 13.7257828] shared holds : 0 exclusive: 1 [ 13.7257828] shares wanted: 0 exclusive: 3 [ 13.7357995] relevant cpu : 3 last held: 0 [ 13.7458129] relevant lwp : 0x846d0bc7d740 last held: 0x846d0daf2b80 [ 13.7458129] last locked* : 0x804e82aa unlocked : 0x80238583 [ 13.7558296] curcpu holds : 0 wanted by: 0x846d0bc7d740 [ 13.7658445] panic: LOCKDEBUG: Kernel lock error: _kernel_lock,244: spinout [ 13.7758590] cpu3: Begin traceback... [ 13.7758590] vpanic() at netbsd:vpanic+0x146 [ 13.7758590] snprintf() at netbsd:snprintf [ 13.7858758] lockdebug_more() at netbsd:lockdebug_more [ 13.7858758] _kernel_lock() at netbsd:_kernel_lock+0x244 [ 13.7958892] ip_slowtimo() at netbsd:ip_slowtimo+0x1a [ 13.7958892] pfslowtimo() at netbsd:pfslowtimo+0x31 [ 13.8059048] callout_softclock() at netbsd:callout_softclock+0x10f [ 13.8059048] softint_dispatch() at netbsd:softint_dispatch+0x105 [ 13.8159175] DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0x9a00adca1ff0 [ 13.8259329] Xsoftintr() at netbsd:Xsoftintr+0x4f [ 13.8259329] --- interrupt --- [ 13.8259329] 28f783ada3ba8615: [ 13.8359485] cpu3: End traceback... [ 13.8359485] fatal breakpoint trap in supervisor mode [ 13.8359485] trap type 1 code 0 rip 0x8021e3a5 cs 0x8 rflags 0x202 cr2 0 ilevel 0x2 rsp 0x9a00adca1d70 [ 13.8459661] curlwp 0x846d0bc7d740 pid 0.36 lowest kstack 0x9a00adc9d2c0 Stopped in pid 0.36 (system) at netbsd:breakpoint+0x5: leave
Re: Proposal to remove netsmb / smbfs
Le lun. 20 janv. 2020 à 21:07, Jason Thorpe a écrit : > > (Cross-posted to tech-kern / tech-net because it affects networking and file > systems.) > > I would like to propose that we remove netsmb and smbfs. Two reasons: Yes, please. > 1- They only support SMB1, which is an ancient flavor of the protocol. SMB2 > was introduced in 2006 and SMB3 in 2012. SMB3 is the flavor of the protocol > favored by all of the major implementations, and for those that still support > SMB1, you sometimes have to go out of your way to enable it. > > 2- It is not MP-safe. The effort to make it MP-safe would be non-trivial. I remember netsmb particularly uses some fairly strange way how it works internally, I was meaning to look at it ever since importing it, which however never happened. IIRC SMB1 is even not supported any more in current version of windows, so the utility of smbfs is close to 0. Jaromir
Re: New "ahcisata0 port 1: device present" messages with NetBSD 9
All right, let's do that. Penalising quirky old gear makes more sense than penalising new gear. I didn't like that the controller approach means we penalise ALL drives on the controller. However, it's better than penalising EVO drives everywhere, when we have evidence it works without errors on other controllers. Please go ahead. Jaromir Le ven. 17 janv. 2020 à 10:24, Simon Burge a écrit : > > matthew green wrote (in kern/54855): > > > most people with 860 EVOs are successful with NCQ, i'm one > > of them with a 500GB model, with netbsd-9 for months now, > > and they're a wildly successful line i'm sure many others > > also have working happily with NCQ. > > > > we don't know what the real issue is. perhaps it is a > > problem that the controller had that these drives trigger, > > rather than a problem with them, so perhaps we should > > disable it on the controller instead of the drive, if > > we're going to pick one of the two triggers here. > > Izumi Tsutsui wrote: > > > Note mine is: > > >> ahcisata0 at pci0 dev 18 function 0: vendor 1002 product 4380 (rev. 0x00) > > [ ... ] > > >> 000:18:0: ATI Technologies SB600 SATA Controller (SATA mass storage, > > >> AHCI 1.0) > > > > Note it already has AHCI_QUIRK_BADPM, but the Samsung EVO 860 NCQ problem > > should be an independent quirk? > > https://nxr.netbsd.org/xref/src/sys/dev/ic/ahcisatavar.h?r=1.23#57 > > https://nxr.netbsd.org/xref/src/sys/dev/pci/ahcisata_pci.c?r=1.56#59 > > Using the approach suggested by mrg, where we "penalise" the (older) > AMD chipsets that exhibit the problem rather than the more recent and > popular Samsung EVO 860 disks, I've added a new AHCI_QUIRK_BADNCQ quirk > and enabled it for the same AMD chipsets that had the AHCI_QUIRK_BADPM > quirk. I do wonder, out of caution, if maybe this quirk should be > enabled for PCI_PRODUCT_ATI_SB600_SATA_1 as well? > > Before patch: > > wd0(ahcisata0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 > (Ultra/133) (using DMA), NCQ (31 tags) > > After patch: > > wd0(ahcisata0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 > (Ultra/133) (using DMA), WRITE DMA FUA EXT > > Jaromir - are you happy with this alternate approach? If you are > happy with this I'll commit this patch and revert your EVO 860 quirk > and request a pull-up to the netbsd-9 branch. > > tsutsui - are you able to test the patch too? > > Cheers, > Simon. > > -- > Index: sys/dev/ic/ahcisata_core.c > === > RCS file: /cvsroot/src/sys/dev/ic/ahcisata_core.c,v > retrieving revision 1.75.4.2 > diff -d -p -u -r1.75.4.2 ahcisata_core.c > --- sys/dev/ic/ahcisata_core.c 24 Dec 2019 17:34:33 - 1.75.4.2 > +++ sys/dev/ic/ahcisata_core.c 17 Jan 2020 08:45:41 - > @@ -276,6 +276,11 @@ ahci_attach(struct ahci_softc *sc) > "ignoring broken port multiplier support\n"); > sc->sc_ahci_cap &= ~AHCI_CAP_SPM; > } > + if (sc->sc_ahci_quirks & AHCI_QUIRK_BADNCQ) { > + aprint_verbose_dev(sc->sc_atac.atac_dev, > + "ignoring broken NCQ support\n"); > + sc->sc_ahci_cap &= ~AHCI_CAP_NCQ; > + } > sc->sc_atac.atac_nchannels = (sc->sc_ahci_cap & AHCI_CAP_NPMASK) + 1; > sc->sc_ncmds = ((sc->sc_ahci_cap & AHCI_CAP_NCS) >> 8) + 1; > ahci_rev = AHCI_READ(sc, AHCI_VS); > Index: sys/dev/ic/ahcisatavar.h > === > RCS file: /cvsroot/src/sys/dev/ic/ahcisatavar.h,v > retrieving revision 1.22 > diff -d -p -u -r1.22 ahcisatavar.h > --- sys/dev/ic/ahcisatavar.h14 Jan 2019 21:29:56 - 1.22 > +++ sys/dev/ic/ahcisatavar.h17 Jan 2020 08:45:41 - > @@ -59,6 +59,7 @@ struct ahci_softc { > #define AHCI_PCI_QUIRK_BAD64 __BIT(1) /* broken 64-bit DMA */ > #define AHCI_QUIRK_BADPMP __BIT(2) /* broken PMP support, ignore */ > #define AHCI_QUIRK_SKIP_RESET __BIT(4) /* skip drive reset sequence */ > +#define AHCI_QUIRK_BADNCQ __BIT(5) /* possibly broken NCQ support, > ignore */ > > uint32_t sc_ahci_cap; /* copy of AHCI_CAP */ > int sc_ncmds; /* number of command slots */ > Index: sys/dev/pci/ahcisata_pci.c > === > RCS file: /cvsroot/src/sys/dev/pci/ahcisata_pci.c,v > retrieving revision 1.55.4.1 > diff -d -p -u -r1.55.4.1 ahcisata_pci.c > --- sys/dev/pci/ahcisata_pci.c 23 Oct 2019 18:09:18 - 1.55.4.1 > +++ sys/dev/pci/ahcisata_pci.c 17 Jan 2020 08:45:41 - > @@ -179,17 +179,17 @@ static const struct ahci_pci_quirk ahci_ > AHCI_PCI_QUIRK_FORCE }, > /* ATI SB600 AHCI 64-bit DMA only works on some boards/BIOSes */ > { PCI_VENDOR_ATI, PCI_PRODUCT_ATI_SB600_SATA_1, > - AHCI_PCI_QUIRK_BAD64 | AHCI_QUIRK_BADPMP }, > + AHCI_PCI_QUIRK_BAD64 | AHCI_QUIRK_BADPMP | AHCI_QUIRK_BADNCQ }, > { PCI_VENDOR_ATI, PCI_PRODUCT_ATI_SB7
Re: New "ahcisata0 port 1: device present" messages with NetBSD 9
Le lun. 13 janv. 2020 à 11:56, Simon Burge a écrit : > A bit more digging shows that this seems to be a (somewhat) known > problem with Samsung EVO 860 disks and AMD SB710/750 chipsets. The > problem also occurs on Windows and Linux with these drives and chipsets. > > Here's a couple of links: > > > https://eu.community.samsung.com/t5/Cameras-IT-Everything-Else/860-EVO-250GB-causing-freezes-on-AMD-system/td-p/575813 > https://bugzilla.kernel.org/show_bug.cgi?id=201693 > > In the first, there's discussion of how Samsung haven't released a > firmware update to address this. > > Jaromir - do you have any suggestions on how we can deal with this, in > particular thinking about the soon to be released NetBSD 9.0? Maybe > somehow detecting this combination of disks and chipsets and disabling > NCQ? My concern right now is that people upgrading from NetBSD 8 or > earlier to NetBSD 9 may experience data corruption. If they are known broken, seems we indeed need to add a quirk entry to disable NCQ for these drives, and push it into NetBSD 9.0 branch. I'll make the change this week. Jaromir
Re: adding linux syscall fallocate
Le sam. 16 nov. 2019 à 17:14, HRISHIKESH GOYAL a écrit : > Does *posix_fallocate()* implemented (has support) for any of the > underlying filesystems in NetBSD/FreeBSD? > As far as I know no filesystem actually implements the support in NetBSD. > Also, as per my understanding, all calls to *posix_fallocate()* > internally go to *fallocate()* systemcall which are redirected to > *vop_fallocate_desc(). > *So in the project https://wiki.netbsd.org/projects/project/ffs-fallocate/ we > only need to implement *ffs_fallocate() *right? > Yes, pretty much. Jaromir
Re: adding linux syscall fallocate
R0ller, is your problem only that the linux binary used for cross-compile errors out, because of the system call returning ENOSYS rather than potentially expected EOPNOTSUPP? In that case the right fix is indeed to add the mapping into the compat code. Then the code would fail the way expected by the tool. Linux also doesn't support fallocate() universally, actually even when supported some of the filesystems return EOPNOTSUPP when e.g. unsupported flag combination is used. Jaromir Le mer. 6 nov. 2019 à 10:12, r0ller a écrit : > Hi Maciej, > > Thanks for the detailed answer! Unfortunately, I don't think that I could > accomplish this task in my spare time:( > > Please, don't take this as an offence but as a fix for my case I thought > of an utter hack: > > I do know that writing a file by calling fallocate can be tricked and > redirected into a named pipe of which the data can simply be pulled into a > normal file. This is what I'm already doing in my project as a workaround > when building it as 32bit arm lib: > > mkfifo mypipe > cat mypipe > myfile > > > The problem with this is that it cannot be used when crosscompiling > autoconf projects where a configure script starts creating many files as > I'd need to edit the script at too many places to implement this trick. > > However, if I could carry out this trick with the pipe when intercepting > the linux fallocate call, it could work. Do you think it feasible? > > Best regards, > r0ller > > Eredeti levél > Feladó: Maciej < maciej.grochow...@protonmail.com (Link -> mailto: > maciej.grochow...@protonmail.com) > > Dátum: 2019 november 4 23:32:56 > Tárgy: Re: adding linux syscall fallocate > Címzett: r0ller < r0l...@freemail.hu (Link -> mailto:r0l...@freemail.hu) > > > Hi r0ller, > > A couple of weeks ago I also run to the issue when I found lack of > fallocate or POSIX_FALLOCATE(2) (to be precise) a little bit sad. > From the very first view typical usage of POSIX_FALLOCATE(2) seems > straight forward, comparing to the Linux fallocate(2) where different modes > have to be supported. However, things can go a little bit more complicated > if you are dealing with an existing file with a more complex structure. > Because of that, I found difficult to provide good quality implementation > without a proper way to test it. > Before EuroBSD 2019 as a part of work about fuzzing the FFS, I decided to > bring some well known FS test (namely speaking "XFS tests') suit to make > sure that bugs that we fix did not introduce a regression. > The same thing applies to the new features of FS, is relatively easy to > port implementation from other OS/FS, but without a proper way to test > them, I would be very careful to introduce such things too quickly to the > end-users. > > One thing that I was missing for XFS tests, and going to publish part of > it soon, is a way to view the internal structure of inodes and other > metadata of Filesystem. My primary use case was to debug mount issues, in > the example the issue that I showed during my presentation about the > fuzzing. But also same apply to the code that manipulates inode blocks. > > Hopefully, we can elaborate on that, and as I pointed earlier I would be > careful with merging new FS features especially such wide used as > POSIX_FALLOCATE(2) without a proper FS test suit or extensive enough > testing that would require proper too like i.e. FSDB. > > Thanks > Maciej > > ‐‐‐ Original Message ‐‐‐ > On Sunday, November 3, 2019 6:06 PM, r0ller wrote: > > Hi Jaromir, > > Indeed. That's bad news but thanks for your answer! I've even found this: > https://wiki.netbsd.org/projects/project/ffs-fallocate/ > Are there any details for this project besides that page? I don't know > anything about NetBSD internals though if it's not meant for gurus, I'd > have a look at it and give it a try. > > Best regards, > r0ller > > Eredeti levél > Feladó: Jaromír Doleček < jaromir.dole...@gmail.com (Link -> mailto: > jaromir.dole...@gmail.com) > > Dátum: 2019 november 3 15:16:34 > Tárgy: Re: adding linux syscall fallocate > Címzett: r0ller < r0l...@freemail.hu (Link -> mailto:r0l...@freemail.hu) > > Le dim. 3 nov. 2019 à 08:57, r0ller a écrit : > > > As you can see on the attached screenshot, "line 4741" gets printed out
Re: adding linux syscall fallocate
Le dim. 3 nov. 2019 à 08:57, r0ller a écrit : > As you can see on the attached screenshot, "line 4741" gets printed out. So I > went on to check what happens in VOP_FALLOCATE but it gets really internal > there. > > Does anyone have any hint? fallocate VOP is not implemented for FFS: > grep fallocate * ffs_vnops.c: { &vop_fallocate_desc, genfs_eopnotsupp }, /* fallocate */ ffs_vnops.c: { &vop_fallocate_desc, spec_fallocate }, /* fallocate */ ffs_vnops.c: { &vop_fallocate_desc, vn_fifo_bypass }, /* fallocate */ Jaromir
Re: Proposal, again: Disable autoload of compat_xyz modules
Le jeu. 26 sept. 2019 à 18:08, Manuel Bouyer a écrit : > > On Thu, Sep 26, 2019 at 05:10:01PM +0200, Maxime Villard wrote: > > issues for a clearly marginal use case, and given the current general > ^^^ > > This is where we dissagree. You guess it's marginal but there's no > evidence of that (and there's no evidence of the opposite either). FYI - I've put also a lot of efford into fixing & enhancing compat_linux in past. I also greatly appreciate all the work work of other folks working on the layer, it's super useful in some situations - browser with flash support used to be important (thankfully not anymore), also vmware and matlab, I also used some Oracle dev tools. However, that is not the topic of the discussion. Let's concentrate on whether it should be enabled by default. Given the history, to me it's completely clear compat_linux shouldn't be on by default. Any possible linux-specific exploits should only be problem for people actually explicitly enabling it. Let's just stop pretending that we'd setup any kind of reasonable testing suite for this - it has not been done in last >20 years, it's even less likely to happen now that most of the major use cases are actually moot. As Maya suggested, let's keep this concentrated on COMPAT_LINUX only to avoid further bikeshed flogging, so basically I propose doing this: 1) Comment out COMPAT_LINUX from all kernels configs for all archs which support modular 2) Disable autoload for compat_linux, requiring the user to explicitly configure system to load it. No extra sysctl. Any major and specific objections? Jaromir
Re: Proposal, again: Disable autoload of compat_xyz modules
Le jeu. 26 sept. 2019 à 10:54, Maxime Villard a écrit : > I recently made a big set of changes to fix many bugs and vulnerabilities in > compat_linux and compat_linux32, the majority of which have a security impact > bigger than the Intel CPU bugs we hear about so much. These compat layers are > enabled by default, so everybody is affected. > ... > Therefore, I am making today the same proposal as Taylor in 2017, because the > problem is still there exactly as-is and we just hit it again; the solution > however is more straightforward. Yes please, it's the right thing to do. Just please still keep also the option to compile it into the kernel. I also now have no objection to moving all the arch-specific bits under sys/compat/ as you&Taylor suggested in the old thread. Jaromir
Re: Enable functionality by default
Le mar. 16 avr. 2019 à 17:55, Sevan Janiyan a écrit : > altq, veriexec, BUFQ_PRIOCSCAN, CARP > ... > Are there any reasons not enable these features by default on all ports > (with the exception of resource constrained systems such as sun2, sun3, > zaurus) ? I think enabling veriexec used to incurr some overhead for exec or file copy, but I don't recall details. Jaromir
Re: Removing PF
Le lun. 1 avr. 2019 à 14:32, Stephen Borrill a écrit : >Your two statements are mutually inconsistent: > 1) No-one is maintaining ipf or pf > and > 2) If the effort had been on one firewall instead of three, the one chosen > would be more functional. IMO it's consistent - if we had one, it would be clear to which one to contribute, and clear if the feature is really missing and working. i.e. nobody contributes anything because they don't know to which of the firewalls to contribute. In either case, let's return to a constructive discussion, and see what needs to be done. NPF-only is the future, so let's get to that future. In the past discussion, I've only seen people mentioning only two features missing in NPF and present in PF: 1. ftp-proxy support - Maxime volunteered to implement this in NPF, I'm sure help there would be welcome 2. group support for config (mentioned by Manuel) - anyone feels like taking? - ??it might be enough to have some kind of config preprocessor initially if that's easier to do?? Is there anything else? Jaromir
Re: Regarding the ULTRIX and OSF1 compats
Le sam. 16 mars 2019 à 16:12, Robert Elz a écrit : > Sorry, I must have missed that. All I ever seem to see is that xxx is > unmaintained and full of unspecified bugs, and that obviously no-one cares, > and so we should delete it.That's not an argument for anything. You suggest that there are people who care about the compat, who are not on the mailing list. How do you suggest we reach them? > Yes, this makes it a long slow, and tedious process (involving a bunch of > additional work) to delete something - but so it should be. Someone spent > a lot of time making the thing you want to discard work in the first place. Max did everything reasonable to gather feedback about this, waiting a long time for feedback to come. Now it's time to proceed with the removal. The stuff will stay in Attic, so if somebody steps in to actually maintain it, they would be able to revive it just as easily as if it would be simply unlinked from build. This is me, speaking as person who put a lot of efford into making compat_linux work, and also trying to keep compat code updated. I know what it means to update esoteric code down in compat layers. Most of this is simply not testable by reasonable means, which unfortunately means it needs to go. Jaromir
Re: Reserve device major numbers for pkgsrc
Perhaps do not pre-reserve anything and simply reserve consecutive numbers as the need arises? Le sam. 16 févr. 2019 à 23:55, Kamil Rytarowski a écrit : > > On 16.02.2019 23:40, Jaromír Doleček wrote: > > Le sam. 16 févr. 2019 à 23:24, Kamil Rytarowski a écrit : > >> > >> We started to build and ship kernel modules through pkgsrc. > >> > >> I would like to reserve 3 major numbers for the HAXM case from the base > >> pool of devices and prevent potential future conflicts and compatibility > >> breakage due to picking up another major number in a 3rd party software. > >> > >> Where and how to reserve these major numbers? > >> > >> I expect that we will start growing software that uses our kernel module > >> framework. (I have got few candidates to get ported in future) > > > > See sys/conf/majors > > > > Jaromir > > > > Proposed patch: > > http://netbsd.org/~kamil/patch-00086-pkgsrc-majors.txt > > I have no opinion on numbers to pick, please suggest the right ranges > and I will follow it. >
Re: Reserve device major numbers for pkgsrc
Le sam. 16 févr. 2019 à 23:24, Kamil Rytarowski a écrit : > > We started to build and ship kernel modules through pkgsrc. > > I would like to reserve 3 major numbers for the HAXM case from the base > pool of devices and prevent potential future conflicts and compatibility > breakage due to picking up another major number in a 3rd party software. > > Where and how to reserve these major numbers? > > I expect that we will start growing software that uses our kernel module > framework. (I have got few candidates to get ported in future) See sys/conf/majors Jaromir
Re: scsipi: physio split the request
> On Dec 27, 12:29pm, buh...@nfbcal.org (Brian Buhrow) wrote: > -- Subject: Re: scsipi: physio split the request > > | hello. Just out of curiosity, why did the tls-maxphys branch never > | get merged with head once the work was done or mostly done? Simply nobody finished it up yet. Le jeu. 27 déc. 2018 à 22:07, Christos Zoulas a écrit : > mostly done... I did the last tls-maxphys sync. It will need at least one resync, since more stuff moved around (i.e. wd(4) was converted to dksubr since then). I want to eliminate some duplicated code in PCI-IDE drivers before the merge. Other than that, I don't see any other blockers for the merge. Of course, it requires proper round of testing before it could be really considered merge-worthy. As others noted, the performance benefit of tls-maxphys is likely to be small. It however removes artificial limits in the stack, unlocking new opportunities for further development, so it's both necessary and good to do nevertheless. Jaromir
Re: scsipi: physio split the request
Le jeu. 27 déc. 2018 à 15:41, Emmanuel Dreyfus a écrit : > > On Thu, Dec 27, 2018 at 02:33:28PM +, Christos Zoulas wrote: > > I think you need resurrect the tls-maxphys branch... It was close to working > > IIRC. > > What happens if I just #define MAXPHYS (1024*1204*1024) ? Several drivers use MAXPHYS to allocate DMA memory for them (e.g. nvme), they usually try to allocate MAXPHYS * (max number of scatter-gather vectors). These will either fail, or block huge amounts of RAM. Also it just happens to be hard I/O size limit for ISA, and ATA/IDE drives not supporting LBA48. Even for hw which does support bigger I/O (stuff behind PCI, SCSI, ATA drivers with LBA48), the drivers might be buggy and not cope. Nothing good happens out of this, unfortunately the lower layers need to be fixed first. Jaromir
Re: svr4, again
Le mar. 18 déc. 2018 à 13:16, Maxime Villard a écrit : > It is clear that COMPAT_SVR4 is completely buggy, but to be clear on the > use of the code: +1 to removal for COMPAT_SVR4, there is always attic. I remember I've been also doing some mechanical changes in the area in past, and also encountered things which were broken just by casual browsing. Which I did not fix because I did not have "srv4" binary I would need to run. It's simply not useful any more. Jaromir
Re: [uvm_hotplug] Fixing the build of tests
Le sam. 15 déc. 2018 à 15:27, Jason Thorpe a écrit : > We can buy ourselves some time on aarch64 by using a different page size, > yes? iOS, for example, uses 16KiB VM pages (backed by 4KiB or 16KiB physical > pages, depending on the specific CPU type). I don't think the ARM eabi has > the same section alignment issue that made the 4K/8K page size thing on m68k > such a pain. > Bigger physical pages also reduce size of page tables (big deal if it e.g. allows to reduce e.g. 4-level to 3-level mapping, but even "just" factor or two or four is very good) and hence speed up page faults, and improve TLB utilization. It's really good thing overall and something which is very desirable on monster memory machines. Jaromir
Re: pci_intr_alloc() vs pci_intr_establish() - retry type?
Le mer. 5 déc. 2018 à 08:39, Masanobu SAITOH a écrit : > I suspect Serial ATA AHCI 1.2.1 specification page 111 has the hint. > > Figure 23: Port/CCC and MSI Message Mapping, Example 1 > Figure 24: Port and MSI Message Mapping, Example 2 > > I suspect MSI-X also assume this layout. pci_msix_alloc_map() might > be used. Very likely we need to disable MSICAP.MME bit if MSI happens to have more vectors to force single MSI operation. There is no such bit for MSI-X. Let's work on this off-list, I have some idea to test the interrupt mapping, and actually add support for multiple vectors. Since in your case the thing only fails for IDENTIFY, it looks like the registers mapped via regular BAR work and just interrupt routing is bad. Performance wise it doesn't buy much besides avoiding one register read per interrupt, because the whole ATA subsystem is not MPSAFE and hence the interrupt handlers are serialized via KERNEL_LOCK() anyway. > My SuperMicro A2SDi-H-TP4F's disk is connected to not channel 0 > but channel 6. I also verified that GHC.MRSM is 0. GHC.MRSM is only set if hardware has less MSI/MSI-X vectors than AHCI ports, and fallbacks to single MSI due to this. So it's actually normal for this to be 0. It'd be quite strange for hardware to have less MSI/MSI-X vectors than supported ports. Jaromir
Re: pci_intr_alloc() vs pci_intr_establish() - retry type?
I've now disabled MSI-X for ahcisata(4), need to figure what needs to be setup there. Le lun. 3 déc. 2018 à 22:41, Jared McNeill a écrit : > IIUC we don't actually have confirmation that ThunderX AHCI works yet.. > Nick is having issues with his board. Okay - I misunderstood, thought it was confirmed working. The BAR used for Cavium is slightly suspicious. AHCI spec mentions (v1.3.1 section 2.4 Serial ATA Capability) 0x10 BAR as one of 'legacy' access methods. It specifically only talks about 0x24 as the proper BAR for AHCI, it doesn't mention option to have it elsewhere. Jaromir
Re: pci_intr_alloc() vs pci_intr_establish() - retry type?
Le lun. 3 déc. 2018 à 12:09, Masanobu SAITOH a écrit : > C3000's AHCI has multi-vector MSI-X table and it doesn't work since > Nobember 20th... Can you try if by chance this code adapted nvme(4) changes anything on your system? http://www.netbsd.org/~jdolecek/ahcisata_msixoff.diff ahcisata(4) works for me on two different computers in MSI mode. I don't have any ahcisata device which supports MSI-X, but it is working on Cavium which has MSI-X. Perhaps some different BAR wierdness than Cavium, or simply the boards using MSI-X should use the "cavium" BAR? So another thing to try would be to make ahci_pci_abar() return AHCI_PCI_ABAR_CAVIUM always and see if it changes things. Also you can try rev 1.44 of ahcisata_pci.c with AHCISATA_DISABLE_MSIX to see if it works if forced to MSI only. I wonder whether there is something we still need to adjust during general initialization, I see this on my hardware: re(4) - on my system works when attached either MSI or MSI-X nvme(4) - on my system works only when attached via MSI-X (regardless of the PCI_CAP_MSIX), doesn't work when forced to MSI ahcisata(4) works in MSI on my system, as MSI-X for skrll@, MSI-X doesn't work for you Jaromir
Re: pci_intr_alloc() vs pci_intr_establish() - retry type?
Le mer. 28 nov. 2018 à 00:42, Taylor R Campbell a écrit : > > Date: Tue, 27 Nov 2018 21:14:04 + > > From: Robert Swindells > > > > It looks to me that drivers try MSI and/or MSIX first based on the > > device type not on whether the host controller can handle it. pci_intr_alloc() checks what the device claims to support down the call stack, reading the PCI config space. I assume there some negotiation between the PCI bus and the device. I hope device can't claim to support e.g. MSI-X if the bus doesn't. > Can we make the same API work for MSI, MSI-X, and INTx interrupts, so > that we don't have to remember two or three different APIs? Maybe > just pass a subset of PCI_INTR_INTx|PCI_INTR_MSI|PCI_INTR_MSIX to say > which ones to try or something? Yes, it would be the goal to have some shared API to do this fallback logic. But I need to understand whether it's actually necessary. It would be much simpler if we can rely on establish to work regardless of type of interrupt. Especially in 1/1/1 case, I don't see in the code a path where it would be possible for MSI/MSI-X establish to fail, but e.g. INTx one to not fail too. For sure, the MD allocation can fail due to some memory shortage or because we run out of interrupts, supported by the port. I don't see how establishing 1 MSI/MSI-X interrupt could fail, but then establishing 1 INTx could succeed. So perhaps it's okay to just to try to establish once and remove all those fallbacks. Unless the driver does something fancy, like try to use ncpu MSI interrupts and falling back to 1 INTx interrupt. I'd be in favour of simplifying the code to do just that in xhci(4) and ahcisata(4) - opinions? Jaromir
pci_intr_alloc() vs pci_intr_establish() - retry type?
I see several drivers (e.g. xhci(4), ahcisata(4), bge(4), nvme(4)) retry the pci_intr_alloc()+pci_intr_establish() with 'lower' types when pci_intr_establish() fails. Is this a real case, can it actualy happen that pci_intr_alloc() returns the interrupt handlers, but pci_intr_establish() would fail for it? If it's not a real case, it would be nice to remove all that boilerplate code. Jaromir
Re: 8.0 performance issue when running build.sh?
2018-08-09 19:40 GMT+02:00 Thor Lancelot Simon : > On Thu, Aug 09, 2018 at 10:10:07AM +0200, Martin Husemann wrote: >> 100.002054 14.18 kernel_lock >> 47.43 846 6.72 kernel_lockfileassoc_file_delete+20 >> 23.73 188 3.36 kernel_lockintr_biglock_wrapper+16 >> 16.01 203 2.27 kernel_lockscsipi_adapter_request+63 > Actually, I wonder if we could kill off the time spent by fileassoc. Is > it still used only by veriexec? We can easily option that out of the build > box kernels. Or even better, make it less heavy? It's not really intuitive that you could improve filesystem performance by removing this obscure component. Jaromir
Re: panic: biodone2 already
2018-08-07 18:42 GMT+02:00 Emmanuel Dreyfus : > kern/53506 Thanks. Could you please try a -current kernel for DOMU and see if it crashes the same? If possible a DOMU kernel from daily builds, to rule out local compiler issue. There are not really many differences in xbd/evtchn code itself between 8.0 and -current however. There was some interrupt code reorganization which might affected this, but this happened after netbsd-8 was created. xbd is not mpsafe, so it shouldn't be even race due to parallell processing on different CPUs. Maybe it would be useful to check if the problem still happens when you assign just single CPU to the DOMU. Jaromir
Re: panic: biodone2 already
This is always a bug, driver processes same buf twice. It can do harm. If the buf is reused for some other I/O, system can fail to store data, or claim to read data when it didn't. Can you give full backtrace? Jaromir 2018-08-06 17:56 GMT+02:00 Emmanuel Dreyfus : > Hello > > I have a Xen domU that now frequently panics with "biodone2 already". I > suspect it started after upgrading to NetBSD 8.0. > > The offending code is in src/sys/kern/vfs_bio.c (see below). I wonder if > we could not just release the mutex and exit in that case. Can it make > some harm? > > static void > biodone2(buf_t *bp) > { > void (*callout)(buf_t *); > > SDT_PROBE1(io, kernel, ,done, bp); > > BIOHIST_FUNC(__func__); > BIOHIST_CALLARGS(biohist, "bp=%#jx", (uintptr_t)bp, 0, 0, 0); > > mutex_enter(bp->b_objlock); > /* Note that the transfer is done. */ > if (ISSET(bp->b_oflags, BO_DONE)) > panic("biodone2 already"); > > Cou > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > m...@netbsd.org
Re: Removing dbregs
2018-07-13 22:54 GMT+02:00 Kamil Rytarowski : > I disagree with disabling it. The code is not broken, it's covered by > tests, it's in use. This looks like perfect candidate for optional (default off) feature. It is useless and dangerous for general purpose use by virtue of being root only, but useful for specialized use. Honestly, I don't see any reason to have this on by default. Jaromir
Re: CVS commit: src/sys/arch/x86/x86
Le dim. 8 juil. 2018 à 15:29, Kamil Rytarowski a écrit : > I've introduced the change to mpbios.c as it was small, selfcontained > and without the need to decorate the whole function. Am I reading the code wrong or you actually introduced bug in mpbios.c? Shouldn't this: memtop |= (uint16_t)mpbios_page[0x414] << 8; be actually << 16 to keep the same semantics? Jaromir
Re: interrupt cleanup #1
ne 24. 6. 2018 v 12:13 odesílatel Cherry G.Mathew napsal: > Sometime tomorrow I'll send a first re-org patch (no functional changes) > followed by the actual meat later in the week. I have some small changes for x86/intr.c and xen/evtchn.c too. In intr.c pretty much just some #ifdef shuffle to make intrctl list work on Xen, in evtchn.c some new code on end. http://www.netbsd.org/~jdolecek/xen_intrctl_list.diff Would you mind if I'd pushed this before your changes, so that it's not interleaved with your intr refactoring? Jaromir
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
2018-05-12 21:24 GMT+02:00 Chuck Silvers : > the problem with keeping the pages locked (ie. PG_BUSY) while accessing > the user address space is that it can lead to deadlock if the page Meanwhile I've tested this scenario - wrote a test program to do mmap(2) for file, then calling write() using the memory from mmap as the buffer for write(2) for the same file offset. It actually caused the process to get SIGSEGV with new code, where the old code made write(2) just returned with EINVAL. I think this happens because fault resolution returns EINVAL when the page has PG_BUSY. Funny thing is that actually on Mac OS X that code actually does end up with deadlock, and unkillable process :D Maybe the SIGSEGV is actually acceptable behaviour? I don't think there would be valid reason to do mmap(2)/write(2) combination like that. Jaromir
Re: Revisiting uvm_loan() for 'direct' write pipes
2018-05-25 23:19 GMT+02:00 Jason Thorpe : > BTW, I was thinking about this, and I think you need to also handle the case > where you try the ubc_uiomove_direct() case, but then *fall back* onto the > non-direct case if some magic error is returned. > ... > These cache flushes could be potentially very expensive. I think that it would not be actually performance-wise make sense to do this - the decision to fallback would have to be done within the pmap_direct_process() function, which happens only after all the preparation steps have been already done. If we abort the operation here and go call the non-direct code path, it could be even slower than the cache flush. For UBC the fallback would not be too terrible to code and debug, but fallback would be very messy e.g. for the uvm_loan() in sys_pipe.c. I don't know what to do for such archs, but I guess it would need to be evaluated if the direct map could be used in a way which actually improves performance there for the general case. 2018-05-26 16:52 GMT+02:00 Thor Lancelot Simon : > In this case, could we do better gathering the IPIs so the cost were > amortized over many pages? Yes, it could gather the IPI for this particular TLB flush. Seems e.g. on my system it would be enough to gather at least 3-4 to match the speed. There is some limit to how this could be scaled though - right now we have only 16 TLB IPI slots, which would be enough for maybe 1 parallell pipe setup. I've noticed DragonflyBSD did some work on asynchronous TLB IPIs, maybe it would be possible to look for inspiration there. I do plan some tweaks for the TLB flush IPIs. I want to change the x86 pmap to not trigger TLB flush IPIs to idle processors, as was discussed earlier for kern/53124. This would reduce the performance impact of the IPIs if there are many idle processors. Seems acpi wakeup takes linear time depending on number of cpus - I think this is also something we'd want to fix too. That said, regardless which way we slice it, anything which triggers IPIs or TLB flushes inherently just does not scale on MP systems. If uvm_loan() is to be used as high-performance zero-copy optimization, it simply must avoid those. Adding something which triggers interrupt to all active processors for every processed page simply can't perform. Jaromir
Re: Revisiting uvm_loan() for 'direct' write pipes
2018-05-21 21:49 GMT+02:00 Jaromír Doleček : > It turned out uvm_loan() incurs most of the overhead. I'm still on my > way to figure what it is exactly which makes it so much slower than > uiomove(). I've now pinned the problem down to the pmap_page_protect(..., VM_PROT_READ), that code does page table manipulations and triggers synchronous IPIs. So basically the same problem as the UBC code in uvm_bio.c. If I comment out the pmap_page_protect() in uvm_loan.c and hence do not change vm_page attributes, the uvm_loan() + direct map pipe variant manages about 13 GB/s, compared to about 12 GB/s for the regular pipe. 8% speedup is not much, but as extra it removes all the KVA limits. Since it should scale well, it should be possible to reduce reduce the direct threshold, and reduce the size of the fixed in-kernel pipe buffer to save kernel memory. So, I'm actually thinking to change uvm_loan() to not enforce R/O mappings and leave page attributes without change. It would require the caller to deal with possible COW or PG_RDONLY if they need to do writes. In other words, allow using the 'loan' mechanics also for writes, and eventually use this also for UBC writes to replace the global PG_BUSY lock there. WYT? Jaromir
Revisiting uvm_loan() for 'direct' write pipes
Hello, I've been playing a little on revisiting kern/sys_pipe.c to take advantage of the direct map in order to avoid the pmap_enter() et.al via the new uvm_direct_process() interface. Mostly since I want to have at least one other consumer of the interface before I consider it as final, to make sure it covers the general use cases. I've managed to get it working. However, the loan mechanism is way slower than just copying the data via the intermediate 16KB pipe buffer - I get about 12GB/s on regular pipe, and about 4GB/s on the 'direct' pipe, even when using huge input buffers like 1MB a time. For regular pipe, CPU is about same for producer and consumer about 50%, for 'direct' pipe producer (the one who does the loan) is at about 80% and consumer at 5%. It turned out uvm_loan() incurs most of the overhead. I'm still on my way to figure what it is exactly which makes it so much slower than uiomove(). Any hints on which parts I should concentrate? Jaromir
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
2018-05-12 21:24 GMT+02:00 Chuck Silvers : > the problem with keeping the pages locked (ie. PG_BUSY) while accessing > the user address space is that it can lead to deadlock if the page > we are accessing via the kernel mapping is the same page that we are > accessing via the user mapping. The pages marked PG_BUSY for the UBC transfer are part of page cache. I sincerely hope it's impossible to have any user mapping for it. Is there are any other scenario for the deadlock in this context? That said, the PG_BUSY is at least a scalability problem, since it restricts the copy operation within the UBC window to only one at a time. Old code holds PG_BUSY only during the fault - with exception of the UBC_FAULTBUSY branch, which still keeps the lock during the whole operation. Jaromir
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
2018-05-12 7:07 GMT+02:00 Michael van Elst : > On Fri, May 11, 2018 at 05:28:18PM +0200, Jaromír Dole?ek wrote: >> I've implemented a tweak to read-ahead code to skip the full >> read-ahead if last page of the range is already in cache, this >> improved things a lot: > > That looks a bit like a hack. Can't you just call uvm_pagelookup() > to determine if the last page is cached? Or if that's too UVM internal, > call uvm_findpages() with UFP_NOALLOC. I thought it would be less abstraction violation when done in getpages. But actually using uvm_pagelookup() seems to be fine from that standpoint. Furthermore, it avoids some page flag manipulation, and makes the change much more localized, to only uvm_readahead.c. So I'll go with that. Thanks. Jaromir
Removing uvm_emap (aka ephemeral mapping interface)?
Hello, I suggest to remove emap code. Unfortunately it turned out as too tricky to implement and use correctly. It's not currently used anywhere, the only use in sys_pipe.c was ifdeffed out due to stability problems in 2009-08-31. While the commit states the stability problems were present also on sparc64 which used the emap fallback, I think the uvm_emap.c fallback pmap_kenter_pa()/pmap_kremove() are actually not safe either. Actual MD support for it was ever implemented only on x86. The emap support complicates the already quite unwieldy x86 pmap, and I think it makes sense to drop this unused code to lower the maintenance overhead. Dropping this would certainly simplify my other work on PCID support for x86. Opinions? Jaromir
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
I've modified the uvm_bio.c code to use direct map for amd64. Change seems to be stable, I've also run 'build.sh tools' to test out the system, build suceeded and I didn't observe any problems with stability. Test was on 6-core Intel CPU with HT on (i.e. "12" cores visible to OS), NVMe disk, cca 2GB file read done via: dd if=foo of=/dev/null bs=64k msgfmt=human Because of the proper formatting, the speed is now in proper Gibibytes. 1. non-cached read: old: 1.7 GiB/s, new: 1.9 GiB/s 2. cache read: old: 2.2 GiB/s, new: 5.6 GiB/s Seems the 1.9 GiB/s is device limit. The patch for this is at: http://www.netbsd.org/~jdolecek/uvm_bio_direct.diff During testing I noticed that the read-ahead code slows down the cached case quite a lot: no read-ahead: 1. non-cached read: old: 851 MiB/s, new: 1.1 GiB/s 2. cached read: old: 2.3 GiB/s, new: 6.8 GiB/s I've implemented a tweak to read-ahead code to skip the full read-ahead if last page of the range is already in cache, this improved things a lot: smarter read-ahead: 1. non-cached read: old: 1.7 GiB/s, new: 1.9 GiB/s (no change compared to old read-ahead) 2. cached read: old: 2.2 GiB/s, new: 6.6 GiB/s Patch for this is at: http://www.netbsd.org/~jdolecek/uvm_readahead_skip.diff For the direct map, I've implemented new helper pmap function, which in turn calls a callback with the appropriate virtual address. This way the arch pmap is more free to choose an alternative implementations for this, e.g. without actually having a direct map, like sparc64. The code right now has some instrumentation for testing, but the general idea should be visible. Opinions? Particularly, I'm not quite sure if it safe to avoid triggering a write fault in ubc_uiomove(), and whether the new heuristics for read-ahead is good enough for general case. Jaromir 2018-04-19 22:39 GMT+02:00 Jaromír Doleček : > I've finally got my test rig setup, so was able to check the > performance difference when using emap. > > Good news there is significant speedupon NVMe device, without > observing any bad side effects so far. > > I've setup a test file of 2G, smaller than RAM to fit all in cache, test was: > dd if=foo of=/dev/null bs=64k > > First read (not-cached): old: 1.7 GB/s, new: 2.1 GB/s > Cached read: old: 2.2 GB/s, new: 3.1 GB/s > > Reads from raw device were the same in both cases, around 1.7 GB/s > which is a bit bizzarde. > > If we want to modify the uvm_bio.c code to optionally use direct map, > there is the problem with the fault technique it uses for read I/O. > The code doesn't actually enter the pages into the KVA window, it lets > uimove() to trigger the faults. Is there some advantage to this, why > is this better than just mapping those 1-2 pages before calling > uiomove()? > > Jaromir > > 2018-04-02 21:28 GMT+02:00 Jaromír Doleček : >> 2018-03-31 13:42 GMT+02:00 Jaromír Doleček : >>> 2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger : >>>> Yeah, that's what ephemeral mappings where supposed to be for. The other >>>> question is whether we can't just use the direct map for this on amd64 >>>> and similar platforms? >>> >>> Right, we could/should use emap. I haven't realized emap is actually already >>> implemented. It's currently used for pipe for the loan/"direct" write. >>> >>> I don't know anything about emap thought. Are there any known issues, >>> do you reckon it's ready to be used for general I/O handling? >> >> Okay, so I've hacked to gether a patch to switch uvm_bio.c to ephemeral >> mapping: >> >> http://www.netbsd.org/~jdolecek/uvm_bio_emap.diff >> >> Seems to boot, no idea what else it will break. >> >> Looking at the state of usage though, the emap is only used for disabled code >> path for sys_pipe and nowhere else. That code had several on-and-off passes >> for being enabled in 2009, and no further use since then. Doesn't give too >> much >> confidence. >> >> The only port actually having optimization for emap is x86. Since amd64 >> is also the only one supporting direct map, we are really at liberty to pick >> either one. I'd lean towards direct map, since that doesn't require >> adding/removing any mapping in pmap_kernel() at all. From looking on the >> code, >> I gather direct map is quite easy to implement for other archs like >> sparc64. I'd say >> significantly easier than adding the necessary emap hooks into MD pmaps. >> >> Jaromir
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
I've finally got my test rig setup, so was able to check the performance difference when using emap. Good news there is significant speedupon NVMe device, without observing any bad side effects so far. I've setup a test file of 2G, smaller than RAM to fit all in cache, test was: dd if=foo of=/dev/null bs=64k First read (not-cached): old: 1.7 GB/s, new: 2.1 GB/s Cached read: old: 2.2 GB/s, new: 3.1 GB/s Reads from raw device were the same in both cases, around 1.7 GB/s which is a bit bizzarde. If we want to modify the uvm_bio.c code to optionally use direct map, there is the problem with the fault technique it uses for read I/O. The code doesn't actually enter the pages into the KVA window, it lets uimove() to trigger the faults. Is there some advantage to this, why is this better than just mapping those 1-2 pages before calling uiomove()? Jaromir 2018-04-02 21:28 GMT+02:00 Jaromír Doleček : > 2018-03-31 13:42 GMT+02:00 Jaromír Doleček : >> 2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger : >>> Yeah, that's what ephemeral mappings where supposed to be for. The other >>> question is whether we can't just use the direct map for this on amd64 >>> and similar platforms? >> >> Right, we could/should use emap. I haven't realized emap is actually already >> implemented. It's currently used for pipe for the loan/"direct" write. >> >> I don't know anything about emap thought. Are there any known issues, >> do you reckon it's ready to be used for general I/O handling? > > Okay, so I've hacked to gether a patch to switch uvm_bio.c to ephemeral > mapping: > > http://www.netbsd.org/~jdolecek/uvm_bio_emap.diff > > Seems to boot, no idea what else it will break. > > Looking at the state of usage though, the emap is only used for disabled code > path for sys_pipe and nowhere else. That code had several on-and-off passes > for being enabled in 2009, and no further use since then. Doesn't give too > much > confidence. > > The only port actually having optimization for emap is x86. Since amd64 > is also the only one supporting direct map, we are really at liberty to pick > either one. I'd lean towards direct map, since that doesn't require > adding/removing any mapping in pmap_kernel() at all. From looking on the code, > I gather direct map is quite easy to implement for other archs like > sparc64. I'd say > significantly easier than adding the necessary emap hooks into MD pmaps. > > Jaromir
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
2018-03-31 13:42 GMT+02:00 Jaromír Doleček : > 2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger : >> Yeah, that's what ephemeral mappings where supposed to be for. The other >> question is whether we can't just use the direct map for this on amd64 >> and similar platforms? > > Right, we could/should use emap. I haven't realized emap is actually already > implemented. It's currently used for pipe for the loan/"direct" write. > > I don't know anything about emap thought. Are there any known issues, > do you reckon it's ready to be used for general I/O handling? Okay, so I've hacked to gether a patch to switch uvm_bio.c to ephemeral mapping: http://www.netbsd.org/~jdolecek/uvm_bio_emap.diff Seems to boot, no idea what else it will break. Looking at the state of usage though, the emap is only used for disabled code path for sys_pipe and nowhere else. That code had several on-and-off passes for being enabled in 2009, and no further use since then. Doesn't give too much confidence. The only port actually having optimization for emap is x86. Since amd64 is also the only one supporting direct map, we are really at liberty to pick either one. I'd lean towards direct map, since that doesn't require adding/removing any mapping in pmap_kernel() at all. From looking on the code, I gather direct map is quite easy to implement for other archs like sparc64. I'd say significantly easier than adding the necessary emap hooks into MD pmaps. Jaromir
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger : > Yeah, that's what ephemeral mappings where supposed to be for. The other > question is whether we can't just use the direct map for this on amd64 > and similar platforms? Right, we could/should use emap. I haven't realized emap is actually already implemented. It's currently used for pipe for the loan/"direct" write. I don't know anything about emap thought. Are there any known issues, do you reckon it's ready to be used for general I/O handling? Jaromir
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
2018-03-27 0:51 GMT+02:00 Michael van Elst : > UBC_WINSHIFT=17 (i.e. 2*MAXPHYS) works fine, but sure, for KVA limited > platforms you have to be careful. On the other hand, it's a temporary > mapping that only exists while doing I/O. How many concurrent I/O > operations would you expect on a 32bit system? I think fragmentation might be bigger issue. It could happen there just won't be continuous 128k KVA block any more, then it would hang forever, like kern/47030. But maybe this is not concern, e.g. i386 KVA uses top 1G. > I'd also like to get another tuning to 8.0. Bumping RA_WINWIZE_MAX > from 8*MAXPHYS to 16*MAXPHYS helps iSCSI and maybe some nvme devices > too. Yes please :) For some reason I thought we have this already. I remember this mentioned in discussion around time nvme(4) got integrated into tree. Regarding your issue, I'd also like to turn attention back to the acpi idle too. The way I read dev/acpi/acpi_cpu_cstate.c, we try to go straight to C3 without any backoff. It maybe be good to make this more granular, maybe only going into the deeper states once enough idle time passes, for example longer time then their wakeup latency. For example, wakeup from C1 might have 1 usec latency, from C2 has 150 usec latency, from C3 500. Any opinion on that, any ACPI experts around? Jaromir
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
2018-03-25 13:48 GMT+02:00 Michael van Elst : > For some reason it's not per block. There is a mechanism that moves > an 8kbyte window, independent of the block size. You can easily change > the window size (I'm currently experimenting with 128kbyte) by building > a kernel with e.g. options UBC_WINSHIFT=17. > > Independent of how the scaling problem will be handled, I suggest to > increase that window to at least MAXPHYS. I think the file mapping window doesn't need to really be tied to MAXPHYS. But values over MAXPHYS will probably currently [*] not give further significant speedup for filesystem I/O, unless data is being read from cache. One thing to note is that this size is used for all file mapping windows, so having too big value could case KVA shortages on 32bit platforms. I think original UBC_WINSHIFT tried to balance most usual file size and overall KVA use in kernel pmap, similar to e.g. MAXSLP works. Maybe bump to higher value on 64-bit archs only? If the bump would work, I think we should pull it up to 8.0 too, at this is probably least likely to break anything out of the proposed options. Did the experiment work? Jaromir [*] I hope to get tls-maxphys integrated for 9.0, by then maybe we need to revisit this. It would be also nice to revisit the need to map the memory to kernel pmap in order to do the I/O.
Fixing excessive shootdowns during FFS read (kern/53124) on x86
Hello, I'd like to gather some feedback on how to best tackle kern/53124. The problem there is that FFS triggers a pathologic case. I/O transfer maps and then unmaps each block into kernel pmap, so that the data could be copied into user memory. This triggers TLB shootdown IPIs for each FS block, sent to all CPUs which happen to be idle, or otherwise running on kernel pmap. On systems with many idle CPUs these TLB shootdowns cause a lot of synchronization overhead. I see three possible ways how to fix this particular case: 1. make it possible to lazy invalidate TLB also for kernel pmap, i.e. make pmap_deactivate()/pmap_reactivate() work with kernel pmap - this avoids the TLB shootdown IPIs to be sent to idle CPUs 2. make idle lwp run in it's own pmap and address space - variant to #1, avoids changing invariant about kernel map being always loaded 3. change UVM code to not do this mapping via kernel map, so pmap_update() and the IPIs are not triggered in first place I reckon #2 would add a lot of overhead into the idle routine, it would require at least an address space switch. This address space switch might be fairly cheap on architectures with address space ID (avoiding TLB flushes), but still much more than what the idle entry/exit does now. Variants of problems with #3 was discussed on and off during the years as I recall, but there is no resolution yet, and I'm not aware of anyone actively working on this. I understand this would be Hard, with nobody currently having the time and courage. This leaves #1 as a short-term practical solution. Anyone foresees any particular problems with this, does it have a chance to fly? Any other idea? Jaromir
Re: amd64: svs
2018-01-10 19:56 GMT+01:00 Maxime Villard : > That's what we do, too. Doing this is faster than "unmapping the kernel pages > on return to user space". > > Switching the address space costs one "movq %rax,%cr3". Yes, but address space switch also causes an implicit TLB/paging structure cache flush when not using something like PCID, doesn't it? >> Do you reckon SVS has potential of having lower overall overhead than >> separate address spaces + PCID? Alternatively, do you think SVS would benefit >> if we added PCID support to x86 pmap? > > Yes, it would benefit if we added PCID. Okay, I'll look into this. The feature seems pretty simple to use, though it will need some care to allow inactive processes to relinguish PCID when there is shortage. I'll try to not disturb your SVS work. > Probably, but there are still many CPUs without PCID out there, and for them > a blunt separation will unfortunately still be needed. Haswell and later have it. Latest benchmarks for Linux and Windows confirm the performance impact of their sw mitigations is significantly higher on CPUs without PCID, so it's all the more important to have this too. Jaromir
Re: amd64: svs
2018-01-08 10:24 GMT+01:00 Maxime Villard : > > As far as SVS is concerned, it is not needed: each time an L4 slot is added > (pmap_get_ptp) or removed (pmap_free_ptp), SVS only applies the change in the > user page tables. > > The TLB is then flushed as usual: the slots that are stale in the pmap page > tables are also stale in the user page tables, so the flush has the same effect > regardless of whether the CPU is executing in userland or in the kernel. > > There are possible optimizations, like not sending TLB-flush IPIs to CPUs that > are known to have performed a kernel<->user transition between the time when a > page got removed from the page tables and the time we decided to flush the TLB > as a result. If I understand correctly, basically main idea is that we keep the same address space, but simply unmap kernel pages on return to user space. Linux went with just completely separate address spaces for kernel and user space on the affected Intel cpus. The address space switch could be made cheaper when cpu has PCID, which Linux already supports. Do you reckon SVS has potential of having lower overall overhead than separate address spaces + PCID? Alternatively, do you think SVS would benefit if we added PCID support to x86 pmap? Since we already have support for Alpha ASN which is similar concept as the x86 PCID, perhaps it would not be overly complex to implement the PCID support. Jaromir
Re: amd64: svs
BTW Maxime, I've updated https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention to note your work on NetBSD support for SMAP and SMEP. Would it be feasible to get SMAP support pulled up to netbsd-8 by chance? The AMD Epyc processors support SMEP/SMAP as well, hopefully this will eventually trickle down to their consumer lines like Ryzen too. I really hope this huge security cpu bug by Intel for Meltdown will help propel AMD into being a viable competitor. Jaromir 2018-01-07 11:05 GMT+01:00 Maxime Villard : > Le 06/01/2018 à 16:03, cl...@csel.org a écrit : > >> Runtime detection and configuration is desired as the >> Linux did, rather than fixed compile-time option. >> > > It would be nice, but it is more complicated than it seems. We would have > to hotpatch the entry points, and to do that we need a way to hotpatch > macros, > and right now we don't have that. It would also be useful for SMAP. >
Re: increase softint_bytes
softint_establish() already fails if there is no space, and prints a WARNING in this case. The caller however needs to check for the failure, which ixgbe(4) mostly doesn't. Not sure if that is really the root cause, the panic seems to be actually in different area, and I didn't actually see the warning in the dmesg. Making softint size fully dynamic is unfortunately not straighforward, it would need to reallocate the softint memory on all CPUs in a way that any serviced softints would continue working during this time. It would be simple however to adjust it based on % of physical memory, instead of fixed value. If I count correctly, with current 8192 bytes the system supports some 100 softints, which seems to be on quite low side - more advanced hardware and drivers usually use queue and softint per cpu, so it can quickly run out on system with e.g. 32 cores. I agree that the sysctl seems somewhat unnecessary, since the limit is only relevant during boot anyway, so it's not very helpful. Jaromir 2017-11-16 21:05 GMT+01:00 matthew green : > Masanobu SAITOH writes: > > Hi, all. > > > > Some device drivers now allocate a lot of softints. > > See: > > > > http://mail-index.netbsd.org/current-users/2017/11/09/ > msg032581.html > > > > To avoid this panic, I wrote the following patch: > > > > http://www.netbsd.org/~msaitoh/softint-20171116-0.dif > > > > Summary: > > > > - Increase the default size from 8192 bytes to 32768 bytes. > > - Add new option SOFTINT_BYTES to change the value. > > - Add two new read-only sysctls kern.softint.{max,count} > > > > Any comment? > > can't this be fixed by making it dynamic? ie, fix it properly so > that it never has this problem again -- if in 10 years we're adding > 5x as many softints as today, we'll hit it again. > > i also don't see why the crash can't be fixed to become an error. > the code tries to avoid adding more than it can, but soemthing must > be wrong for the crash to occur anyway. ie, at worst we'd fix it > to return an error instead of crashing. > > if that is done, the sysctl exports seem unneccesary as well. > > thanks. > > > .mrg. >
Re: ext3 support
I share pretty much the same sentiment. ext3/ext4 journalling support is not very useful, as it's not likely to be used for anything more critical then sharing files between Linux and NetBSD. Ext3 journal would have to be complete reimplementation - NetBSD wapbl is not usable nor sufficient for ext3 needs. Practically speaking it would be more useful to finish support for the other miscellaneous features, like extents, dirhash, or extended attributes. You'd be more then welcome to help there. Jaromir 2017-10-29 18:39 GMT+01:00 Christos Zoulas : > In article namprd13.prod.outlook.com>, > Abhirup Das wrote: > >Hi, > > > >I just came across the ext3 implementation project listed on the netbsd > >website, https://wiki.netbsd.org/projects/project/ext3fs/ > >and I would like to take the time to implement this. But I am not too > >sure on how to start this project, I have read up on journaling and I > >understand the added work that journaling needs to perform. > >However I am sure ext3 needs more than just journaling, but I havent > >come across a specifications list for ext3 yet. Any advice on how to > >start this project would be greatly appreciated. > > So this is a bit complicated issue. I am not sure if adding journaling > support is worth it (compared to the amount of time it will take to > implement it) because you can always mount the filesystem on linux, > recover from the journal and then use it on NetBSD. OTOH there has > been some ext3/ext4 work done as part of GSoC 2016... > > https://summerofcode.withgoogle.com/archive/2016/ > projects/5183714946449408/ > > Finishing the RW support for it might be a more useful endeavor. > > christos > >
Re: Deadlock on fragmented memory?
Thanks for doing this. How difficult you gather it would be to convert the execargs to use 4K pages rathern then continuous KVA, are there any particular gotchas? I would be interested to look at this - it sounds like nice high-kernel only project for me, to make a break from the ATA oddities. Jaromir 2017-10-23 15:25 GMT+02:00 Manuel Bouyer : > On Mon, Oct 23, 2017 at 03:36:53AM +, Taylor R Campbell wrote: > > I should maybe also signal when pool_grow is done so it doesn't wait > > forever, and/or I should maybe just go to bed now. > > > [patch stripped] > > > I applied this patch to a netbsd-7 tree (along with the execargs one), > and I have a Xen i386 domU doing a bulk build now. It's still running > pbulk-scan so it's way too soon to see if it helps, but a last the > system boots so it's not worse than before :) > > -- > Manuel Bouyer > NetBSD: 26 ans d'experience feront toujours la difference > -- >
Removal of 'wd* at umass?' driver?
Hello, The code for this umass attachment (dev/usb/umass_isdata.c) doesn't use the atabus, and sidesteps the normal atabus processing. Today I discovered that after the NCQ branch was merged, some kernels which included dev/usb/usbdevices.config but no atabus failed to compile due to missing symbols, and added some bandaid for that. As part of NCQ branch, I made efford for the umass_isdata.c code to continue compile, but had no way to test if it actually still works. In fact, I have no idea if the code worked before I started on NCQ. Nobody responsed also when I asked for testers for this particular device in half a year. As the hardware is relatively rare, it's actually quite difficult to keep maintaining this while working on more advanced ATA features. I think it would be fair to simply drop it from tree. If some brave soul wants to revive it, it would be available in the history. Opinions? Jaromir
Re: Snapshot performances
The WAPBL resource exhaustion caught my attention. Does this still happen with netbsd-8? I've fixed something like this in WAPBL a while ago, and the fixes are on the release branch. Jaromir 2017-09-26 14:52 GMT+02:00 Edgar Fuß : > > How is an internal snapshot defined? > I'm not completely sure. Ask hannken@. > > > The term seems to only exist in dump(8). > And fsck_ffs(8). > > > Is it a snapshot for which the backing store is on the same disk > > the snapshot is done? > I suppose. hannken@ should know. > > My experience is that I regularily take external snapshots (using > fssconfig(8)) > while doing TSM backups, and this works fine even for multi-TB filesystems. > However, taking an internal snapshot (fsck -X) on that filesystem starved > the > machine, eventually paniced it (some WAPBL resource exhaustion) and the log > replay on reboot took something like half an hour. On much smaller > filesystems, > fsck -X works fine. >
Re: Proposal: Disable autoload of compat_xyz modules
I like that all the arch-specific code is under sys/arch, and not randomly spread around tree, i.e. I prefer to keep the compat things under sys/arch. For sure, same argument could be used the opposite direction, that it would be neater to have all the compat code together. But IMO the arch-specific bits are more tied to the specific platform, so it makes sense to have them together with their native counterparts - like e.g. signal trampoline code. Jaromir 2017-08-01 13:12 GMT+02:00 Maxime Villard : > Yes, that's the right thing to do. > > I haven't rechecked, but as far as I remember, there will be some issues with > the compat layers that have several #ifdefs deep into the machdep code. > > Apart from that, several files could usefully be moved into the appropriate > compat_xyz directory, instead of polluting sys/arch/. For example: > > sys/arch/amd64/amd64/linux_*-> sys/compat/linux/arch/amd64/ > sys/arch/amd64/amd64/linux32_* -> sys/compat/linux32/arch/amd64/ > > This would provide a better encapsulation.
SATA PMP and ATAPI - apparently not working, is it supposed to?
Hi, during my testing on jdolecek-ncq branch, I found that when I attach the ATAPI device via a port multiplier, the device is never actually detected. The behaviour I see is that atapibus is detected on the port, but not the actual device cd1, like this: ... atabus10 at siisata0 channel 0 ... atabus10: SATA port multiplier, 15 ports atabus10: vendor 0x197b, product 0x0325, revision 1.1, level 0 atabus10 PMP port 0: device present, speed: 1.5Gb/s atabus10 PMP port 1: device present, speed: 3.0Gb/s atabus10 PMP port 2: device present, speed: 3.0Gb/s atabus10 PMP port 3: device present, speed: 1.5Gb/s ... atapibus0 at atabus10: 1 targets ... It gets detected like this when attached directly: ... atabus11 at siisata0 channel 1 ... atapibus0 at atabus11: 1 targets cd1 at atapibus0 drive 0: cdrom removable cd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100) cd1(siisata0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100) (using DMA) This behaviour is same on kernel from HEAD sources and my ncq branch. Is ATAPI behind a PMP supposed to work, or does this depend on the particular card maybe? I thought one of user cases for port multiplier is to attach all the 'slow' devices through it, to save direct ports for fast disks. Could this be maybe driver issue? Do ATAPI devices connected via PMP work e.g. on ahcisata(4)? This particular port multiplier has issue with port detection (satapmp claims it has 15 ports when it has only 5), so it's entirely possible it could have problem with ATAPI pass-through also. Jaromir
Re: HEADS-UP: jdolecek-ncq branch merge imminent
I've made some improvements to the error handling, and fixed the PMP & dump issues I discovered on siisata/mvsata/ahcisata/pciide. I've also tested siisata(4) and ahcisata(4) with atapi device to some extend, didn't see any regression there so far. I do want to do some more ATAPI testing, and do the wd(4) chaos monkey to exercise&test the 'regular' error paths better, but other than that, I think the branch is in quite decent shape. I consider it ready for merge. Jonathan, do you have any specific concerns still? Particularly, do you think the remaining XXXs in siisata(4) prevent the branch to be merged? Jaromir 2017-06-13 22:34 GMT+02:00 Jaromír Doleček : > I'll see if I can do some chaos monkey thing for wd(4) to test the error > paths. > > I'm not quite sure what kind of special error recovery would be necessary > to add though. With NCQ, if there is any error, all pending commands are > cancelled by the device. There is nothing to recover by HBA. The commands > just need to be reissued after channel reset, which is done after a timeout > by wd(4). I think that eventual further error path tweaks really could be > done on HEAD already, and that it's not necessary to hold off making this > available for general public. > > I'd really prefer not to stretch this relatively simple branch to too > ambitious rework, like converting ata to thread context or doing base MP > work. > > I do want to test PMP though, thanks for reminder. In my opinion, once > that is confirmed working, there is no reason really to wait any further > with the merge. > > Is there anything on HEAD which is useful to sync to the branch? I'm > wondering whether it's useful to sync it to ncq branch right now at all. > > Jaromir > > 2017-06-08 10:45 GMT+02:00 Jonathan A. Kollasch : > >> On Wed, Jun 07, 2017 at 09:45:48PM +0200, Jaromír Doleček wrote: >> > Hello, >> > >> > I plan to merge the branch to HEAD very soon, likely over the weekend. >> > Eventual further fixes will be done on HEAD already, including mvsata(4) >> > restabilization, and potential switch of siisata(4) to support NCQ. >> > >> > The plan is to get this pulled up to netbsd-8 branch soon also, so that >> it >> > will be part of 8.0. >> > >> > Status: >> > - ahci(4) fully working with NCQ (confirmed with qemu, and real hw) >> > - piixide(4) continues working (no NCQ support of course) (confirmed in >> > qemu) >> > - siisata(4) continues working (without NCQ still) (confirmed with real >> hw) >> > - mvsata(4) not yet confirmed working after changes, mainly due the DMA >> not >> > really working on Marvell 88SX6042 which I have available - I have same >> > issue as kern/52126 >> > - other ide/sata drivers received mechanical changes, should continue >> > working as before >> > >> > Jaromir >> >> My limited testing has shown that the device error handling paths need >> work yet. It's unfortunately been a few weeks since I've tested things, >> so my memory isn't fresh with the details. The general issue is that >> ata(4) is doing error handling in interrupt context, and NCQ error >> handling is straining that assumption. >> >> At this point I advise that the branch could be synced with HEAD and >> further testing done before merge. I can not advise that this be >> pulled up to netbsd-8 before more extensive and comprehensive testing >> has been done. >> >> I've got siisata(4) NCQ stuff I should commit to the branch. >> >> Jonathan Kollasch >> > >
Re: HEADS-UP: jdolecek-ncq branch merge imminent
I'll see if I can do some chaos monkey thing for wd(4) to test the error paths. I'm not quite sure what kind of special error recovery would be necessary to add though. With NCQ, if there is any error, all pending commands are cancelled by the device. There is nothing to recover by HBA. The commands just need to be reissued after channel reset, which is done after a timeout by wd(4). I think that eventual further error path tweaks really could be done on HEAD already, and that it's not necessary to hold off making this available for general public. I'd really prefer not to stretch this relatively simple branch to too ambitious rework, like converting ata to thread context or doing base MP work. I do want to test PMP though, thanks for reminder. In my opinion, once that is confirmed working, there is no reason really to wait any further with the merge. Is there anything on HEAD which is useful to sync to the branch? I'm wondering whether it's useful to sync it to ncq branch right now at all. Jaromir 2017-06-08 10:45 GMT+02:00 Jonathan A. Kollasch : > On Wed, Jun 07, 2017 at 09:45:48PM +0200, Jaromír Doleček wrote: > > Hello, > > > > I plan to merge the branch to HEAD very soon, likely over the weekend. > > Eventual further fixes will be done on HEAD already, including mvsata(4) > > restabilization, and potential switch of siisata(4) to support NCQ. > > > > The plan is to get this pulled up to netbsd-8 branch soon also, so that > it > > will be part of 8.0. > > > > Status: > > - ahci(4) fully working with NCQ (confirmed with qemu, and real hw) > > - piixide(4) continues working (no NCQ support of course) (confirmed in > > qemu) > > - siisata(4) continues working (without NCQ still) (confirmed with real > hw) > > - mvsata(4) not yet confirmed working after changes, mainly due the DMA > not > > really working on Marvell 88SX6042 which I have available - I have same > > issue as kern/52126 > > - other ide/sata drivers received mechanical changes, should continue > > working as before > > > > Jaromir > > My limited testing has shown that the device error handling paths need > work yet. It's unfortunately been a few weeks since I've tested things, > so my memory isn't fresh with the details. The general issue is that > ata(4) is doing error handling in interrupt context, and NCQ error > handling is straining that assumption. > > At this point I advise that the branch could be synced with HEAD and > further testing done before merge. I can not advise that this be > pulled up to netbsd-8 before more extensive and comprehensive testing > has been done. > > I've got siisata(4) NCQ stuff I should commit to the branch. > > Jonathan Kollasch >
HEADS-UP: jdolecek-ncq branch merge imminent
Hello, I plan to merge the branch to HEAD very soon, likely over the weekend. Eventual further fixes will be done on HEAD already, including mvsata(4) restabilization, and potential switch of siisata(4) to support NCQ. The plan is to get this pulled up to netbsd-8 branch soon also, so that it will be part of 8.0. Status: - ahci(4) fully working with NCQ (confirmed with qemu, and real hw) - piixide(4) continues working (no NCQ support of course) (confirmed in qemu) - siisata(4) continues working (without NCQ still) (confirmed with real hw) - mvsata(4) not yet confirmed working after changes, mainly due the DMA not really working on Marvell 88SX6042 which I have available - I have same issue as kern/52126 - other ide/sata drivers received mechanical changes, should continue working as before Jaromir
Re: Modules and bus_dmamap_create() failing
The allocation size is function of MAXPHYS / PAGE_SIZE * (queue size), which is e.g. for i386 and 128 queue size: 64k / 4k / 128 = 16*128 = 2048. It's MAXPHYS _divided_ by PAGE_SIZE. I've looked at virtio_alloc_vq() in little bit of detail. The way I count how allocsize is computed, I arrive to something like 36104 bytes plus twice alignment to 4096 boundary. When I boot kernel printing the allocsize, it prints 45056 allocsize for the vioscsi queue of size 128, and some 81920 bytes for vioif for queue of 256. So in line with expectation. So is it really that the physical memory would be so fragmented after boot to multiuser, that you cannot get even 11 continual memory pages? It confuses me that this behaviour doesn't change when I increase the available physical memory, but I used relatively small increase (128MB/256MB/512MB IIRC). Maybe reducing page cache size would help? Jaromir 2017-05-28 13:48 GMT+02:00 Jonathan A. Kollasch : > > On Sat, May 27, 2017 at 11:21:14AM +0200, Jaromír Doleček wrote: > > The driver allocates sizeable, but not excessive amounts of memory - mostly > > working with values like MAXPHYS / PAGE_SIZE * (queue size) for dmamaps/dma > > memory, where queue size is 128 or so. > > No, that's excessive for a contiguous chunk. You can only expect such > large contiguous allocations to work at boot. Physical memory becomes > too fragmented after the machine has been up and doing things. We > could really use a way to defrag physical memory, but doing that is not > easy. > > Jonathan Kollasch
Modules and bus_dmamap_create() failing
Hello, while working on vioscsi(4) improvements, I had quite tedious problems with driver failing to allocate the dma map in bus_dmamap_create(). Symptoms were that the bus_dmamap_create()/bus_dmamem_alloc() call was very often, but not always, failing when driver was loaded via module. For example, when the module was loaded immediatelly after VM reboot, the dmamap/dmamem calls would succeed very often, but also not always. If it failed, it would consistently fail with more module unload/load until another reboot. When the driver was compiled into kernel and hence attached during boot, the bus dmamem/dmamap never failed for me. So I mostly ended up rebooting kernels. The driver allocates sizeable, but not excessive amounts of memory - mostly working with values like MAXPHYS / PAGE_SIZE * (queue size) for dmamaps/dma memory, where queue size is 128 or so. The memory allocation failing was not dependant on amount of available physical memory - it failed regardless of how much memory I allocated to the virtual host, and regardless if I used i386 or x86_64 virtual machien. It also failed regardless of using ALLOCNOW or not, and I think also regardless of using WAITOK/NOWAIT. Is there something which differs in bus dmamem/dmamap allocation during the boot and after, which would cause the behaviour? Could I possibly tweak some parameters to make bus_dmamem_alloc()/bus_dmamap_create() fail less for dynamically loaded modules? Failing that, is there maybe some other physical memory allocation limit I should observe in driver code and try to limit the allocation with that? Jaromir
Initial ahci(4) NCQ support on jdolecek-ncq branch, test/review wanted
Hi, I've committed the initial NCQ support, it's available at jdolecek-ncq branch (branch only over sys). It works under Parallels for now, and is ready for brave adventurers. There is sys/dev/ata/TODO.ncq which you might want to check, also. If people want to try it out and give feedback, I'd really appreciate it. I'll be busy for next 2-3 weeks elsewhere, so probably won't have too much time for bugfixing during the time. After that I plan to start testing with real hardware, and look on remaining bugs. One of the more annoying bugs right now is that AHCI NCQ transfers pretty frequently timeout for me under Parallels. Maybe interrupt handling is not quite right yet in the driver. I only got so far to note that every transfer gets two interrupts, one with IS 0x8 and one 0x0 The 0x8 is Set device bit and is expected for NCQ, the other not so much. My understanding of ATA and AHCI specs is not good enough yet, so if somebody could look there and sanity check, I would be really glad. Also, I've not modified mvsata(4) and siisata(4) drivers yet to enable NCQ. While the drivers seem to be mostly prepared, I lack the hardware, so have no way to test. I would really appreciate if folks with the hardware could check it out and possibly provide patches to enable the support there. Last but not least, there is simply no consideration in AHCI for PMP in the changes. I have no idea if it needs special support. I hope not, but would be nice if our PMP experts would sanity check. Jaromir
Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL
2017-04-02 17:28 GMT+02:00 Thor Lancelot Simon : > However -- I believe for the 20-30% of SAS drives you mention as shipping > with WCE set, it should be possible to obtain nearly identical performance > and more safety by setting the Queue Algorithm Modifier bit in the control > mode page to 1. This allows the drive to arbitrarily reorder SIMPLE > writes so long as the precedence rules with HEAD and ORDERED commands > are respected. Is there any reason we wouldn't want to set QAM=1 by default for sd(4)? Seems like pretty obvious performance improvement tweak. MODE SENSE has a flag to tell it this field is settable, so should be pretty safe to set relying on that. I can make this change (maybe with sysctl to control it?), unfortunately I don't have any hw to test if it actually makes measurable difference. Any volunteers to test a patch? Jaromir
Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL
2017-03-31 22:16 GMT+02:00 Thor Lancelot Simon : > It's not obvious, but in fact ORDERED gets set for writes > as a default, I believe -- in sd.c, I think? > > This confused me for some time when I last looked at it. It confused me also, that's why I changed the code a while back to be less confusing :) It used to be easy to draw conclusion that we use ORDERED all the time, as that was default if tag flag was unset. But in fact sd(4) and cd(4) always set XS_CTL_SIMPLE_TAG explicitely, so it was actually always SIMPLE. scsipi_base.c revision 1.166 date: 2016-10-02 21:40:35 +0200; author: jdolecek; state: Exp; lines: +4 -4; commitid: iiAGFJk9looTyBoz; change scsipi_execute_xs() to default to simple tags for !XS_CTL_URGENT if not specified by caller; this is mostly for documentation purposes only, as sd(4) and cd(4) explicitly use simple tags already Jaromir
Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL
> The problem is that it does not always use SIMPLE and ORDERED tags in a > way that would facilitate the use of ORDERED tags to enforce barriers. Our scsipi layer actually never issues ORDERED tags right now as far as I can see, and there is currently no interface to get it set for an I/O. > Also, that we may not know enough about the behavior of our filesystems > in the real world to be 100% sure it's safe to set the other mode page > bits that allow the drive to arbitrarily reorder SIMPLE commands (which > under some conditions is necessary to match the performance of running > with WCE set). I lived under assumption that SIMPLE tagged commands could be and are reordered by the controller/drive at will already, without setting any other flags. > When SCSI tagged queueing is used properly, it is not necessary to set WCE > to get good write performance, and doing so is in fact harmful, since it > allows the drive to return ORDERED commands as complete before any of the > data for those or prior commands have actually been committed to stable > storage. This was what I meant when I said "even ordered tags couldn't avoid the cache flushes". Using ORDERED tags doesn't provide on-media integrity when WCE is set. Now, it might be the case that the on-media integrity is not the primary goal. Then flush is only a write barrier, not integrity measure. In that case yes, ORDERED does keep the semantics (e.g. earlier journal writes are written before later journal writes). It does make stuff much easier to code, too - simply mark I/O as ORDERED and fire, no need to explicitly wait for competition, and can drop e.g journal locks faster. I do think that it's important to concentrate on case where WCE is on, since that is realistically what majority of systems run with. Just for record, I can see these practical problems with ORDERED: 1. only available on SCSI, so still needs fallback barrier logic for less awesome hw 2. Windows and Linux used to always use SIMPLE tags and wait for completition; suggests this avenue may have been already explored and found not interesting enough, or too buggy (remember scheduler activations?) 3. bufq processing needs special care for MPSAFE SCSI drivers, to prevent processing any further commands while I/O with ORDERED tag is being submitted to the controller. I still see my FUA efford as more direct replacement of the cache flushes, for it keeps both the logical and on-media integrity. Also, it will benefit the SATA disks too, once/if NCQ is integrated. I think that implementing barrier/ORDERED can be parallel efford, similar to the maxphys branch. I don't think barriers will make FUA irrelevant, as its still needed for systems with WCE on. Jaromir
Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL
Attached is final version of the patch. It uses MEDIA prefix for the flags, but keeps the FUA/DPO - i.e names are B_MEDIA_FUA, B_MEDIA_DPO. For wapbl it introduces a sysctl to use the feature, default is off for now. I plan to commit this later in the week or early next week, unless there are some serious objections. Jaromir 2017-03-05 23:22 GMT+01:00 Jaromír Doleček : > Here is an updated patch. It was updated to check for the FUA support > for SCSI, using the MODE SENSE device-specific flag. Code was tested > with QEMU emulated bha(4) and nvme. WAPBL code was updated to use the > flag. It keeps the flag naming for now. > > In the patch, WAPBL sets the flag for journal writes, and also for the > metadata buffer for bawrite() call after journal commit. > > There is possible layer violation for metadata write - b_flags are > supposed to be set by owner of the buffer. Not sure how strict we > want/need to be there - perhaps introduce another flag field? Also the > flag > probably needs to be unset in biodone hook, so that the code > guarantees the buffer in buffer cache doesn't accidentaly keep it over > to another I/O. > > Jaromir ? dev/ic/TODO.nvme Index: sys/buf.h === RCS file: /cvsroot/src/sys/sys/buf.h,v retrieving revision 1.126 diff -u -p -r1.126 buf.h --- sys/buf.h 26 Dec 2016 23:12:33 - 1.126 +++ sys/buf.h 27 Mar 2017 22:31:22 - @@ -198,16 +198,21 @@ struct buf { #defineB_RAW 0x0008 /* Set by physio for raw transfers. */ #defineB_READ 0x0010 /* Read buffer. */ #defineB_DEVPRIVATE0x0200 /* Device driver private flag. */ +#defineB_MEDIA_FUA 0x0800 /* Set Force Unit Access for media. */ +#defineB_MEDIA_DPO 0x1000 /* Set Disable Page Out for media. */ #define BUF_FLAGBITS \ "\20\1AGE\3ASYNC\4BAD\5BUSY\10DELWRI" \ "\12DONE\13COWDONE\15GATHERED\16INVAL\17LOCKED\20NOCACHE" \ -"\23PHYS\24RAW\25READ\32DEVPRIVATE\33VFLUSH" +"\23PHYS\24RAW\25READ\32DEVPRIVATE\33VFLUSH\34MEDIAFUA\35MEDIADPO" /* Avoid weird code due to B_WRITE being a "pseudo flag" */ #define BUF_ISREAD(bp) (((bp)->b_flags & B_READ) == B_READ) #define BUF_ISWRITE(bp)(((bp)->b_flags & B_READ) == B_WRITE) +/* Media flags, to be passed for nested I/O */ +#define B_MEDIA_FLAGS (B_MEDIA_FUA|B_MEDIA_DPO) + /* * This structure describes a clustered I/O. It is stored in the b_saveaddr * field of the buffer on which I/O is done. At I/O completion, cluster Index: sys/dkio.h === RCS file: /cvsroot/src/sys/sys/dkio.h,v retrieving revision 1.22 diff -u -p -r1.22 dkio.h --- sys/dkio.h 8 Dec 2015 20:36:15 - 1.22 +++ sys/dkio.h 27 Mar 2017 22:31:22 - @@ -85,6 +85,8 @@ #defineDKCACHE_RCHANGE 0x000100 /* read enable is changeable */ #defineDKCACHE_WCHANGE 0x000200 /* write enable is changeable */ #defineDKCACHE_SAVE0x01 /* cache parameters are savable/save them */ +#defineDKCACHE_FUA 0x02 /* Force Unit Access supported */ +#defineDKCACHE_DPO 0x04 /* Disable Page Out supported */ /* sync disk cache */ #defineDIOCCACHESYNC _IOW('d', 118, int) /* sync cache (force?) */ Index: kern/vfs_bio.c === RCS file: /cvsroot/src/sys/kern/vfs_bio.c,v retrieving revision 1.271 diff -u -p -r1.271 vfs_bio.c --- kern/vfs_bio.c 21 Mar 2017 10:46:49 - 1.271 +++ kern/vfs_bio.c 27 Mar 2017 22:31:22 - @@ -2027,7 +2027,7 @@ nestiobuf_iodone(buf_t *bp) void nestiobuf_setup(buf_t *mbp, buf_t *bp, int offset, size_t size) { - const int b_read = mbp->b_flags & B_READ; + const int b_pass = mbp->b_flags & (B_READ|B_MEDIA_FLAGS); struct vnode *vp = mbp->b_vp; KASSERT(mbp->b_bcount >= offset + size); @@ -2035,14 +2035,14 @@ nestiobuf_setup(buf_t *mbp, buf_t *bp, i bp->b_dev = mbp->b_dev; bp->b_objlock = mbp->b_objlock; bp->b_cflags = BC_BUSY; - bp->b_flags = B_ASYNC | b_read; + bp->b_flags = B_ASYNC | b_pass; bp->b_iodone = nestiobuf_iodone; bp->b_data = (char *)mbp->b_data + offset; bp->b_resid = bp->b_bcount = size; bp->b_bufsize = bp->b_bcount; bp->b_private = mbp; BIO_COPYPRIO(bp, mbp); - if (!b_read && vp != NULL) { + if (BUF_ISWRITE(bp) && vp != NULL) { mutex_enter(vp->v_interlock); vp->v_numoutput++; mutex_exit(vp->v_interlock); Index: kern/vfs_wapbl.c =
Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL
2017-03-12 11:15 GMT+01:00 Edgar Fuß : > Some comments as I probably count as one of the larger WAPBL consumers (we > have ~150 employee's Home and Mail on NFS on FFS2+WAPBL on RAIDframe on SAS): I've not changed the code in RF to pass the cache flags, so the patch doesn't actually enable FUA there. Mainly because disks come and go and I'm not aware of mechanism to make WAPBL aware of such changes. It would be easy to add the passthrough if needed, can use code from DIOCCACHESYNC handling. Or maybe we could assume that everybody uses same kind of disks in raid volume :) Maybe we could handle this just by polling - i.e. adding some flag like DKCACHE_VOLATILE for use in RF, and WAPBL code could implement a periodic callout checking for changes. The SAS drivers attach normal sd(4), right? My patch only updates sd(4) and nvme(4) to support the flags. Of course it would be neat to get anything else on board, if docs and testers are available. It would good to first test it on real hardware, to see if there is actual benefit. > Also, I remember each journal flush to actually cause two cache syncs, one > before and one after writing the actual data. Yes. It needs to do this because it needs to make sure that journal data are saved before we save the journal commit block. Implicitly, the pre-commit flush also makes sure that all asynchronously written metadata updates are written to the media, before the commit makes them impossible to replay. The pre-commit sync can be avoided by adding a checksum to journal commit record - I have a plan on adding this sometime later, too. > JD> it adds (another) incentive to actually integrate AHCI NCQ support > What about SCSI TCQ? I seem to remember all that flushing could be avoided > if the FS used queueing. DHs comments on barriers seem to second that. Even SCSI ORDERED tags wouldn't help to avoid the need for cache flushes. Jaromir
Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL
Here is an updated patch. It was updated to check for the FUA support for SCSI, using the MODE SENSE device-specific flag. Code was tested with QEMU emulated bha(4) and nvme. WAPBL code was updated to use the flag. It keeps the flag naming for now. In the patch, WAPBL sets the flag for journal writes, and also for the metadata buffer for bawrite() call after journal commit. There is possible layer violation for metadata write - b_flags are supposed to be set by owner of the buffer. Not sure how strict we want/need to be there - perhaps introduce another flag field? Also the flag probably needs to be unset in biodone hook, so that the code guarantees the buffer in buffer cache doesn't accidentaly keep it over to another I/O. Jaromir ? dev/ic/TODO.nvme Index: sys/buf.h === RCS file: /cvsroot/src/sys/sys/buf.h,v retrieving revision 1.126 diff -u -p -r1.126 buf.h --- sys/buf.h 26 Dec 2016 23:12:33 - 1.126 +++ sys/buf.h 5 Mar 2017 22:08:35 - @@ -198,11 +198,13 @@ struct buf { #defineB_RAW 0x0008 /* Set by physio for raw transfers. */ #defineB_READ 0x0010 /* Read buffer. */ #defineB_DEVPRIVATE0x0200 /* Device driver private flag. */ +#defineB_FUA 0x0800 /* Force Unit Access flag (mandatory). */ +#defineB_DPO 0x1000 /* Disable Page Out flag (advisory). */ #define BUF_FLAGBITS \ "\20\1AGE\3ASYNC\4BAD\5BUSY\10DELWRI" \ "\12DONE\13COWDONE\15GATHERED\16INVAL\17LOCKED\20NOCACHE" \ -"\23PHYS\24RAW\25READ\32DEVPRIVATE\33VFLUSH" +"\23PHYS\24RAW\25READ\32DEVPRIVATE\33VFLUSH\34FUA\35DPO" /* Avoid weird code due to B_WRITE being a "pseudo flag" */ #define BUF_ISREAD(bp) (((bp)->b_flags & B_READ) == B_READ) Index: sys/dkio.h === RCS file: /cvsroot/src/sys/sys/dkio.h,v retrieving revision 1.22 diff -u -p -r1.22 dkio.h --- sys/dkio.h 8 Dec 2015 20:36:15 - 1.22 +++ sys/dkio.h 5 Mar 2017 22:08:35 - @@ -85,6 +85,8 @@ #defineDKCACHE_RCHANGE 0x000100 /* read enable is changeable */ #defineDKCACHE_WCHANGE 0x000200 /* write enable is changeable */ #defineDKCACHE_SAVE0x01 /* cache parameters are savable/save them */ +#defineDKCACHE_FUA 0x02 /* Force Unit Access supported */ +#defineDKCACHE_DPO 0x04 /* Disable Page Out supported */ /* sync disk cache */ #defineDIOCCACHESYNC _IOW('d', 118, int) /* sync cache (force?) */ Index: kern/vfs_wapbl.c === RCS file: /cvsroot/src/sys/kern/vfs_wapbl.c,v retrieving revision 1.87 diff -u -p -r1.87 vfs_wapbl.c --- kern/vfs_wapbl.c5 Mar 2017 13:57:29 - 1.87 +++ kern/vfs_wapbl.c5 Mar 2017 22:08:35 - @@ -70,6 +70,7 @@ __KERNEL_RCSID(0, "$NetBSD: vfs_wapbl.c, static struct sysctllog *wapbl_sysctl; static int wapbl_flush_disk_cache = 1; static int wapbl_verbose_commit = 0; +static int wapbl_use_fua = 1; static inline size_t wapbl_space_free(size_t, off_t, off_t); @@ -229,6 +230,12 @@ struct wapbl { u_char *wl_buffer; /* l: buffer for wapbl_buffered_write() */ daddr_t wl_buffer_dblk; /* l: buffer disk block address */ size_t wl_buffer_used; /* l: buffer current use */ + + int wl_dkcache; /* r: disk cache flags */ +#define WAPBL_USE_FUA(wl) \ + (wapbl_use_fua && ISSET(wl->wl_dkcache, DKCACHE_FUA)) + int wl_jwrite_flags;/* r: journal write flags */ + int wl_mwrite_flags;/* r: metadata write flags */ }; #ifdef WAPBL_DEBUG_PRINT @@ -280,6 +287,8 @@ static void wapbl_deallocation_free(stru static void wapbl_evcnt_init(struct wapbl *); static void wapbl_evcnt_free(struct wapbl *); +static void wapbl_dkcache_init(struct wapbl *); + #if 0 int wapbl_replay_verify(struct wapbl_replay *, struct vnode *); #endif @@ -390,6 +399,30 @@ wapbl_evcnt_free(struct wapbl *wl) evcnt_detach(&wl->wl_ev_cacheflush); } +static void +wapbl_dkcache_init(struct wapbl *wl) +{ + int error; + + /* Get disk cache flags */ + error = VOP_IOCTL(wl->wl_devvp, DIOCGCACHE, &wl->wl_dkcache, + FWRITE, FSCRED); + if (error) { + /* behave as if there is a write cache */ + wl->wl_dkcache = DKCACHE_WRITE; + } + + /* Use FUA instead of cache flush if available */ + if (WAPBL_USE_FUA(wl)) { + wl->wl_jwrite_flags |= B_FUA; + wl->wl_mwrite_flags |= B_FUA; + } + + /* Use DPO for journal writes if available */ + if (ISSET(wl->wl_dkcache, DKCACHE_DPO)) + wl->wl_jwrite_flags |= B_DPO; +} + static int wapbl_start_flush_inodes(struct wapbl *wl, struct wapbl_replay *wr) { @@ -562,6 +595,8 @@ wapb
Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL
2017-03-03 18:11 GMT+01:00 David Holland : > Yes and no; there's also standard terminology for talking about > caches, so my inclination would be to call it something like > B_MEDIASYNC: synchronous at the media level. Okay, this might be good. Words better then acronyms :) > > For DPO it's not so clear cut maybe. We could reuse B_NOCACHE maybe > > for the same functionality, but not sure if it matches with what swap > > is using this flag for. DPO is ideal for journal writes however, > > that's why I want to add the support for it now. > > What does DPO do? It tells the hardware to not store the data into it's cache. Or more precisely, to not put it into a cache if it meant that it would have to evict something else from it. It should improve general performance of the disk - the journal writes will not trash the device cache. B_MEDIANOCACHE? > Perfect abstractions for any of these would be more complex, but > barriers serve pretty well. Perhaps we could start with reworking DIOCCACHESYNC into a barrier :) Currently it is not actually guaranteed to be executed after the already queued writes - the ioctl is executed out of bounds, bypassing the bufq queue. Hence it doesn't actually quite work if there are any in-fligh async writes, as queued e.g. by the bawrite() calls in ffs_fsync(). In linux the block write interface accounts for that, there are flags to ask a sync to be done before or after the I/O, and it is also possible to send just empty I/O with only the sync flags. Thus the sync is always queued along with the writes. It would be good to adopt something like this, but that would require bufq interface changes and possibly device driver changes, with much broader tree disturbance. At least with FUA the caller can ensure to have all the writes safely on media, and wouldn't depend on an out-of-bound ioctl. > Single synchronous block writes are a bad way to implement barriers > and it maybe makes sense to have two models and force every fs to be > able to do things two different ways; but single synchronous block > writes are also a bad way to implement any of the above invariants. > E.g. I'm not convinced that writing out journal blocks synchronously > one at a time will be faster than flushing the cache at the end of a > journal write, even though the latter inflicts collateral damage in > the sense of waiting for perhaps many blocks that don't need to be > waited for. Indeed - writing journal blocks sync one by one is unlikely to be faster then sending them all async and doing cache flush on the end, that wouldn't make sense. I plan to change WAPBL to do the journal writes partially async. It will use several bufs, issue the I/O asynchronously and only wait if it runs out of buffers, or it it needs to do the commit. Seems usually there are three or four block writes done as part of the transaction commit, so there is decent parallelism opportunity. > I guess it would help if I knew what you were intending to do with > wapbl in this regard; have you posted that? (I've been at best > skimming tech-kern the past few months...) I haven't posted details on the WAPBL part of the changes. I'll put together a patch over the weekend, and send it over. It will be useful to show my thinking how the proposed interface could be used. > Getting it working first is great but I'm not sure a broadly exposed > piece of infrastructure should be committed in a preliminary design > state... especially in a place (the legacy buffer cache) that's > already a big ol' mess. That's one of reasons I want to keep the current changes minimal :) The proposed patch doesn't actually touch the legacy buffer cache code at all. It only adds another B_* flag, and changes hardware device drivers to react upon it. The flag is supposed to be set by the caller, for example by WAPBL itself. Nothing in e.g. ffs would set the flags. > I guess what worries me is the possibility of getting interrupted by > real life and then all this remaining in a half-done state in the long > term; there are few things worse for maintainability in general than > half-finished reorgs that end up getting left to bitrot. :-/ There is semi-good chance this will be finished into workable state soon - I picked up jornaling improvements as my Bachelor thesis material, so it either gets done or I will fail :) > Is there something more generic / less hardware-specific that we can > put in the fs in the near term? Well, the FUA support looks like a good candidate for being useful and could have direct positive performance impact, so I picked up that. If we have code taking advantage of FUA, it adds (another) incentive to actually integrate AHCI NCQ support, as that is the only way how to get FUA support on more contemporary hardware. Also, it's my understanding using FUA instead of full cache sync should be huge win for raid also, so it's worth for that avenue too. > keep in mind that whatever it is might end up in -8 and needing to be > mai
Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL
> Some quick thoughts, though: > > (1) ultimately it's necessary to patch each driver to crosscheck the > flag, because otherwise eventually there'll be silent problems. Maybe. I think I like having this as responsibility on the caller for now, avoids too broad tree changes. Ultimately it might indeed be necessary, if we find out that it can't be reasonably be handled by the caller. Like maybe raidframe kicking in spare disk without FUA into set with FUA. > (2) it would be better not to expose hardware-specific flags in the > buffercache, so it would be better to come up with a name that > reflects the semantics, and a semantic guarantee that's at least > notionally not hardware-specific. I want to avoid unnecessary private NetBSD nomenclature. If storage industry calls it FUA, it's probably good to just call it FUA. For DPO it's not so clear cut maybe. We could reuse B_NOCACHE maybe for the same functionality, but not sure if it matches with what swap is using this flag for. DPO is ideal for journal writes however, that's why I want to add the support for it now. > (3) as I recall (can you remind those of us not currently embedded in > this stuff what the semantics of FUA actually are?) FUA is *not* a > write barrier (as in, all writes before happen before all writes > after) and since write barriers are a natural expression of the > requirements for many fses, it would be well to make sure the > implementation of this doesn't conflict with that. FUA doesn't enforce any barriers. It merely changes the sematics of the write request - the hardware will return success response only after the data is written to non-volatile media. Any barriers required by filesystem sematics need to be handled by the fs code, same as now with DIOCCACHESYNC. I've talked about adding some kind of generic barrier support in the previous thread. After thinking about it, and reading more, I'm not convinced it's necessary. Incidentally, Linux has moved away from the generic barriers and pushed the logic into their fs code, which can DTRT with e.g. journal transactions, too. > (3a) Also, past discussion of this stuff has centered around trying to > identify a single coherent interface for fs code to use, with the > expansion into whatever hardware semantics are available happening in > the bufferio layer. This would prevent needing conditional logic on > device features in every fs. However, AFAICR these discussions have > never reached any clear conclusion. Do you have any opinion on that? I think that I'd like to have at least two different places in kernel needing particular interface before generalizing this into a bufferio level. Or at minimum, I'd like to have it working on one place correctly, and then it can be generalized before using it on second place. It would be awesome to use FUA e.g. for fsync(2), but let's not get too ahead of ourselves. We don't commit too much right now besides a B_* flag. I'd rather to keep this raw and lean for now, and concentrate on fixing the device drivers to work with the flags correctly. Only then maybe come up with interface to make it easier for general use. I want to avoid broadening the scope too much. Especially since I want to introduce SATA NCQ support within next few months, which might need some tweaks to the semantics again. > We don't want to block improvements to wapbl while we figure out the > one true device interface, but on the other hand I'd rather not > acquire a new set of long-term hacks. Stuff like the "logic" wapbl > uses to intercept the synchronous writes issued by the FFS code is > very expensive to get rid of later. Yes, that funny bwrite() not being real bwrite() until issued for second time from WAPBL :) Quite ugly. It's shame the B_LOCKED hack is not really extensible to cover also data in journal, as it holds all transaction data in memory. Jaromir
Exposing FUA as alternative to DIOCCACHESYNC for WAPBL
Hi, I'm working on an interface for WAPBL to use Force Unit Access (FUA) feature on compatible hardware (currently SCSI and NVMe), as a replacement to full disk cache flushes. I'd also like to add support for DPO (Disable Page Out), as that is trivial extension of FUA support at least for SCSI. Scope is currently limited to I/O via traditional buffer cache (metadata only) interface for now, as that is all what is needed for WAPBL case. First support direct use over sd(4)/ld(4), then the fix layered drivers like dk(4), cgd(4) and eventually raid(4). In order to be a reliable cache flush replacement, the FUA flag needs to be exposed in a way that it's used only when the underlying driver and hardware supports it. When it's specified, the driver needs to honor it. For DPO, it should be safe to simply ignore the flag when not supported, because it's just disk cache optimization. Since I wasn't able to spot any specific support for this in FreeBSD or OpenBSD, I've come with a new, and rather simplistic kernel change to expose the feature. It basically relies on the caller to DTRT. Code extends DIOCGCACHE to return also information about FUA/DPO support by the underlying hardware. Then, I added new flags for buffer cache I/O, and modified the drivers to pass the flags to the hardware when present in the struct buf. It's supposed to be used in a way that caller on the beginning checks the FUA support via DIOCGCACHE once, then the caller issues any further I/O requests with FUA flag when DIOCGCACHE indicated FUA is supported. This way there is no need to modify drivers to refuse the flag when not supported. It assumes that if DIOCGCACHE indicates support, then the I/O will reach the same hardware also with the struct buf flag kept. Seems so far all drivers keep struct buf b_flags when splitting or processing the I/O buffer, i.e. cgd(4), dk(4), raid(4) all seem to do it. This needs to be tested of course and we need to fix any disk pseudo devices which don't pass the flags intact. Attached is the patch. It's really small and compile-tested only at the moment. The main goal now is to get feedback if this approach is sufficient and suitable, or whether we need something completely different or complex. Comments? Jaromir Index: sys/buf.h === RCS file: /cvsroot/src/sys/sys/buf.h,v retrieving revision 1.126 diff -u -p -r1.126 buf.h --- sys/buf.h 26 Dec 2016 23:12:33 - 1.126 +++ sys/buf.h 1 Mar 2017 21:19:41 - @@ -198,11 +198,13 @@ struct buf { #defineB_RAW 0x0008 /* Set by physio for raw transfers. */ #defineB_READ 0x0010 /* Read buffer. */ #defineB_DEVPRIVATE0x0200 /* Device driver private flag. */ +#defineB_FUA 0x0800 /* Force Unit Access flag (mandatory). */ +#defineB_DPO 0x1000 /* Disable Page Out flag (advisory). */ #define BUF_FLAGBITS \ "\20\1AGE\3ASYNC\4BAD\5BUSY\10DELWRI" \ "\12DONE\13COWDONE\15GATHERED\16INVAL\17LOCKED\20NOCACHE" \ -"\23PHYS\24RAW\25READ\32DEVPRIVATE\33VFLUSH" +"\23PHYS\24RAW\25READ\32DEVPRIVATE\33VFLUSH\34FUA\35DPO" /* Avoid weird code due to B_WRITE being a "pseudo flag" */ #define BUF_ISREAD(bp) (((bp)->b_flags & B_READ) == B_READ) Index: sys/dkio.h === RCS file: /cvsroot/src/sys/sys/dkio.h,v retrieving revision 1.22 diff -u -p -r1.22 dkio.h --- sys/dkio.h 8 Dec 2015 20:36:15 - 1.22 +++ sys/dkio.h 1 Mar 2017 21:19:41 - @@ -85,6 +85,8 @@ #defineDKCACHE_RCHANGE 0x000100 /* read enable is changeable */ #defineDKCACHE_WCHANGE 0x000200 /* write enable is changeable */ #defineDKCACHE_SAVE0x01 /* cache parameters are savable/save them */ +#defineDKCACHE_FUA 0x02 /* Force Unit Access supported */ +#defineDKCACHE_DPO 0x04 /* Disable Page Out supported */ /* sync disk cache */ #defineDIOCCACHESYNC _IOW('d', 118, int) /* sync cache (force?) */ Index: dev/ic/ld_nvme.c === RCS file: /cvsroot/src/sys/dev/ic/ld_nvme.c,v retrieving revision 1.14 diff -u -p -r1.14 ld_nvme.c --- dev/ic/ld_nvme.c28 Feb 2017 20:55:09 - 1.14 +++ dev/ic/ld_nvme.c1 Mar 2017 21:19:41 - @@ -152,11 +152,15 @@ static int ld_nvme_start(struct ld_softc *ld, struct buf *bp) { struct ld_nvme_softc *sc = device_private(ld->sc_dv); + int flags = BUF_ISWRITE(bp) ? 0 : NVME_NS_CTX_F_READ; + + if (bp->b_flags & B_FUA) + flags |= NVME_NS_CTX_F_FUA; return nvme_ns_dobio(sc->sc_nvme, sc->sc_nsid, sc, bp, bp->b_data, bp->b_bcount, sc->sc_ld.sc_secsize, bp->b_rawblkno, - BUF_ISWRITE(bp) ? 0 : NVME_NS_CTX_F_READ, + flags, ld_nvme_biodone)
Re: Plan: journalling fixes for WAPBL
2017-01-02 18:31 GMT+01:00 David Holland : > Well, there's two things going on here. One is the parallelism limit, > which you can work around by having more drives, e.g. in an array. > The other is netbsd's 64k MAXPHYS issue, which is our own problem that > we could put time into. (And in fact, it would be nice to get > tls-maxphys into -8... anyone have time? I don't. sigh) It would be very nice to have this intergrated, yes. It won't have dramatic performance effect, but it's relatively low hanging fruit, it's already almost done and we should just get rid of this arbitrary system limit. I'd like to look into this, but I won't manage sooner then autumn 2017 - I'd like to first work on FUA/DPO support, and then SATA NCQ. I think those could have bigger performance impact. I hope I'll have some patches to make FUA available for I/O for SCSI drives by second half of feb, plus the changes for WAPBL to use it. It's slightly difficult to DTRT with FUA on nested drivers like raidframe/cgd/vnd, but maybe we can ignore those for the first iteration :) I'll send a proposal once I figure details. Jaromir
Re: Possible buffer cache race?
2016-10-24 19:34 GMT+02:00 David Holland : > Is this correlated with the syncer running? I have been seeing a > problem where every time the syncer runs it locks out everything else, > and once that happens some things take several seconds to complete > afterwards. Doesn't seem to me just syncer slow - it behaves as regular deadlock. If I just wait, nothing happens - it will not recover itself. Actually, when I waited too long (couple minutes), the sync command itself blocked and system didn't recover. > The tstile ones are unlikely to be interesting (if you have a bunch of > tstiles, pick one that isn't, which'll be what they're waiting behind) > but the others might be. Knowing where it gets stuck will help a lot; > or if it doesn't get stuck in any one place that too. Got these example tracebacks: #1 trace: pid 540 lid 1 at 0xfe90f090fb10 sleepq_block() at sleepq_block+0x97 cv_timedwait() at cv_timedwait+0x123 bbusy() at bbusy+0xad getblk() at getblk+0x5a bio_doread() at bio_doread+0x1d bread() at bread+0x1a ffs_update.part.3() at ffs_update.part.3+0x112 ufs_inactive() at ufs_inactive+0x18e VOP_INACTIVE() at VOP_INACTIVE+0x33 vrelel() at vrelel+0x259 vn_close() at vn_close+0x41 closef() at closef+0x54 fd_close() at fd_close+0x2da sys_close() at sys_close+0x20 syscall() at syscall+0x164 #2 trace: pid 222 lid 1 at 0xfe90f0927c20 sleepq_block() at sleepq_block+0x97 turnstile_block() at turnstile_block+0x3a9 mutex_enter() at mutex_enter+0x36c genfs_renamelock_enter() at genfs_renamelock_enter+0xd do_sys_renameat() at do_sys_renameat+0x478 sys_rename() at sys_rename+0x25 syscall() at syscall+0x164 #3 trace: pid 697 lid 1 at 0xfe90f089f850 sleepq_block() at sleepq_block+0x97 cv_timedwait() at cv_timedwait+0x123 bbusy() at bbusy+0xad getblk() at getblk+0x5a bio_doread() at bio_doread+0x1d bread() at bread+0x1a ffs_update.part.3() at ffs_update.part.3+0x112 ufs_inactive() at ufs_inactive+0x18e VOP_INACTIVE() at VOP_INACTIVE+0x33 vrelel() at vrelel+0x259 genfs_rename_exit() at genfs_rename_exit+0x10a genfs_sane_rename() at genfs_sane_rename+0x7a3 ufs_sane_rename() at ufs_sane_rename+0x3a genfs_insane_rename() at genfs_insane_rename+0x15d VOP_RENAME() at VOP_RENAME+0x43 do_sys_renameat() at do_sys_renameat+0x77f sys_rename() at sys_rename+0x25 syscall() at syscall+0x164 Those seem to be then for inode updates. I've spent some time trying to trigger this locally, so far without success. Once I'll get the remote machine reset again, I'll try to trigger the condition using ATA disk - I want to rule out driver bug. > The most common cause of missed wakeups is not going to sleep > atomically, so it's usually a property of a particular sleep site. > Next most common is races between changing things affecting the sleep > condition and posting a wakeup; in particular, unlocked wakeup before > change (or unlocked wakeup after change without a memory barrier to > enforce the ordering) leads to problems. Seems this more likely an async mount issue, not so much concurrency - similar symptoms are described in PR kern/47030 for sysinst stall. Sysinst also mounts the new filesystem async, and the condition there is observed on VM with 1 CPU. I didn't manage to repeat this with qemu 2.7.0, but I'll check later with the qemu0 package. The traceback in that PR is slightly different thought (it stalls waiting for new buf), so that can actually be different issue. Jaromir
Re: WAPBL fix for deallocation exhaustion + slow file removal
The fix was committed, with only minor changes (some comments, and fixed mishandling of error return value in ffs_indirtrunc(). Jaromir 2016-10-06 22:36 GMT+02:00 Jaromír Doleček : > I've incorporated the mutex fix, here is the final patch relative to > trunk. I'd like to commit this sometime next week. > > Jaromir > > 2016-10-01 19:00 GMT+02:00 Taylor R Campbell > : >>Date: Sat, 1 Oct 2016 18:40:31 +0200 >>From: Jaromír Dole ek >> >>> Thanks for taking a shot at this! But I think it needs a little more >>> time for review -- certainly I can't digest it in the 24 hours you're >>> giving. >> >>Sure. >> >> OK, thanks! I'll try to find some time to review this more carefully >> in the next few days. For now, one comment: >> >>> From a quick glance at the patch, I see one bug immediately in >>> vfs_wapbl.c that must have been introduced in a recent change: >>> pool_get(PR_WAITOK) is forbidden while holding a mutex, but >>> wapbl_register_deallocation does just that. >> >>I'll recheck this, thank you. I run it under rump only, I think I had >>it compiled LOCKDEBUG, so should trigger the same assertions as >>LOCKDEBUG kernel. But apparently not. >> >> Since wl->wl_mtx is an adaptive lock, not a spin lock, nothing >> automatically detects this. But in general, with minor exceptions, >> one is not supposed to sleep while holding an adaptive lock. >> (Exception example: waiting for a cheap cross-call to complete in >> pserialize_perform(9).) >> >> It's also suboptimal that we sleep while holding rwlocks for vnode >> locks, since rw_enter is uninterruptable, so if a wait inside the >> kernel hangs with a vnode lock held, anyone else trying to examine >> that vnode will hang uninterruptably too. But it will take some >> effort to undo that state of affairs by teaching all callers of >> vn_lock to accept failure. >> >> The same goes for the the long-term file system transaction lock >> wl->wl_rwlock. However, suboptimality of sleeping with those >> long-term locks aside, certainly one should not sleep while holding >> the short-term wl->wl_mtx.
Re: PCIVERBOSE causing kernel stack overflow during boot - why?
In my case it crashed on the same device, Core i7-6xxxK/Xeon-D Memory Controller (Target Address, Thermal, RAS) ID 0x6fa8. The last pci line before the trap was for device immediatelly preceding that one. Thanks Paul for getting to the bottom of this. Jaromir 2016-10-25 7:02 GMT+02:00 Paul Goyette : > OK, here's the problem... > > The device in question is Intel product code 6fa8 which has a product name > of (deliberately line-split to facilitate character counting) > > "Core i7-6xxxK/Xeon-D Memory Cont" > "roller (Target Address, Thermal," > " RAS)" > > That's a total of 65 characters plus trailing '\0'. > > Unfortunately pci/pci_verbose.h defines PCI_PRODUCTSTR_LEN to 64 (which > would include the trailing '\0' as well as the space after the final word - > this final space is removed in dev_untokenstring()), so this device's name > is too long. > > It's still not clear to me why this is triggering SSP checks, but I think we > can sweep the problem under the rug if we simply increase the value of > PCI_PRODUCTSTR_LEN to a larger value. > > Index: pci_verbose.h > === > RCS file: /cvsroot/src/sys/dev/pci/pci_verbose.h,v > retrieving revision 1.4 > diff -u -p -r1.4 pci_verbose.h > --- pci_verbose.h 21 Sep 2014 14:30:22 - 1.4 > +++ pci_verbose.h 25 Oct 2016 04:58:08 - > @@ -36,7 +36,7 @@ typedef uint32_t pcireg_t;/* configura > #endif > > #definePCI_VENDORSTR_LEN 64 > -#definePCI_PRODUCTSTR_LEN 64 > +#definePCI_PRODUCTSTR_LEN 96 > > DEV_VERBOSE_DECLARE(pci); > > > > > > On Sun, 23 Oct 2016, JaromĆr DoleÄ~Mek wrote: > >> Here is the output from lspci/pcictl. >> >> I'll try that DDB_COMMANDONENTER also - the machine is remote though, >> so I'll send it later when I get it. >> >> Thanks. >> >> Jaromir >> >> 2016-10-19 7:23 GMT+02:00 Paul Goyette : >>> >>> On Tue, 18 Oct 2016, Paul Goyette wrote: >>> Just as an added experiment, can you try to boot the non-PCIVERBOSE kernel, and at the boot prompt enter load pciverbose before actually booting? As far as getting a back-trace, you could set DDB_COMMANDONENTER="bt" in your config file The dmesg looks interesting, especially with that strange pci9 bus. My machine has a similar "management devices" pci bus. >>> >>> >>> >>> Also, if you have installed pkgsrc/sysutils/pciutils it would be useful >>> to >>> get the output from >>> >>> lspci -tvnn >>> >>> Otherwise, please provide output from following two commands: >>> >>> pcictl pci0 list -N >>> pcictl pci0 list -n >>> >>> >>> On Mon, 17 Oct 2016, JaromĆ Ār DoleĆ„~Mek wrote: > Hi, > > I've got an amd64 system which panics with 'stack overflow detected' > on boot, somewhere halfway through probing pci9 bus, when booted with > kernel with PCIVERBOSE. Same kernel config without PCIVERBOSE boots > fine. dmesg without PCIVERBOSE is attached. > > Any idea what might be causing this? > > I've had cursory look at pci code, it doesn't seem as if anything > would be allocating extra space there. Maybe some interaction with > dev_verbose module code? Unfortunately can't get backtrace as this > happens before the keyboard is probed and attached. > > Jaromir > > > > +--+--++ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com | | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org | +--+--++ >>> >>> >>> >>> +--+--++ >>> | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | >>> | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com | >>> | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org | >>> +--+--++ >> >> >> >> !DSPAM:580ce657154322062083821! >> > > +--+--++ > | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | > | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com | > | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org | > +--+--++
Re: PCIVERBOSE causing kernel stack overflow during boot - why?
Here is the output from lspci/pcictl. I'll try that DDB_COMMANDONENTER also - the machine is remote though, so I'll send it later when I get it. Thanks. Jaromir 2016-10-19 7:23 GMT+02:00 Paul Goyette : > On Tue, 18 Oct 2016, Paul Goyette wrote: > >> Just as an added experiment, can you try to boot the non-PCIVERBOSE >> kernel, and at the boot prompt enter >> >> load pciverbose >> >> before actually booting? >> >> As far as getting a back-trace, you could set DDB_COMMANDONENTER="bt" in >> your config file >> >> The dmesg looks interesting, especially with that strange pci9 bus. My >> machine has a similar "management devices" pci bus. > > > Also, if you have installed pkgsrc/sysutils/pciutils it would be useful to > get the output from > > lspci -tvnn > > Otherwise, please provide output from following two commands: > > pcictl pci0 list -N > pcictl pci0 list -n > > > >> >> >> >> On Mon, 17 Oct 2016, JaromÃr DoleÄ~Mek wrote: >> >>> Hi, >>> >>> I've got an amd64 system which panics with 'stack overflow detected' >>> on boot, somewhere halfway through probing pci9 bus, when booted with >>> kernel with PCIVERBOSE. Same kernel config without PCIVERBOSE boots >>> fine. dmesg without PCIVERBOSE is attached. >>> >>> Any idea what might be causing this? >>> >>> I've had cursory look at pci code, it doesn't seem as if anything >>> would be allocating extra space there. Maybe some interaction with >>> dev_verbose module code? Unfortunately can't get backtrace as this >>> happens before the keyboard is probed and attached. >>> >>> Jaromir >>> >>> >>> !DSPAM:58058f59188191671919093! >>> >> >> +--+--++ >> | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | >> | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com | >> | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org | >> +--+--++ > > > +--+--++ > | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | > | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com | > | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org | > +--+--++ -[:00]-+-00.0 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DMI2 [8086:6f00] +-01.0-[01]00.0 LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] [1000:005d] +-02.0-[02]--+-00.0 Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 [8086:1528] |\-00.1 Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 [8086:1528] +-02.2-[03]00.0 Intel Corporation PCIe Data Center SSD [8086:0953] +-03.0-[04]-- +-04.0 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 0 [8086:6f20] +-04.1 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 1 [8086:6f21] +-04.2 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 2 [8086:6f22] +-04.3 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 3 [8086:6f23] +-04.4 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 4 [8086:6f24] +-04.5 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 5 [8086:6f25] +-04.6 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 6 [8086:6f26] +-04.7 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 7 [8086:6f27] +-05.0 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Map/VTd_Misc/System Management [8086:6f28] +-05.1 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D IIO Hot Plug [8086:6f29] +-05.2 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D IIO RAS/Control Status/Global Errors [8086:6f2a] +-05.4 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D I/O APIC [8086:6f2c] +-11.0 Intel Corporation C610/X99 series chipset SPSR [8086:8d7c] +-14.0 Intel Corporation C610/X99 series chipset USB xHCI Host Controller [8086:8d31] +-16.0 Intel Corporation C610/X99 series chipset MEI Controller #1 [8086:8d3a] +-16.1 Intel Corporation C610/X99 series chipset MEI Controller #2 [8086:8d3b] +-1a.0 Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #2 [8086:8d2d] +-1c.0-[05-06]00.0-[06]00.0 ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] +-1c.2-[07]00.0 Intel Corporation I210 Gigabit Network Connection [8086:1533] +-1c.3-[
Possible buffer cache race?
Hi, I'm doing some testing with nvme(4) with quite deep i/o queues with -current kernel on a MP system. At this moment basically just confirming functionality, so just bunch of parallel tar extracting file within a filesystem to a subdirectory. I have the filesystem mounted async and the machine has huge amount of RAM, without logging at the moment. So it's mostly buffer cache exercise, with i/o spikes on sync. I see interesting thing - periodically, all of the tar processes get blocked sleeping on either tstile, biolock or pager_map. All the tar processes block. When I just wait they stay blocked. When I call sync(8), all of them unblock and continue running, until again they all hit the same condition later. When I keep calling sync, eventually all processes finish. Most often they block on biolock, then somewhat less frequently tstile; pager_map is more rare. It's usually mix of these - most processes block on biolock, some tstile and zero/one/two on pager_map. I understand "biolock" is used to wait for busy buf. Looked into code, nothing really obviously wrong there. Code uses cv_broadcast(&bp->b_busy) & cv_timedwait(), so there shouldn't be a race. Any idea where I should try to start poking? Jaromir
PCIVERBOSE causing kernel stack overflow during boot - why?
Hi, I've got an amd64 system which panics with 'stack overflow detected' on boot, somewhere halfway through probing pci9 bus, when booted with kernel with PCIVERBOSE. Same kernel config without PCIVERBOSE boots fine. dmesg without PCIVERBOSE is attached. Any idea what might be causing this? I've had cursory look at pci code, it doesn't seem as if anything would be allocating extra space there. Maybe some interaction with dev_verbose module code? Unfortunately can't get backtrace as this happens before the keyboard is probed and attached. Jaromir Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016 The NetBSD Foundation, Inc. All rights reserved. Copyright (c) 1982, 1986, 1989, 1991, 1993 The Regents of the University of California. All rights reserved. NetBSD 7.99.39 (BEAST) #8: Mon Oct 17 22:17:26 CEST 2016 dolecek@.../netbsd/sys/arch/amd64/compile/BEAST total memory = 127 GB avail memory = 124 GB cpu_rng: RDSEED timecounter: Timecounters tick every 10.000 msec Kernelized RAIDframe activated timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100 Supermicro SYS-1028R-MCTR (0123456789) mainbus0 (root) ACPI: RSDP 0x000F0580 24 (v02 SUPERM) ACPI: XSDT 0x797330A8 CC (v01 01072009 AMI 00010013) ACPI: FACP 0x79763EB8 00010C (v05 SUPERM SMCI--MB 01072009 AMI 00010013) ACPI: DSDT 0x79733208 030CB0 (v02 SUPERM SMCI--MB 01072009 INTL 20091013) ACPI: FACS 0x79BDEF80 40 ACPI: APIC 0x79763FC8 0001E0 (v03 SUPERM SMCI--MB 01072009 AMI 00010013) ACPI: FPDT 0x797641A8 44 (v01 SUPERM SMCI--MB 01072009 AMI 00010013) ACPI: FIDT 0x797641F0 9C (v01 SUPERM SMCI--MB 01072009 AMI 00010013) ACPI: SPMI 0x79764290 40 (v05 SUPERM SMCI--MB AMI. ) ACPI: MCFG 0x797642D0 3C (v01 SUPERM SMCI--MB 01072009 MSFT 0097) ACPI: UEFI 0x79764310 42 (v01 SUPERM SMCI--MB 01072009 ) ACPI: HPET 0x79764358 38 (v01 SUPERM SMCI--MB 0001 INTL 20091013) ACPI: MSCT 0x79764390 90 (v01 SUPERM SMCI--MB 0001 INTL 20091013) ACPI: SLIT 0x79764420 2D (v01 SUPERM SMCI--MB 0001 INTL 20091013) ACPI: SRAT 0x79764450 001158 (v03 SUPERM SMCI--MB 0001 INTL 20091013) ACPI: WDDT 0x797655A8 40 (v01 SUPERM SMCI--MB INTL 20091013) ACPI: SSDT 0x797655E8 016F57 (v02 SUPERM PmMgt0001 INTL 20120913) ACPI: SSDT 0x7977C540 00264C (v02 SUPERM SpsNm0002 INTL 20120913) ACPI: SSDT 0x7977EB90 64 (v02 SUPERM SpsNvs 0002 INTL 20120913) ACPI: PRAD 0x7977EBF8 000102 (v02 SUPERM SMCI--MB 0002 INTL 20120913) ACPI: DMAR 0x7977ED00 C4 (v01 SUPERM SMCI--MB 0001 INTL 20091013) ACPI: HEST 0x7977EDC8 00027C (v01 SUPERM SMCI--MB 0001 INTL 0001) ACPI: BERT 0x7977F048 30 (v01 SUPERM SMCI--MB 0001 INTL 0001) ACPI: ERST 0x7977F078 000230 (v01 SUPERM SMCI--MB 0001 INTL 0001) ACPI: EINJ 0x7977F2A8 000130 (v01 SUPERM SMCI--MB 0001 INTL 0001) ACPI: 4 ACPI AML tables successfully acquired and loaded ioapic0 at mainbus0 apid 1: pa 0xfec0, version 0x20, 24 pins ioapic1 at mainbus0 apid 2: pa 0xfec01000, version 0x20, 24 pins cpu0 at mainbus0 apid 0 cpu0: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu1 at mainbus0 apid 2 cpu1: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu2 at mainbus0 apid 4 cpu2: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu3 at mainbus0 apid 6 cpu3: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu4 at mainbus0 apid 8 cpu4: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu5 at mainbus0 apid 10 cpu5: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu6 at mainbus0 apid 12 cpu6: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu7 at mainbus0 apid 16 cpu7: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu8 at mainbus0 apid 18 cpu8: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu9 at mainbus0 apid 20 cpu9: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu10 at mainbus0 apid 22 cpu10: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu11 at mainbus0 apid 24 cpu11: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu12 at mainbus0 apid 26 cpu12: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu13 at mainbus0 apid 28 cpu13: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu14 at mainbus0 apid 1 cpu14: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu15 at mainbus0 apid 3 cpu15: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu16 at mainbus0 apid 5 cpu16: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu17 at mainbus0 apid 7 cpu17: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1 cpu18 at mainbus0 apid 9 cpu18: Inte
Re: WAPBL fix for deallocation exhaustion + slow file removal
I've incorporated the mutex fix, here is the final patch relative to trunk. I'd like to commit this sometime next week. Jaromir 2016-10-01 19:00 GMT+02:00 Taylor R Campbell : >Date: Sat, 1 Oct 2016 18:40:31 +0200 >From: Jaromír Dole ek > >> Thanks for taking a shot at this! But I think it needs a little more >> time for review -- certainly I can't digest it in the 24 hours you're >> giving. > >Sure. > > OK, thanks! I'll try to find some time to review this more carefully > in the next few days. For now, one comment: > >> From a quick glance at the patch, I see one bug immediately in >> vfs_wapbl.c that must have been introduced in a recent change: >> pool_get(PR_WAITOK) is forbidden while holding a mutex, but >> wapbl_register_deallocation does just that. > >I'll recheck this, thank you. I run it under rump only, I think I had >it compiled LOCKDEBUG, so should trigger the same assertions as >LOCKDEBUG kernel. But apparently not. > > Since wl->wl_mtx is an adaptive lock, not a spin lock, nothing > automatically detects this. But in general, with minor exceptions, > one is not supposed to sleep while holding an adaptive lock. > (Exception example: waiting for a cheap cross-call to complete in > pserialize_perform(9).) > > It's also suboptimal that we sleep while holding rwlocks for vnode > locks, since rw_enter is uninterruptable, so if a wait inside the > kernel hangs with a vnode lock held, anyone else trying to examine > that vnode will hang uninterruptably too. But it will take some > effort to undo that state of affairs by teaching all callers of > vn_lock to accept failure. > > The same goes for the the long-term file system transaction lock > wl->wl_rwlock. However, suboptimality of sleeping with those > long-term locks aside, certainly one should not sleep while holding > the short-term wl->wl_mtx. wapbl_fix_dealloc_2.patch Description: Binary data
Re: WAPBL fix for deallocation exhaustion + slow file removal
> Thanks for taking a shot at this! But I think it needs a little more > time for review -- certainly I can't digest it in the 24 hours you're > giving. Sure. > From a quick glance at the patch, I see one bug immediately in > vfs_wapbl.c that must have been introduced in a recent change: > pool_get(PR_WAITOK) is forbidden while holding a mutex, but > wapbl_register_deallocation does just that. I'll recheck this, thank you. I run it under rump only, I think I had it compiled LOCKDEBUG, so should trigger the same assertions as LOCKDEBUG kernel. But apparently not. > What happens if wapbl_register_deallocation fails in the middle of > ffs_truncate, and then the power goes out before the caller retries > the truncation? It looks like we may end up committing a transaction > to disk that partially zeros a file -- and it is not immediately clear > to me who will be responsible for freeing the remaining blocks when > the system comes back up. The indirect disk blocks go via the journal also, so if power fails while the journal is not yet committed, it would work just the same as usually, i.e. no change would be visible on filesystem. If truncate fails once, then this partial truncate would be committed, and power fails before the truncate is completely finished, then the log replay would just replay just the first part, creating the hole. > I don't understand why the possible failure in the fragment extension > code should not be worrying. `Only 7 times during a file's lifetime' > says to me that if I check out and delete the NetBSD CVS repository > once, there are about two million times this can happen. If we can't > restart the transaction safely, we need to find another way to do it. The fragment deallocation is only registered when a file is extended from size less then full block, and only when there is no free block available just next to the fragment. This can't happen more then once per data write operation, hence once per each wapbl_begin()/wapbl_end() block. The two million file creates are all in different vnode calls, each call checking the limits in wapbl_start() and flushing if necessary. It would be extremely difficult to trigger transaction overflow there. The deletions won't cause any problem, since on that path we use !force path, so can't overflow transaction. > Can you say anything about alternatives you might have considered that > don't entail adding undo/restart logic to every user of UFS_TRUNCATE > and why they were unfit? Right now every caller of UFS_TRUNCATE() does it implicitely, via ufs_truncate(). That function right now tries to avoid big transaction by truncating on indirect block boundary with wapbl, but that fails to account for bigger directories and of course is slow. There is alternative patch in kern/49175, which merely changes ufs_truncate() to check number of real blocks and adjust truncate size to fit the transaction within that limit. That patch however exclusively locks the transaction in order to avoid some parallell operations to exhaust the dealloc limit, and I don't like how it moves logic about allocated blocks away from ffs_alloc/ffs_inode. I considered another alternative with dealloc pre-registration, avoiding the exclusive lock also. But that would still require partial truncate support similar to my patch in order to avoid stall when there are several parallel requests, so basically it doesn't bring any extra value. Pre-allocation also requires some accouting hence extra code, so it looked just simplier to go without. Jaromir
WAPBL fix for deallocation exhaustion + slow file removal
Hi, attached patch contains a fix to WAPBL deallocation structure exhaustion and panic (kern/47146), and avoids need to do slow partial truncates in loop, fixing kern/49175. Patch changes wapbl_register_deallocation() to fail with EAGAIN when we run into the limit, and change ffs_truncate() and ffs_indirtrunc() to bail out early as well when this happens. Callers were updated to either explicitly not care, or loop with transaction restart when required by semantics (inactive, setattr). There is one place in code really can't easily cope with failure, in fragment extension code, when we deallocate the previous fragment block. Since that particular condition is very rare (happens at most 7 times during file lifetime, and only when filesystem is very fragmented), I've added just forced registration there. There are some slight semantics changes. ffs_indirtrunc() now always holds block buf for upper indirect node when descending, and the block is written only when everything is done. Also, code assumes that nothing in ffs_truncate() triggers EAGAIN for other purpose. Besides this, the partial deallocation might leave holes in the file, as code deallocates the whole indirect blocks before the last partial one. I think this is okay even for dirs, so this shouldn't cause problem. I plan to commit this sometime tomorrow. Comments welcome. Jaromir wapbl_fix_dealloc.patch Description: Binary data
Re: Plan: journalling fixes for WAPBL
I think it's far assesment to say that on SATA with NCQ/31 tags (max is actually 31, not 32 tags), it's pretty much impossible to have acceptable write performance without using write cache. We could never saturate even drive with 16MB cache with just 31 tags and 64k maxphys. So it's IMO not useful to design for world without disk drive write cache. Back to discussion about B_ORDERED: As was said before, SCSI ORDERED tag does precisely what we want for journal commit record - it forces all previous commands sent to controller to be finished before the one with ORDERED tag is processed, and any commands sent after the ORDERED tagged one are executed only after the previous ordered command is finished. No need for any bufq magic there, which is wonderful. Too bad that NCQ doesn't provide this. That said, we still need to be sure that all the previous commands were sent prior to pushing ORDERED command to SCSI controller. Are there any SCSI controllers with multiple submission queues (like NVMe), regardless of our scsipi layer MP limitations? FWIW AHCI is single-threaded by design, every command submission has to write to same set of registers. Jaromir 2016-09-23 19:51 GMT+02:00 Manuel Bouyer : > On Fri, Sep 23, 2016 at 01:46:09PM -0400, Thor Lancelot Simon wrote: >> > > This seems like the key thing needed to avoid FUA: to implement fsync() >> > > you just wait for notifications of completion to be received, and once >> > > you have those for all requests pending when fsync was called, or >> > > started as part of the fsync, then you're done. >> > >> > *if you have the write cache disabled* >> >> *Running with the write cache enabled is a bad idea* > > On ATA devices, you can't permanently disable the write cache. You have > to do it on every power cycles. > > Well this really needs to be carefully evaluated. With only 32 tags I'm not > sure you can efficiently use recent devices with the write cache > disabled (most enterprise disks have a 64M cache these days) > > -- > Manuel Bouyer > NetBSD: 26 ans d'experience feront toujours la difference > --
Re: CVS commit: src/sys/arch
Hey Maxime, Seems the KASSERTs() are too aggressive, or there is some other bug. I can trigger the kassert by simply attaching to rump_ffs, setting a breakpoint and continuing, i.e: > rump_ffs -o log ./ffs ./mnt > gdb rump_ffs ... (gdb) attach RUMP_PID (gdb) break ffs_truncate Breakpoint 1 at 0xad0b951f: file /usr/home/dolecek/netbsd/sys/rump/fs/lib/libffs/../../../../ufs/ffs/ffs_inode.c, line 210. (gdb) cont panic: kernel diagnostic assetion "onfault == kcopy_fault || rcr2() < VM_MAXUSER_ADDRESS" failed: file "../../../../arch/i386/i386/trap.c", line 358 Could you please look at it? I'll disable the KASSERT() in my local tree, so that I'll be able to develop. But would be good idea to check what so special that gdb is doing that it trips over. Thank you. Jaromir 2016-09-16 13:48 GMT+02:00 Maxime Villard : > Module Name:src > Committed By: maxv > Date: Fri Sep 16 11:48:10 UTC 2016 > > Modified Files: > src/sys/arch/amd64/amd64: trap.c > src/sys/arch/i386/i386: trap.c > > Log Message: > Put two KASSERTs, to make sure the fault is happening in the correct > half of the vm space when using special copy functions. It can detect > bugs where the kernel would fault when copying a kernel buffer which > it wrongly believes comes from userland. > > > To generate a diff of this commit: > cvs rdiff -u -r1.84 -r1.85 src/sys/arch/amd64/amd64/trap.c > cvs rdiff -u -r1.278 -r1.279 src/sys/arch/i386/i386/trap.c > > Please note that diffs are not public domain; they are subject to the > copyright notices on the relevant files. >