Re: weird hangs in current (ghc, gnucash)
... and probably 3. PR kern/57660 https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=57660 Markus Am So., 22. Okt. 2023 um 23:10 Uhr schrieb Thomas Klausner : > > On Sun, Oct 22, 2023 at 11:06:25PM +0200, Thomas Klausner wrote: > > On Sun, Oct 22, 2023 at 10:37:54PM +0200, Thomas Klausner wrote: > > > I've just updated my kernel from 10.99.10 to 10.99.10 (~ Oct 11 to Oct > > > 20) to test the rge(4) changes, and started a bulk build, and the > > > packages using ghc seem to wait for something and make no progress. > > ... > > > I see one other new weird behaviour on that machine - gnucash doesn't > > > finish starting up. > > > > I've backed out ad's changes from the 13th, and both problems are gone. > > > > I'll attach my local change. > > > > Andrew, can you please take a look? > > Two test cases to see the problem I have: > > 1. start gnucash, it doesn't finish starting up, the splash screen hangs. > > 2. cd /usr/pkgsrc/devel/hs-data-array-byte && make >The 'build' step has two parts, it hangs after the first one. > > Thomas
daily CVS update output
Updating src tree: P src/external/bsd/top/dist/machine/m_netbsd.c P src/sys/arch/amd64/conf/GENERIC P src/sys/arch/i386/conf/GENERIC P src/tests/usr.bin/indent/lsym_unary_op.c P src/tests/usr.bin/indent/opt_bad.c P src/tests/usr.bin/xlint/lint1/decl_direct_abstract.c Updating xsrc tree: Killing core files: Updating release-8 src tree (netbsd-8): Updating release-8 xsrc tree (netbsd-8): Updating release-9 src tree (netbsd-9): Updating release-9 xsrc tree (netbsd-9): Updating release-10 src tree (netbsd-10): U doc/CHANGES-10.0 P sys/arch/x86/pci/pci_machdep.c P sys/arch/x86/x86/genfb_machdep.c P sys/dev/pci/if_rge.c Updating release-10 xsrc tree (netbsd-10): Updating file list: -rw-rw-r-- 1 srcmastr netbsd 43598829 Oct 23 03:13 ls-lRA.gz
Re: weird hangs in current (ghc, gnucash)
On Sun, Oct 22, 2023 at 11:06:25PM +0200, Thomas Klausner wrote: > On Sun, Oct 22, 2023 at 10:37:54PM +0200, Thomas Klausner wrote: > > I've just updated my kernel from 10.99.10 to 10.99.10 (~ Oct 11 to Oct > > 20) to test the rge(4) changes, and started a bulk build, and the > > packages using ghc seem to wait for something and make no progress. > ... > > I see one other new weird behaviour on that machine - gnucash doesn't > > finish starting up. > > I've backed out ad's changes from the 13th, and both problems are gone. > > I'll attach my local change. > > Andrew, can you please take a look? Two test cases to see the problem I have: 1. start gnucash, it doesn't finish starting up, the splash screen hangs. 2. cd /usr/pkgsrc/devel/hs-data-array-byte && make The 'build' step has two parts, it hangs after the first one. Thomas
Re: weird hangs in current (ghc, gnucash)
On Sun, Oct 22, 2023 at 10:37:54PM +0200, Thomas Klausner wrote: > I've just updated my kernel from 10.99.10 to 10.99.10 (~ Oct 11 to Oct > 20) to test the rge(4) changes, and started a bulk build, and the > packages using ghc seem to wait for something and make no progress. ... > I see one other new weird behaviour on that machine - gnucash doesn't > finish starting up. I've backed out ad's changes from the 13th, and both problems are gone. I'll attach my local change. Andrew, can you please take a look? Thanks, Thomas Module Name:src Committed By: ad Date: Fri Oct 13 18:48:56 UTC 2023 Modified Files: src/sys/kern: kern_condvar.c kern_sleepq.c src/sys/rump/librump/rumpkern: locks.c locks_up.c src/sys/sys: condvar.h lwp.h Log Message: Add cv_fdrestart() (better name suggestions welcome): Like cv_broadcast(), but make any LWPs that share the same file descriptor table as the caller return ERESTART when resuming. Used to dislodge LWPs waiting for I/O that prevent a file descriptor from being closed, without upsetting access to the file (not descriptor) made from another direction. To generate a diff of this commit: cvs rdiff -u -r1.59 -r1.60 src/sys/kern/kern_condvar.c cvs rdiff -u -r1.83 -r1.84 src/sys/kern/kern_sleepq.c cvs rdiff -u -r1.86 -r1.87 src/sys/rump/librump/rumpkern/locks.c cvs rdiff -u -r1.12 -r1.13 src/sys/rump/librump/rumpkern/locks_up.c cvs rdiff -u -r1.17 -r1.18 src/sys/sys/condvar.h cvs rdiff -u -r1.227 -r1.228 src/sys/sys/lwp.h Module Name:src Committed By: ad Date: Fri Oct 13 18:50:39 UTC 2023 Modified Files: src/sys/kern: uipc_socket.c uipc_syscalls.c src/sys/sys: socketvar.h Log Message: Use cv_fdrestart() to implement fo_restart. To generate a diff of this commit: cvs rdiff -u -r1.305 -r1.306 src/sys/kern/uipc_socket.c cvs rdiff -u -r1.208 -r1.209 src/sys/kern/uipc_syscalls.c cvs rdiff -u -r1.165 -r1.166 src/sys/sys/socketvar.h Module Name:src Committed By: ad Date: Fri Oct 13 19:07:09 UTC 2023 Modified Files: src/sys/ddb: db_command.c db_interface.h db_xxx.c src/sys/kern: sys_pipe.c src/sys/sys: pipe.h src/usr.bin/fstat: fstat.c Log Message: Simplify/streamline pipes a little bit: - Allocate only one struct pipe not two (no need to be bidirectional here). - Then use f_flag (FREAD/FWRITE) to figure out what to do in the fileops. - Never wake the other side or acquire long-term (I/O) lock unless needed. - Whenever possible, defer wakeups until after locks have been released. - Do some things locklessly in pipe_ioctl() and pipe_poll(). Some notable results: - -30% latency on a 486DX2/66 doing 1 byte ping-pong within a single process. - 2.5x less lock contention during "make cleandir" of src on a 48 CPU machine. - 1.5x bandwith with 1kB messages on the same 48 CPU machine (8kB: same b/w). To generate a diff of this commit: cvs rdiff -u -r1.186 -r1.187 src/sys/ddb/db_command.c cvs rdiff -u -r1.41 -r1.42 src/sys/ddb/db_interface.h cvs rdiff -u -r1.77 -r1.78 src/sys/ddb/db_xxx.c cvs rdiff -u -r1.164 -r1.165 src/sys/kern/sys_pipe.c cvs rdiff -u -r1.39 -r1.40 src/sys/sys/pipe.h cvs rdiff -u -r1.118 -r1.119 src/usr.bin/fstat/fstat.c ad.backed.out.diff.gz Description: Binary data
weird hangs in current (ghc, gnucash)
Hi! I've just updated my kernel from 10.99.10 to 10.99.10 (~ Oct 11 to Oct 20) to test the rge(4) changes, and started a bulk build, and the packages using ghc seem to wait for something and make no progress. In one of my sandboxes there is a hs-data-array-byte build but it's not doing anything. The log stops at: ===> Creating toolchain wrappers for hs-data-array-byte-0.1.0.1nb2 ===> Configuring for hs-data-array-byte-0.1.0.1nb2 => Checking for portability problems in extracted files [1 of 2] Compiling Main ( Setup.hs, Setup.o ) >From ps: pbulk 26131 0.0 0.1 1073923564 140684 ? Il8:23PM 0:00.23 /usr/pkg/lib/ghc-9.4.7/bin/./ghc-9.4.7 -B/usr/pkg/lib/ghc-9.4.7/lib -package-env - --make Setup -dynamic (btw, that is a really huge process size?!) Attaching with gdb shows me: [Switching to LWP 20090 of process 26131] 0x7195fa607a1a in ___lwp_park60 () from /usr/lib/libc.so.12 (gdb) bt #0 0x7195fa607a1a in ___lwp_park60 () from /usr/lib/libc.so.12 #1 0x7195fa97dc4d in pthread_cond_timedwait () from /usr/lib/libpthread.so.1 #2 0x7195faae1472 in waitCondition (pCond=pCond@entry=0x7195fa22f010, pMut=pMut@entry=0x7195fa22f038) at rts/posix/OSThreads.c:143 #3 0x7195faa903e1 in waitForWorkerCapability (task=) at rts/Capability.c:707 #4 yieldCapability (pCap=pCap@entry=0x7195f77fff10, task=task@entry=0x7195fa22f000, gcAllowed=gcAllowed@entry=true) at rts/Capability.c:1011 #5 0x7195faab0026 in scheduleYield (task=0x7195fa22f000, pcap=0x7195f77fff08) at rts/Schedule.c:709 #6 schedule (initialCapability=initialCapability@entry=0x7195fab21cc0 , task=task@entry=0x7195fa22f000) at rts/Schedule.c:319 #7 0x7195faab20b9 in scheduleWorker (cap=cap@entry=0x7195fab21cc0 , task=task@entry=0x7195fa22f000) at rts/Schedule.c:2668 #8 0x7195faab78a2 in workerStart (task=0x7195fa22f000) at rts/Task.c:444 #9 0x7195fa97f2df in pthread.create_tramp () from /usr/lib/libpthread.so.1 #10 0x7195fa5f0c60 in ?? () from /usr/lib/libc.so.12 #11 0x0020 in ?? () #12 0x in ?? () (gdb) thread apply all bt Thread 6 (LWP 26131 of process 26131 ""): #0 0x7195fa607a1a in ___lwp_park60 () from /usr/lib/libc.so.12 #1 0x7195fa97dc4d in pthread_cond_timedwait () from /usr/lib/libpthread.so.1 #2 0x7195faae1472 in waitCondition (pCond=pCond@entry=0x7195fa2b2010, pMut=pMut@entry=0x7195fa2b2038) at rts/posix/OSThreads.c:143 #3 0x7195faa903e1 in waitForWorkerCapability (task=) at rts/Capability.c:707 #4 yieldCapability (pCap=pCap@entry=0x7f7fff2287c0, task=task@entry=0x7195fa2b2000, gcAllowed=gcAllowed@entry=true) at rts/Capability.c:1011 #5 0x7195faab0026 in scheduleYield (task=0x7195fa2b2000, pcap=0x7f7fff2287b8) at rts/Schedule.c:709 #6 schedule (initialCapability=initialCapability@entry=0x7195fab21cc0 , task=task@entry=0x7195fa2b2000) at rts/Schedule.c:319 #7 0x7195faab2069 in scheduleWaitThread (tso=0x4200406ce8, ret=ret@entry=0x0, pcap=pcap@entry=0x7f7fff228940) at rts/Schedule.c:2651 #8 0x7195faaa85fb in rts_evalLazyIO (cap=cap@entry=0x7f7fff228940, p=p@entry=0x1071e60, ret=ret@entry=0x0) at rts/RtsAPI.c:566 #9 0x7195faaabb48 in hs_main (argc=, argv=, main_closure=0x1071e60, rts_config=...) at rts/RtsMain.c:72 #10 0x01063124 in main () Thread 5 (LWP 7329 of process 26131 "ghc_ticker"): #0 0x7195fa607a1a in ___lwp_park60 () from /usr/lib/libc.so.12 #1 0x7195fa97dc4d in pthread_cond_timedwait () from /usr/lib/libpthread.so.1 #2 0x7195faae1472 in waitCondition (pCond=pCond@entry=0x7195fab21bc0 , pMut=pMut@entry=0x7195fab21b80 ) at rts/posix/OSThreads.c:143 #3 0x7195faae040e in itimer_thread_func (_handle_tick=0x7195faab9c57 ) at rts/posix/ticker/Pthread.c:140 #4 0x7195fa97f2df in pthread.create_tramp () from /usr/lib/libpthread.so.1 #5 0x7195fa5f0c60 in ?? () from /usr/lib/libc.so.12 #6 0x in ?? () Thread 4 (LWP 15032 of process 26131 "ghc_worker"): #0 0x7195fa5a030a in _sys___kevent100 () from /usr/lib/libc.so.12 #1 0x7195fa97a8a7 in __kevent100 () from /usr/lib/libpthread.so.1 #2 0x7195fba014f2 in base_GHCziEventziKQueue_new12_info () from /usr/pkg/lib/ghc-9.4.7/lib/x86_64-netbsd-ghc-9.4.7/libHSbase-4.17.2.0-ghc9.4.7.so #3 0x in ?? () Thread 3 (LWP 17781 of process 26131 "ghc_worker"): #0 0x7195fa5a016a in poll () from /usr/lib/libc.so.12 #1 0x7195fa97ae63 in poll () from /usr/lib/libpthread.so.1 #2 0x7195fba0ff55 in ?? () from /usr/pkg/lib/ghc-9.4.7/lib/x86_64-netbsd-ghc-9.4.7/libHSbase-4.17.2.0-ghc9.4.7.so #3 0x in ?? () Thread 2 (LWP 23219 of process 26131 "ghc_worker"): #0 0x7195fa607a1a in ___lwp_park60 () from /usr/lib/libc.so.12 #1 0x7195fa97dc4d in pthread_cond_timedwait () from /usr/lib/libpthread.so.1 #2 0x7195faae1472 in waitCondition (pCond=pCond@entry=0x7195fa2b2190, pMut=pMut@entry=0x7195fa2b21b8) at rts/posix/OSThreads.c:143 #3
Re: file-backed cgd backup question
mlel...@serpens.de (Michael van Elst) writes: > g...@lexort.com (Greg Troxel) writes: > >>> vnd opens the backing file when the unit is created and closes >>> the backing file when the unit is destroyed. Then you can access >>> the file again. > >>Is there a guarantee of cache consistency for writes before and reads >>after? > > Before the unit is created you can access the file and after the > unit is destroyed you can access the file. That's always safe. Sorry if I'm failing to understand something obvious, but with a caching layer that has file contents, how are the cache contents invalidated? Specifically (but loosely in commands) let's assume the vnd is small and there is a lot of RAM available process opens the file and reads it vnconfig mount vnd0 /mnt date > /mnt/somefile umount /mnt vnconfig -u process opens the file and reads it Without fs cache invalidation, stale data can be returned. If there is explicit invalidation, it would be nice to say that precisely but I am not understanding that it is there. Reading vnd.c, I don't see any cache invalidation on detach. The only explicit invalidation I find is in setcred from VNDIOCSET. I guess that prevents the above, but doesn't prevent vnconfig mount read backing file write to mount unmount detach read backing file so maybe we need a vinvalbuf on detach? > I also think that when the unit is configured but not opened > (by device access or mounts) it is safe to access the file. As I read the code, reads are ok but will leave possibly stale data in the cache for post-close. >>> The data is written directly to the allocated blocks of the file. >>> So exclusively opening the backing file _or_ the vnd unit should >>> also be safe. But that's not much different from accessing any file >>> concurrently, which also leads to "corrupt", inconsistent backups. > >>That's a different kind of corrupt. > > Yes, but in the end it's the same, the "backup" isn't usuable. I am expecting that after deconfiguring, a read of the entire file is guaranteed consistent, but I think we're missing invalidate on close. > You cannot access the backing file to get a consistent state of the > data while a unit is in use. And that's independent of how vnd accesses > the bits. Agreed; that's more or less like using a backup program on database files while the database is running. > N.B. if you want to talk about dangers, think about fdiscard(). I > doubt that it is safe in the context of the vnd optimization. It seems clear that pretty much any file operations are unsafe while the vnd is configured. That seems like an entirely reasonable situation and if that's the rule, easy to document. I wrote a test script and it shows that stale reads happen. When I run this on UFS2 (netbsd-10), I find that all 4 files are all zero. When I run it on zfs (also netbsd-10), I find that 000 and 001 are all zero and 002 and 003 are the same. (I am guessing that zfs doesn't use the direct operations, or caches differently; here I haven't the slightest idea what is happening.) 10 minutes later, reading VND is still all zeros. With a new vnconfig, it still reads as all zeros. #!/bin/sh dd if=/dev/zero of=VND bs=1m count=1 cat VND > VND.000 vnconfig vnd0 VND cat VND > VND.001 newfs /dev/rvnd0a cat VND > VND.002 vnconfig -u vnd0 cat VND > VND.003
Re: dtracing unlink
On Sun, 22 Oct 2023, Thomas Klausner wrote: Yes, then we're back at the start: dtrace: error on enabled probe ID 2 (ID 405: syscall::unlink:return): invalid address (0x77002a73f7ce) in action #1 at DIF offset 12 : No such file or directory Eh? Then we'd better ring for a dtrace guru because I'm stumped. The right one works for me...and also the wrong one! -RVP
Re: dtracing unlink
On Sun, Oct 22, 2023 at 07:40:17AM +, RVP wrote: > Ah, that attachment is still based on _my_ version which is plain wrong: You > can't do copyinstr(arg0) in the :entry action because the kernel may not have > paged in the memory containing the pathname (yet). > > Use your version (which is correct--it does copyinstr() in :return when the > kernel is sure to have the pathname already in memory): Yes, then we're back at the start: dtrace: error on enabled probe ID 2 (ID 405: syscall::unlink:return): invalid address (0x77002a73f7ce) in action #1 at DIF offset 12 : No such file or directory Thomas
Re: dtracing unlink
On Sun, 22 Oct 2023, Thomas Klausner wrote: I tried that (see attachment), didn't help. dtrace: error on enabled probe ID 1 (ID 404: syscall::unlink:entry): invalid address (0x7a8e0685a7ce) in action #1 at DIF offset 12 : No such file or directory dtrace: error on enabled probe ID 2 (ID 405: syscall::unlink:return): invalid address (0x0) in action #2 : No such file or directory dtrace: error on enabled probe ID 1 (ID 404: syscall::unlink:entry): invalid address (0x7a8e0685a7ce) in action #1 at DIF offset 12 : No such file or directory dtrace: error on enabled probe ID 2 (ID 405: syscall::unlink:return): invalid address (0x0) in action #2 : No such file or directory Ah, that attachment is still based on _my_ version which is plain wrong: You can't do copyinstr(arg0) in the :entry action because the kernel may not have paged in the memory containing the pathname (yet). Use your version (which is correct--it does copyinstr() in :return when the kernel is sure to have the pathname already in memory): ``` #!/usr/sbin/dtrace -s #pragma D option quiet syscall::unlink:entry /pid == MY_PID/ { self->file = arg0; } syscall::unlink:return /pid == MY_PID/ { printf("%s\n", copyinstr(self->file)); self->file = 0; } ``` The machine has 128 GB RAM and ~450 GB swap. I haven't tried limiting the RAM from BIOS yet. Don't bother: this is a red herring barking up the wrong tree. -RVP
Re: dtracing unlink
On Sun, Oct 22, 2023 at 06:00:43AM +, RVP wrote: > On Fri, 20 Oct 2023, Thomas Klausner wrote: > > > # dtrace -n syscall::unlink:entry'/pid == 27647/{ self->file = arg0; }' -n > > syscall::unlink:return'{ trace(copyinstr(self->file)); self->file = 0; }' > > > > but this just gives me lots of > > > > dtrace: error on enabled probe ID 2 (ID 405: syscall::unlink:return): > > invalid address (0x79c4586577ce) in action #1 at DIF offset 12 > > : No such file or directory > > > > Actually, this command-line is almost correct. What's missing is the paired > /pid == 27647/ for syscall::unlink:return. Without it, unlink:return is called > for _every_ pid and there's not going to be a valid self->file for almost > every > one of them. I tried that (see attachment), didn't help. dtrace: error on enabled probe ID 1 (ID 404: syscall::unlink:entry): invalid address (0x7a8e0685a7ce) in action #1 at DIF offset 12 : No such file or directory dtrace: error on enabled probe ID 2 (ID 405: syscall::unlink:return): invalid address (0x0) in action #2 : No such file or directory dtrace: error on enabled probe ID 1 (ID 404: syscall::unlink:entry): invalid address (0x7a8e0685a7ce) in action #1 at DIF offset 12 : No such file or directory dtrace: error on enabled probe ID 2 (ID 405: syscall::unlink:return): invalid address (0x0) in action #2 : No such file or directory The machine has 128 GB RAM and ~450 GB swap. I haven't tried limiting the RAM from BIOS yet. Thomas #!/usr/sbin/dtrace -s #pragma D option destructive #pragma D option quiet syscall::unlink:entry /pid == 28651/ { self->file = copyinstr(arg0); } syscall::unlink:return /pid == 28651/ { printf("%d %s\n", pid, self->file); self->file = 0; }
Re: file-backed cgd backup question
g...@lexort.com (Greg Troxel) writes: >> vnd opens the backing file when the unit is created and closes >> the backing file when the unit is destroyed. Then you can access >> the file again. >Is there a guarantee of cache consistency for writes before and reads >after? Before the unit is created you can access the file and after the unit is destroyed you can access the file. That's always safe. I also think that when the unit is configured but not opened (by device access or mounts) it is safe to access the file. >> The data is written directly to the allocated blocks of the file. >> So exclusively opening the backing file _or_ the vnd unit should >> also be safe. But that's not much different from accessing any file >> concurrently, which also leads to "corrupt", inconsistent backups. >That's a different kind of corrupt. Yes, but in the end it's the same, the "backup" isn't usuable. You cannot access the backing file to get a consistent state of the data while a unit is in use. And that's independent of how vnd accesses the bits. N.B. if you want to talk about dangers, think about fdiscard(). I doubt that it is safe in the context of the vnd optimization.
Re: dtracing unlink
On Fri, 20 Oct 2023, Thomas Klausner wrote: # dtrace -n syscall::unlink:entry'/pid == 27647/{ self->file = arg0; }' -n syscall::unlink:return'{ trace(copyinstr(self->file)); self->file = 0; }' but this just gives me lots of dtrace: error on enabled probe ID 2 (ID 405: syscall::unlink:return): invalid address (0x79c4586577ce) in action #1 at DIF offset 12 : No such file or directory Actually, this command-line is almost correct. What's missing is the paired /pid == 27647/ for syscall::unlink:return. Without it, unlink:return is called for _every_ pid and there's not going to be a valid self->file for almost every one of them. HTH, -RVP