Re: our little daemon abused as symbol of the evil
I have gotten word from the authors that they are aware of the problem and are correcting it (e.g., taking out the daemon). Kirk McKusick =-=-=-= From:Engin Kirda e...@iseclab.org Date:Wed, 3 Feb 2010 19:03:49 +0100 To: mckus...@mckusick.com Subject: BSD logo misuse Cc: Gilbert Wondracek gilb...@iseclab.org, Thorsten Holz t...@iseclab.org, Christopher Kruegel ch...@cs.ucsb.edu Kirk, I colleague from Symantec pointed out the discussion about the BSD logo that we have, apparently, misused in our paper without realizing that it was the BSD logo :-/ We'd like to apologize for this. It was not intentional. The PDF we put up is a technical report and we can easily correct this. We'll make sure that we do not use it in the camera-ready version of the published paper. Best regards, --Engin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: our little daemon abused as symbol of the evil
Thanks for the pointer. As you note, the damage (or benefit :-) is done. Still I have sent an email to the editor at Spiegel notifying them of my copyright in the hopes that they will at least ask in the future. Kirk McKusick =-=-=-= From:Julian H. Stacey j...@berklix.com Date:Tue, 02 Feb 2010 19:30:29 +0100 To: Christoph Kukulies k...@kukulies.org Subject: Re: our little daemon abused as symbol of the evil Cc: freebsd-hackers@freebsd.org, Kirk McKusick mckus...@mckusick.com Organization: http://www.berklix.com BSD Unix Linux Consultancy, Munich Germany Christoph Kukulies wrote: Look here: http://www.spiegel.de/fotostrecke/fotostrecke-51396-2.html ( Well spotted Christoph ! ) For those that don't read German, tracing back, Text article starts here http://www.spiegel.de/netzwelt/web/0,1518,675395,00.html That is in German, (some might like a translator web, eg http://babelfish.org ) I did read the german article (but skipped graphics). Key paragraph: Es ist ein Horrorszenario für Datenschützer, was Thorsten Holz, Gilbert Wondracek, Engin Kirda und Christopher Kruegel in ihrem 15-seitigen Aufsatz beschreiben ( PDF-Datei hier, 803 KB): Die Experten vom Isec-Forschungslabor für IT-Sicherheit, einer Kooperation der Technischen Universität Wien, dem Institute Eurcom und der University of California, dokumentieren einen technisch eher simplen Angriff, der eine seit zehn Jahren bekannte Sicherheitslücke ausnutzt. In key para there I could click download sonda-TR.pdf (though now I can't seem to redownload http://www.iseclab.org/papers/sonda-TR.pdf ) A 15 page article in Engish. Page 4 uses the Firefox BSD logos. I havent read that English [yet], but with it, any interested here can now read form own opinions if it seems fair to use the Daemon logo, especially cc'd copyright holder of BSD daemon holder: Kirk McKusick mckus...@mckusick.com IMO The German article by weekly magazine Spiegel.de didnt really seem to have anything to do with BSD, they just copied the graphics. Personaly my 2c: Initial reaction was I'd be a happier if a generic PC graphic had been used in the spiegel.de web, but maybe its the price of fame, I guess tests were done using BSD, Spiegel thought it was nice colourful graphic. (Politicians never looked good on British TV Spitting Image programme, but they learnt it was better to look bad there, be talked about, than not seen, not recognised ignored). Cheers, Julian -- Julian Stacey: BSD Unix Linux C Sys Eng Consultants Munich http://berklix.com Mail plain text not quoted-printable, HTML or Base64 http://www.asciiribbon.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Possible softupdates bug when a indirect block buffer is reused
This has been a long nagging problem that was finally tracked down and fixed by Stephan Uphoff [EMAIL PROTECTED]. See revision 1.182 on 2005/07/31 to sys/ufs/ffs/ffs_softdep.c. Kirk McKusick =-=-=-=-=-=-= Date: Sun, 31 Jul 2005 11:40:32 -0700 (PDT) From: Matthew Dillon [EMAIL PROTECTED] To: Kirk McKusick [EMAIL PROTECTED] Cc: freebsd-hackers@freebsd.org Subject: Possible softupdates bug when a indirect block buffer is reused X-ASK-Info: Whitelist match [from [EMAIL PROTECTED] (2005/07/31 11:40:52) Hi Kirk, hackers! I'm trying to track down a bug that is causing a buffer to be left in a locked state and then causes the filesystem to lock up because of that. The symptoms are that a heavily used filesystem suddenly starts running out of space. It isn't due to deleted files with open descriptors, it's due to the syncer getting stuck in a getblk state. This is in DragonFly, but I can't find anything DFlyish wrong so I'm beginning to think it may be an actual bug in softupdates. I have wound up with a situation where a getblk()'d bp has been associated with a indirdep dependancy, i.e. stored in indirdep-ir_savebp, but is never released. When something like the syncer comes along and tries to access it, it locks up, and this of course leads to inodes not getting cleared and the filesystem eventually runs out of space when a lot of files are being created and deleted. What has got me really confused is that the buffer in question seems to wind up with a D_INDIRDEP dependancy that points back to itself. Here's the situation from a live gdb. Here is where the syncer is stuck: (kgdb) back #0 lwkt_switch () at thread2.h:95 #1 0xc02a8a79 in tsleep (ident=0x0, flags=0, wmesg=0xc04eadb0 getblk, timo=0) at /usr/src-125beta/sys/kern/kern_synch.c:428 #2 0xc02956bb in acquire (lkp=0xc758b4e0, extflags=33554464, wanted=1536) at /usr/src-125beta/sys/kern/kern_lock.c:127 #3 0xc0295a92 in lockmgr (lkp=0xc758b4e0, flags=33620002, interlkp=0x0, td=0xd68f6400) at /usr/src-125beta/sys/kern/kern_lock.c:354 #4 0xc02d6828 in getblk (vp=0xc71b3058, blkno=94440240, size=8192, slpflag=0, slptimeo=0) at thread.h:79 #5 0xc02d4404 in bread (vp=0xc71b3058, blkno=0, size=0, bpp=0x0) at /usr/src-125beta/sys/kern/vfs_bio.c:567 #6 0xc03f24fe in indir_trunc (ip=0xe048fc0c, dbn=94440240, level=1, lbn=2060, countp=0xe048fbf8) at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:2221 #7 0xc03f22df in handle_workitem_freeblocks (freeblks=0xe2fcef98) at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:2138 #8 0xc03f0462 in process_worklist_item (matchmnt=0x0, flags=0) at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:726 #9 0xc03f026c in softdep_process_worklist (matchmnt=0x0) at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:625 #10 0xc02e5ff3 in sched_sync () at /usr/src-125beta/sys/kern/vfs_sync.c:244 #11 0xc0294863 in kthread_create_stk (func=0, arg=0x0, tdp=0xff80, stksize=0, fmt=0x0) at /usr/src-125beta/sys/kern/kern_kthread.c:104 (kgdb) The buffer it is stuck on: (kgdb) print bp $62 = (struct buf *) 0xc758b4b8 (kgdb) print *bp $63 = { b_hash = { le_next = 0x0, le_prev = 0xc7391348 }, b_vnbufs = { tqe_next = 0xc739b890, tqe_prev = 0xc76f32b8 }, b_freelist = { tqe_next = 0xc768d610, tqe_prev = 0xc0565ac0 }, b_act = { tqe_next = 0x0, tqe_prev = 0x0 }, b_flags = 536870912, 0x2000 (getblk with no bread, etc) b_qindex = 0, b_xflags = 2 '\002', b_lock = { lk_interlock = { t_cpu = 0xff80, t_reqcpu = 0xff80, t_unused01 = 0 }, lk_flags = 2098176, lk_sharecount = 0, lk_waitcount = 1, lk_exclusivecount = 1, lk_prio = 0, lk_wmesg = 0xc04eadb0 getblk, lk_timo = 0, lk_lockholder = 0xfffe }, b_error = 0, b_bufsize = 8192, b_runningbufspace = 0, b_bcount = 8192, b_resid = 0, b_dev = 0xde0f0e38, b_data = 0xcf824000 ¨\205Ð\002, b_kvabase = 0xcf824000 ¨\205Ð\002, b_kvasize = 16384, b_lblkno = 94440240, b_blkno = 94440240, b_offset = 48353402880, b_iodone = 0, b_iodone_chain = 0x0, b_vp = 0xc71b3058, b_dirtyoff = 0, b_dirtyend = 0, b_pblkno = 87503631, b_saveaddr = 0x0, b_driver1 = 0x0, b_caller1 = 0x0, b_pager = { pg_spc = 0x0, pg_reqpage = 0 }, b_cluster = { cluster_head = { tqh_first = 0x0, tqh_last = 0xc768d6bc ---Type return to continue, or q return to quit--- }, cluster_entry = { tqe_next = 0x0, tqe_prev = 0xc768d6bc } }, b_xio = { xio_pages = 0xc758b584, xio_npages = 2, xio_offset = 0, xio_bytes = 0, xio_flags = 0, xio_error = 0, xio_internal_pages = {0xc34e5870, 0xc4aeb2b4, 0x0 repeats 30 times} }, b_dep = { lh_first = 0xc7045040 }, b_chain = { parent = 0x0, count = 0
Re: snapshots and innds
Excellent detective work on your part. The invarient that is being broken here is that you are never supposed to hold a vnode locked when you call vn_start_write. The call to vn_start_write should be done in vm_object_sync before acquiring the vnode lock rather than later in vnode_pager_putpages. Of course, moving the vn_start_write out of vnode_pager_putpages means that we have to track down every other caller of vnode_pager_putpages to make sure that they have also done the vn_start_write call as well. Jeff Robertson has come up with a much cleaner way of dealing with the suspension code that I believe he is using in the -current tree. It puts a hook in the ufs_lock code that tracks the number of locks held in each filesystem. To do a suspend, it blocks all new lock requests on that filesystem by any thread that does not already hold a lock and waits for all the existing locks to be released. This obviates the need for the vn_start_write calls sprinkled all through the system. I have copied Jeff on this email so that he can comment further on this issue as he is much more up to speed on it at the moment than I am. Kirk McKusick =-=-=-=-=-=-= From: [EMAIL PROTECTED] (Steve Watt) Date: Sun, 22 May 2005 14:02:39 -0700 In-Reply-To: [EMAIL PROTECTED] (Steve Watt) snapshots and innds (Dec 18, 17:39) To: freebsd-hackers@freebsd.org Subject: Re: snapshots and innds Cc: [EMAIL PROTECTED] X-Archived: [EMAIL PROTECTED] X-ASK-Info: Whitelist match [from [EMAIL PROTECTED] (2005/05/22 14:03:00) [ OK, there's a lot of text in here, but I have definitively found a deadlock between ffs_mksnap and msync(). ] Waaay back on Dec 18, 17:39, I wrote: } Subject: snapshots and innds } I'm getting a strong hunch that snapshots and inn don't get along } well, presumably having something to do with inn's extensive use } of mmap(). } } Just for an example, my system panic()ed earlier today (different } problem) and during the reboot, I'm stuck with an fsck_ufs on wchan } ufs and innd on wchan suspfs, and neither of them responding } in any way. And I have been seeing hangs periodically since December that all seem to implicate innd(msync()) arguing with dump(mksnap_ffs). The system is 5.4-STABLE, updated last on the (PDT) morning of 2 May. Finally, this morning, I got a kernel core dump that I can do useful stuff with. The system was mostly operating normally, except that any attempt to access the /news partition (which has articles, tradspool.map, overviews, and incoming/outgoing data) would get stuck in suspfs. So I forced a dump from ddb. The mount point does (as one would expect) have MNTK_SUSPEND set. I see mksnap_ffs sitting waiting for ufs (really vnode 0xc19af318), which it got to via: (kgdb) info stack #0 sched_switch (td=0xc1ede780, newtd=0xc146f480, flags=1) at /usr/src/sys/kern/sched_4bsd.c:882 #1 0xc0662ad0 in mi_switch (flags=1, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:355 #2 0xc067a9e4 in sleepq_switch (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:406 #3 0xc067ab9e in sleepq_wait (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:518 #4 0xc06627b6 in msleep (ident=0xc19af3c4, mtx=0xc095e4cc, priority=80, wmesg=0xc08a3f13 ufs, timo=0) at /usr/src/sys/kern/kern_synch.c:228 #5 0xc06505d6 in acquire (lkpp=0xd02df680, extflags=16777280, wanted=1536) at /usr/src/sys/kern/kern_lock.c:161 #6 0xc0650a14 in lockmgr (lkp=0xc19af3c4, flags=16842754, interlkp=0x0, td=0xc1ede780) at /usr/src/sys/kern/kern_lock.c:389 #7 0xc07bd6e3 in ufs_lock (ap=0xd02df6bc) at /usr/src/sys/ufs/ufs/ufs_vnops.c:2007 #8 0xc07be380 in ufs_vnoperate (ap=0x0) at /usr/src/sys/ufs/ufs/ufs_vnops.c:2828 #9 0xc06c0501 in vn_lock (vp=0xc19af318, flags=65538, td=0xc1ede780) at vnode_if.h:1013 #10 0xc06b4195 in vget (vp=0xc19af318, flags=65538, td=0x0) at /usr/src/sys/kern/vfs_subr.c:2028 #11 0xc07af408 in ffs_sync (mp=0xc15e5c00, waitfor=1, cred=0xc2953080, td=0xc1ede780) at /usr/src/sys/ufs/ffs/ffs_vfsops.c:1151 #12 0xc06c0840 in vfs_write_suspend (mp=0xc15e5c00) at /usr/src/sys/kern/vfs_vnops.c:1084 #13 0xc079db18 in ffs_snapshot (mp=0xc15e5c00, snapfile=0xbfbfef1b Address 0xbfbfef1b out of bounds) at /usr/src/sys/ufs/ffs/ffs_snapshot.c:317 #14 0xc07ad5d8 in ffs_omount (mp=0xc15e5c00, path=0xc2a8c380 /news, data=0x0, td=0xc1ede780) at /usr/src/sys/ufs/ffs/ffs_vfsops.c:313 #15 0xc06af787 in vfs_domount (td=0xc1ede780, fstype=0xc1eea730 ffs, fspath=0xc2a8c380 /news, fsflags=18944000, fsdata=0xbfbfe7d4, compat=1) at /usr/src/sys/kern/vfs_mount.c:861 #16 0xc06aef16 in mount (td=0x0, uap=0xd02dfd04) at /usr/src/sys/kern/vfs_mount.c:620 #17 0xc0828553 in syscall (frame= [ snip ] And inn is sitting waiting for the suspended filesystem: (kgdb) info stack #0 sched_switch (td=0xc1c16c00, newtd=0xc1ede780, flags=1) at /usr/src/sys/kern/sched_4bsd.c:882 #1 0xc0662ad0 in mi_switch (flags=1, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:355 #2 0xc067a9e4
Re: bleh. Re: ufs_rename panic
Date: Fri, 21 Feb 2003 15:26:01 -0800 From: Terry Lambert [EMAIL PROTECTED] To: Yevgeniy Aleynikov [EMAIL PROTECTED] CC: Kirk McKusick [EMAIL PROTECTED], Matt Dillon [EMAIL PROTECTED], Ian Dowse [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], Ken Pizzini [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: bleh. Re: ufs_rename panic Yevgeniy Aleynikov wrote: As pointed by Ken - we do have alot of file renames (qmail). But 2-nd solution, directory-only rename serialization, probably won't affect performance as much. But i believe it's not only us who's gonna have problem when exploit code will be known by everybody sooner or later Dan's non-atomicity assumption on renames is incorrect. Even if it's were correct, it's possible to recover fully following a failure, because metadata updates are ordered (there is a real synchronization between dependent operations). I think that a workaround would be to comment the directory fsync() code out of qmail, which apparently thinks it's running on extfs or an async mounted FFS. -- Terry You cannot get rid of the fsync calls in qmail. You have to distinguish between a filesystem that is recoverable and one which loses data. When receiving an incoming message, SMTP requires that the receiver have the message in stable store before acknowledging receipt. The only way to know that it is in stable store is to fsync it before responding. Kirk McKusick To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: bleh. Re: ufs_rename panic
The potentially slow, but utterly effective way to fix this race is to only allow one rename at a time per filesystem. It is implemented by adding a flag in the mount structure and using it to serialize calls to rename. When only one rename can happen at a time, the race cannot occur. If this proves to be too much of a slow down, it would be possible to only serialize directory renames. As these are (presumably) much rarer the slow down would be less noticable. Kirk McKusick =-=-=-=-=-= Date: Wed, 19 Feb 2003 15:10:09 -0800 From: Yevgeniy Aleynikov [EMAIL PROTECTED] To: Matt Dillon [EMAIL PROTECTED] CC: Kirk McKusick [EMAIL PROTECTED], Ian Dowse [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], Ken Pizzini [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: bleh. Re: ufs_rename panic X-ASK-Info: Confirmed by User Just reminder that this problem is local kernel panic DoS (which can do filesystem corruption) with very simple trigger code and it still exists. And it's been almost 2 years since i wrote about it. Workaround (commenting out panic call) doesnt fix the problem. Server still crashes (not so often though) from virtual memory failures like this: panic: vm_fault: fault on nofault entry, addr: d0912000 mp_lock = 0102; cpuid = 1; lapic.id = boot() called on cpu#1 (kgdb) bt #0 0xc0175662 in dumpsys () #1 0xc017542c in boot () #2 0xc0175894 in poweroff_wait () #3 0xc01e7c18 in vm_fault () #4 0xc0219d32 in trap_pfault () #5 0xc021990b in trap () #6 0xc01e008a in ufs_dirrewrite () #7 0xc01e31a4 in ufs_rename () #8 0xc01e4645 in ufs_vnoperate () #9 0xc01a9121 in rename () #10 0xc021a44d in syscall2 () #11 0xc02077cb in Xint0x80_syscall () How can i help to resolve this problem ASAP? Thanks! Matt Dillon wrote: Well, I've gone through hell trying to fix the rename()/rmdir()/remove() races and failed utterly. There are far more race conditions then even my last posting indicated, and there are *severe* problems fixing NFS to deal with even Ian's suggestion... it turns out that NFS's nfs_namei() permanently adjusts the mbuf while processing the path name, making restarts impossible. The only solution is to implement namei cache path locking and formalize the 'nameidata' structure, which means ripping up a lot of code because nearly the entire code base currently plays with the contents of 'nameidata' willy-nilly. Nothing else will work. It's not something that I can consider doing now. In the mean time I am going to remove the panic()'s in question. This means that in ufs_rename() the machine will silently ignore the race (not do the rename) instead of panic. It's all that can be done for the moment. It solve the security/attack issue. We'll have to attack the races as a separate issue. The patch to remove the panics is utterly trivial and I will commit it after I test it. -Matt -- Yevgeniy Aleynikov | Sr. Systems Engineer - USE InfoSpace INC 601 108th Ave NE | Suite 1200 | Bellevue, WA 98004 Tel 425.709.8214 | Fax 425.201.6160 | Mobile 425.418.8924 [EMAIL PROTECTED] | http://www.infospaceinc.com Discover what you can do.TM To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: fsck -p
Date: Wed, 20 Nov 2002 13:09:55 +0200 From: Ruslan Ermilov [EMAIL PROTECTED] To: Ian Dowse [EMAIL PROTECTED], Kirk McKusick [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: fsck -p Hi! Today I've got a hard lockup with 4.7 box. Upon reboot, ``fsck -p'' was run, and it resulted in the following, in particular: /dev/da0s1h: UNREF FILE I=3D591 OWNER=3Dnobody MODE=3D100644 /dev/da0s1h: SIZE=3D81269024 MTIME=3DNov 20 09:50 2002 (CLEARED) /dev/da0s1h: FREE BLK COUNT(S) WRONG IN SUPERBLK (SALVAGED) /dev/da0s1h: SUMMARY INFORMATION BAD (SALVAGED) /dev/da0s1h: BLK(S) MISSING IN BIT MAPS (SALVAGED) I thought that the correct action here would be to reconnect this file under fs's lost+found, but it did not happen. Why? (I've lost a week of useful squid's access.log.) Cheers, Ruslan Ermilov Sysadmin and DBA, [EMAIL PROTECTED] Sunbay Software AG, [EMAIL PROTECTED] FreeBSD committer, +380.652.512.251Simferopol, Ukraine http://www.FreeBSD.org The Power To Serve http://www.oracle.com Enabling The Information Age The reference count on the file was zero, so the assumption by fsck is that you were in the process of removing it at the time of the crash (e.g., the name had been removed from the directory but the inode had not yet been cleared). Thus the default behavior is to finish the removal. FYI, if you had run fsck manually, it would have given you the option to save the file. Kirk McKusick To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: bleh. Re: ufs_rename panic
The problems all arise from the fact that we unlock the source while we look up the destination, and when we return to relookup the source, it may have changed/moved/disappeared. The reason to unlock the source before looking up the destination was to avoid deadlocking against ourselves on a lock that we held associated with the source. Since we now allow recursive locks on vnodes, it is no longer necessary to release the source before looking up the destination. So, it seems to me that the correct fix is to *not* release the source after looking it up, but rather hold it locked while we look up the destination. We can completely get rid of relookup and lots of other hairy code and generally make rename much simpler. Am I missing something here? ~Kirk To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: utilizing write caching
Sorry for the slow response. I only read my freebsd.org email very occationally. Soft updates does do most of its writes asynchronously, but it still needs to know when the data has really hit stable store. With SCSI disks, we can use tag queuing to reliably get this information. With IDE disks the only way to get this information is to disable write-cacheing. Most failure senarios allow IDE disks to write out their caches - software crashes, plug pulled out of the wall, etc. Where they cannot write out their caches are instances where the power drops nearly instantly such as a power supply failure, or the battery being pulled out of a laptop. We could decide that we are willing to lump those sorts of failures in with media failure as a class of problems that we choose not to protect against, but I think that should be a decision that users have to take an active role to make (much as they can choose to mount their filesystems async). So, I agree with the decision to turn off write caching by default, though there should be an easy way to reenable it for those users that want to run the associated risks. Kirk McKusick =-=-=-=-=-= Date: Thu, 19 Apr 2001 00:07:12 -0700 From: Alfred Perlstein [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: Kirk McKusick [EMAIL PROTECTED] Subject: utilizing write caching I'm sure you guys remeber the recent discusion wrt write caching on disks possibly causing inconsistancies for UFS and just about any filesystem or program that expect things like fsync() to actually work. The result of the discussion was that write caching was disabled for all disks. I really think this is suboptimal. I mean _really_ suboptimal, my laptop disk is a pig since the default went in for ata disks. Or maybe it's just a pig anyway, but I'd like to take a look at this. The most basic fix to gain performance back would be to have the device examine the B_ASYNC flags and decide there whether or not to perform write caching. However, I have this strange feeling that softupdates is actually able to issue the meta-data writes with B_ASYNC set. Kirk, is this true? If so would it be possible to tag the buffer with yet another flag saying yes, write me async, but safely when doing softdep disk io? If softupdates doesn't use B_ASYNC, then it seems trivial to make DEV_STRATEGY propogate B_ASYNC into the bio request (BIO_STRATEGY) via OR'ing something like BIO_CACHE so that the device driver could then choose to activate write caching. This is still suboptimal because we'll be turning off caching when the buffer system is experiencing a shortage and issuing sync writes in order not to deadlock, but it's still better IMO than turning it off completely. If on the otherhand Kirk can figure out a quick hack to flag buffers that need completely stable storage (including fsync(2)*) ops then I think we've got a solution. (*) i'll look at fsync and physio if the scope of fixing those seems to be too much wrt to time available. If softupdates doesn't use B_ASYNC something like this: Index: sys/bio.h === RCS file: /home/ncvs/src/sys/sys/bio.h,v retrieving revision 1.104 diff -u -r1.104 bio.h --- sys/bio.h 2001/01/14 18:48:42 1.104 +++ sys/bio.h 2001/04/19 06:53:52 @@ -91,6 +91,7 @@ #define BIO_ERROR 0x0001 #define BIO_ORDERED0x0002 #define BIO_DONE 0x0004 +#define BIO_ASYNC 0x0008 /* Device may choose to write cache */ #define BIO_FLAG2 0x4000 /* Available for local hacks */ #define BIO_FLAG1 0x8000 /* Available for local hacks */ Index: sys/conf.h === RCS file: /home/ncvs/src/sys/sys/conf.h,v retrieving revision 1.126 diff -u -r1.126 conf.h --- sys/conf.h 2001/03/26 12:41:26 1.126 +++ sys/conf.h 2001/04/19 06:52:08 @@ -157,6 +157,8 @@ (bp)-b_io.bio_offset = (bp)-b_offset; \ else\ (bp)-b_io.bio_offset = dbtob((bp)-b_blkno); \ + if ((bp)-b_flags B_ASYNC)\ + (bp)-b_io.bio_flags |= BIO_ASYNC \ (bp)-b_io.bio_done = bufdonebio; \ (bp)-b_io.bio_caller2 = (bp); \ BIO_STRATEGY((bp)-b_io, dummy); \ could do the trick, no? -- -Alfred Perlstein - [[EMAIL PROTECTED]] Instead of asking why a piece of software is using 1970s technology, start asking why software is ignoring 30 years of accumulated wisdom. - End forwarded message - -- -Alfred Perlstein - [[EMAIL PROTECTED]] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/ To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: vm balance
Date: Tue, 17 Apr 2001 09:49:54 -0400 (EDT) From: Robert Watson [EMAIL PROTECTED] To: Kirk McKusick [EMAIL PROTECTED] cc: Julian Elischer [EMAIL PROTECTED], Rik van Riel [EMAIL PROTECTED], [EMAIL PROTECTED], Matt Dillon [EMAIL PROTECTED], David Xu [EMAIL PROTECTED] Subject: Re: vm balance On Mon, 16 Apr 2001, Kirk McKusick wrote: I am still of the opinion that merging VM objects and vnodes would be a good idea. Although it would touch a huge number of lines of code, when the dust settled, it would simplify some nasty bits of the system. This merger is really independent of making the number of vnodes dynamic. Under the old name cache implementation, decreasing the number of vnodes was slow and hard. With the current name cache implementation, decreasing the number of vnodes would be easy. I concur that adding a dynamically sized vnode cache would help performance on some workloads. I'm interested in this idea, although profess a gaping blind spot in expertise in the area of the VM system. However, one of the aspects of our VFS that has always concerned me is that use of a single vnode simplelock funnels most of the relevant (and performance-sensitive) calls. The result is that all accesses to an object represented by a vnode are serialized, which can represent a substantial performance hit for applications such as databases, where simultaneous write would be advantageous, or for various vn-backed oddities (possibly including vnode-backed swap?). At some point, apparently an effort was made to mark up vnode_if.src with possible alternative locking using read/write locks, but given that all the consumers use exclusive locks right now, I assume that was not followed through on. A large part of the cost is mitigated through caching on the under-side of VFS, allowing vnode operations to return rapidly, but while this catches a number of common cases (where the file is already in the cache), there are sufficient non-common cases that I would anticipate this being a problem. Are there any performance figures available that either confirm this concern, or demonstrate that in fact it is not relevant? :-) Would this concern introduce additional funneling in the VM system, or is the granularity of locks in the VM sufficiently low that it might improve performance by combining existing broad locks? Robert N M Watson FreeBSD Core Team, TrustedBSD Project [EMAIL PROTECTED] NAI Labs, Safeport Network Services Every vnode in the system has an associated object. Every object backed by a file (e.g., everything but anonymous objects) has an associated vnode. So, the performance of one is pretty tied to the performance of the other. Matt is right that the VM does locking on a page level, but then has to get a lock on the associated vnode to do a read or a write, so really is pretty tied to the vnode lock performance. Merging the two data structures is not likely to change the performance characteristics of the system for either better or worse. But it will save a lot of headaches having to do with lock ordering that we have to deal with at the moment. Kirk McKusick To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
Date: Tue, 10 Apr 2001 22:14:28 -0700 From: Julian Elischer [EMAIL PROTECTED] To: Rik van Riel [EMAIL PROTECTED] CC: Matt Dillon [EMAIL PROTECTED], David Xu [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: vm balance Rik van Riel wrote: I'm curious about the other things though ... FreeBSD still seems to have the early 90's abstraction layer from Mach and the vnode cache doesn't seem to grow and shrink dynamically (which can be a big win for systems with lots of metadata activity). So while it's true that FreeBSD's VM balancing seems to be the best one out there, I'm not quite sure about the rest of the VM... Many years ago Kirk was talking about merging the vm objects and the vnodes.. (they tend to come in pairs anyhow) I still think it might be an idea worth investigating further. kirk? -- __--_|\ Julian Elischer / \ [EMAIL PROTECTED] ( OZ) World tour 2000-2001 --- X_.---._/ v I am still of the opinion that merging VM objects and vnodes would be a good idea. Although it would touch a huge number of lines of code, when the dust settled, it would simplify some nasty bits of the system. This merger is really independent of making the number of vnodes dynamic. Under the old name cache implementation, decreasing the number of vnodes was slow and hard. With the current name cache implementation, decreasing the number of vnodes would be easy. I concur that adding a dynamically sized vnode cache would help performance on some workloads. Kirk McKusick To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Where is the syncer kernel process implemented?
From: Sheldon Hearn [EMAIL PROTECTED] To: Alfred Perlstein [EMAIL PROTECTED] Cc: [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: Where is the syncer kernel process implemented? In-reply-to: Your message of "Fri, 14 Jul 2000 05:38:58 MST." [EMAIL PROTECTED] Date: Fri, 14 Jul 2000 14:51:13 +0200 Sender: Sheldon Hearn [EMAIL PROTECTED] On Fri, 14 Jul 2000 05:38:58 MST, Alfred Perlstein wrote: /* * System filesystem synchronizer daemon. */ void sched_sync(void) It seems that the default sync delay, syncer_maxdelay, is no longer controllable via sysctl(8). Are there complex issues restricting the changing of this value in real time, or is it just not something people feel the need to change these days? Ciao, Sheldon. The value of syncer_maxdelay was never settable, as it is used to set the size of the array used to hold the timing events. It was formerly possible to set syncdelay, but that variable was replaced by three variables: time_t filedelay = 30; /* time to delay syncing files */ time_t dirdelay = 29; /* time to delay syncing directories */ time_t metadelay = 28; /* time to delay syncing metadata */ Each of these variables is individually setable. Kirk McKusick To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: mmap/write case brought up again - maybe its time to...
Date: Wed, 8 Dec 1999 21:30:37 -0800 (PST) From: Matthew Dillon [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: Kirk McKusick [EMAIL PROTECTED] Subject: mmap/write case brought up again - maybe its time to... Someone brought up the mmap/write case up again - that's a deadlock case that we haven't fixed yet where you write from one descriptor into shared writeable file-backed memory area and, from another process, do the vise-versa. Maybe it's time to make filesystem locks recursive by default. Doing so will allow the above case to be fixed 100% by allowing the read() and write() code to pre-lock the underlying vnodes in the correct order (by pointer comparison) prior to digging into them. I think Kirk may be the best person to make this determination - I seem to recall there being some (minor?) issues. Implementing recursive locks may be as simple as adding LK_RECURSE to vn_lock() but I haven't researched it heavily. This may also tie-in well with the revamping of the VOP code later on. There is a significant amount of complexity in the VOP code in having to deal with non-recursive locks when a passed argument is supposed to be locked and remain locked on return, the return argument is supposed to be locked, and the returned argument winds up being the same as the passed argument. With recursive locks as the norm we can remove nearly all of those special cases leaving just the one that deals with ".." (or perhaps dealing with namei directory locks in a different way). -Matt Recursive locks are easy to implement. Just add LK_CANRECURSE as the final argument to the call to lockinit at line 1077 in ffs_vget of ufs/ffs/ffs_vfsops.c. That's it. From there on out all FFS locks will be recursive and you can begin simplifying away. Kirk To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message