Re: our little daemon abused as symbol of the evil

2010-02-05 Thread Kirk McKusick
I have gotten word from the authors that they are aware of the
problem and are correcting it (e.g., taking out the daemon).

Kirk McKusick

=-=-=-=

From:Engin Kirda e...@iseclab.org
Date:Wed, 3 Feb 2010 19:03:49 +0100
To:  mckus...@mckusick.com
Subject: BSD logo misuse
Cc:  Gilbert Wondracek gilb...@iseclab.org,
 Thorsten Holz t...@iseclab.org,
 Christopher Kruegel ch...@cs.ucsb.edu

Kirk,

I colleague from Symantec pointed out the discussion about the BSD  
logo that we have, apparently, misused in our paper without realizing  
that it was the BSD logo :-/ We'd like to apologize for this. It was  
not intentional.

The PDF we put up is a technical report and we can easily correct  
this. We'll make sure that we do not use it in the camera-ready  
version of the published paper.

Best regards,

--Engin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: our little daemon abused as symbol of the evil

2010-02-02 Thread Kirk McKusick
Thanks for the pointer. As you note, the damage (or benefit :-) is
done. Still I have sent an email to the editor at Spiegel notifying
them of my copyright in the hopes that they will at least ask in the
future.

Kirk McKusick

=-=-=-=

From:Julian H. Stacey j...@berklix.com
Date:Tue, 02 Feb 2010 19:30:29 +0100
To:  Christoph Kukulies k...@kukulies.org
Subject: Re: our little daemon abused as symbol of the evil 
Cc:  freebsd-hackers@freebsd.org, Kirk McKusick mckus...@mckusick.com
Organization: http://www.berklix.com BSD Unix Linux Consultancy, Munich Germany

Christoph Kukulies wrote:
 Look here:
 
 http://www.spiegel.de/fotostrecke/fotostrecke-51396-2.html

( Well spotted Christoph ! )
For those that don't read German, tracing back,
Text article starts here 
http://www.spiegel.de/netzwelt/web/0,1518,675395,00.html

That is in German, 
(some might like a translator web, eg http://babelfish.org )
I did read the german article (but skipped graphics).

Key paragraph:
Es ist ein Horrorszenario für Datenschützer, was Thorsten
Holz, Gilbert Wondracek, Engin Kirda und Christopher Kruegel
in ihrem 15-seitigen Aufsatz beschreiben ( PDF-Datei hier,
803 KB): Die Experten vom Isec-Forschungslabor für
IT-Sicherheit, einer Kooperation der Technischen Universität
Wien, dem Institute Eurcom und der University of California,
dokumentieren einen technisch eher simplen Angriff, der
eine seit zehn Jahren bekannte Sicherheitslücke ausnutzt.

In key para there I could click  download
sonda-TR.pdf
(though now I can't seem to redownload
http://www.iseclab.org/papers/sonda-TR.pdf  )
A 15 page article in Engish.
Page 4 uses the Firefox  BSD logos.

I havent read that English [yet],  but with it, any interested here
can now read  form own opinions if it seems fair to use the Daemon
logo, especially cc'd copyright holder of BSD daemon holder:
Kirk McKusick mckus...@mckusick.com

IMO The German article by weekly magazine Spiegel.de didnt really seem 
to have anything to do with BSD, they just copied the graphics.

Personaly my 2c:
  Initial reaction was I'd be a happier if a generic PC graphic had
  been used in the spiegel.de web, but maybe its the price of fame,
  I guess tests were done using BSD,  Spiegel thought it was nice
  colourful graphic.  (Politicians never looked good on British TV
  Spitting Image programme, but they learnt it was better to look
  bad there,  be talked about, than not seen, not recognised 
  ignored).

Cheers,
Julian
-- 
Julian Stacey: BSD Unix Linux C Sys Eng Consultants Munich http://berklix.com
Mail plain text not quoted-printable, HTML or Base64 http://www.asciiribbon.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Possible softupdates bug when a indirect block buffer is reused

2005-08-02 Thread Kirk McKusick
This has been a long nagging problem that was finally tracked down and
fixed by Stephan Uphoff [EMAIL PROTECTED]. See revision 1.182 on 2005/07/31
to sys/ufs/ffs/ffs_softdep.c.

Kirk McKusick

=-=-=-=-=-=-=

Date: Sun, 31 Jul 2005 11:40:32 -0700 (PDT)
From: Matthew Dillon [EMAIL PROTECTED]
To: Kirk McKusick [EMAIL PROTECTED]
Cc: freebsd-hackers@freebsd.org
Subject: Possible softupdates bug when a indirect block buffer is reused
X-ASK-Info: Whitelist match [from [EMAIL PROTECTED] (2005/07/31 11:40:52)

Hi Kirk, hackers!

I'm trying to track down a bug that is causing a buffer to be left
in a locked state and then causes the filesystem to lock up because
of that.

The symptoms are that a heavily used filesystem suddenly starts running
out of space.  It isn't due to deleted files with open descriptors, it's
due to the syncer getting stuck in a getblk state.  This is in DragonFly,
but I can't find anything DFlyish wrong so I'm beginning to think it may
be an actual bug in softupdates.

I have wound up with a situation where a getblk()'d bp has been
associated with a indirdep dependancy, i.e. stored in
indirdep-ir_savebp, but is never released.  When something like
the syncer comes along and tries to access it, it locks up, and this
of course leads to inodes not getting cleared and the filesystem
eventually runs out of space when a lot of files are being created and
deleted.

What has got me really confused is that the buffer in question seems to
wind up with a D_INDIRDEP dependancy that points back to itself.

Here's the situation from a live gdb.  Here is where the syncer is 
stuck:

(kgdb) back
#0  lwkt_switch () at thread2.h:95
#1  0xc02a8a79 in tsleep (ident=0x0, flags=0, wmesg=0xc04eadb0 getblk, 
timo=0) at /usr/src-125beta/sys/kern/kern_synch.c:428
#2  0xc02956bb in acquire (lkp=0xc758b4e0, extflags=33554464, wanted=1536)
at /usr/src-125beta/sys/kern/kern_lock.c:127
#3  0xc0295a92 in lockmgr (lkp=0xc758b4e0, flags=33620002, interlkp=0x0, 
td=0xd68f6400) at /usr/src-125beta/sys/kern/kern_lock.c:354
#4  0xc02d6828 in getblk (vp=0xc71b3058, blkno=94440240, size=8192, slpflag=0, 
slptimeo=0) at thread.h:79
#5  0xc02d4404 in bread (vp=0xc71b3058, blkno=0, size=0, bpp=0x0)
at /usr/src-125beta/sys/kern/vfs_bio.c:567
#6  0xc03f24fe in indir_trunc (ip=0xe048fc0c, dbn=94440240, level=1, lbn=2060, 
countp=0xe048fbf8) at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:2221
#7  0xc03f22df in handle_workitem_freeblocks (freeblks=0xe2fcef98)
at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:2138
#8  0xc03f0462 in process_worklist_item (matchmnt=0x0, flags=0)
at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:726
#9  0xc03f026c in softdep_process_worklist (matchmnt=0x0)
at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:625
#10 0xc02e5ff3 in sched_sync () at /usr/src-125beta/sys/kern/vfs_sync.c:244
#11 0xc0294863 in kthread_create_stk (func=0, arg=0x0, tdp=0xff80, 
stksize=0, fmt=0x0) at /usr/src-125beta/sys/kern/kern_kthread.c:104
(kgdb) 

The buffer it is stuck on:

(kgdb) print bp
$62 = (struct buf *) 0xc758b4b8
(kgdb) print *bp
$63 = {
  b_hash = {
le_next = 0x0, 
le_prev = 0xc7391348
  }, 
  b_vnbufs = {
tqe_next = 0xc739b890, 
tqe_prev = 0xc76f32b8
  }, 
  b_freelist = {
tqe_next = 0xc768d610, 
tqe_prev = 0xc0565ac0
  }, 
  b_act = {
tqe_next = 0x0, 
tqe_prev = 0x0
  }, 
  b_flags = 536870912,   0x2000 (getblk with no bread, etc)
  b_qindex = 0, 
  b_xflags = 2 '\002', 
  b_lock = {
lk_interlock = {
  t_cpu = 0xff80, 
  t_reqcpu = 0xff80, 
  t_unused01 = 0
}, 
lk_flags = 2098176, 
lk_sharecount = 0, 
lk_waitcount = 1, 
lk_exclusivecount = 1, 
lk_prio = 0, 
lk_wmesg = 0xc04eadb0 getblk, 
lk_timo = 0, 
lk_lockholder = 0xfffe
  }, 
  b_error = 0, 
  b_bufsize = 8192, 
  b_runningbufspace = 0, 
  b_bcount = 8192, 
  b_resid = 0, 
  b_dev = 0xde0f0e38, 
  b_data = 0xcf824000 ¨\205Ð\002, 
  b_kvabase = 0xcf824000 ¨\205Ð\002, 
  b_kvasize = 16384, 
  b_lblkno = 94440240, 
  b_blkno = 94440240, 
  b_offset = 48353402880, 
  b_iodone = 0, 
  b_iodone_chain = 0x0, 
  b_vp = 0xc71b3058, 
  b_dirtyoff = 0, 
  b_dirtyend = 0, 
  b_pblkno = 87503631, 
  b_saveaddr = 0x0, 
  b_driver1 = 0x0, 
  b_caller1 = 0x0, 
  b_pager = {
pg_spc = 0x0, 
pg_reqpage = 0
  }, 
  b_cluster = {
cluster_head = {
  tqh_first = 0x0, 
  tqh_last = 0xc768d6bc
---Type return to continue, or q return to quit--- 
}, 
cluster_entry = {
  tqe_next = 0x0, 
  tqe_prev = 0xc768d6bc
}
  }, 
  b_xio = {
xio_pages = 0xc758b584, 
xio_npages = 2, 
xio_offset = 0, 
xio_bytes = 0, 
xio_flags = 0, 
xio_error = 0, 
xio_internal_pages = {0xc34e5870, 0xc4aeb2b4, 0x0 repeats 30 times}
  }, 
  b_dep = {
lh_first = 0xc7045040
  }, 
  b_chain = {
parent = 0x0, 
count = 0

Re: snapshots and innds

2005-05-23 Thread Kirk McKusick
Excellent detective work on your part. The invarient that is being
broken here is that you are never supposed to hold a vnode locked
when you call vn_start_write. The call to vn_start_write should
be done in vm_object_sync before acquiring the vnode lock rather
than later in vnode_pager_putpages. Of course, moving the
vn_start_write out of vnode_pager_putpages means that we have to
track down every other caller of vnode_pager_putpages to make sure
that they have also done the vn_start_write call as well.

Jeff Robertson has come up with a much cleaner way of dealing with
the suspension code that I believe he is using in the -current tree.
It puts a hook in the ufs_lock code that tracks the number of locks
held in each filesystem. To do a suspend, it blocks all new lock
requests on that filesystem by any thread that does not already
hold a lock and waits for all the existing locks to be released.
This obviates the need for the vn_start_write calls sprinkled all
through the system. I have copied Jeff on this email so that he
can comment further on this issue as he is much more up to speed
on it at the moment than I am.

Kirk McKusick

=-=-=-=-=-=-=

From: [EMAIL PROTECTED] (Steve Watt)
Date: Sun, 22 May 2005 14:02:39 -0700
In-Reply-To: [EMAIL PROTECTED] (Steve Watt)
   snapshots and innds (Dec 18, 17:39)
To: freebsd-hackers@freebsd.org
Subject: Re: snapshots and innds
Cc: [EMAIL PROTECTED]
X-Archived: [EMAIL PROTECTED]
X-ASK-Info: Whitelist match [from [EMAIL PROTECTED] (2005/05/22 14:03:00)

[ OK, there's a lot of text in here, but I have definitively found a
  deadlock between ffs_mksnap and msync(). ]

Waaay back on Dec 18, 17:39, I wrote:
} Subject: snapshots and innds
} I'm getting a strong hunch that snapshots and inn don't get along
} well, presumably having something to do with inn's extensive use
} of mmap().
} 
} Just for an example, my system panic()ed earlier today (different
} problem) and during the reboot, I'm stuck with an fsck_ufs on wchan
} ufs and innd on wchan suspfs, and neither of them responding
} in any way.

And I have been seeing hangs periodically since December that all
seem to implicate innd(msync()) arguing with dump(mksnap_ffs).

The system is 5.4-STABLE, updated last on the (PDT) morning of 2 May.

Finally, this morning, I got a kernel core dump that I can do useful
stuff with.  The system was mostly operating normally, except that
any attempt to access the /news partition (which has articles,
tradspool.map, overviews, and incoming/outgoing data) would get
stuck in suspfs.

So I forced a dump from ddb.  The mount point does (as one would
expect) have MNTK_SUSPEND set.

I see mksnap_ffs sitting waiting for ufs (really vnode 0xc19af318),
which it got to via:

(kgdb) info stack
#0  sched_switch (td=0xc1ede780, newtd=0xc146f480, flags=1)
at /usr/src/sys/kern/sched_4bsd.c:882
#1  0xc0662ad0 in mi_switch (flags=1, newtd=0x0) at 
/usr/src/sys/kern/kern_synch.c:355
#2  0xc067a9e4 in sleepq_switch (wchan=0x0) at 
/usr/src/sys/kern/subr_sleepqueue.c:406
#3  0xc067ab9e in sleepq_wait (wchan=0x0) at 
/usr/src/sys/kern/subr_sleepqueue.c:518
#4  0xc06627b6 in msleep (ident=0xc19af3c4, mtx=0xc095e4cc, priority=80, 
wmesg=0xc08a3f13 ufs, timo=0) at /usr/src/sys/kern/kern_synch.c:228
#5  0xc06505d6 in acquire (lkpp=0xd02df680, extflags=16777280, wanted=1536)
at /usr/src/sys/kern/kern_lock.c:161
#6  0xc0650a14 in lockmgr (lkp=0xc19af3c4, flags=16842754, interlkp=0x0, 
td=0xc1ede780)
at /usr/src/sys/kern/kern_lock.c:389
#7  0xc07bd6e3 in ufs_lock (ap=0xd02df6bc) at 
/usr/src/sys/ufs/ufs/ufs_vnops.c:2007
#8  0xc07be380 in ufs_vnoperate (ap=0x0) at 
/usr/src/sys/ufs/ufs/ufs_vnops.c:2828
#9  0xc06c0501 in vn_lock (vp=0xc19af318, flags=65538, td=0xc1ede780) at 
vnode_if.h:1013
#10 0xc06b4195 in vget (vp=0xc19af318, flags=65538, td=0x0)
at /usr/src/sys/kern/vfs_subr.c:2028
#11 0xc07af408 in ffs_sync (mp=0xc15e5c00, waitfor=1, cred=0xc2953080, 
td=0xc1ede780)
at /usr/src/sys/ufs/ffs/ffs_vfsops.c:1151
#12 0xc06c0840 in vfs_write_suspend (mp=0xc15e5c00) at 
/usr/src/sys/kern/vfs_vnops.c:1084
#13 0xc079db18 in ffs_snapshot (mp=0xc15e5c00, 
snapfile=0xbfbfef1b Address 0xbfbfef1b out of bounds)
at /usr/src/sys/ufs/ffs/ffs_snapshot.c:317
#14 0xc07ad5d8 in ffs_omount (mp=0xc15e5c00, path=0xc2a8c380 /news, data=0x0, 
td=0xc1ede780) at /usr/src/sys/ufs/ffs/ffs_vfsops.c:313
#15 0xc06af787 in vfs_domount (td=0xc1ede780, fstype=0xc1eea730 ffs, 
fspath=0xc2a8c380 /news, fsflags=18944000, fsdata=0xbfbfe7d4, compat=1)
at /usr/src/sys/kern/vfs_mount.c:861
#16 0xc06aef16 in mount (td=0x0, uap=0xd02dfd04) at 
/usr/src/sys/kern/vfs_mount.c:620
#17 0xc0828553 in syscall (frame=
   [ snip ]

And inn is sitting waiting for the suspended filesystem:
(kgdb) info stack
#0  sched_switch (td=0xc1c16c00, newtd=0xc1ede780, flags=1)
at /usr/src/sys/kern/sched_4bsd.c:882
#1  0xc0662ad0 in mi_switch (flags=1, newtd=0x0) at 
/usr/src/sys/kern/kern_synch.c:355
#2  0xc067a9e4

Re: bleh. Re: ufs_rename panic

2003-02-21 Thread Kirk McKusick
Date: Fri, 21 Feb 2003 15:26:01 -0800
From: Terry Lambert [EMAIL PROTECTED]
To: Yevgeniy Aleynikov [EMAIL PROTECTED]
CC: Kirk McKusick [EMAIL PROTECTED],
   Matt Dillon [EMAIL PROTECTED],
   Ian Dowse [EMAIL PROTECTED],
   [EMAIL PROTECTED], [EMAIL PROTECTED],
   Ken Pizzini [EMAIL PROTECTED],
   [EMAIL PROTECTED], [EMAIL PROTECTED],
   [EMAIL PROTECTED],
   [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED],
   [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject: Re: bleh. Re: ufs_rename panic

Yevgeniy Aleynikov wrote:
 As pointed by Ken - we do have alot of file renames (qmail).
 But 2-nd solution, directory-only rename serialization, probably
 won't affect performance as much.
 
 But i believe it's not only us who's gonna have problem when exploit
 code will be known by everybody sooner or later

Dan's non-atomicity assumption on renames is incorrect.

Even if it's were correct, it's possible to recover fully following
a failure, because metadata updates are ordered (there is a real
synchronization between dependent operations).

I think that a workaround would be to comment the directory fsync()
code out of qmail, which apparently thinks it's running on extfs
or an async mounted FFS.

-- Terry

You cannot get rid of the fsync calls in qmail. You have to distinguish
between a filesystem that is recoverable and one which loses data.
When receiving an incoming message, SMTP requires that the receiver
have the message in stable store before acknowledging receipt. The
only way to know that it is in stable store is to fsync it before
responding.

Kirk McKusick

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message


Re: bleh. Re: ufs_rename panic

2003-02-19 Thread Kirk McKusick
The potentially slow, but utterly effective way to fix this race
is to only allow one rename at a time per filesystem. It is
implemented by adding a flag in the mount structure and using
it to serialize calls to rename. When only one rename can happen
at a time, the race cannot occur.

If this proves to be too much of a slow down, it would be possible
to only serialize directory renames. As these are (presumably) much
rarer the slow down would be less noticable.

Kirk McKusick

=-=-=-=-=-=

Date: Wed, 19 Feb 2003 15:10:09 -0800
From: Yevgeniy Aleynikov [EMAIL PROTECTED]
To: Matt Dillon [EMAIL PROTECTED]
CC: Kirk McKusick [EMAIL PROTECTED], Ian Dowse [EMAIL PROTECTED],
   [EMAIL PROTECTED], [EMAIL PROTECTED], Ken Pizzini [EMAIL PROTECTED],
   [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED],
   [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED],
   [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject: Re: bleh. Re: ufs_rename panic
X-ASK-Info: Confirmed by User

Just reminder that this problem is local kernel panic DoS (which can do 
filesystem corruption) with very simple trigger code and it still exists.

And it's been almost 2 years since i wrote about it.


Workaround (commenting out panic call) doesnt fix the problem.
Server still crashes (not so often though) from virtual memory failures 
like this:

panic: vm_fault: fault on nofault entry, addr: d0912000
mp_lock = 0102; cpuid = 1; lapic.id = 
boot() called on cpu#1


(kgdb) bt
#0  0xc0175662 in dumpsys ()
#1  0xc017542c in boot ()
#2  0xc0175894 in poweroff_wait ()
#3  0xc01e7c18 in vm_fault ()
#4  0xc0219d32 in trap_pfault ()
#5  0xc021990b in trap ()
#6  0xc01e008a in ufs_dirrewrite ()
#7  0xc01e31a4 in ufs_rename ()
#8  0xc01e4645 in ufs_vnoperate ()
#9  0xc01a9121 in rename ()
#10 0xc021a44d in syscall2 ()
#11 0xc02077cb in Xint0x80_syscall ()


How can i help to resolve this problem ASAP?

Thanks!

Matt Dillon wrote:
 Well, I've gone through hell trying to fix the rename()/rmdir()/remove()
 races and failed utterly.  There are far more race conditions then even
 my last posting indicated, and there are *severe* problems fixing NFS
 to deal with even Ian's suggestion... it turns out that NFS's nfs_namei()
 permanently adjusts the mbuf while processing the path name, making
 restarts impossible.
 
 The only solution is to implement namei cache path locking and formalize
 the 'nameidata' structure, which means ripping up a lot of code because
 nearly the entire code base currently plays with the contents of 
 'nameidata' willy-nilly.  Nothing else will work.  It's not something
 that I can consider doing now.
 
 In the mean time I am going to remove the panic()'s in question.  This
 means that in ufs_rename() the machine will silently ignore the race 
 (not do the rename) instead of panic.  It's all that can be done for
 the moment.  It solve the security/attack issue.  We'll have to attack
 the races as a separate issue.  The patch to remove the panics is utterly
 trivial and I will commit it after I test it.
 
   -Matt
 
 
 

-- 
Yevgeniy Aleynikov | Sr. Systems Engineer - USE
InfoSpace INC 601 108th Ave NE | Suite 1200 | Bellevue, WA 98004
Tel 425.709.8214 | Fax 425.201.6160 | Mobile 425.418.8924
[EMAIL PROTECTED] | http://www.infospaceinc.com

Discover what you can do.TM


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: fsck -p

2002-11-26 Thread Kirk McKusick
Date: Wed, 20 Nov 2002 13:09:55 +0200
From: Ruslan Ermilov [EMAIL PROTECTED]
To: Ian Dowse [EMAIL PROTECTED],
Kirk McKusick [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: fsck -p

Hi!

Today I've got a hard lockup with 4.7 box.  Upon reboot,
``fsck -p'' was run, and it resulted in the following,
in particular:

/dev/da0s1h: UNREF FILE I=3D591  OWNER=3Dnobody MODE=3D100644
/dev/da0s1h: SIZE=3D81269024 MTIME=3DNov 20 09:50 2002  (CLEARED)
/dev/da0s1h: FREE BLK COUNT(S) WRONG IN SUPERBLK (SALVAGED)
/dev/da0s1h: SUMMARY INFORMATION BAD (SALVAGED)
/dev/da0s1h: BLK(S) MISSING IN BIT MAPS (SALVAGED)

I thought that the correct action here would be to reconnect
this file under fs's lost+found, but it did not happen.  Why?

(I've lost a week of useful squid's access.log.)


Cheers,

Ruslan Ermilov  Sysadmin and DBA,
[EMAIL PROTECTED]   Sunbay Software AG,
[EMAIL PROTECTED]  FreeBSD committer,
+380.652.512.251Simferopol, Ukraine

http://www.FreeBSD.org  The Power To Serve
http://www.oracle.com   Enabling The Information Age

The reference count on the file was zero, so the assumption by
fsck is that you were in the process of removing it at the time
of the crash (e.g., the name had been removed from the directory
but the inode had not yet been cleared). Thus the default behavior
is to finish the removal. FYI, if you had run fsck manually, it
would have given you the option to save the file.

Kirk McKusick

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: bleh. Re: ufs_rename panic

2001-10-02 Thread Kirk McKusick

The problems all arise from the fact that we unlock the source
while we look up the destination, and when we return to relookup
the source, it may have changed/moved/disappeared. The reason to
unlock the source before looking up the destination was to avoid
deadlocking against ourselves on a lock that we held associated 
with the source. Since we now allow recursive locks on vnodes, it
is no longer necessary to release the source before looking up
the destination. So, it seems to me that the correct fix is to
*not* release the source after looking it up, but rather hold it
locked while we look up the destination. We can completely get
rid of relookup and lots of other hairy code and generally make
rename much simpler. Am I missing something here?

~Kirk

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: utilizing write caching

2001-05-10 Thread Kirk McKusick

Sorry for the slow response. I only read my freebsd.org email
very occationally.

Soft updates does do most of its writes asynchronously, but it
still needs to know when the data has really hit stable store.
With SCSI disks, we can use tag queuing to reliably get this
information. With IDE disks the only way to get this information
is to disable write-cacheing. Most failure senarios allow IDE
disks to write out their caches - software crashes, plug pulled
out of the wall, etc. Where they cannot write out their caches
are instances where the power drops nearly instantly such as a
power supply failure, or the battery being pulled out of a laptop.
We could decide that we are willing to lump those sorts of 
failures in with media failure as a class of problems that we
choose not to protect against, but I think that should be a
decision that users have to take an active role to make (much
as they can choose to mount their filesystems async). So, I
agree with the decision to turn off write caching by default,
though there should be an easy way to reenable it for those
users that want to run the associated risks.

Kirk McKusick

=-=-=-=-=-=

Date: Thu, 19 Apr 2001 00:07:12 -0700
From: Alfred Perlstein [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: Kirk McKusick [EMAIL PROTECTED]
Subject: utilizing write caching

I'm sure you guys remeber the recent discusion wrt write caching
on disks possibly causing inconsistancies for UFS and just about
any filesystem or program that expect things like fsync() to actually
work.

The result of the discussion was that write caching was disabled
for all disks.

I really think this is suboptimal.  I mean _really_ suboptimal,
my laptop disk is a pig since the default went in for ata disks.
Or maybe it's just a pig anyway, but I'd like to take a look at
this.

The most basic fix to gain performance back would be to have the
device examine the B_ASYNC flags and decide there whether or
not to perform write caching.

However, I have this strange feeling that softupdates is actually
able to issue the meta-data writes with B_ASYNC set.  Kirk, is this
true?  If so would it be possible to tag the buffer with yet another
flag saying yes, write me async, but safely when doing softdep
disk io?

If softupdates doesn't use B_ASYNC, then it seems trivial to make
DEV_STRATEGY propogate B_ASYNC into the bio request (BIO_STRATEGY)
via OR'ing something like BIO_CACHE so that the device driver could
then choose to activate write caching.

This is still suboptimal because we'll be turning off caching when
the buffer system is experiencing a shortage and issuing sync writes
in order not to deadlock, but it's still better IMO than turning
it off completely.

If on the otherhand Kirk can figure out a quick hack to flag buffers
that need completely stable storage (including fsync(2)*) ops then
I think we've got a solution.

  (*) i'll look at fsync and physio if the scope of fixing those
  seems to be too much wrt to time available.

If softupdates doesn't use B_ASYNC something like this:

Index: sys/bio.h
===
RCS file: /home/ncvs/src/sys/sys/bio.h,v
retrieving revision 1.104
diff -u -r1.104 bio.h
--- sys/bio.h   2001/01/14 18:48:42 1.104
+++ sys/bio.h   2001/04/19 06:53:52
@@ -91,6 +91,7 @@
 #define BIO_ERROR  0x0001
 #define BIO_ORDERED0x0002
 #define BIO_DONE   0x0004
+#define BIO_ASYNC  0x0008  /* Device may choose to write cache */
 #define BIO_FLAG2  0x4000  /* Available for local hacks */
 #define BIO_FLAG1  0x8000  /* Available for local hacks */
 
Index: sys/conf.h
===
RCS file: /home/ncvs/src/sys/sys/conf.h,v
retrieving revision 1.126
diff -u -r1.126 conf.h
--- sys/conf.h  2001/03/26 12:41:26 1.126
+++ sys/conf.h  2001/04/19 06:52:08
@@ -157,6 +157,8 @@
(bp)-b_io.bio_offset = (bp)-b_offset; \
else\
(bp)-b_io.bio_offset = dbtob((bp)-b_blkno);   \
+   if ((bp)-b_flags  B_ASYNC)\
+   (bp)-b_io.bio_flags |= BIO_ASYNC   \
(bp)-b_io.bio_done = bufdonebio;   \
(bp)-b_io.bio_caller2 = (bp);  \
BIO_STRATEGY((bp)-b_io, dummy);   \

could do the trick, no?

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]]
Instead of asking why a piece of software is using 1970s technology,
start asking why software is ignoring 30 years of accumulated wisdom.

- End forwarded message -

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]]
Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: vm balance

2001-04-17 Thread Kirk McKusick

Date: Tue, 17 Apr 2001 09:49:54 -0400 (EDT)
From: Robert Watson [EMAIL PROTECTED]
To: Kirk McKusick [EMAIL PROTECTED]
cc: Julian Elischer [EMAIL PROTECTED],
   Rik van Riel [EMAIL PROTECTED], [EMAIL PROTECTED],
   Matt Dillon [EMAIL PROTECTED], David Xu [EMAIL PROTECTED]
Subject: Re: vm balance 

On Mon, 16 Apr 2001, Kirk McKusick wrote:

 I am still of the opinion that merging VM objects and vnodes would be a
 good idea. Although it would touch a huge number of lines of code, when
 the dust settled, it would simplify some nasty bits of the system. This
 merger is really independent of making the number of vnodes dynamic.
 Under the old name cache implementation, decreasing the number of vnodes
 was slow and hard. With the current name cache implementation,
 decreasing the number of vnodes would be easy. I concur that adding a
 dynamically sized vnode cache would help performance on some workloads. 

I'm interested in this idea, although profess a gaping blind spot in
expertise in the area of the VM system.  However, one of the aspects of
our VFS that has always concerned me is that use of a single vnode
simplelock funnels most of the relevant (and performance-sensitive) calls. 
The result is that all accesses to an object represented by a vnode are
serialized, which can represent a substantial performance hit for
applications such as databases, where simultaneous write would be
advantageous, or for various vn-backed oddities (possibly including
vnode-backed swap?).

At some point, apparently an effort was made to mark up vnode_if.src with
possible alternative locking using read/write locks, but given that all
the consumers use exclusive locks right now, I assume that was not
followed through on.  A large part of the cost is mitigated through
caching on the under-side of VFS, allowing vnode operations to return
rapidly, but while this catches a number of common cases (where the file
is already in the cache), there are sufficient non-common cases that I
would anticipate this being a problem.  Are there any performance figures
available that either confirm this concern, or demonstrate that in fact it
is not relevant? :-)  Would this concern introduce additional funneling in
the VM system, or is the granularity of locks in the VM sufficiently low
that it might improve performance by combining existing broad locks?

Robert N M Watson FreeBSD Core Team, TrustedBSD Project
[EMAIL PROTECTED]  NAI Labs, Safeport Network Services

Every vnode in the system has an associated object. Every object
backed by a file (e.g., everything but anonymous objects) has an
associated vnode. So, the performance of one is pretty tied to the
performance of the other. Matt is right that the VM does locking
on a page level, but then has to get a lock on the associated
vnode to do a read or a write, so really is pretty tied to the
vnode lock performance. Merging the two data structures is not
likely to change the performance characteristics of the system for
either better or worse. But it will save a lot of headaches having
to do with lock ordering that we have to deal with at the moment.

Kirk McKusick

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm balance

2001-04-16 Thread Kirk McKusick

Date: Tue, 10 Apr 2001 22:14:28 -0700
From: Julian Elischer [EMAIL PROTECTED]
To: Rik van Riel [EMAIL PROTECTED]
CC: Matt Dillon [EMAIL PROTECTED], David Xu [EMAIL PROTECTED],
   [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject: Re: vm balance

Rik van Riel wrote:

 
 I'm curious about the other things though ... FreeBSD still seems
 to have the early 90's abstraction layer from Mach and the vnode
 cache doesn't seem to grow and shrink dynamically (which can be a
 big win for systems with lots of metadata activity).
 
 So while it's true that FreeBSD's VM balancing seems to be the
 best one out there, I'm not quite sure about the rest of the VM...
 

Many years ago Kirk was talking about merging the vm objects
and the vnodes..  (they tend to come in pairs anyhow)

I still think it might be an idea worth investigating further.

kirk?

-- 
  __--_|\  Julian Elischer
 /   \ [EMAIL PROTECTED]
(   OZ) World tour 2000-2001
--- X_.---._/  
v

I am still of the opinion that merging VM objects and vnodes would
be a good idea. Although it would touch a huge number of lines of
code, when the dust settled, it would simplify some nasty bits of
the system. This merger is really independent of making the number
of vnodes dynamic. Under the old name cache implementation, decreasing
the number of vnodes was slow and hard. With the current name cache
implementation, decreasing the number of vnodes would be easy. I
concur that adding a dynamically sized vnode cache would help
performance on some workloads.

Kirk McKusick

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Where is the syncer kernel process implemented?

2000-07-24 Thread Kirk McKusick

From: Sheldon Hearn [EMAIL PROTECTED]
To: Alfred Perlstein [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject: Re: Where is the syncer kernel process implemented? 
In-reply-to: Your message of "Fri, 14 Jul 2000 05:38:58 MST."
 [EMAIL PROTECTED] 
Date: Fri, 14 Jul 2000 14:51:13 +0200
Sender: Sheldon Hearn [EMAIL PROTECTED]

On Fri, 14 Jul 2000 05:38:58 MST, Alfred Perlstein wrote:

 /*
  * System filesystem synchronizer daemon.
  */
 void 
 sched_sync(void)

It seems that the default sync delay, syncer_maxdelay, is no longer
controllable via sysctl(8).  Are there complex issues restricting the
changing of this value in real time, or is it just not something people
feel the need to change these days?

Ciao,
Sheldon.

The value of syncer_maxdelay was never settable, as it is used
to set the size of the array used to hold the timing events.
It was formerly possible to set syncdelay, but that variable
was replaced by three variables:

time_t filedelay = 30;  /* time to delay syncing files */
time_t dirdelay = 29;   /* time to delay syncing directories */
time_t metadelay = 28;  /* time to delay syncing metadata */

Each of these variables is individually setable.

    Kirk McKusick


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: mmap/write case brought up again - maybe its time to...

1999-12-08 Thread Kirk McKusick

Date: Wed, 8 Dec 1999 21:30:37 -0800 (PST)
From: Matthew Dillon [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: Kirk McKusick [EMAIL PROTECTED]
Subject: mmap/write case brought up again - maybe its time to...

Someone brought up the mmap/write case up again - that's a deadlock
case that we haven't fixed yet where you write from one descriptor into
shared writeable file-backed memory area and, from another process,
do the vise-versa.

Maybe it's time to make filesystem locks recursive by default.  Doing
so will allow the above case to be fixed 100% by allowing the read()
and write() code to pre-lock the underlying vnodes in the correct order
(by pointer comparison) prior to digging into them.

I think Kirk may be the best person to make this determination - I
seem to recall there being some (minor?) issues.  Implementing recursive
locks may be as simple as adding LK_RECURSE to vn_lock() but I haven't
researched it heavily.

This may also tie-in well with the revamping of the VOP code later on.
There is a significant amount of complexity in the VOP code in having to
deal with non-recursive locks when a passed argument is supposed to be
locked and remain locked on return, the return argument is supposed to
be locked, and the returned argument winds up being the same as the
passed argument.  With recursive locks as the norm we can remove nearly
all of those special cases leaving just the one that deals with ".."
(or perhaps dealing with namei directory locks in a different way).

-Matt

Recursive locks are easy to implement. Just add LK_CANRECURSE as
the final argument to the call to lockinit at line 1077 in ffs_vget
of ufs/ffs/ffs_vfsops.c. That's it. From there on out all FFS locks
will be recursive and you can begin simplifying away.

Kirk


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message