Re: I'm leaving the project

2002-12-18 Thread Matt Dillon
On Tue, 17 Dec 2002 10:56:19 -0800


Does anyone know why this person is trying to (poorly) 
impersonate MD?

Unfortunately not. We do not yet know who this fake Dillon 
is (the guy posting from that backplane.com address)

I've been working hard on the new ipfw[2] patch for 5.0, 
the new patch is released under the DPL. For those of you 
not familiar with it, here's the most important paragraph:

The de Raadt Public License 1.0

Redistribution and use in source and binary forms, with or 
without modification, are permitted provided that the 
following conditions are met:

* Redistributions of source code must retain the 
above copyright notice,
this list of conditions and the following 
disclaimer. 
* Redistributions in binary form must reproduce 
the above copyright notice,
this list of conditions and the following 
disclaimer in the documentation 
and/or other materials provided with the 
distribution. 
* FUCK YOU ALL ASSHOLES!

Take care,
  Matthew Dillon


_
For the best comics, toys, movies, and more,
please visit 


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message


I'm leaving the project

2002-12-17 Thread Matt Dillon
Thanks to my dear friend Warner Losh. I've decided to 
leave FreeBSD and flame in another project. Maybe I could 
join OpenBSD, the seem to share my views on how to deal 
with other people.

I hereby give maintainership of all my code to Warner, or, 
whoever wants it, for that matter.

Thank you,
  Matthew
_
For the best comics, toys, movies, and more,
please visit 


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message


I want to apologize

2002-12-16 Thread Matt Dillon
Hey dudes, I want to apologize for being a total *asshole* 
wrt the ipfw thingie. Sorry. I know my patch was shit 
anyway, and that ipfw blows dead goats when compared to 
ipf, but even with that in mind, I had to pull a deraadt, 
sorry. I'm so sorry. I mean, I've had my commit bit taken 
away many times already, and yet I'm stupid enough to keep 
with the same attitude. Damn,

Yours,
  Matthew.
_
For the best comics, toys, movies, and more,
please visit 


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message


I can't believe this!

2002-08-21 Thread Matt Dillon

I mean, WTF? 14 people answered what was nothing but a *blatant* troll! Come on, even 
Rick 'shittiest VM subsystem' van Riel answered! What can I say, pathetic, simply 
pathetic. No wonder FreeBSD is dead. I'm just talking on behalf of myself and my 3 
friends, Bavid O'Drien, Piten Handya, and Muli Jallett, but I'm I speak for all of us 
when I say: FreeBSD is dying!

FWIW, some people have privately e-mailed me asking : Why is Hiten an IMBECILE?

Here's the answer.. http://www.linuxforlesbians.org/~pjs/hiten-idiot.txt

Hiten is an idiot, discuss

Yours faithfully,
Matthew

_
Get your own free Mickeyfan.com email address!!
DisneySites!! - http://www.disneysites.com/webmail/poohfan

_
Promote your group and strengthen ties to your members with [EMAIL PROTECTED] by 
Everyone.net  http://www.everyone.net/?btn=tag

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: contigfree, free what?

2001-10-15 Thread Matt Dillon


:
:I have a potentially silly question about contigmalloc1(), if the very
:unlikely occurance that the kernel VM space ran out, (the vm_map_findspace()
:failed) wouldn't we want to return the chunk of contiguous physical space
:back on the free queue before we return an allocation failure?
:
:--mark tinguely.

Ah, you came across that XXX comment?  You are absolutely right.  The
original implementor rushed writing the routine and didn't finish it.
contigmalloc() is only supposed to be used in the early life of the
system when its loading device drivers that need contiguous space,
so the case is not supposed to come up.  Of course, that means that it
does come up from time to time :-(.

If you want to have a go at fixing it I will be happy to review.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: RE: RE: Imagestream WanIC-520 interface cards

2001-10-13 Thread Matt Dillon

:on the Internet has been routers costing in the $100,000 range.  Now, maybe
:BEST Internet is now wealthy enough that you can blow that kind of money on
:Cisco gear without thinking about it, but a lot of smaller ISP's are not.
:
:If you look at what happened last weekend on Sunday, and the number of people
:that screamed about it, it's quite obvious that there are a huge number of
:gated and zebra boxes out there handling global routing.  Take off those
:Cisco blinders, boy! ;-)
:
:Ted Mittelstaedt   [EMAIL PROTECTED]
:Author of:   The FreeBSD Corporate Networker's Guide

Hmm.  Well, as a person who ran gated at BEST, has hacked on gated on
same, had to deal with BSDI and FreeBSD route table bugs, tracked down
OSPF bugs for a friend running gated, and otherwise spent hundreds of
hours (at least!) keeping boxes running gated operational... well, I'll
take the Cisco any day thank you very much!

If you are a small ISP and you have enough money to pay for two T1's,
you have enough money to buy a used router that can do BGP for you. 
IMHO.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: sin_zero & bind problems

2001-10-13 Thread Matt Dillon


:
:The following was initially formatted as PR, but I suppose it is reasonable
:to discuss first here. There were some vague mentions that sin_zero field
:of struct sockaddr_in may be used in future for some extensions; but this
:future is already expired;) without any real step.
:If the verdict will be to keep current behavior, it should be strictly
:documented to remove this permanent rake field.
:
:>Description:
:
:If bind() syscall is called for PF_INET socket with address another than
:INADDR_ANY, and sin_zero field in addr parameter is not filled with
:...

Nobody in their right mind uses a struct sockaddr_in or any other 
struct sock* type of structure without zeroing it first.  I suppose
we can document that in the man pages, but we certainly should not go
hacking up the kernel code to work around bad programmers.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: RE: Imagestream WanIC-520 interface cards

2001-10-12 Thread Matt Dillon

The Cisco 2600 series is great for T1's.  A 2620 with a T1 card (it
can take up to two) and you are done.  The 2501's are ancient, don't
even bother any more.  You can find 2620's on EBay in the $700-$1500
range, many of which appear (in my quick look) to include a T1 card.

As much as I like to support running things on BSD, I stopped trying to
run T1's from general purpose unix boxes 4 years ago.  When BEST Internet
first started we ran the (old) Riscom cards from a BSDI box w/ an external
csu/dsu, and they were great for that, but these days the overall
cost of ownership is much, much lower with a used cisco and a WAN
card with an integrated csu/dsu in it.  It's file and forget... once
you set the thing up you don't have to touch it ever again.

One advantage of the dot-com crash is that EBay and other sites are
saturated with high quality, barely used hardware. 

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: contigfree, free what?

2001-10-12 Thread Matt Dillon


:I have also looked into this a while ago, but got stuck at some
:point. I have just looked at it again, and I think I have found a solution.
:
:...
:
:This is probably because the map entries do have a NULL object
:pointer. vm_map_pageable() calls vm_fault_wire(), so this will fail.
:
:I have attached a patch which works for me. It duplicates most of the
:logic of kmem_alloc in that it calls vm_map_findspace() first, then
:vm_map_insert() (which basically is what is done in
:kmem_alloc_pageable() too, but here, kernel_object is passed instead
:of a NULL pointer, so that the map entry will have a valid object
:pointer). Then, the pages are inserted into the object as before, and
:finally, the map entries are marked as wired by using
:vm_map_pageable(). Because this will also call vm_fault_wire(), which
:will among other things do a vm_page_wire(), contigmalloc does not
:need to wire the pages itself. 
:
:The pmap_kenter() calls can also be reomved, since the pages will be
:mapped in any case by vm_fault(). 
:
:   - thomas

Ach, of course.  I see what's happening now!  Thomas, your patch looks
good!  I'm going to patch it in and test it a bit.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Patch to allow 4.2 driver .o to run on 4.4

2001-10-12 Thread Matt Dillon


:Sorry about the crosspost but I estimate that this reaches those who need
:to see this..
:
:There was a change in the 4.x kernel.h on June 15 that broke backwards
:compatibility for binary distributed driver files
:(distributed as .o files)  It was an MFC of a patch by peter..
:but we didn't understand the ramifications in 4.x.

This seems very reasonable to me... a nice quick and easy solution.

-Matt


:my fix (a reversion in part) is as follows:
:
:Index: kernel.h
:===
:RCS file: /repos/cvs/mod/freebsd/src/sys/sys/kernel.h,v
:retrieving revision 1.63.2.5
:diff -u -r1.63.2.5 kernel.h
:--- kernel.h   2001/07/26 23:27:53 1.63.2.5
:+++ kernel.h   2001/10/11 21:30:03
:@@ -113,13 +113,13 @@
:   SI_SUB_VM   = 0x100,/* virtual memory system
:init*/
:   SI_SUB_KMEM = 0x180,/* kernel memory*/
:   SI_SUB_KVM_RSRC = 0x1A0,/* kvm operational
:limits*/
:-  SI_SUB_CPU  = 0x200,/* CPU resource(s)*/
:-  SI_SUB_KLD  = 0x210,/* KLD and module setup */
:...
:
:The result of this was that old .o  drivers are initialise at sequence 
:0x240 but the device framework is not initialised until 0x310.
:
:my patch shifts all the numbers back to below that that binary
:...
:Julian

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: contigfree, free what?

2001-10-12 Thread Matt Dillon


:Mark,
:
:> I also placed some checks on vm_map_delete
:
:I did that also, and as far as I understand everything works fine.
:The only thing I found was the fact that when contigmalloc() grabs the
:contig pages it sets the value of pga[i] (for i in allocated pages)
:note that: vm_page_t pga = vm_page_array;
:
:Then contigfree() does a pretty good job, but does not reset the values
:of pga[i] to pqtype == PQ_FREE (pqtype = pga[i].queue - pga[i].pc)
:
:So the next contigmalloc() requiring the same number of pages fails on
:the previously released pages because they are not PQ_FREE
:
:The other thing that puzzled me is the fact that in vm_map_delete()
:called by contgigfree() has a variable
:...

I think what is going on is that contigmalloc() is wiring the pages
but placing them in a pageable container (entry->wired_count == 0),
so when contigfree() kmem_free()'s the block the system does not know
that it must unwire the pages.  This leaves the pages wired and prevents
them from being freed.

I haven't found a quick and easy solution to the problem yet.  kmem_alloc()
doesn't do what we want either.  I tried calling vm_map_pageable() in
contigmalloc1() but it crashed the machine, so there might be something
else going on as well.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: contigfree, free what?

2001-10-11 Thread Matt Dillon

:We are currently working with FreeBSD 4.3 and we found out that
:kldloading/kldunloading modules working with contigmalloc()/contigfree()
:like if_xl.ko produces a memory leak.
:
:This is due to the contigfree() function which seems to uncompletely release
:the memory ressource allocated in vm_page_array.
:
:When contigmalloc() steps in vm_page_array, it does not find back
:the pages previously released by contigfree()
:The loop vm/vm_page.c is this one:
:
:  for (i = start; i < cnt.v_page_count; i++) {
:int pqtype; 
:phys = VM_PAGE_TO_PHYS(&pga[i]); 
:pqtype = pga[i].queue - pga[i].pc; 
:if (pqtype == PQ_FREE 
:
:
:It fails on the `pqtype == PQ_FREE' test
:and the previously allocated (and supposedly released by contigfree)
:pages can't be reallocated.
:
:Anyone has a patch?
:
:Thanx,
:Patrick.

This meshes with a bug report I received a couple of weeks ago,
though you provide a great deal more information.  I'll take a look
at it (if others have a patch and want to jump in, then by all means
post away!).

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Memory allocation question

2001-10-10 Thread Matt Dillon


:
:On Tue, 2 Oct 2001, Matt Dillon wrote:
:
:> 
:> :
:> :Dwayne wrote:
:> :>  I'm creating an app where I want to use memory to store data so I
:> :> can get at it quickly. The problem is, I can't afford the delays that
:> :> would occur if the memory gets swapped out. Is there any way in FreeBSD
:> :> to allocate memory so that the VM system won't swap it out?
:> :> 
:> :I think mlock(2) is what you want.
:> :
:> :Maxime Henrion
:> :-- 
:> :Don't be fooled by cheap finnish imitations ; BSD is the One True Code
:> 
:> Don't use mlock().
:
:Could you please explain that. Thanks.

mlock() can only be used by root, and it isn't really all that portable
an interface.  It cannot guarentee that the memory will actually be
locked into core.

:> 
:> Use SysV Shared memory segments.  If you tell the kernel to use 
:> physical ram for SysV shared memory (kern.ipc.shm_use_phys=1)
:> then any shm segments you allocate (see manual pages for
:> shmctl, shmget, and shmat) will reside in unswappable shared memory.
:> 
:>  -Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Heads up! My interview....

2001-10-08 Thread Matt Dillon

OSNews interviewed me, it's up in today's page!  I think it's a really
good interview but, of course, I am biased :-)

http://osnews.com/story.php?news_id=153

On the side:  Oh my god, they listed my personal web page!  It's like
having your parents show your friends your messy room!  (read: I haven't
been keeping it cleaned up.)  8-|

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: bleh. Re: ufs_rename panic

2001-10-07 Thread Matt Dillon

Well, I've gone through hell trying to fix the rename()/rmdir()/remove()
races and failed utterly.  There are far more race conditions then even
my last posting indicated, and there are *severe* problems fixing NFS
to deal with even Ian's suggestion... it turns out that NFS's nfs_namei()
permanently adjusts the mbuf while processing the path name, making
restarts impossible.

The only solution is to implement namei cache path locking and formalize
the 'nameidata' structure, which means ripping up a lot of code because
nearly the entire code base currently plays with the contents of 
'nameidata' willy-nilly.  Nothing else will work.  It's not something
that I can consider doing now.

In the mean time I am going to remove the panic()'s in question.  This
means that in ufs_rename() the machine will silently ignore the race 
(not do the rename) instead of panic.  It's all that can be done for
the moment.  It solve the security/attack issue.  We'll have to attack
the races as a separate issue.  The patch to remove the panics is utterly
trivial and I will commit it after I test it.

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: bleh. Re: ufs_rename panic

2001-10-05 Thread Matt Dillon


:
:It seems like there's no activity on this subject.
:This is local DoS, guys. if it gets on public (which is probably gonna
:be soon) everything everywhere will be crashing, and there's no stable
:fix ready.
:How can i help to accelerate this problem solution? 
:
: And why FreeBSD security officer's email address always bounces my
:mail?
:
:Thanks!

My most recently posted patch will solve your problem.  I am reworking
it as per Ian's suggestions before I commit, and will also implement
the same feature in rename() (for files).  Then I will do a standard
commit-to-current-wait-commit-to-stable sequence.  But due to the
complexity of the changes (even after simplifying them), it is going to
be another few days before anything gets into -stable officially.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: patch #3 (was Re: bleh. Re: ufs_rename panic)

2001-10-03 Thread Matt Dillon

:The addition of the SOFTLOCKLEAF code is quite a major change, so
:it would be very useful if you could describe exactly what it does,
:what its semantics are, and how it fits into the rename problem.

Setting SOFTLOCKLEAF in namei will set the VSOFTLOCK flag in the
returned vnode (whether the returned vnode is locked or not), and
namei() will fail with EAGAIN if the VSOFTLOCK flag is already set.
An extra reference is added to the returned vnode which either
the caller must free or NDFREE must free (note that VOP_RENAME is
not responsible for freeing this extra reference, so the API itself
does not actually change).  The caller must either call vclearsoftlock()
or call NDFREE() with the appropriate flags to clear the flag and
dereference the vnode.

int gc = 0;
vagain(&gc);

vagain() is a routine that falls through the first time, initializing
'gc' to a global counter.  If later on you get an EAGAIN from a
SOFTLOCK failure and loop back up to the vagain(), it will block the
process until whomever owns the softlock has 'probably' released it.
This allows the system call to restart internally and attempt the
operation again from scratch.

:Because the source node and parent are not locked, there is the
:possibility that the source node could be renamed or removed at
:any time before VOP_RENAME finally gets around to locking it and
:removing it. Something needs to protect the source node against
:being renamed/removed between the point that the source node is
:initially looked up and the point that it is finally locked. Both
:Matt's SOFTLOCKLEAF and the VRENAME flag are there to provide this
:protection.
:
:It is the fact that this problem is entirely unique to VOP_RENAME
:that leads me to think that adding the generic SOFTLOCKLEAF code
:is overkill. The following fragment also suggests that maybe the
:approach doesn't actually fit in that well:

Well, maybe.  I have my eye on possibly seeing a way to fix
the race-to-root directory scanning program too, so I decided
to implement VSOFTLOCK formally rather then as a hack.

:The way that vclearsoftlock() is used to clear a flag in an unlocked
:vnode is also not ideal. This should probably be protected at least
:by v_interlock as other flags are.
:
:Ian

In -current, definitely.  I'm not sure why v_interlock even exists
in -stable.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: patch #3 (was Re: bleh. Re: ufs_rename panic)

2001-10-03 Thread Matt Dillon


:The rename routine is probably the most convoluted in the entire file
:system code (FFS). Now that everybody's memory is fresh, I would like to
:ask something about it:
:
:(1) I am always wondering why not use a global rename lock so that there
:is only one rename operation in progress at any time. This method is
:used by GFS and probably Linux.  This could make the code simply. Maybe 
:we can even get rid of the relookup() stuff.
:
:This may reduce concurrency, but rename should not be a frequent
:operation.

Well, you could say that about virtually any filesystem operation.
Bitmaps are shared, for example.  It is a bad idea to try to code
simplistic solutions to complex problems.  Throughout the code history
of BSDs we have had to constantly make adjustments to algorithms that
were originally not designed to scale past what the authors originally
believed was reasonable.

:(2) In the code of 4.3-release, we grab the source inode while holding the
:locks of target inodes.  In ufs_rename(), we have:
:
:if ((error = vn_lock(fvp, LK_EXCLUSIVE, p)) != 0)
: goto abortit;
:
:I wonder whether this could cause deadlock. I think locking more than
:one inode should be done in some sequence (ie. order them by inode 
:number).

Hmm.  Yes, there might possibly be a problem there.  We may be
safe due to the fact that only directory scans and rename hold multiple
vnodes locked, and in this case the destination directory holding the
destination file is already locked.  However, if the source directory/file
gets ripped out from under rename() the 'new' location of the source
directory/file could cause a deadlock against another process.  It
would be very difficult to generate it though.

:(4) This is not related to rename().  But ufs_mknod() reload the inode 
:through VFS_VGET() to avoid duplicate aliases.  I can not see why it
:is necessary. I asked this before, but have not got any satisfactory
:responses.
:...
:Any ideas are welcome. Thanks,
:
:-Zhihui

I don't know the answer to this at the moment.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: bleh. Re: ufs_rename panic

2001-10-02 Thread Matt Dillon


:The problems all arise from the fact that we unlock the source
:while we look up the destination, and when we return to relookup
:the source, it may have changed/moved/disappeared. The reason to
:unlock the source before looking up the destination was to avoid
:deadlocking against ourselves on a lock that we held associated 
:with the source. Since we now allow recursive locks on vnodes, it
:is no longer necessary to release the source before looking up
:the destination. So, it seems to me that the correct fix is to
:*not* release the source after looking it up, but rather hold it
:locked while we look up the destination. We can completely get
:rid of relookup and lots of other hairy code and generally make
:rename much simpler. Am I missing something here?
:
:   ~Kirk

   That was the first thing I thought of, but unfortunately it
   is still possible to deadlock against another process...
   for example, a process doing an (unrelated) rename in the reverse 
   direction.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Memory allocation question

2001-10-02 Thread Matt Dillon


:
:Dwayne wrote:
:>  I'm creating an app where I want to use memory to store data so I
:> can get at it quickly. The problem is, I can't afford the delays that
:> would occur if the memory gets swapped out. Is there any way in FreeBSD
:> to allocate memory so that the VM system won't swap it out?
:> 
:I think mlock(2) is what you want.
:
:Maxime Henrion
:-- 
:Don't be fooled by cheap finnish imitations ; BSD is the One True Code

Don't use mlock().

Use SysV Shared memory segments.  If you tell the kernel to use 
physical ram for SysV shared memory (kern.ipc.shm_use_phys=1)
then any shm segments you allocate (see manual pages for
shmctl, shmget, and shmat) will reside in unswappable shared memory.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: patch #3 (was Re: bleh. Re: ufs_rename panic)

2001-10-02 Thread Matt Dillon


:
:Matt Dillon wrote:
:> Here is the latest patch I have.  It appears to completely solve the
:> problem.  I have shims in unionfs and nfs for the moment.
:
:This seems rather large compared to Ian Dowse's version..  Are you sure that
:you're doing this the right way?  Adding a whole new locking mechanism
:when the simple VRENAME flag to be enough seems like a bit of overkill..

Ian's doesn't fix any of the filesystem semantics bugs, it only prevents
the panic from occuring.  For example, if you have two hardlinked files
residing in different directories both get renamed simultaniously, one
of the rename()s can fail even though there is no conflict.  If you
have a simultanious rmdir() and rename(), the rename() can return an
unexpected error code.  And so forth.

If you remove the filesystem semantics fixes from my patch you 
essentially get Ian's patch except that I integrated the vnode flag
in namei/lookup whereas Ian handles it manually in the syscall code.

Also, Ian's patch only effects system calls.  It doesn't do the same
fixes for nfs (server side) or unionfs.

-Matt

:I'm not criticizing your work, I am just wondering if you have considered
:Ian's work and feel that it is wrong or the wrong direction..  His certainly
:appears more elegant than yours so I want to understand why you feel yours
:is better than his.
:
:freebsd-hackers
:Message-id: <[EMAIL PROTECTED]>
:
:Cheers,
:-Peter
:--
:Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
:"All of this is for nothing if we don't go to the stars" - JMS/B5

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



patch #3 (was Re: bleh. Re: ufs_rename panic)

2001-10-02 Thread Matt Dillon

Here is the latest patch I have.  It appears to completely solve the
problem.  I have shims in unionfs and nfs for the moment.

The patch is against -stable.

* Implements SOFTLOCKLEAF namei() option

* Implements EAGAIN error & appropriate tsleep/retry code

* Universal for rename() & rmdir(). Final patch will also probably
  implement it on unlink() to solve (pre-existing) unlink/rename regular
  file race.

* Tested very well w/ UFS, should be compatible with and work for
  direct access to other filesystems that still use IN_RENAME.

* Tested for collision probability.  Answer: Very low even when
  one tries on purpose.  There is no need to implement a more
  sophisticated fine-grained tsleep.

-Matt


Index: kern/vfs_lookup.c
===
RCS file: /home/ncvs/src/sys/kern/vfs_lookup.c,v
retrieving revision 1.38.2.3
diff -u -r1.38.2.3 vfs_lookup.c
--- kern/vfs_lookup.c   2001/08/31 19:36:49 1.38.2.3
+++ kern/vfs_lookup.c   2001/10/02 20:06:33
@@ -372,6 +372,11 @@
error = EISDIR;
goto bad;
}
+   if (cnp->cn_flags & SOFTLOCKLEAF) {
+   if ((error = vsetsoftlock(dp)) != 0)
+   goto bad;
+   VREF(dp);
+   }
if (wantparent) {
ndp->ni_dvp = dp;
VREF(dp);
@@ -565,13 +570,17 @@
error = EROFS;
goto bad2;
}
+   if (cnp->cn_flags & SOFTLOCKLEAF) {
+   if ((error = vsetsoftlock(dp)) != 0)
+   goto bad2;
+   VREF(dp);
+   }
if (cnp->cn_flags & SAVESTART) {
ndp->ni_startdir = ndp->ni_dvp;
VREF(ndp->ni_startdir);
}
if (!wantparent)
vrele(ndp->ni_dvp);
-
if ((cnp->cn_flags & LOCKLEAF) == 0)
VOP_UNLOCK(dp, 0, p);
return (0);
@@ -654,6 +663,11 @@
error = ENOTDIR;
goto bad;
}
+   if (cnp->cn_flags & SOFTLOCKLEAF) {
+   if ((error = vsetsoftlock(dp)) != 0)
+   goto bad;
+   VREF(dp);
+   }
if (!(cnp->cn_flags & LOCKLEAF))
VOP_UNLOCK(dp, 0, p);
*vpp = dp;
@@ -707,6 +721,11 @@
error = EROFS;
goto bad2;
}
+   if (cnp->cn_flags & SOFTLOCKLEAF) {
+   if ((error = vsetsoftlock(dp)) != 0)
+   goto bad2;
+   VREF(dp);
+   }
/* ASSERT(dvp == ndp->ni_startdir) */
if (cnp->cn_flags & SAVESTART)
VREF(dvp);
@@ -715,8 +734,9 @@
vrele(dvp);
 
if (vn_canvmio(dp) == TRUE &&
-   ((cnp->cn_flags & (NOOBJ|LOCKLEAF)) == LOCKLEAF))
+   ((cnp->cn_flags & (NOOBJ|LOCKLEAF)) == LOCKLEAF)) {
vfs_object_create(dp, cnp->cn_proc, cnp->cn_cred);
+   }
 
if ((cnp->cn_flags & LOCKLEAF) == 0)
VOP_UNLOCK(dp, 0, p);
Index: kern/vfs_subr.c
===
RCS file: /home/ncvs/src/sys/kern/vfs_subr.c,v
retrieving revision 1.249.2.11
diff -u -r1.249.2.11 vfs_subr.c
--- kern/vfs_subr.c 2001/09/11 09:49:53 1.249.2.11
+++ kern/vfs_subr.c 2001/10/02 22:55:38
@@ -130,6 +132,8 @@
 #endif
 struct nfs_public nfs_pub; /* publicly exported FS */
 static vm_zone_t vnode_zone;
+static int vagain_count = 1;
+static int vagain_waiting = 0;
 
 /*
  * The workitem queue.
@@ -2927,6 +2963,13 @@
  struct nameidata *ndp;
  const uint flags;
 {
+   if (!(flags & NDF_NO_FREE_SOFTLOCKLEAF) &&
+   (ndp->ni_cnd.cn_flags & SOFTLOCKLEAF) &&
+   ndp->ni_vp) {
+   vclearsoftlock(ndp->ni_vp);
+   ndp->ni_cnd.cn_flags &= ~SOFTLOCKLEAF;
+   vrele(ndp->ni_vp);
+   }
if (!(flags & NDF_NO_FREE_PNBUF) &&
(ndp->ni_cnd.cn_flags & HASBUF)) {
zfree(namei_zone, ndp->ni_cnd.cn_pnbuf);
@@ -2955,3 +2998,55 @@
ndp->ni_startdir = NULL;
}
 }
+
+/*
+ * vsetsoftlock() -set the VSOFTLOCK flag on the vnode, return
+ * EAGAIN if it has already been set by someone else.
+ *
+ * note: we could further refine the collision by setting a VSOFTLOCKCOLL
+ * flag and then only waking up waiters when the colliding vnode is
+ * released.  However, this sort of collision does not happen often
+ * enough for such an addition to yield any improvement in performance.
+ */
+
+int
+vsetsoftlock(struct vnode *vp)
+{
+   int s;
+   int error = 0;
+
+   s = splbio();
+   if (vp->v_flag & VSOFTLOCK)
+ 

Re: bleh. Re: ufs_rename panic

2001-10-02 Thread Matt Dillon


Ok, I'm adding -hackers... another email thread got going in -committers.

Here is a patch set for -stable.  It isn't perfect but it does appear
to solve the problem.  The one case I don't handle right is if you have
a hardlinked file and two renames in two different directories on the same
file occur at the same time... that will (improperly) return an error 
code when, in fact, it's perfectly acceptable to do that.

This patch appears to fix the utterly trivial crash reproduction that
Yevgeniy was able to produce with a couple of simple race scripts running
in the background.

What I've done is add a SOFTLOCKLEAF capability to namei().  If set, and
the file/directory exists, namei() will generate an extra VREF() on 
the vnode and set the VSOFTLOCK flag in vp->v_flag.  If the vnode already
has VSOFTLOCK set, namei() will return EINVAL.

Then in rename() and rmdir() I set SOFTLOCKLEAF for the namei resolution
and, of course, clean things up when everything is done.

The ufs_rename() and ufs_rmdir() code no longer have to do the IN_RENAME
hack at all, because it's now handled.

This patch set does not yet include fixes to unionfs or the nfs server
and is for informational purposes only.  Comments?

-Matt

Index: kern/vfs_lookup.c
===
RCS file: /home/ncvs/src/sys/kern/vfs_lookup.c,v
retrieving revision 1.38.2.3
diff -u -r1.38.2.3 vfs_lookup.c
--- kern/vfs_lookup.c   2001/08/31 19:36:49 1.38.2.3
+++ kern/vfs_lookup.c   2001/10/02 19:04:21
@@ -372,11 +372,20 @@
error = EISDIR;
goto bad;
}
+   if ((cnp->cn_flags & SOFTLOCKLEAF) &&
+   (dp->v_flag & VSOFTLOCK)) {
+   error = EINVAL;
+   goto bad;
+   }
if (wantparent) {
ndp->ni_dvp = dp;
VREF(dp);
}
ndp->ni_vp = dp;
+   if (cnp->cn_flags & SOFTLOCKLEAF) {
+   VREF(dp);
+   vsetsoftlock(dp);
+   }
if (!(cnp->cn_flags & (LOCKPARENT | LOCKLEAF)))
VOP_UNLOCK(dp, 0, p);
/* XXX This should probably move to the top of function. */
@@ -565,13 +574,20 @@
error = EROFS;
goto bad2;
}
+   if ((cnp->cn_flags & SOFTLOCKLEAF) && (dp->v_flag & VSOFTLOCK)) {
+   error = EINVAL;
+   goto bad2;
+   }
if (cnp->cn_flags & SAVESTART) {
ndp->ni_startdir = ndp->ni_dvp;
VREF(ndp->ni_startdir);
}
if (!wantparent)
vrele(ndp->ni_dvp);
-
+   if (cnp->cn_flags & SOFTLOCKLEAF) {
+   VREF(dp);
+   vsetsoftlock(dp);
+   }
if ((cnp->cn_flags & LOCKLEAF) == 0)
VOP_UNLOCK(dp, 0, p);
return (0);
@@ -654,6 +670,15 @@
error = ENOTDIR;
goto bad;
}
+   if ((cnp->cn_flags & SOFTLOCKLEAF) &&
+   (dp->v_flag & VSOFTLOCK)) {
+   error = EINVAL;
+   goto bad;
+   }
+   if (cnp->cn_flags & SOFTLOCKLEAF) {
+   VREF(dp);
+   vsetsoftlock(dp);
+   }
if (!(cnp->cn_flags & LOCKLEAF))
VOP_UNLOCK(dp, 0, p);
*vpp = dp;
@@ -707,6 +732,10 @@
error = EROFS;
goto bad2;
}
+   if ((cnp->cn_flags & SOFTLOCKLEAF) && (dp->v_flag & VSOFTLOCK)) {
+   error = EINVAL;
+   goto bad2;
+   }
/* ASSERT(dvp == ndp->ni_startdir) */
if (cnp->cn_flags & SAVESTART)
VREF(dvp);
@@ -718,6 +747,10 @@
((cnp->cn_flags & (NOOBJ|LOCKLEAF)) == LOCKLEAF))
vfs_object_create(dp, cnp->cn_proc, cnp->cn_cred);
 
+   if (cnp->cn_flags & SOFTLOCKLEAF) {
+   VREF(dp);
+   vsetsoftlock(dp);
+   }
if ((cnp->cn_flags & LOCKLEAF) == 0)
VOP_UNLOCK(dp, 0, p);
return (0);
Index: kern/vfs_subr.c
===
RCS file: /home/ncvs/src/sys/kern/vfs_subr.c,v
retrieving revision 1.249.2.11
diff -u -r1.249.2.11 vfs_subr.c
--- kern/vfs_subr.c 2001/09/11 09:49:53 1.249.2.11
+++ kern/vfs_subr.c 2001/10/02 18:45:55
@@ -2927,6 +2961,12 @@
  struct nameidata *ndp;
  const uint flags;
 {
+   if (!(flags & NDF_NO_FREE_SOFTLOCKLEAF) &&
+   (ndp->ni_cnd.cn_flags & SOFTLOCKLEAF) &&
+   ndp->ni_vp) {
+   vclearsoftlock(ndp->ni_vp);
+   vrele(ndp->ni_vp);
+  

Re: dump/restore and DIRPREF

2001-10-02 Thread Matt Dillon

I recommend using cpdup ( /usr/ports/sysutils/cpdup ), mainly because
you can ^C it and restart it at any time so it's a lot easier to 
play around with your directory dup'ing.


-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: VM: dynamic swap remapping (patch)

2001-09-30 Thread Matt Dillon

:> :Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
:> :[EMAIL PROTECTED] | TCP/IP since RFC 956
:> 
:> I think the file descriptor problem can be solved easily... simply
:> open the file, mmap() the entire 1G segment for this special application,
:> and then close() the file.  Then have sbrk() just eats out of the mapped 
:> segment.  Alternatively sbrk() could open/mmap/close in large 1MB or 4MB
:> segments, again leaving no file descriptors dangling.
:
:Won't that cause fragmentation?  You're forgettng the need to 
:ftruncate or pre-zero the file unless that's been fixed.
:
:-- 
:-Alfred Perlstein [[EMAIL PROTECTED]]

You have to pre-zero the file.   You can do it in reasonably-sized
chunks (like 4M) without causing fragmentation.  You *CANNOT* use 
ftruncate() to extend the file - that will virtually guarentee massive
fragmentation.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: precise timing

2001-09-30 Thread Matt Dillon

   You definite need to use a microcontroller.  Something like the 
   68HC11F1 is a good single-chip solution (though the F1 only has
   512 bytes of E^2).  I'm sure Motorola has newer chips with more
   on-board E^2.Stepper motors can be manipulated from a PC parallel
   port but you will never get smooth output and you can forget about
   momentum accelleration.

   There are also a huge number of Intel-derivative microcontrollers that
   are as self contained and in much smaller packages then typical motorola
   parts.  

   I'm most familiar with the Motorola's... For a stepper or waveform output
   I've always liked the motorola MCUs because they have timer output
   compare registers that will automatically flip a bit for you on an
   output port, giving you timer resolution down to crystal / 4 and 
   accuracy that is at the crystal accuracy.

   But the Intel derivatives are going to be much, much cheaper... $2 or $3
   for an MCU that does what you want and extremely easy to program.  Look
   at the MCS51 and MCS96 series.  Note that there are dozens of manfacturers
   of Intel-style controllers.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: VM: dynamic swap remapping (patch)

2001-09-30 Thread Matt Dillon


:
:In message <[EMAIL PROTECTED]>, Matt Dillon writes:
:>:  Second, application not always grows to 1G, most of the time it keeps
:>:  as small as 500M ;). Why should we precommit 1G for 500M data? Doing
:>:  multi-mmap memory management is additional pain.
:>
:>Even using file-backed memory is fairly trivial.  You don't need to
:>do multi-mmap memory management or do any kernel tweaking.  Just
:>reserve 1G and use a single mmap() and file per process.
:
:I once had a patch to phkmalloc() which backed all malloc'ed VM with
:hidden files in the users homedir.  It was written to put the VM
:usage under QUOTA control, but it had many useful side effects as well.
:
:I can't seem to find it right now, but it is trivial to do: just
:replace the sbrk(2) with mmap().  Only downside is the needed 
:filedescriptor which some shells don't like.
:
:-- 
:Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
:[EMAIL PROTECTED] | TCP/IP since RFC 956

I think the file descriptor problem can be solved easily... simply
open the file, mmap() the entire 1G segment for this special application,
and then close() the file.  Then have sbrk() just eats out of the mapped 
segment.  Alternatively sbrk() could open/mmap/close in large 1MB or 4MB
segments, again leaving no file descriptors dangling.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: VM: dynamic swap remapping (patch)

2001-09-30 Thread Matt Dillon


:  Second, application not always grows to 1G, most of the time it keeps
:  as small as 500M ;). Why should we precommit 1G for 500M data? Doing
:  multi-mmap memory management is additional pain.

Why not?  Disk space is cheap.  For a problem like this I would simply
throw in two 30G+ hard drives and partition them with 16G of swap each,
giving me 32G of swap for the machine.  If you needed to do it cheaply
you could even use IDE, though personally I would use SCSI for 
reliability.  Depending on the amount of real memory in the machine
you might have to tweek a few kernel options (like matching NSWAP to
the actual number of swap devices), but basically it should just work.

Even using file-backed memory is fairly trivial.  You don't need to
do multi-mmap memory management or do any kernel tweaking.  Just
reserve 1G and use a single mmap() and file per process.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: VM: dynamic swap remapping (patch)

2001-09-29 Thread Matt Dillon

:> overcommit?  I've always wanted the ability to turn off overcommit
:> for exactly the same reasons you do.
:
:FWIW: Tru64 has had this capability since day one. You can select
:swap-overcommit mode by removing a symlink (/sbin/swapdefault -> /dev/foob)
:were /dev/foob is the primary swap partition.
:
:W/
:
:-- 
:|   / o / /_  _email:  [EMAIL PROTECTED]
:|/|/ / / /(  (_)  BulteArnhem, The Netherlands 

Well, the overcommit argument comes up once or twice a year.  Frankly
I don't see much of a point to it.  While it is true that you could 
implement a signal the plain fact of the matter is that having to deal
with the possibility in a program at the N points (generally hundreds of
points) where that program allocates memory, either directly or 
indirectly, virtually guarentees that you will introduce bugs into the
system.  You also cannot guarentee that your process will have time to
cleanup prior to the system killing, nor can you guarentee that all the
standard system utilities and daemons will be able to gracefully handle
the out of memory condition.  In otherwords, you could implement
the signal and even have the program use it, but you will still likely
leave gaping holes in the implementation that will result in lost data.

It is much easier to manage memory manually.  For example, if these
programs require 1G of independant memory to run it ought to be a
fairly simple matter to simply create a 1GB file for each process
(using dd rather then ftruncate() to create the file so the blocks are
preallocated), mmap() it using PROT_READ|PROT_WRITE, MAP_SHARED|MAP_NOSYNC,
and do your memory management out of that.  The memory space will be
backed by the file rather then by swap.  You get all the benefits of
the standard overcommit capabilities of the system as well as the
ability to pre-reserve the main workspace for the programs and you
automatically get persistent storage for the data.  Problem solved.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: more on Re: Please review: bugfix for vinvalbuf()

2001-09-28 Thread Matt Dillon

:@@ -721,9 +721,9 @@
:   }
:   }
:
:-  while (vp->v_numoutput > 0) {
:-  vp->v_flag |= VBWAIT;
:-  tsleep(&vp->v_numoutput, PVM, "vnvlbv", 0);
:+  if (VOP_GETVOBJECT(vp, &object) == 0) {
:+  while (object->paging_in_progress)
:+  vm_object_pip_sleep(object, "vnvlbv");
:   }
:
:   splx(s);


Hey Douglas, try the patch fragment below and see if you can reproduce the
problem.

-Matt
 
@@ -721,10 +746,21 @@
}
}
 
-   while (vp->v_numoutput > 0) {
-   vp->v_flag |= VBWAIT;
-   tsleep(&vp->v_numoutput, PVM, "vnvlbv", 0);
-   }
+   /*
+* Wait for I/O to complete.  XXX needs cleaning up.  The vnode can
+* have write I/O in-progress but if there is a VM object then the
+* VM object can also have read-I/O in-progress.
+*/
+   do {
+   while (vp->v_numoutput > 0) {
+   vp->v_flag |= VBWAIT;
+   tsleep(&vp->v_numoutput, PVM, "vnvlbv", 0);
+   }
+   if (VOP_GETVOBJECT(vp, &object) == 0) {
+   while (object->paging_in_progress)
+   vm_object_pip_sleep(object, "vnvlbx");
+   }
+   } while (vp->v_numoutput > 0);
 
splx(s);
 

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: bind : address already inuse

2001-09-28 Thread Matt Dillon


:--Boundary_(ID_mQ1p0DshiB00ke/hmU0cHA)
:Content-type: text/plain; charset=us-ascii
:Content-transfer-encoding: 7BIT
:
:When an app binds an address and port to a listen socket,  what
:variables
:can I adjust so the address may be reused immediately after the app
:exits.
:My understanding  was that
:int on = 1;
:setsockopt(s,SOL_SOCKET,SO_REUSEADDR,&on,sizeof(on));
:would do it but there still seems to be a significant amount of time
:between
:the exit and bind allowing a new app to use the address, even though
:there
:are no inbound connections pending in the listen queue when the exit
:occurs.
:I am debugging a server and the process requires restarting often.
:
:Thanks,
:Rick

SO_REUSEADDR is the correct socket opt and it will allow the address
to be reused immediately.  If you still get 'address already in use'
then there is still another process listening on the socket...
probably an older named that you missed, or perhaps the other named
simply wasn't exiting quickly enough before you started the new one.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: more on Re: Please review: bugfix for vinvalbuf()

2001-09-27 Thread Matt Dillon

I totally forgot about that one.  Your fix looks good, I'll start testing
it.

The bigger picture here is that vinvalbuf() is not typically called
while a vnode is still active.  NFS calls it on active vnodes in order
to invalidate the cache when the client detects that the file mtime
has been changed by someone else (ugly ugly ugly).  So this sort of 
crash can occur if one client is mmap()ing a file another another
client (or the server) writes to the file.

-Matt

:I recently mentioned on freebsd-stable in message
:
:  <[EMAIL PROTECTED]>
:
:a recurring rslock panic which I believe has been caused by the below
:mentioned problem in vinvalbuf(). I have worked up a patch for -STABLE
:(which should also apply to -CURRENT if there have not been major changes
:to this code). The patch is appended to this message for review.
:
:Data from the crash dump revealed the following:
:
:SMP 2 cpus
:IdlePTD 3555328
:initial pcb at 2cf280
:panicstr: rslock: cpu: 0, addr: 0xd7be66ec, lock: 0x0001
:panic messages:
:---
:panic: rslock: cpu: 0, addr: 0xd7be66ec, lock: 0x0001
:mp_lock = 0001; cpuid = 0; lapic.id = 0100
:boot() called on cpu#0
:
:---
:
:#0  dumpsys () at /usr/src/sys/kern/kern_shutdown.c:473
:#1  0xc016cf8f in boot (howto=256) at /usr/src/sys/kern/kern_shutdown.c:313
:#2  0xc016d3a9 in panic (fmt=0xc025bcce "rslock: cpu: %d, addr: 0x%08x, lock: 0x%08x")
:at /usr/src/sys/kern/kern_shutdown.c:581
:#3  0xc025bcce in bsl1 ()
:#4  0xc021eeac in _unlock_things (fs=0xd7a6dec4, dealloc=0)
:at /usr/src/sys/vm/vm_fault.c:147
:#5  0xc021f8c7 in vm_fault (map=0xd7a6bf40, vaddr=673968128, fault_type=1 '\001',
:  fault_flags=0) at /usr/src/sys/vm/vm_fault.c:826
:#6  0xc025d016 in trap_pfault (frame=0xd7a6dfa8, usermode=1, eva=673972223)
:at /usr/src/sys/i386/i386/trap.c:829
:#7  0xc025ca8b in trap (frame={tf_fs = 47, tf_es = 47, tf_ds = 47, tf_edi = 99049,
:  tf_esi = 0, tf_ebp = -1077937884, tf_isp = -676929580, tf_ebx = 48110729,
:  tf_edx = 0, tf_ecx = 1835007, tf_eax = 672137216, tf_trapno = 12, tf_err = 4,
:  tf_eip = 134517190, tf_cs = 31, tf_eflags = 66070, tf_esp = -1077937940,
:  tf_ss = 47})
:at /usr/src/sys/i386/i386/trap.c:359
:#8  0x80491c6 in ?? ()
:#9  0x8048d9e in ?? ()
:#10 0x804a078 in ?? ()
:#11 0x8048b45 in ?? ()
:
:---
:
:A quick review of processes revealed a process stuck in vmopar:
:
:(kgdb) ps
:...
:46479 d7ffc560 d806e000 235886 1 46394  004006  3  tail vmopar c09120c8
:...
:
:which was sleeping in vm_object_page_remove() in vinvalbuf():
:  
:(kgdb) btp 46479
: frame 0 at 0xd806fc8c: ebp d806fcb8, eip 0xc0170051 :  mov
:0x141(%ebx),%al
: frame 1 at 0xd806fcb8: ebp d806fce0, eip 0xc02262e8 : 
:  add$0x10,%esp
: frame 2 at 0xd806fce0: ebp d806fd2c, eip 0xc019a667 :   add
:$0x10,%esp
: frame 3 at 0xd806fd2c: ebp d806fd60, eip 0xc01d0aa0 :   add   
: $0x18,%esp
: frame 4 at 0xd806fd60: ebp d806fe2c, eip 0xc01cf5d8 : mov
:%eax,0xff74(%ebp)
: frame 5 at 0xd806fe2c: ebp d806fe44, eip 0xc01f6842 : jmp
:0xc01f6849 
: frame 6 at 0xd806fe44: ebp d806fe78, eip 0xc01a22cc : mov
:%eax,0xffe8(%ebp)
: frame 7 at 0xd806fe78: ebp d806fef4, eip 0xc017b690 :  mov
:%eax,%esi
: frame 8 at 0xd806fef4: ebp d806ff28, eip 0xc017b556 : mov%eax,%esi
: frame 9 at 0xd806ff28: ebp d806ffa0, eip 0xc025d745 :mov
:%eax,0xffb8(%ebp)
:
:
:The patch is below, against vfs_subr.c 1.249.2.11 2001/09/11
:
:--- vfs_subr.c  Tue Sep 11 04:49:53 2001
:+++ vfs_subr.c.new  Wed Sep 26 15:23:09 2001
:@@ -721,9 +721,9 @@
:   }
:   }
:
:-  while (vp->v_numoutput > 0) {
:-  vp->v_flag |= VBWAIT;
:-  tsleep(&vp->v_numoutput, PVM, "vnvlbv", 0);
:+  if (VOP_GETVOBJECT(vp, &object) == 0) {
:+  while (object->paging_in_progress)
:+  vm_object_pip_sleep(object, "vnvlbv");
:   }
:
:   splx(s);

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Conclusions on... was Re: More on the cache_purgeleafdirs() routine

2001-09-27 Thread Matt Dillon


:
:On Sunday, 23rd September 2001, Poul-Henning Kamp wrote:
:
:>Things to look out for:
:>
:>1. !ufs filesystems
:
:I am irredeemably slack for not testing this a lot but...
:
:I believe I saw bad interactions between vmiodirenable and isofs on 4.3-R.
:
:I mounted a CD, looked at stuff on it, did a lot of other work, went back
:to the CD and files were screwy (files contained the contents of other
:files, files were zero size).  I unmounted and remounted the CD and
:everything was fine.  The machine is a reliable old workhorse, and has
:no hardware errors.
:
:Since then, I've not had a chance to go back and check.  It's only because
:you are making vmiodirenable the default that I'm mentioning it.  Sorry
:for not making a proper bug report containing actual facts. :-(
:
:Stephen.

Hmm.  Well, if someone can reproduce the problem it sounds like it
ought to be easy to track down.  I am somewhat skeptical that 
vmiodirenable could cause that but I suppose it's possible.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Conclusions on... was Re: More on the cache_purgeleafdirs() routine

2001-09-26 Thread Matt Dillon

:
:Then I suggest the following to be changed:
:
:#
:#  This file is read when going to multi-user and its contents piped thru
:#  ``sysctl'' to adjust kernel values.  ``man 5 sysctl.conf'' for details.
:#
:
:# $FreeBSD: src/etc/sysctl.conf,v 1.5 2001/08/26 02:37:22 dd Exp $
:
:vfs.vmiodirenable=1 # Set to 1 to enable the use of the VM subsystem to
:# back UFS directory memory requirements. Because of
:# the amount of wasted memory this causes it's not
:# advised for machines with less than 64MB of RAM,
:# on machines with more than 64MB it can provide a
:# substantial benefit related to directory caching.
:
:That was what I read and confused me ;)

Yah.  That isn't correct any more.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: VM Corruption - stumped, anyone have any ideas?

2001-09-25 Thread Matt Dillon

I really don't think it is necessary to hack up GCC to figure
out stack utilization.  We have issues with only a few drivers
and it is fairly trivial (as my patch shows) to throw a pattern
into the kernel stack to determine how much is actually used.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: VM Corruption - stumped, anyone have any ideas?

2001-09-25 Thread Matt Dillon

:So, Matt, does this solve the original question? (VM Corruption) or 
:is it just a fruitful red-herring?
:-- 
:++   __ _  __
:|   __--_|\  Julian Elischer |   \ U \/ / hard at work in 

It seems unlikely to me, but you never know.  Certainly this is a
problem that has to be fixed now.  I've bumped -stable's UPAGES up to 3
but we absolutely have to MFC the fixes for the two devices allocating
2K stacks.  Maybe I should have bumped UPAGES up to 4 :-)

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: VM Corruption - stumped, anyone have any ideas?

2001-09-25 Thread Matt Dillon

:I had been contemplating making a fake 'struct user' in userland only in
:order to keep the a.out coredump reader code happy.  The a.out coredump
:code (see cpu_coredump() in */*/vm_machdep.c) can generate this fake
:structure in order to keep gdb happy.  But then I realized that a.out
:coredump debugging was almost totally irrelevant these days.
:
:Cheers,
:-Peter
:--
:Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]

Hmm.  How about this... if we keep the guard field at the end of 
struct user we could #ifdef _KERNEL it so userland doesn't notice it.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: VM Corruption - stumped, anyone have any ideas?

2001-09-25 Thread Matt Dillon

:Ok, time to take a good stab at sticking my foot in my mouth here.
:
:Would it be possible to have a kernel mode where the read-only bit was
:turned on for malloc pools which shouldn't currently be accessed?  This
:could be gated through the spl() calls (or specific mutexes on -current),
:ensuring that something like getpid couldn't stomp on the vm structures
:w/o first doing a splvm().

Kinda sounds like Multics :-)... no, it would be too messy trying
to protect kernel structures in one subsystem from another.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Patch to test kstack usage.

2001-09-24 Thread Matt Dillon

:stack size = 4688

Sep 24 22:47:22 test1 /kernel: process 29144 exit kstackuse 4496

closer... :-)

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Patch to test kstack usage.

2001-09-24 Thread Matt Dillon

:
:Matt Dillon wrote:
:> This isn't perfect but it should be a good start in regards to 
:> testing kstack use.  This patch is against -stable.  It reports
:> kernel stack use on process exit and will generate a 'Kernel stack
:> underflow' message if it detects an underflow.  It doesn't panic,
:> so for a fun time you can leave UPAGES at 2 and watch in horror.
:
:It is checking against the wrong guard value. It should be u_guard2.
:
:FWIW; the max stack available is 4688 bytes on a standard 4.x system. Yes,
:that is too freaking close.  Also, the maximum usage depends on what sort
:of cards you have in the system.. If you have a heavy tty user (eg: a 32+

I looked at it fairly carefully.  It has got to be u_guard... at the
end of struct user, at least until you do that MFC.  The ptrace code
appears to mess around with u_kproc quite a bit.  And when you rip out
u_kproc it still needs to be at the end, after the coredump structure
(though for i386 the coredump structure is empty)... because interrupts
can occur during a core dump.

:port serial card) then you have lots of tty interrupts nesting as well.
:Having the ppp/sl/plip drivers in the system partly negates the effect of
:this though since it wires the net/tty interrupt masks together.
:...
:Cheers,
:-Peter
:--
:Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
:"All of this is for nothing if we don't go to the stars" - JMS/B5
:

Yah... the test I ran was just a couple of seconds worth of playing
around over ssh.  I expect the worst case to be a whole lot worse.

We're going to have to bump up UPAGES to 3 in 4.x, there's no question
about it.  I'm going to do it tonight.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Patch to test kstack usage.

2001-09-24 Thread Matt Dillon

This isn't perfect but it should be a good start in regards to 
testing kstack use.  This patch is against -stable.  It reports
kernel stack use on process exit and will generate a 'Kernel stack
underflow' message if it detects an underflow.  It doesn't panic,
so for a fun time you can leave UPAGES at 2 and watch in horror.

note: make sure you make depend before making a new kernel, or use
buildkernel.

-Matt


Index: sys/user.h
===
RCS file: /home/ncvs/src/sys/sys/user.h,v
retrieving revision 1.24
diff -u -r1.24 user.h
--- sys/user.h  1999/12/29 04:24:49 1.24
+++ sys/user.h  2001/09/25 03:41:04
@@ -109,9 +109,13 @@
 * Remaining fields only for core dump and/or ptrace--
 * not valid at other times!
 */
+   u_int32_t u_guard2; /* guard the base of the kstack */
struct  kinfo_proc u_kproc; /* proc + eproc */
struct  md_coredump u_md;   /* machine dependent glop */
+   u_int32_t u_guard;  /* guard the base of the kstack */
 };
+
+#define U_GUARD_MAGIC   0x51A2C3D4
 
 /*
  * Redefinitions to make the debuggers happy for now...  This subterfuge
Index: kern/init_main.c
===
RCS file: /home/ncvs/src/sys/kern/init_main.c,v
retrieving revision 1.134.2.6
diff -u -r1.134.2.6 init_main.c
--- kern/init_main.c2001/06/15 09:37:55 1.134.2.6
+++ kern/init_main.c2001/09/25 01:39:05
@@ -358,6 +358,7 @@
 */
p->p_stats = &p->p_addr->u_stats;
p->p_sigacts = &p->p_addr->u_sigacts;
+   p->p_addr->u_guard = U_GUARD_MAGIC; /* bottom of kernel stack */
 
/*
 * Charge root for one process.
Index: kern/kern_exit.c
===
RCS file: /home/ncvs/src/sys/kern/kern_exit.c,v
retrieving revision 1.92.2.5
diff -u -r1.92.2.5 kern_exit.c
--- kern/kern_exit.c2001/07/27 14:06:01 1.92.2.5
+++ kern/kern_exit.c2001/09/25 04:09:32
@@ -123,6 +123,16 @@
WTERMSIG(rv), WEXITSTATUS(rv));
panic("Going nowhere without my init!");
}
+   {
+   int *ua;
+   int *addrend = (int *)((char *)p->p_addr + UPAGES * PAGE_SIZE);
+   for (ua = &p->p_addr->u_guard + 1; ua < addrend; ++ua) {
+   if (*ua != 0x)
+   break;
+   }
+   printf("process %d exit kstackuse %d\n",
+   p->p_pid, (char *)addrend - (char *)ua);
+   }
 
aio_proc_rundown(p);
 
Index: kern/kern_synch.c
===
RCS file: /home/ncvs/src/sys/kern/kern_synch.c,v
retrieving revision 1.87.2.3
diff -u -r1.87.2.3 kern_synch.c
--- kern/kern_synch.c   2000/12/31 22:10:45 1.87.2.3
+++ kern/kern_synch.c   2001/09/25 02:54:46
@@ -44,13 +44,17 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
+#include 
 #ifdef KTRACE
 #include 
 #include 
@@ -792,6 +796,13 @@
register struct proc *p = curproc;  /* XXX */
register struct rlimit *rlim;
int x;
+
+   /*
+* Check to see if the kernel stack underflowed (XXX)
+*/
+   if (p->p_addr->u_guard != U_GUARD_MAGIC) {
+   printf("Kernel stack underflow! %p %p %08x\n", p, p->p_addr, 
+p->p_addr->u_guard);
+   }
 
/*
 * XXX this spl is almost unnecessary.  It is partly to allow for
Index: i386/i386/pmap.c
===
RCS file: /home/ncvs/src/sys/i386/i386/pmap.c,v
retrieving revision 1.250.2.10
diff -u -r1.250.2.10 pmap.c
--- i386/i386/pmap.c2001/07/30 23:27:59 1.250.2.10
+++ i386/i386/pmap.c2001/09/25 04:03:52
@@ -891,6 +891,7 @@
}
if (updateneeded)
invltlb();
+   memset(up, 0x11, UPAGES * PAGE_SIZE);
 }
 
 /*
Index: i386/include/param.h
===
RCS file: /home/ncvs/src/sys/i386/include/param.h,v
retrieving revision 1.54.2.5
diff -u -r1.54.2.5 param.h
--- i386/include/param.h2001/09/15 00:50:36 1.54.2.5
+++ i386/include/param.h2001/09/25 03:41:11
@@ -110,7 +110,7 @@
 #define MAXDUMPPGS (DFLTPHYS/PAGE_SIZE)
 
 #define IOPAGES2   /* pages of i/o permission bitmap */
-#define UPAGES 2   /* pages of u-area */
+#define UPAGES 4   /* pages of u-area */
 
 /*
  * Ceiling on amount of swblock kva space.
Index: vm/vm_glue.c
===
RCS file: /home/ncvs/src/sys/vm/vm_glue.c,v
retrieving revision 1.94.2.1
diff -u -r1.94.2.1 vm_glue.c
--- vm/vm_glue.c2000/08/02 22:15:09   

Re: VM Corruption - stumped, anyone have any ideas?

2001-09-24 Thread Matt Dillon

:Oh, one other thing...  When we had PCIBIOS active for pci config space
:read/write support, we had stack overflows on many systems when the SSE
:stuff got MFC'ed.  The simple act of trimming about 300 bytes from the
:pcb_save structure was enough to make the difference between it working or
:not.  We are *way* too close to the wire.  I asked about raising UPAGES
:from 2 to 3 before RELENG_4_4 but it never happened.
:
:Julian cleaned up a couple of places stuff where we were allocating 2K of
:local data *twice* on local stack frames.  There are some gcc patches
:floating around that enable you to generate a warning if your local stack
:frame exceedes a certain amount or the arguments are bigger than a
:specified amount.
:
:Cheers,
:-Peter
:--
:Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]

I'm getting stack underflows with UPAGES set to 2.  I've set UPAGES to 4
and preinitialized the UAREA to 0x11 and then scan it in exit1() to
determine how much stack was actually used.  If these numbers are
correct, we are screwed with UPAGES set to 2.  This is just 4 seconds
worth of a buildworld.  Note the '3664's showing up.  That's too close.
note the 3984 that came up after playing with the system for a few 
seconds!

I'll post the patch set to use to test this stuff in a moment.

-Matt

process 323 exit kstackuse 2272
...
process 333 exit kstackuse 2272
process 225 exit kstackuse 3664
process 233 exit kstackuse 2272
...
process 237 exit kstackuse 2272
process 322 exit kstackuse 2676
process 334 exit kstackuse 2272
...
process 319 exit kstackuse 2272

test1# dmesg | fgrep process | sort -n +4 | tail -10
process 6 exit kstackuse 3640
process 89 exit kstackuse 3640
process 176 exit kstackuse 3664
process 186 exit kstackuse 3664
process 225 exit kstackuse 3664
process 290 exit kstackuse 3664
process 299 exit kstackuse 3664
process 300 exit kstackuse 3664
process 303 exit kstackuse 3664
process 138 exit kstackuse 3984


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: VM Corruption - stumped, anyone have any ideas?

2001-09-24 Thread Matt Dillon

:
:I did it as part of the KSE work in 5.x.  It would be quite easy to do it
:for 4.x as well, but it makes a.out coredumps problematic.
:
:Also, "options UPAGES=4" is a pretty good defensive measure.
:
:Cheers,
:-Peter
:--
:Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]

Well, in 4.x:

(kgdb) print p->p_addr  
$6 = (struct user *) 0xcb7b9000
(kgdb) print &p->p_addr->u_sigacts
$7 = (struct sigacts *) 0xcb7b9260
(kgdb) print &p->p_addr->u_stats  
$8 = (struct pstats *) 0xcb7b9cd0
(kgdb) print &p->p_addr->u_kproc
$9 = (struct kinfo_proc *) 0xcb7b9db0
(kgdb) print &p->p_addr->u_md   
$10 = (struct md_coredump *) 0xcb7ba1d0
(kgdb) print &p->p_addr->u_guard(my new field)
$11 = (u_int32_t *) 0xcb7ba1d0
(kgdb) 

cb7b9000start of kstack
cb7ba1d4end of struct user
cb7bb000top of kstack

Leaving us 3628 bytes for the kernel stack.

Something really weird is going on... I added u_guard to the end
of the struct user structure and there are two or three processes
hitting the guard immediately.  All the rest are ok.  I'm going
to investigate further but this is very odd.  Am I missing something
about the UAREA?

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: ecc on i386

2001-09-24 Thread Matt Dillon


:What happens on an ECC equipped PC when you have a multi-bit memory
:error that hardware scrubbing can't fix?  Will there be some sort of
:NMI or something that will panic the box?
:
:I'm used to alphas (where you'll get a fatal machine check panic) and
:I am just wondering if PCs are as safe.
:
:Thanks,
:
:Drew

ECC can typically detect and correct single bit errors and detect
double bit errors.  Anything beyond that is problematic... it may or
may not detect the problem or may mis-correct a multi-bit error. 
An NMI is generated if an uncorrectable error is detected.

On PC's, ECC is optional.  Desktops typically do not ship with ECC
memory.  Branded servers typically do.A year or two ago I would
have been happy to use non-ECC rams (finding bad RAM through trial
and error), but now with capacities as they are and memory prices down
ECC is definitely the way to go.

Bit errors can come from many sources, memory being only one.  Bit errors
can occur inside the cpu chip, in the L1 and L2 caches, in memory, in
controller chips... all over the place.  Many modern processors implement
parity on their caches to try to cover the problem areas.  I'm not sure
how Pentium III's and IV's are setup.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: VM Corruption - stumped, anyone have any ideas?

2001-09-24 Thread Matt Dillon


:
:In message <[EMAIL PROTECTED]>, Matt Dillon writes:
:>
:>Hmm.  Do we have a guard page at the base of the per process kernel
:>stack?
:
:As I understand it, no. In RELENG_4 there are UPAGES (== 2 on i386)
:pages of per-process kernel state at p->p_addr. The stack grows
:down from the top, and struct user (sys/user.h) sits at the bottom.
:According to the comment in the definition of struct user, only
:the first three items in struct user are valid in normal running
:conditions:

Er, I mean I'll add a magic number to struct pstats.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: VM Corruption - stumped, anyone have any ideas?

2001-09-24 Thread Matt Dillon

:In message <[EMAIL PROTECTED]>, Matt Dillon writes:
:>
:>Hmm.  Do we have a guard page at the base of the per process kernel
:>stack?
:
:As I understand it, no. In RELENG_4 there are UPAGES (== 2 on i386)
:pages of per-process kernel state at p->p_addr. The stack grows
:down from the top, and struct user (sys/user.h) sits at the bottom.
:According to the comment in the definition of struct user, only
:the first three items in struct user are valid in normal running
:conditions:

Ok.  I'm going to add a magic number to the end of the process
structure and check it in mi_switch() in -stable.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: VM Corruption - stumped, anyone have any ideas?

2001-09-24 Thread Matt Dillon


:
:remember that we hit almost this problem with the KSE stuff during
:debugging?
:
:The pointers in the last few entries of the vm_page_buckets array got
:corrupted when an agument to a function that manipulated whatever was next
:in ram was 0, and it turned out that it was 0 because
: of some PTE flushing thing (you are the one that found it... remember?)
:(there was a line of asm code missing)

I've kept that in mind, but I think this may be a different issue.
The memory involved is 100% statically mapped in the kernel page table
array, and the errors are more like bit errors then anything else.  Either
the memory is bad or something in our kernel is setting or clearing flags
through a bad pointer.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: VM Corruption - stumped, anyone have any ideas?

2001-09-24 Thread Matt Dillon


:>The pointers in the last few entries of the vm_page_buckets array got
:>corrupted when an agument to a function that manipulated whatever was next
:>in ram was 0, and it turned out that it was 0 because
:> of some PTE flushing thing (you are the one that found it... remember?)
:
:I think I've also seen a few reports of programs exiting with
:"Profiling timer expired" messages with 4.4. These can be caused
:by stack overflows, since the p_timer[] array in struct pstats is
:one of the things that I think lives below the per-process kernel
:stack. I wonder if they are related? Stack overflows could result
:in corruption of local variables, after which anything could happen.
:
:That said, hardware problems are still a possiblilty.
:
:Ian

Hmm.  Do we have a guard page at the base of the per process kernel
stack?

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: VM Corruption - stumped, anyone have any ideas?

2001-09-24 Thread Matt Dillon


:
:In message <[EMAIL PROTECTED]>, Matt Dillon writes:
:>
:>$8 = 58630
:>(kgdb) print vm_page_buckets[$8]
:
:What is vm_page_hash_mask? The chunk of memory you printed out below
:looks alright; it is consistent with vm_page_array == 0xc051c000. Is
:it just the vm_page_buckets[] pointer that is corrupt?
:
:The address 0xc08428cc is (char *)&vm_page_array[55060] + 28, and
:sizeof(struct vm_page) is 60, so 0xc08428cc is in the middle of
:a vm_page within vm_page_array[].
:
:Ian

(kgdb) print vm_page_buckets[58630]
$5 = (struct vm_page *) 0xc08428cc
(kgdb) print vm_page_array
$6 = 0xc051c000
(kgdb) print vm_page_hash_mask
$7 = 262143
(kgdb) print &vm_page_array[55060]
$11 = (struct vm_page *) 0xc08428b0
(kgdb) print &vm_page_array[55061]
$10 = (struct vm_page *) 0xc08428ec

Yowzer.  How the hell did that happen!  Yes, you're right, the
vm_page_array[] pointer has gotten corrupted.  If we assume that
the vm_page_t is valid (0xc0842acc), then the vm_page_buckets[]
pointer should be that.

vm_page_buckets[58630]  -> c08428cc
panic on vm_page_t m-> c0842acc

Ok, so the corruption here is that an 'a' turned into an '8'. 1010 turned
into 1000... a bit got cleared.

This is very similar to the corruption I found on one of Yahoo's 
machines.  Except on that machine two bits were changed.  It's as though
some other subsystem is trying to manipulate a flag in a structure using
a bad structure pointer.

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Second set of stable buildworld results w/ vmiodirenable & nameileafonly combos

2001-09-24 Thread Matt Dillon

Ok, here is the second set of results.  I didn't run all the tests
because nothing I did appeared to really have much of an effect.  In
this set of tests I set MAXMEM to 128M.  As you can see the buildworld
took longer verses 512M (no surprise), and vmiodirenable still helped 
verses otherwise.  If one takes into consideration the standard
deviation, the directory vnode reclamation parameters made absolutely
no difference in the tests.

The primary differentiator in all the tests is 'block input ops'.  With
vmiodirenable turned on it sits at around 51000.  With it off it sits
at around 56000.  In the 512M tests the pass-1 numbers were 26000 with
vmiodirenable turned on and 33000 with it off.  Pass-2 numbers were
9000 with it on and 18000 with it off.  The directory leaf reuse 
parameters had almost no effect on either the 128M or 512M numbers.

I'm not sure why test2 wound up doing a better job then test1 in the
128M tests with vmiodirenable disabled.  Both machines are configured
identically with only some extra junk on test1's /usr from prior tests.
In anycase, the differences point to a rather significant error spread
in regards to possible outcomes, at least with vmiodirenable=0.

My conclusion from all of this is:

* vmiodirenable should be turned on by default.

* We should rip out the cache_purgeleafdirs() code entirely and use my
  simpler version to fix the vnode-growth problem.

* We can probably also rip out my cache_leaf_test() .. we do not need 
  to add any sophistication to reuse only directory vnodes without 
  subdirectories in the cache.  If it had been a problem we would have
  seen it.

I can leave the sysctl's in place on the commit to allow further testing,
and I can leave it conditional on vmiodirenable.  I'll set the default
vmiodirenable to 1 (which will also enable directory vnode reuse) and
the default nameileafonly to 0 (i.e. to use the less sophisticated check).
In a few weeks I will rip-out nameileafonly and cache_leaf_test().

-Matt


WIDE TERMINAL WINDOW REQUIRED! 
---

TEST SUITE 2 (128M ram)

buildworld of -stable.  DELL2550 (Duel PIII-1.2GHz / 128M ram (via MAXMEM) / 
SCSI)
23 September 2001   SMP kernel, softupdates-enabled, dirpref'd local 
/usr/src (no nfs),
make -j 12 buildworld   UFS_DIRHASH.  2 identical machines tested in parallel 
(test1, test2)
/usr/bin/time -l timingsnote: atime updates left enabled in all tests

REUSE LEAF DIR VNODES:  directory vnodes with no subdirectories in the namei cache can 
be reused
REUSE ALL DIR VNODES:   directory vnodes can be reused (namei cache ignored)
DO NOT REUSE DIR...:(Poul's original 1995 algo) directory vnode can only be reused 
if no subdirectories or files in the
 namei cache

I stopped bothering with pass-2 after it became evident that the numbers
were not changing significantly.

VMIODIRENABLE ENABLED   [ A ]   [ B ]  
 [ C ]
[BEST CASE  ]   [BEST CASE  ]  
 [BEST CASE  ]
machine test1   test2   test1   test2   test1   test2   test1   test2  
 test1   test2   test1   test2
pass (2)R   1   1   2   2   R   1   1   2   2R 
 1   1   2   2
vfs.vmiodirenable   E   1   1   1   1   E   1   1   1   1E 
 1   1   1   1
vfs.nameileafonly   B   1   1   1   1   B   0   0   0   0B 
 -1  -1  -1  -1
O   OO
O   REUSE LEAF DIR VNODES   O   REUSE ALL DIR VNODES O 
 DO NOT REUSE DIR VNODES W/ACTIVE NAMEI
T   TT
26:49   26:30   26:41   26:24
real1609159016011584
user1361135413611356
sys 617 615 617 614
max resident16264   16256   16260   16264
avg shared mem  1030103010301030
avg unshared data   1004100510061004
avg unshared stack  129 129 129 129
page reclaims   11.16M  11.16M  11.15M  11.15M
page faults 3321367429402801
swaps   0   0   0   0
block input ops 51748   51881   50777   50690
block output ops5532649756806089
messages sent   35847   35848   35789   35715
messages received   35848   35852   35792   3

stable buildworld results w/ vmiodirenable & nameileafonly combos

2001-09-23 Thread Matt Dillon

Ok, here are the first set of results.  I am going to rerun the entire
suite of tests again with the machines limited to 128M of ram to see
what happens then.

BTW, these are really nice machines!  I highly recommend DELL2550's.

The results w/ 512M are basically that it doesn't matter what we do
with the namei cache.  vmiodirenable is the only thing that really makes
a difference.  I expect we will see more differentiation in the 128M
tests.

-Matt

WIDE TERMINAL WINDOW REQUIRED! 
---

TEST SUITE 1 (512M ram)

buildworld of -stable.  DELL2550 (Duel PIII-1.2GHz / 512M ram / SCSI)
23 September 2001   SMP kernel, softupdates-enabled, dirpref'd local 
/usr/src (no nfs),
make -j 12 buildworld   UFS_DIRHASH.  2 identical machines tested in parallel 
(test1, test2)
/usr/bin/time -l timingsnote: atime updates left enabled in all tests

REUSE LEAF DIR VNODES:  directory vnodes with no subdirectories in the namei cache can 
be reused
REUSE ALL DIR VNODES:   directory vnodes can be reused (namei cache ignored)
DO NOT REUSE DIR...:(Poul's original 1995 algo) directory vnode can only be reused 
if no subdirectories or files in the
 namei cache


VMIODIRENABLE ENABLED   [ A ]   [ B ]  
 [ C ]
[BEST CASE  ]   [BEST CASE  ]  
 [BEST CASE  ]
machine test1   test2   test1   test2   test1   test2   test1   test2  
 test1   test2   test1   test2
pass (2)R   1   1   2   2   R   1   1   2   2R 
 1   1   2   2
vfs.vmiodirenable   E   1   1   1   1   E   1   1   1   1E 
 1   1   1   1
vfs.nameileafonly   B   1   1   1   1   B   0   0   0   0B 
 -1  -1  -1  -1
O   OO
O   REUSE LEAF DIR VNODES   O   REUSE ALL DIR VNODES O 
 DO NOT REUSE DIR VNODES W/ACTIVE NAMEI
T   TT
25:46   25:44   25:19   25:05   25:49   25:40   25:14   25:04  
 25:46   25:42   25:07   25:14
real15461544151915051549154015141504   
 1546154215071514
user13611352135913561362135413611354   
 1361135213581355
sys 636 637 645 640 632 633 642 641
 636 637 642 640
max resident16292   16276   16268   16288   16284   16288   16280   16280  
 16288   16288   16284   16288
avg shared mem  10261025102510251027102810251023   
 1025102610261025
avg unshared data   10181009101410071007101010071008   
 1010101810061002
avg unshared stack  129 129 129 129 129 129 129 128
 129 129 129 129
page reclaims   11.15M  11.15M  11.16M  11.15M  11.15M  11.16M  11.15M  11.16M 
 11.15M  11.15M  11.15M  11.15M
page faults 18121800131613481797179813201273   
 1795179713071321
swaps   0   0   0   0   0   0   0   0  
 0   0   0   0
block input ops 26542   26535   9272923326555   26577   92588720   
 26470   26552   89059237
block output ops54005217524852665450510953285389   
 5332537153005341
messages sent   34582   34572   33533   33538   34579   34538   33539   33510  
 34610   34587   33525   33543
messages received   34582   34578   33533   33539   34580   34540   33546   33524  
 34610   34589   33525   33543
signals received8   8   8   8   8   8   8   8  
 8   8   8   8
voluntary ctx sw594390  595868  575038  574626  594677  594028  571265  572912 
 593500  593548  571394  573575
invol. ctx switch   380583  381897  378850  376941  380393  381540  374408  376260 
 380385  379585  375318  374844

desiredvnodes   36157   36157   36157   36157   36157   36157   36157   36157  
 36157   36157   36157   36157
maxvnodes (sysstat)(1)  37099   37064   37122   37208   37180   36928   37193   37314  
 37175   37152   37152   37152


VMIODIRENABLE DISABLED  [ D ]   [ E ]
[BEST CASE  ]   [BEST CASE  ]
machine   

another correction (self negating)

2001-09-23 Thread Matt Dillon

I got the cache_leaf_test() return values backwards.  But that's ok,
because my cache_leaf_test() if() statement is also backwards :-).  I'll
turn them around in a later patch set.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Conclusions on... was Re: More on the cache_purgeleafdirs() routine

2001-09-23 Thread Matt Dillon


:Notice that both the user and system times increased..
:if there had been another parallel task, the overall system throughout may have
:decreased..
:
:I'm not saying this is wrong, just that we should look at other workloads too.
:no point in optimising the system for compiling itself.. that's not really a 
:real-world task..
:lmost noo-one buys a freeBSD box so that they can compile freeBS.. they usually
:have other plans for what they want out of it...
:
:(of course this may effect other tasks in an even more positive way, but we
:should know that..)

Yah, but buildworld is all we can really do on test boxes.  We will have
to wait for people to try these things on production squid, news, and
web server systems to get real-world numbers.

The timing difference is 1.1%.  On two identical machines I got 1564
and 1553 seconds with the same config, which is 0.7%, so I expect the
std-dev is going to be, what, 1.4% or so?  I'm running some comprehensive
buildworld tests now using /usr/bin/time -l and will post them when they
finish.  These are very fast machines: 26 minute buildworld on -stable!
So I don't expect this to take very long.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



correction... nameileafonly=-1 is 'do not purge dirs on vnode reclaim'. (was cache purge cache for ...)

2001-09-23 Thread Matt Dillon

fdirs())

References: <1620.1001272770@critter> 


:
:Here is the latest patch for -stable.  vmiodirenable is turned on by
:default, the cache purge code is enabled based on vmiodirenable, and
:I added a new sysctl called nameileafonly which defaults to ON (1).
:
:nameileafonly vmiodirenableaction
:   1   1   (DEFAULT) purge leaf dirs on vnode reclaim
:   0   1   purge any dir on vnode reclaim
:   1   0   purge leaf dirs on vnode reclaim
:   0   0   do not purge dirs on vnode reclaim
:   -1  0   puger any dir on vnode reclaim

Correction.

-1  any do NOT purge dirs on vnode reclaim

Sorry for the mixup.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



cache purge cache for -current (Was Re: Conclusions on... cache_purgeleafdirs())

2001-09-23 Thread Matt Dillon

Here is the latest patch for -current.  vmiodirenable is turned on by
default, the cache purge code is enabled based on vmiodirenable, and
I added a new sysctl called nameileafonly which defaults to ON (1).
The old cache_purgeleafdirs() stuff is #if 0'd out.

nameileafonly vmiodirenable action
1   1   (DEFAULT) purge leaf dirs on vnode reclaim
0   1   purge any dir on vnode reclaim
1   0   purge leaf dirs on vnode reclaim
0   0   do not purge dirs on vnode reclaim
-1  0   puger any dir on vnode reclaim

Index: sys/vnode.h
===
RCS file: /home/ncvs/src/sys/sys/vnode.h,v
retrieving revision 1.157
diff -u -r1.157 vnode.h
--- sys/vnode.h 2001/09/13 22:52:42 1.157
+++ sys/vnode.h 2001/09/23 21:17:23
@@ -559,7 +559,7 @@
struct componentname *cnp));
 void   cache_purge __P((struct vnode *vp));
 void   cache_purgevfs __P((struct mount *mp));
-void   cache_purgeleafdirs __P((int ndir));
+intcache_leaf_test __P((struct vnode *vp));
 void   cvtstat __P((struct stat *st, struct ostat *ost));
 void   cvtnstat __P((struct stat *sb, struct nstat *nsb));
 intgetnewvnode __P((enum vtagtype tag,
Index: kern/vfs_bio.c
===
RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
retrieving revision 1.288
diff -u -r1.288 vfs_bio.c
--- kern/vfs_bio.c  2001/09/12 08:37:46 1.288
+++ kern/vfs_bio.c  2001/09/23 21:13:02
@@ -90,7 +90,7 @@
  * but the code is intricate enough already.
  */
 vm_page_t bogus_page;
-int vmiodirenable = FALSE;
+int vmiodirenable = TRUE;
 int runningbufspace;
 static vm_offset_t bogus_offset;
 
Index: kern/vfs_cache.c
===
RCS file: /home/ncvs/src/sys/kern/vfs_cache.c,v
retrieving revision 1.61
diff -u -r1.61 vfs_cache.c
--- kern/vfs_cache.c2001/09/12 08:37:46 1.61
+++ kern/vfs_cache.c2001/09/23 21:17:47
@@ -101,8 +101,10 @@
 SYSCTL_ULONG(_debug, OID_AUTO, numcache, CTLFLAG_RD, &numcache, 0, "");
 static u_long  numcachehv; /* number of cache entries with vnodes held */
 SYSCTL_ULONG(_debug, OID_AUTO, numcachehv, CTLFLAG_RD, &numcachehv, 0, "");
+#if 0
 static u_long  numcachepl; /* number of cache purge for leaf entries */
 SYSCTL_ULONG(_debug, OID_AUTO, numcachepl, CTLFLAG_RD, &numcachepl, 0, "");
+#endif
 struct nchstats nchstats;  /* cache effectiveness statistics */
 
 static int doingcache = 1; /* 1 => enable the cache */
@@ -247,6 +249,31 @@
 }
 
 /*
+ * cache_leaf_test()
+ * 
+ *  Test whether this (directory) vnode's namei cache entry contains
+ *  subdirectories or not.  Used to determine whether the directory is
+ *  a leaf in the namei cache or not.  Note: the directory may still   
+ *  contain files in the namei cache.
+ *
+ *  Returns 0 if the directory is a leaf, -1 if it isn't.
+ */
+int
+cache_leaf_test(struct vnode *vp)
+{
+   struct namecache *ncpc;
+
+   for (ncpc = LIST_FIRST(&vp->v_cache_src);
+ncpc != NULL;
+ncpc = LIST_NEXT(ncpc, nc_src)
+   ) {
+   if (ncpc->nc_vp != NULL && ncpc->nc_vp->v_type == VDIR)
+   return(0);
+   }
+   return(-1);
+}
+
+/*
  * Lookup an entry in the cache
  *
  * We don't do this if the segment name is long, simply so the cache
@@ -499,6 +526,8 @@
}
 }
 
+#if 0
+
 /*
  * Flush all dirctory entries with no child directories held in
  * the cache.
@@ -554,6 +583,8 @@
}
numcachepl++;
 }
+
+#endif
 
 /*
  * Perform canonical checks and cache lookup and pass on to filesystem
Index: kern/vfs_subr.c
===
RCS file: /home/ncvs/src/sys/kern/vfs_subr.c,v
retrieving revision 1.319
diff -u -r1.319 vfs_subr.c
--- kern/vfs_subr.c 2001/09/12 08:37:47 1.319
+++ kern/vfs_subr.c 2001/09/23 21:19:26
@@ -110,6 +110,8 @@
 /* Number of vnodes in the free list. */
 static u_long freevnodes = 0;
 SYSCTL_LONG(_debug, OID_AUTO, freevnodes, CTLFLAG_RD, &freevnodes, 0, "");
+
+#if 0
 /* Number of vnode allocation. */
 static u_long vnodeallocs = 0;
 SYSCTL_LONG(_debug, OID_AUTO, vnodeallocs, CTLFLAG_RD, &vnodeallocs, 0, "");
@@ -125,6 +127,7 @@
 /* Number of vnodes attempted to recycle at a time. */
 static u_long vnoderecyclenumber = 3000;
 SYSCTL_LONG(_debug, OID_AUTO, vnoderecyclenumber, CTLFLAG_RW, &vnoderecyclenumber, 0, 
"");
+#endif
 
 /*
  * Various variables used for debugging the new implementation of
@@ -142,6 +145,8 @@
 /* Set to 0 for old insertion-sort based reassignbuf, 1 for modern method. */
 static int reassignbufmethod = 1;
 SYSCTL_INT(_vfs, OID_AUTO, reassignbufmethod, CTLFLAG_RW, &reassignbufmethod, 0, "");
+static int nameileafonly = 1;
+SYSCTL_INT(_vfs, OID_AUT

cache purge cache for -stable (Was Re: Conclusions on... cache_purgeleafdirs())

2001-09-23 Thread Matt Dillon

Here is the latest patch for -stable.  vmiodirenable is turned on by
default, the cache purge code is enabled based on vmiodirenable, and
I added a new sysctl called nameileafonly which defaults to ON (1).

nameileafonly vmiodirenable action
1   1   (DEFAULT) purge leaf dirs on vnode reclaim
0   1   purge any dir on vnode reclaim
1   0   purge leaf dirs on vnode reclaim
0   0   do not purge dirs on vnode reclaim
-1  0   puger any dir on vnode reclaim

Index: sys/vnode.h
===
RCS file: /home/ncvs/src/sys/sys/vnode.h,v
retrieving revision 1.111.2.12
diff -u -r1.111.2.12 vnode.h
--- sys/vnode.h 2001/09/22 09:21:48 1.111.2.12
+++ sys/vnode.h 2001/09/23 21:18:00
@@ -550,6 +550,7 @@
struct componentname *cnp));
 void   cache_purge __P((struct vnode *vp));
 void   cache_purgevfs __P((struct mount *mp));
+intcache_leaf_test __P((struct vnode *vp));
 void   cvtstat __P((struct stat *st, struct ostat *ost));
 void   cvtnstat __P((struct stat *sb, struct nstat *nsb));
 intgetnewvnode __P((enum vtagtype tag,
Index: kern/vfs_bio.c
===
RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
retrieving revision 1.242.2.9
diff -u -r1.242.2.9 vfs_bio.c
--- kern/vfs_bio.c  2001/06/03 05:00:09 1.242.2.9
+++ kern/vfs_bio.c  2001/09/23 20:24:47
@@ -82,7 +82,7 @@
  * but the code is intricate enough already.
  */
 vm_page_t bogus_page;
-int vmiodirenable = FALSE;
+int vmiodirenable = TRUE;
 int runningbufspace;
 static vm_offset_t bogus_offset;
 
Index: kern/vfs_cache.c
===
RCS file: /home/ncvs/src/sys/kern/vfs_cache.c,v
retrieving revision 1.42.2.4
diff -u -r1.42.2.4 vfs_cache.c
--- kern/vfs_cache.c2001/03/21 10:50:58 1.42.2.4
+++ kern/vfs_cache.c2001/09/23 21:18:24
@@ -405,6 +405,31 @@
 }
 
 /*
+ * cache_leaf_test()
+ *
+ * Test whether this (directory) vnode's namei cache entry contains
+ * subdirectories or not.  Used to determine whether the directory is
+ * a leaf in the namei cache or not.  Note: the directory may still
+ * contain files in the namei cache.
+ *
+ * Returns 0 if the directory is a leaf, -1 if it isn't.
+ */
+int
+cache_leaf_test(struct vnode *vp)
+{
+   struct namecache *ncpc;
+
+   for (ncpc = LIST_FIRST(&vp->v_cache_src);
+ncpc != NULL;
+ncpc = LIST_NEXT(ncpc, nc_src)
+   ) {
+   if (ncpc->nc_vp != NULL && ncpc->nc_vp->v_type == VDIR)
+   return(0);
+   }
+   return(-1);
+}
+
+/*
  * Perform canonical checks and cache lookup and pass on to filesystem
  * through the vop_cachedlookup only if needed.
  */
Index: kern/vfs_subr.c
===
RCS file: /home/ncvs/src/sys/kern/vfs_subr.c,v
retrieving revision 1.249.2.11
diff -u -r1.249.2.11 vfs_subr.c
--- kern/vfs_subr.c 2001/09/11 09:49:53 1.249.2.11
+++ kern/vfs_subr.c 2001/09/23 21:18:20
@@ -113,6 +113,8 @@
 SYSCTL_INT(_vfs, OID_AUTO, reassignbufsortbad, CTLFLAG_RW, &reassignbufsortbad, 0, 
"");
 static int reassignbufmethod = 1;
 SYSCTL_INT(_vfs, OID_AUTO, reassignbufmethod, CTLFLAG_RW, &reassignbufmethod, 0, "");
+static int nameileafonly = 1;
+SYSCTL_INT(_vfs, OID_AUTO, nameileafonly, CTLFLAG_RW, &nameileafonly, 0, "");
 
 #ifdef ENABLE_VFS_IOOPT
 int vfs_ioopt = 0;
@@ -506,13 +508,32 @@
TAILQ_REMOVE(&vnode_free_list, vp, v_freelist);
TAILQ_INSERT_TAIL(&vnode_tmp_list, vp, v_freelist);
continue;
-   } else if (LIST_FIRST(&vp->v_cache_src)) {
-   /* Don't recycle if active in the namecache */
-   simple_unlock(&vp->v_interlock);
-   continue;
-   } else {
-   break;
}
+   if (LIST_FIRST(&vp->v_cache_src)) {
+   /* 
+* If nameileafonly is set, do not throw
+* away directory vnodes unless they are
+* leaf nodes in the namecache.
+*
+* If nameileafonly is not set then throw-aways
+* are based on vmiodirenable.  If
+* vmiodirenable is turned off we do not throw
+* away directory vnodes active in the
+* namecache.  The (nameileafonly < 0) test
+* is for further debugging only.
+*/

Re: Conclusions on... was Re: More on the cache_purgeleafdirs() routine

2001-09-23 Thread Matt Dillon


:Has the problem of small-memory machines (< 64M IIRC) solved now? As I
:understand it vmiodirenable is counter-productive for these boxes. 
:Maybe one could decide on-boot whether the amount of mem is enough to 
:make it useful?
:
:Just a thought of course.
:
:|   / o / /_  _email:  [EMAIL PROTECTED]

Small memory machines never had a problem.  Even though there can be
considerable memory inefficiencies using vmiodirenable (e.g. a directory
less then 512 bytes eats 512 bytes of physical ram with vmiodirenable
turned off, and 4K with it turned on), there are two compensating factors:
First, the VM Paging cache is a cache, so the cached directory blocks can
be thrown away.  Second, the VM Page cache has all of memory to play
with while the buffer cache with vmiodirenable turned off on a 64MB
machine will reserve less then a megabyte to cache directories.  #2 is
important because it leads to more I/O which has a far greater effect
on the system then memory waste.  Also, if you look at the typical
program's memory footprint and assume that actively accessed directories
eat 4K, the memory used to hold the active directories winds up being
almost nothing compared to the RSS of the program using those directories.
Single-use directories (e.g. make buildworld) are recycled in the VM Page
cache and do not eat as much memory as you might otherwise think.

So even though the memory inefficiency can be up to 8:1 (4096/512),
this factor is somewhat deceptive in regards to the actual effect on
the system.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Conclusions on... was Re: More on the cache_purgeleafdirs() routine

2001-09-23 Thread Matt Dillon


:Block input operations is the one notable exception and it tells a
:very interesting story: Matts patch results in a 4% increase, but
:combined with vmdirioenable it results in a 21.5% decrease.
:
:That's pretty darn significant: one out of every five I/O have
:been saved.
:
:The reason it has not manifested itself in the "real" number is
:probably the high degree of parallelism in the task which practically
:ensures that the CPU will not go idle.
:
:I suggest we let Matt's patch depend on the vmiodirenable sysctl
:and change the default for that.
:
:If there are no bad side effects found in the next couple of months,
:then kill the sysctl and lets be done with it.
:
:Poul-Henning

Very interesting!  I agree completely.  I will instrument the code
to make the namei cache check conditional on vmiodirenable.  I will
also add a sysctl to change the check condition to test for leaf
directories or not (closer to what Seigo's code accomplished) so we can
test for differences there.  If we can get that 4% down the code will
be more acceptable.

The results make a lot of sense, especially if the machine has sufficient
memory to cache most of the source tree.  Increasing the number of
vnode reclaims will increase I/O on directories, especially for something
like a buildworld which is doing major path expansions up the wazoo.
Turning on vmiodirenable in a buildworld situation will decrease I/O 
significantly.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Conclusions on... was Re: More on the cache_purgeleafdirs() routine

2001-09-23 Thread Matt Dillon


:>VM Page Cache, and thus not be candidates for reuse anyway.  So my patch
:>has a very similar effect but without the overhead.
:
:Back when I rewrote the VFS namecache back in 1997 I added that
:clause because I saw directories getting nuked in no time because
:there were no pages holding on to them (device nodes were even worse!)
:
:So refresh my memory here, does directories get pages cached in VM if
:you have vfs.vmiodirenable=0 ?  
:
:What about !UFS filesystems ?  Do they show a performance difference ?
:
:Also, don't forget that if the VM system gave preferential caching to
:directory pages, we wouldn't need the VFS-cache very much in the first
:place...
:
:-- 
:Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20

Ah yes, vmiodirenable.  We should just turn it on by default now.  I've
been waffling too long on that.  With it off the buffer cache will 
remember at most vfs.maxmallocspace worth of directory data (read: not
very much), and without VMIO backing, which means vnodes could be
reclaimed immediately.  Ah!  Now I see why that clause was put
in... but it's obsolete now if vmiodirenable is turned on, and it
doesn't scale well to large-memory machines if it is left in.

If we turn vmiodirenable on then directory blocks get cached by the 
VM system.  There is no preferential treatment of directory blocks
but there doesn't need to be, the VM system does a very good job figuring
out which blocks to keep and which not to.

vfs.vmiodirenable=0 works well for small lightly loaded systems but
doesn't scale at all.  vfs.vmiodirenable=1 works well for any sized 
system, even though there's considerable storage ineffeciencies with 
small directories, because the VM Page algorithms compensate (and
scale).  Small systems with fewer directories don't see the vnode 
scaling problem because there are simply not enough directories to
saturate the vnode/inode malloc areas.  Large systems with a greater
number of directories blow up the vnode/inode malloc space.

I'll run some buildworld tests tomorrow.  Er, later today.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Conclusions on... was Re: More on the cache_purgeleafdirs() routine

2001-09-23 Thread Matt Dillon

:   Hmmm. This would seem to be a step back to the days when caching was done
:relative to the device as opposed to the file-relative scheme we have now.
:One of the problems with the old scheme as I recall is that some filesystems
:like NFS don't have a 'device' and thus no physical block numbers to
:associate the cached pages with. There is also some cost in moving the pages
:between the file object and the device object. For these reasons, I would
:prefer that we keep the existing model, but just make sure that we can
:handle the degenerate case of one page per file object.
:
:-DG
:
:David Greenman

Yah.  For NFS we'd have to just throw the pages away.  I'm not changing
anything any time soon, I've got lots of other interesting stuff on my
plate.  It's just an idea.  The filesystem already has a VM Object
that it uses for inodes and bitmap blocks and such, this would simply
be a way to leverage it.  Ultimately it may not matter... perhaps we
can devise a way of hanging onto the inode/vnode reference without
eating up so much KVM.  For example, the VM Object could be associated
with the filesystem inode cache at the filesystem level.  Or something
like that.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Conclusions on... was Re: More on the cache_purgeleafdirs() routine

2001-09-23 Thread Matt Dillon


:>Well, this has turned into a rather sticky little problem.  I've
:>spent all day going through the vnode/name-cache reclaim code, looking
:>both at Seigo's cache_purgeleafdirs() and my own patch.
:
:   Can you forward me your patch? I'd like to try it out on some machines in
:the TSI lab.

Absolutely, any and all testing is good.

Here's the patch for stable and current.  Stable first:

Index: kern/vfs_subr.c
===
RCS file: /home/ncvs/src/sys/kern/vfs_subr.c,v
retrieving revision 1.249.2.11
diff -u -r1.249.2.11 vfs_subr.c
--- kern/vfs_subr.c 2001/09/11 09:49:53 1.249.2.11
+++ kern/vfs_subr.c 2001/09/23 07:33:51
@@ -506,10 +506,12 @@
TAILQ_REMOVE(&vnode_free_list, vp, v_freelist);
TAILQ_INSERT_TAIL(&vnode_tmp_list, vp, v_freelist);
continue;
+#if 0
} else if (LIST_FIRST(&vp->v_cache_src)) {
/* Don't recycle if active in the namecache */
simple_unlock(&vp->v_interlock);
continue;
+#endif
} else {
break;
}


 And here is the patch for current:

Index: kern/vfs_cache.c
===
RCS file: /home/ncvs/src/sys/kern/vfs_cache.c,v
retrieving revision 1.61
diff -u -r1.61 vfs_cache.c
--- kern/vfs_cache.c2001/09/12 08:37:46 1.61
+++ kern/vfs_cache.c2001/09/23 07:27:05
@@ -101,8 +101,10 @@
 SYSCTL_ULONG(_debug, OID_AUTO, numcache, CTLFLAG_RD, &numcache, 0, "");
 static u_long  numcachehv; /* number of cache entries with vnodes held */
 SYSCTL_ULONG(_debug, OID_AUTO, numcachehv, CTLFLAG_RD, &numcachehv, 0, "");
+#if 0
 static u_long  numcachepl; /* number of cache purge for leaf entries */
 SYSCTL_ULONG(_debug, OID_AUTO, numcachepl, CTLFLAG_RD, &numcachepl, 0, "");
+#endif
 struct nchstats nchstats;  /* cache effectiveness statistics */
 
 static int doingcache = 1; /* 1 => enable the cache */
@@ -499,6 +501,8 @@
}
 }
 
+#if 0
+
 /*
  * Flush all dirctory entries with no child directories held in
  * the cache.
@@ -554,6 +558,8 @@
}
numcachepl++;
 }
+
+#endif
 
 /*
  * Perform canonical checks and cache lookup and pass on to filesystem
Index: kern/vfs_subr.c
===
RCS file: /home/ncvs/src/sys/kern/vfs_subr.c,v
retrieving revision 1.319
diff -u -r1.319 vfs_subr.c
--- kern/vfs_subr.c 2001/09/12 08:37:47 1.319
+++ kern/vfs_subr.c 2001/09/22 20:15:11
@@ -110,6 +110,8 @@
 /* Number of vnodes in the free list. */
 static u_long freevnodes = 0;
 SYSCTL_LONG(_debug, OID_AUTO, freevnodes, CTLFLAG_RD, &freevnodes, 0, "");
+
+#if 0
 /* Number of vnode allocation. */
 static u_long vnodeallocs = 0;
 SYSCTL_LONG(_debug, OID_AUTO, vnodeallocs, CTLFLAG_RD, &vnodeallocs, 0, "");
@@ -125,6 +127,7 @@
 /* Number of vnodes attempted to recycle at a time. */
 static u_long vnoderecyclenumber = 3000;
 SYSCTL_LONG(_debug, OID_AUTO, vnoderecyclenumber, CTLFLAG_RW, &vnoderecyclenumber, 0, 
"");
+#endif
 
 /*
  * Various variables used for debugging the new implementation of
@@ -556,8 +559,13 @@
 * Don't recycle if active in the namecache or
 * if it still has cached pages or we cannot get
 * its interlock.
+*
+* XXX the namei cache can hold onto vnodes too long,
+* causing us to run out of MALLOC space.  Instead, we 
+* should make path lookups requeue any vnodes on the free
+* list.
 */
-   if (LIST_FIRST(&vp->v_cache_src) != NULL ||
+   if (/* LIST_FIRST(&vp->v_cache_src) != NULL || */
(VOP_GETVOBJECT(vp, &object) == 0 &&
 (object->resident_page_count || object->ref_count)) ||
!mtx_trylock(&vp->v_interlock)) {
@@ -636,6 +644,7 @@
 
vfs_object_create(vp, td, td->td_proc->p_ucred);
 
+#if 0
vnodeallocs++;
if (vnodeallocs % vnoderecycleperiod == 0 &&
freevnodes < vnoderecycleminfreevn &&
@@ -643,6 +652,7 @@
/* Recycle vnodes. */
cache_purgeleafdirs(vnoderecyclenumber);
}
+#endif
 
return (0);
 }
Index: sys/vnode.h
===
RCS file: /home/ncvs/src/sys/sys/vnode.h,v
retrieving revision 1.157
diff -u -r1.157 vnode.h
--- sys/vnode.h 2001/09/13 22:52:42 1.157
+++ sys/vnode.h 2001/09/23 07:26:54
@@ -559,7 +559,6 @@
struct componentname *cnp));
 void   cache_purge __P((struct vnode *vp));
 void   cache_purgevfs __P((struct mount *mp));
-void   cache_purgeleafdirs __P((

Conclusions on... was Re: More on the cache_purgeleafdirs() routine

2001-09-22 Thread Matt Dillon

Well, this has turned into a rather sticky little problem.  I've
spent all day going through the vnode/name-cache reclaim code, looking
both at Seigo's cache_purgeleafdirs() and my own patch.

This is what is going on:  The old code refused to reuse any vnode that
had (A) Cached VM Pages associated with it *AND* (B) refused to reuse
directory vnodes residing in the namei cache that contained any 
subdirectory or file.  (B) does not apply to file vnodes since they
obviously cannot have subdirectories or files 'under' them in the namei
cache.  The problem is that when you take the union of (A) and (B),
just about every directory vnode in the system winds up being immunue
from reclamation.  Thus directory vnodes appear to grow forever... so
it isn't just the fact that most directories are small that is the
problem, it's the presence of (B).  This is why small files don't cause
the same problem (or at least do not cause it to the same degree).

Both Seigo's cache_purgeleafdirs() and my simpler patch simply remove
the (B) requirement, making directory reclamation work approximately
the same as file reclamation.  The only difference between Seigo's
patch and mine is that Seigo's makes an effort to remove directories
intelligently... it tries to avoid removing higher level directories.
My patch doesn't make a distinction but assumes that (A) will tend to
hold for higher level directories: that is, that higher level directories
tend to be accessed more often and thus will tend to have pages in the 
VM Page Cache, and thus not be candidates for reuse anyway.  So my patch
has a very similar effect but without the overhead.

In all the testing I've done I cannot perceive any performance difference
between Seigo's patch and mine, but from an algorithmic point of view
mine ought to scale much, much better.   Even if we adjust 
cache_purgeleafdirs() to run even less often, we still run up against
the fact that the scanning algorithm is O(N*M) and we know from history
that this can create serious breakage.

People may recall that we had similar problems with the VM Pageout 
daemon, where under certain load conditions the pageout daemon wound
up running continuously, eating enormous amounts of cpu.  We lived with
the problem for years because the scaling issues didn't rear their
heads until machines got hefty enough to have enough pages for the
algorithms to break down.

People may also recall that we had similar problems with the buffer
cache code specifically, the scan 'restart' conditions could
break down algorithmically and result in massive cpu use by bufdaemon.

I think cache_purgeleafdirs() had the right idea.  From my experience
with the VM system, however, I have to recommend that we remove it
from the system and, at least initially, replace it with my simpler
patch.  We could extend my patch to do the same check -- that is, only
remove directory vnodes at lower levels in the namei cache, simply
by scanning the namei cache list at the vnode in question.  So in fact
it would be possible to adjust my patch to have the same effect that
cache_purgeleafdirs() had, but without the scaling issue (or at least
with less of an issue.. it would be O(N) rather then O(M*N)).

-

The bigger problem is exactly as DG has stated... it isn't the namei
cache that is our enemy, it's the VM Page cache preventing vnodes
from being recycled.

For the moment I believe that cache_purgeleafdirs() or my patch solves
the problem well enough that we can run with it for a while.  The real
solution, I believe, is to give us the ability to take cached VM Pages
associated with a file and rename them to cached VM Pages associated
with the filesystem device - we can do this only for page-aligned
blocks of course, not fragments (which we would simply throw away)...
but it would allow us to reclaim vnodes independant of the VM Page cache
without losing the cached pages.  I think this is doable but it will
require a considerable amount of work.  It isn't something I can do in a
day.  I also believe that this can dovetail quite nicely into the I/O
model that we have slowly been moving towards over the last year
(Poul's work).  Inevitably we will have to manage device-based I/O
on a page-by-page basis and being able to do it via a VM Object seems
to fit the bill in my opinion.

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: More on the cache_purgeleafdirs() routine

2001-09-22 Thread Matt Dillon

:>Well, wait a sec... all I do is zap the namei cache for the vnode.  The
:>check to see if the vnode's object still has resident pages is still in
:>there so I don't quite understand how I turned things around.  In my
:>tests it appears to cache vnodes as long as there are resident pages
:>associated with them.
:
:   Sounds like a very good first step. I would like to point out that the
:problem may still occur on large memory systems with a few hundred thousand
:tiny files (that consume just one page of memory). There really needs to
:be a hard limit as well - something low enough so that the FFS node KVM malloc
:limit isn't reached, but still large enough to not significantly pessimize
:the use of otherwise free physical memory for file caching. Considering that
:a 4GB machine has about 1 million pages and that the malloc limit hits at
:about 250,000 vnodes, this is an impossible goal to acheive in that case
:without increasing the malloc limit by at least 4X. Of course this many
:1 page files is extremely rare, however, and I don't think we should optimize
:for it.
:
:-DG

Yes, it's easy to hit 250,000 vnodes.  On a 512M machine I can hit 80,000
vnodes without breaking a sweat.  The reason is simply due to the way 
the VM Page cache works... even medium sized files can wind up with
just one page in core after a while.  It does stabilize, but at a level
that is too high for comfort.

On -stable we've exasperated the problem because inode->i_dirhash
increased the size of struct inode to 260 bytes, causing it to use a 512
byte MALLOC chunk (FFS node malloc type) instead of a 256 byte chunk. 
I'm working with Ian to get struct inode back down to 256 bytes (It isn't
a problem in -current). Unfortunately this introduced additional issues 
in our 4.4 release for people with > 2GB of ram and at least one person
is seeing system deadlocks from it.

I think it would be fairly easy to release VM pages underlying objects
associated with vnodes being reused when we determine that we have too
many vnodes.  We would have to be careful, because what Poul said really
would come true if we blindly released VM pages based on the vnode free
list!  Alternatively we might be able to adjust the VM paging system to
release nearby VM pages in idle VM objects, getting them cleared out more
quickly and allowing their vnodes to be reused without compromising 
the VM system's excellent page selection code.  At least then it would
truely take a million 'small' files to create an issue.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: More on the cache_purgeleafdirs() routine

2001-09-22 Thread Matt Dillon

:I agree, I've never been too fond of the purgeleafdirs() code myself
:for that reason and others.
:
:If we disregard the purgeleafdirs() workaround, the current cache code
:was built around the assumption that VM page reclaims would be enough
:to keep the vnode cache flushed and any vnode which could be potentially
:useful was kept around until it wasn't.
:
:Your patch changes this to the opposite: we kill vnodes as soon as
:possible, and pick them off the freelist next time we hit them,
:if they survice that long.
:
:I think that more or less neuters the vfs cache for anything but
:open files, which I think is not in general an optimal solution
:either.
:
:I still lean towards finding a dynamic limit on the number of vnodes
:and have the cache operation act accordingly as the least generally
:lousy algorithm we can employ.
:
:Either way, I think that we should not replace the current code with
:a new algorithm until we have some solid data for it, it is a complex
:interrelationship and some serious benchmarking is needed before we
:can know what to do.
:
:In particular we need to know:
:
:   What ratio of directories are reused as a function of
:   the number of children they have in the cache.
:
:   What ratio of files are reused as a function of them
:   being open or not.
:
:   What ratio of files are being reused as a function of
:   the number of pages they have in-core.
:
:-- 
:Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
:[EMAIL PROTECTED] | TCP/IP since RFC 956

Well, wait a sec... all I do is zap the namei cache for the vnode.  The
check to see if the vnode's object still has resident pages is still in
there so I don't quite understand how I turned things around.  In my
tests it appears to cache vnodes as long as there are resident pages
associated with them.

We could also throw a flag into the namei structure for a sub-directory
count and only blow leaf nodes - which would be roughly equivalent to
what the existing code does except without the overhead.  But I don't
think it is necessary.

In regards to directory reuse... well, if you thought it was complex
before consider how complex it is now with the ufs dirhash code.  Still,
the core of directory caching is still the buffer cache and if 
vfs.vmiodirenable is turned on, it becomes the VM Page cache which is
already fairly optimal.  This alone will prevent actively used directories
from being blown out.

And, in fact, in the tests I've done so far the system still merrily
caches tens of thousands of vnodes while doing a 'tar cf /dev/null /'.
I'm setting up a postfix test too.  As far as I can tell, my changes
have had no effect on namei cache efficiency other then eating less cpu.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



More on the cache_purgeleafdirs() routine

2001-09-22 Thread Matt Dillon


Hi guys.  I've been tracking down issues with vnode recycling.. 
specifically, getnewvnode() deadlocks in -stable in large-memory
configurations.  I took a real good look at cache_purgeleafdirs() 
in -current to see if it was MFCable as a solution.  But

There are a number of issues... well, there is really one big issue, and
that is the simple fact that there can be upwards of 260,000+ entries
in the name cache and cache_purgeleafdirs() doesn't scale.  It is an
O(N*M) algorithm.  Any system that requires a great deal of vnode
recycling -- for example Yahoo's userbase lookup (one file per userid)
would be terribly impacted by this algorithm.

It seems to me that the best way to deal with this is to simply have
getnewvnode() zap the namei-cache, and have vfs_lookup() requeue
(for LRU purposes) any namei cache associated vnodes that are on the
freelist.  The patch is much less complex... here's is a preliminary
patch (without the LRU requeueing, I haven't gotten to that yet, and 
I haven't completely removed the non-scaleable recycling code).

This code seems to work quite well even without the LRU requeueing,
especialy if you turn on vfs.vmiodirenable.  I don't see any reason
to be totally rabid about keeping top level directory entries in-core...
at least not at the cost of the current algorithm in -current.  We 
want to try to keep them in core, but I believe that LRU requeueing
in vfs_lookup() may be sufficient for that... or possibly even something
as simple as a flag in the namei caching structure that guarentees the
first N directory levels are left in-core.  Or something like that. 
But not a complete scan of the namei cache!

-Matt

Index: kern/vfs_cache.c
===
RCS file: /home/ncvs/src/sys/kern/vfs_cache.c,v
retrieving revision 1.61
diff -u -r1.61 vfs_cache.c
--- kern/vfs_cache.c2001/09/12 08:37:46 1.61
+++ kern/vfs_cache.c2001/09/22 20:13:47
@@ -101,8 +101,10 @@
 SYSCTL_ULONG(_debug, OID_AUTO, numcache, CTLFLAG_RD, &numcache, 0, "");
 static u_long  numcachehv; /* number of cache entries with vnodes held */
 SYSCTL_ULONG(_debug, OID_AUTO, numcachehv, CTLFLAG_RD, &numcachehv, 0, "");
+#if 0
 static u_long  numcachepl; /* number of cache purge for leaf entries */
 SYSCTL_ULONG(_debug, OID_AUTO, numcachepl, CTLFLAG_RD, &numcachepl, 0, "");
+#endif
 struct nchstats nchstats;  /* cache effectiveness statistics */
 
 static int doingcache = 1; /* 1 => enable the cache */
@@ -476,6 +478,20 @@
 }
 
 /*
+ * Flush the namei cache references associated with a vnode.
+ * The vnode remains valid.
+ */
+void
+cache_flush(vp)
+   struct vnode *vp;
+{
+   while (!LIST_EMPTY(&vp->v_cache_src)) 
+   cache_zap(LIST_FIRST(&vp->v_cache_src));
+   while (!TAILQ_EMPTY(&vp->v_cache_dst)) 
+   cache_zap(TAILQ_FIRST(&vp->v_cache_dst));
+}
+
+/*
  * Flush all entries referencing a particular filesystem.
  *
  * Since we need to check it anyway, we will flush all the invalid
@@ -499,6 +515,8 @@
}
 }
 
+#if 0
+
 /*
  * Flush all dirctory entries with no child directories held in
  * the cache.
@@ -554,6 +572,8 @@
}
numcachepl++;
 }
+
+#endif
 
 /*
  * Perform canonical checks and cache lookup and pass on to filesystem
Index: kern/vfs_subr.c
===
RCS file: /home/ncvs/src/sys/kern/vfs_subr.c,v
retrieving revision 1.319
diff -u -r1.319 vfs_subr.c
--- kern/vfs_subr.c 2001/09/12 08:37:47 1.319
+++ kern/vfs_subr.c 2001/09/22 20:15:11
@@ -110,6 +110,8 @@
 /* Number of vnodes in the free list. */
 static u_long freevnodes = 0;
 SYSCTL_LONG(_debug, OID_AUTO, freevnodes, CTLFLAG_RD, &freevnodes, 0, "");
+
+#if 0
 /* Number of vnode allocation. */
 static u_long vnodeallocs = 0;
 SYSCTL_LONG(_debug, OID_AUTO, vnodeallocs, CTLFLAG_RD, &vnodeallocs, 0, "");
@@ -125,6 +127,7 @@
 /* Number of vnodes attempted to recycle at a time. */
 static u_long vnoderecyclenumber = 3000;
 SYSCTL_LONG(_debug, OID_AUTO, vnoderecyclenumber, CTLFLAG_RW, &vnoderecyclenumber, 0, 
"");
+#endif
 
 /*
  * Various variables used for debugging the new implementation of
@@ -556,8 +559,13 @@
 * Don't recycle if active in the namecache or
 * if it still has cached pages or we cannot get
 * its interlock.
+*
+* XXX the namei cache can hold onto vnodes too long,
+* causing us to run out of MALLOC space.  Instead, we 
+* should make path lookups requeue any vnodes on the free
+* list.
 */
-   if (LIST_FIRST(&vp->v_cache_src) != NULL ||
+   if (/* LIST_FIRST(&vp->v_cache_src) != NULL || */
 

Re: Proposed patch (was Re: bug in sshd - signal during free())

2001-09-17 Thread Matt Dillon

   I forwarded the whole thing to Brian.  We'll wait to see what he decides
   to do.  Obviously a fix like this needs to go, it's just a matter of who,
   how, and when.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: bug in sshd - signal during free()

2001-09-17 Thread Matt Dillon


:
:* Matt Dillon <[EMAIL PROTECTED]> [010917 15:32] wrote:
:> sshd died on one of our machines today.  The traceback seems to 
:> indicate that a signal is interrupting a free().  I'm going to 
:> play with the code a bit to see if there's an easy fix.
:> 
:> This bug can't occur very often... the key regeneration signal
:> has to occur *just* as sshd is trying to free() something.
:
:The bug seems more likely to be caused by use of unsafe functions
:in a signal handler.
:
:I'm really suprised that the OpenSSH team didn't slap whomever decided
:to do so much processing within a signal handler silly.

It's funny... they had an XXX comment in there so obviously someone
was a little jittery about it.  I think they just didn't realize that
a malloc() might occur inside the signal handler or they would have
fixed it long ago.

UNIX signals suck.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Proposed patch (was Re: bug in sshd - signal during free())

2001-09-17 Thread Matt Dillon

I looked at the code and there is definitely a serious issue.  This
proposed patch should solve the problem.  Here it is for review before
I commit it and send a bug report off to the openssh folks.  I am testing
it now.

-Matt

Index: sshd.c
===
RCS file: /home/ncvs/src/crypto/openssh/sshd.c,v
retrieving revision 1.6.2.7
diff -u -r1.6.2.7 sshd.c
--- sshd.c  2001/03/04 15:13:08 1.6.2.7
+++ sshd.c  2001/09/17 20:45:54
@@ -134,6 +134,11 @@
 char *server_version_string = NULL;
 
 /*
+ * Indicates that a key-regeneration alarm occured.
+ */
+int received_regeneration;
+
+/*
  * Any really sensitive data in the application is contained in this
  * structure. The idea is that this structure could be locked into memory so
  * that the pages do not get written into swap.  However, there are some
@@ -260,19 +265,26 @@
fatal("Timeout before authentication for %s.", get_remote_ipaddr());
 }
 
-/*
- * Signal handler for the key regeneration alarm.  Note that this
- * alarm only occurs in the daemon waiting for connections, and it does not
- * do anything with the private key or random state before forking.
- * Thus there should be no concurrency control/asynchronous execution
- * problems.
- */
-/* XXX do we really want this work to be done in a signal handler ? -m */
 void
 key_regeneration_alarm(int sig)
 {
-   int save_errno = errno;
+   received_regeneration = 1;
+   /* Reschedule the alarm. */
+   signal(SIGALRM, key_regeneration_alarm);
+   alarm(options.key_regeneration_time);
+}
 
+/*
+ * Regenerate the keys.  Note that this alarm only occurs in the daemon
+ * waiting for connections, and it does not do anything with the
+ * private key or random state before forking.  However, it calls routines
+ * which may malloc() so we do not call this routine directly from the 
+ * signal handler.
+ */
+void
+key_regeneration(void)
+{
+   received_regeneration = 0;
/* Check if we should generate a new key. */
if (key_used) {
/* This should really be done in the background. */
@@ -292,10 +304,6 @@
key_used = 0;
log("RSA key generation complete.");
}
-   /* Reschedule the alarm. */
-   signal(SIGALRM, key_regeneration_alarm);
-   alarm(options.key_regeneration_time);
-   errno = save_errno;
 }
 
 void
@@ -854,6 +862,8 @@
for (;;) {
if (received_sighup)
sighup_restart();
+   if (received_regeneration)
+   key_regeneration();
if (fdset != NULL)
xfree(fdset);
fdsetsz = howmany(maxfd, NFDBITS) * sizeof(fd_mask);
@@ -994,6 +1004,8 @@
 */
alarm(0);
signal(SIGALRM, SIG_DFL);
+   received_regeneration = 0;
+
signal(SIGHUP, SIG_DFL);
signal(SIGTERM, SIG_DFL);
signal(SIGQUIT, SIG_DFL);

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



bug in sshd - signal during free()

2001-09-17 Thread Matt Dillon

sshd died on one of our machines today.  The traceback seems to 
indicate that a signal is interrupting a free().  I'm going to 
play with the code a bit to see if there's an easy fix.

This bug can't occur very often... the key regeneration signal
has to occur *just* as sshd is trying to free() something.

-Matt

(gdb) back
#0  0x28231e34 in kill () from /usr/lib/libc.so.4
#1  0x2826dd8a in abort () from /usr/lib/libc.so.4
#2  0x2826c899 in isatty () from /usr/lib/libc.so.4
#3  0x2826c8cf in isatty () from /usr/lib/libc.so.4
#4  0x2826d907 in malloc () from /usr/lib/libc.so.4
#5  0x2826be58 in __smakebuf () from /usr/lib/libc.so.4
#6  0x2826bdec in __swsetup () from /usr/lib/libc.so.4
#7  0x282663ef in vfprintf () from /usr/lib/libc.so.4
#8  0x28266059 in fprintf () from /usr/lib/libc.so.4
#9  0x2824e0ed in vsyslog () from /usr/lib/libc.so.4
#10 0x2824e009 in syslog () from /usr/lib/libc.so.4
#11 0x804feb3 in do_log ()
#12 0x806ade3 in log ()
#13 0x804c742 in key_regeneration_alarm ()
#14 0xbfbfffac in ?? ()
#15 0x2826da35 in free () from /usr/lib/libc.so.4
#16 0x805f087 in xfree ()
#17 0x804d8be in main ()
#18 0x804c50d in _start ()
(gdb) 


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: cvs commit: src/lib/libatm atm_addr.c cache_key.c ioctl_subr.c ip_addr.c ip_checksum.c timer.c

2001-09-15 Thread Matt Dillon


:What about changing this to __FBSD(), which is what I was using in a
:prototype to reduce the number of characters in the macro name (and thus
:reduce the wrap around).
:
:-- 
:-- David  ([EMAIL PROTECTED])

__FBSD() is too generic for a #define name in what is essentially a 
global header file.  These are rcs id's, so it should be something like
__FBSDID() or __FBSD_ID() or something like that, not just __FBSD().

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: PLEASE REVIEW: loader fix for gzipped kernels

2001-08-29 Thread Matt Dillon


:In article <[EMAIL PROTECTED]>,
:John Baldwin  <[EMAIL PROTECTED]> wrote:
:> 
:> Looks good to me, but I'm only somewhat familiar with libstand. :)
:
:Thanks for taking a look at it.  Matt Dillon also reviewed it and gave
:it a clean bill of health.  He made a suggestion for making the code a
:bit smaller.  I'll incorporate that and then commit it to -current.
:
:John

   I'll give it a quick test after you commit it (I can combine the test
   with some other work I'm doing).

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Possible race in i386/i386/pmap.c:pmap_copy()

2001-08-25 Thread Matt Dillon

:
:On Sat, 25 Aug 2001, Julian Elischer wrote:
:
:> AH yes, it's a race for KSe, but we are 1:1 still so it's not a problem (yet :-)
:> ( at least, not the one that's hitteng me at the moment)
:
:Well, don't get frustrated, look on the bright side of things:  Even if
:you don't have KSE running in the next few weeks, you will have fixed
:*many* bugs in the kernel. :)
:
:Actually, wouldn't this bug also affect linuxthreads operation?  I'm not
:sure if anyone actually uses that, but if they do and were experiencing
:problems, this could explain it.
:
:Mike "Silby" Silbersack

Yes, it could effect linuxthreads.  

I think it will be fairly easy to fix - I just have to add a check
after it gets past the blocking condition to see if it needs to reload
the alternative page table.

-

I found the KSE bug that was stumping us last night.  Julian will be
a happy camper when he wakes up!  I found a bunch of minor issues as
well, things like buildworld erroring out in libkvm due to the structural
changes, but I expect he will be able to commit the KSE patchset very
soon now that we've found the bug that was making such a huge mess
of the machine.

It turns out that some assembly in swtch.s got munged and pmap->pm_active
was not getting set, having the side effect of causing most of the TLB
invalidation routines to believe that they didn't have to do anything.
You can imagine the result!

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Possible race in i386/i386/pmap.c:pmap_copy()

2001-08-24 Thread Matt Dillon

:> Hmm.  Ok, I think you are right.  APTDpde is what is being loaded
:> and that points into the user page table directory page, which is
:> per-process.  So APTDpde should be per-process.
:
:But it is!  (sort-of)  APTDpde was per-process but is now per-address-space
:with the advent of fork and RFMEM sharing (and KSE).
:
:When we context switch, PTD goes with the process^H^H^H^Haddress space, and
:APTD is merely mapped by the last entry in the per-process PTD
:(PTD[APTDPDTI] if memory serves correctly).
:
:Cheers,
:-Peter
:--
:Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]

Oh !@#$#@$.. you're right!  That means there *IS* a race, just that it
is a race in the case where you use rfork.  APTDpde can be ripped out
from under one thread by another thread.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: function calls/rets in assembly

2001-08-24 Thread Matt Dillon

You guys are forgetting about the stack-boundry crap some idiot added
to GCC to optimize floating point ops, which gets stuffed in there even
if there are no floating point ops.

I really wish someone would rip it out.  It is SOOO fraggin annoying.

-Matt

cc -S -O -fomit-frame-pointer -mpreferred-stack-boundary=2 x.c

printasint:
pushl 4(%esp)
pushl $.LC0
call printf
addl $8,%esp
ret


cc -S -fomit-frame-pointer -mpreferred-stack-boundary=2 x.c 

printasint:
movl 4(%esp),%eax
pushl %eax
pushl $.LC0
call printf
addl $8,%esp
.L2:
ret


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Possible race in i386/i386/pmap.c:pmap_copy()

2001-08-24 Thread Matt Dillon


:
:Thinking about this a bit more
:doesn't each process ahve it's own PTD?, so a process could sleep and
:another could run but it would have a differnt PTD
:so they could change that PTDE with impunity
:because when teh current process runs again it get's its own 
:ptd back again..

Hmm.  Ok, I think you are right.  APTDpde is what is being loaded
and that points into the user page table directory page, which is
per-process.  So APTDpde should be per-process.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Possible race in i386/i386/pmap.c:pmap_copy()

2001-08-24 Thread Matt Dillon

Alfred, DG, could you take a look at pmap_copy() in i386/i386/pmap.c
and tell me if what I think I'm seeing is what I'm seeing?

My read of this code is that a global, APTDpde, is being set, and
then that pointer is being used in a loop later on in the routine.
the problem is that the pmap_allocpte() call can block and, by my
read, that means some other process can go in and change APTDpde out
from under the loop.

This could also be related to problem Julian has been seeing with his
KSE patch set.

There is a comment:

/*
 * We have to check after allocpte for the
 * pte still being around...  allocpte can
 * block.
 */
dstmpte = pmap_allocpte(dst_pmap, addr);
if ((*dst_pte == 0) && (ptetemp = *src_pte)) {
/*
 * Clear the modified and
...

But I do not believe this check is sufficient if APTDpde gets ripped
out from under the loop.  Is this race real or am I blowing smoke?

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: mmap MAP_INHERIT question.

2001-08-24 Thread Matt Dillon


:>
:> MAP_INHERIT   This is supposed to permit regions to be
:>   inherited across execve(2) system calls,
:>   but is currently broken.
:
:   Support for the flag and reference to it in the manpage should just be
:removed.
:
:-DG
:
:David Greenman

Yah, I agree.  Even if we implemented it it would be a massive security
hole.  a MAP_SHARED mmap() is easier.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: mmap MAP_INHERIT question.

2001-08-23 Thread Matt Dillon

   MAP_INHERIT is broken and always has been.

-Matt

:exec gives you an new vm space..
:inherrit only applies to forks
:
:
:On Thu, 23 Aug 2001, Alfred Perlstein wrote:
:
:> * Bernd Walter <[EMAIL PROTECTED]> [010823 06:16] wrote:
:> > I do the following:
:> > buf = (char*)mmap(NULL, BUFSIZE, PROT_READ | PROT_WRITE,
:> >MAP_ANON | MAP_INHERIT | MAP_SHARED, -1, 0);
:> > 
:> > Now I vfork/execve a child.
:> > But the child can't access the mmaped memory.
:> > It was my understanding that MAP_INHERIT | MAP_SHARED keep the memory
:> > over the execve.
:> 
:> Without sample code this is impossible to explain.
:> 
:> -- 
:> -Alfred Perlstein [[EMAIL PROTECTED]]
:> Ok, who wrote this damn function called '??'?
:> And why do my programs keep crashing in it?
:> 
:> To Unsubscribe: send mail to [EMAIL PROTECTED]
:> with "unsubscribe freebsd-hackers" in the body of the message
:> 
:
:
:To Unsubscribe: send mail to [EMAIL PROTECTED]
:with "unsubscribe freebsd-hackers" in the body of the message
:


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: ssh password cracker - now this *is* cool!

2001-08-22 Thread Matt Dillon


:
:* Matt Dillon <[EMAIL PROTECTED]> [010822 18:30] wrote:
:> This gets an 'A' on my cool-o-meter.
:> 
:>  http://www.vnunet.com/News/1124839
:
:Interesting, I guess one could work around it by periodically
:sending bogus empty packets in the middle of activity.
:
:-- 
:-Alfred Perlstein [[EMAIL PROTECTED]]

Yah, and typing backspaces also ought to work.  12345bb45bb45678b8

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



ssh password cracker - now this *is* cool!

2001-08-22 Thread Matt Dillon

This gets an 'A' on my cool-o-meter.

http://www.vnunet.com/News/1124839

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: long-term kernel locks

2001-08-20 Thread Matt Dillon


:Matt
:
:Ok I see..the interlock is a lock on a collection (e.g
:on vfs mount list) and it can be released once the simple
:lock within the to-be-locked object has been acquired.
:These are really spin locks, now that I saw simplelock.s
:
:One more clarification if you will.. :-)
:
:What is the purpose of the "splhigh" in acquire() ?
:
:Is it this that prevents an involuntary context switch in
:a UP system , while the lock variables are being modified 
:by acquire() ?
:
:-Sandeep

In -stable there are no involuntary context switches in kernel mode
and only one cpu can truely be running kernel code at any given
moment, but interrupts can still preempt (temporarily) the mainline
code.  So all you need to protect against are interrupts.  splhigh()
effectively disables interrupts - because an interrupt might call code
that conflicts with the mainline code in question.  There are places in
the mainline code that interrupts never touch where the mainline code thus
does not bother to use splhigh() or any spl*() stuff at all.  There
are other places in the mainline code that are also called from
interrupts and that is where the spl*() protection is required.

In -current the situation is vastly different.  Involuntary 
context switches can occur at almost any time and more then one
cpu may be running kernel code at any given instance, so structures
must be protected by mutexes at all times. 

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: long-term kernel locks

2001-08-20 Thread Matt Dillon


:Hi there,
:
:I need some mechanism to hold long-term locks (across
:context switches) while using kernel threads (kthread_*)
:and lockmgr() looked like the right thing to use.
:
:I am running FreeBSD 4.1 on a uniprocessor (..the questions
:are similar with 4.3)
:
:Looking at kern_lock.c, I see that lockmgr() uses simple
:locks.  On a UP system, simple locks are turned off.
:I dont see any way to prevent a context switch while the
:kernel thread is in the lockmgr code - after going
:thru a simple_lock() call.  Is this correct ?
:
:So my questions are :
:1) Is lockmgr safe ?
:2) Are there other sync primitives that can be used 
:   between two kernel entities (i can move to 4.3)
:3) What is the use of simplelock_recurse in
:   kern_lock.c ?
:
:TIA,
:-Sandeep

Ah, the wonderful world of lockmgr().  This function actually
implements long term locks.  It uses simplelock() internally
to force consistent state for short periods of time - which really
only applies to SMP environments, which is why simplelock() is a
NOP on UP systems.

For an example of long-term in-kernel locks, take a look at 
kern/vfs_subr.c.  Search for 'lockinit' and 'lockmgr' calls.
lockmgr() is in fact what you want to use.

You should be able to safely ignore the interlock stuff for your
purposes.  interlock is passed in order to allow lockmgr() to
release the simplelock that the caller was holding in order for
lockmgr() to be able to sleep, and then gain it back later.  

i.e.  the caller may simplelock() something it is managing and then
want to call lockmgr() to get a real lock 'atomically'.  The atomicy
is achieved by the caller maintaining its hold on the simplelock()
through the call to lockmgr().  But simplelock()'s cannot survive
context switches so the caller must pass the simplelock to lockmgr()
so lockmgr() can release it temporarily when it decides it needs to
block.  Weird, but it works.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Recommendation for minor KVM adjustments for the release

2001-08-19 Thread Matt Dillon


:> Yes, and the buffer cache determines how much dirty file-backed data
:> (via write() or mmap()) the system is allowed to accumulate before
:> it forces it out, which should probably be the greater concern here.
:
:How hard would it be to allow dirty data in the file
:cache, without buffer mappings ?
:
:regards,
:
:Rik

This is already supported via MAP_NOSYNC.  The problem is that the
dirty data is not tracked under light memory loads (tracking dirty 
data is a function of the buffer cache), so unless something 
forces the page out or fsync()s the file explicitly, the dirty data
might not be written out for days.   It might look good in a benchmark,
but it would play havoc on system stability in the event of a crash.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Proposed patch (-stable) for minor KVM adjustments for the release

2001-08-19 Thread Matt Dillon

Here is my proposed patch.  It adds to options (overrides) to the kernel
config, two boot-time tunables for same, and sets reasonable defaults.
I also changed the default auto-sizing calculation for swap-meta from
32x physical pages to 16x physical pages on top of it being capped.
I only made a small change there since we cannot really afford to run
out of swap-meta structures (the system will probably lockup if we do).

This patch is against -stable.  I will commit it to -current today and
await the release-engineers permission to MFC it to -stable for the
release.  I did some simple testing under -stable, including bumping up
maxusers and NMBCLUSTERS.

-Matt

Index: conf/options
===
RCS file: /home/ncvs/src/sys/conf/options,v
retrieving revision 1.191.2.35
diff -u -r1.191.2.35 options
--- conf/options2001/08/03 00:47:27 1.191.2.35
+++ conf/options2001/08/19 20:00:29
@@ -163,6 +163,8 @@
 NBUF   opt_param.h
 NMBCLUSTERSopt_param.h
 NSFBUFSopt_param.h
+VM_BCACHE_SIZE_MAX opt_param.h
+VM_SWZONE_SIZE_MAX opt_param.h
 MAXUSERS
 
 # Generic SCSI options.
Index: i386/conf/LINT
===
RCS file: /home/ncvs/src/sys/i386/conf/Attic/LINT,v
retrieving revision 1.749.2.77
diff -u -r1.749.2.77 LINT
--- i386/conf/LINT  2001/08/15 01:23:49 1.749.2.77
+++ i386/conf/LINT  2001/08/19 20:27:02
@@ -2315,7 +2315,44 @@
 #
 optionsNSFBUFS=1024
 
+# Set the size of the buffer cache KVM reservation, in buffers.  This is
+# scaled by approximately 16384 bytes.  The system will auto-size the buffer
+# cache if this option is not specified or set to 0.
 #
+optionsNBUF=512
+
+# Set the size of the mbuf KVM reservation, in clusters.  This is scaled
+# by approximately 2048 bytes.  The system will auto-size the mbuf area
+# if this options is not specified or set to 0.
+#
+optionsNMBCLUSTERS=1024
+
+# Tune the kernel malloc area parameters.  VM_KMEM_SIZE represents the 
+# minimum, in bytes, and is typically (12*1024*1024) (12MB). 
+# VM_KMEM_SIZE_MAX represents the maximum, typically 200 megabytes.
+# VM_KMEM_SIZE_SCALE can be set to adjust the auto-tuning factor, which
+# typically defaults to 4 (kernel malloc area size is physical memory 
+# divided by the scale factor).
+#
+optionsVM_KMEM_SIZE="(10*1024*1024)"
+optionsVM_KMEM_SIZE_MAX="(100*1024*1024)"
+optionsVM_KMEM_SIZE_SCALE="4"
+
+# Tune the buffer cache maximum KVA reservation, in bytes.  The maximum is
+# usually capped at 200 MB, effecting machines with > 1GB of ram.  Note
+# that the buffer cache only really governs write buffering and disk block
+# translations.  The VM page cache is our primary disk cache and is not
+# effected by the size of the buffer cache.
+#
+optionsVM_BCACHE_SIZE_MAX="(100*1024*1024)"
+
+# Tune the swap zone KVA reservation, in bytes.  The default is typically
+# 70 MB, giving the system the ability to manage a maximum of 28GB worth
+# of swapped out data.  
+#
+optionsVM_SWZONE_SIZE_MAX="(50*1024*1024)"
+
+#
 # Enable extra debugging code for locks.  This stores the filename and
 # line of whatever acquired the lock in the lock itself, and change a
 # number of function calls to pass around the relevant data.  This is
@@ -2500,9 +2537,7 @@
 optionsKEY
 optionsLOCKF_DEBUG
 optionsLOUTB
-optionsNBUF=512
 optionsNETATALKDEBUG
-optionsNMBCLUSTERS=1024
 #options   OLTR_NO_BULLSEYE_MAC
 #options   OLTR_NO_HAWKEYE_MAC
 #options   OLTR_NO_TMS_MAC
@@ -2521,7 +2556,5 @@
 optionsSPX_HACK
 optionsTIMER_FREQ="((14318182+6)/12)"
 optionsVFS_BIO_DEBUG
-optionsVM_KMEM_SIZE
-optionsVM_KMEM_SIZE_MAX
-optionsVM_KMEM_SIZE_SCALE
 optionsXBONEHACK
+
Index: i386/i386/machdep.c
===
RCS file: /home/ncvs/src/sys/i386/i386/machdep.c,v
retrieving revision 1.385.2.15
diff -u -r1.385.2.15 machdep.c
--- i386/i386/machdep.c 2001/07/30 23:27:59 1.385.2.15
+++ i386/i386/machdep.c 2001/08/19 20:36:19
@@ -320,7 +320,9 @@
 * The nominal buffer size (and minimum KVA allocation) is BKVASIZE.
 * For the first 64MB of ram nominally allocate sufficient buffers to
 * cover 1/4 of our ram.  Beyond the first 64MB allocate additional
-* buffers to cover 1/20 of our ram over 64MB.
+* buffers to cover 1/20 of our ram over 64MB.  When auto-sizing
+* the buffer cache we limit the eventual kva reservation to
+* maxbcache bytes.
 *
 * factor represents the 1/4 x ram conversion.
 */
@@ -332,6 +334,8 @@
nbuf += min((physmem - 1024) / factor, 16384 / factor)

Re: Recommendation for minor KVM adjustments for the release

2001-08-19 Thread Matt Dillon


:
:>There are two things I would like to commit for the release:
:>
:>  - I would like to cap the SWAPMETA zone reservation to 70MB,
:>which allows us to manage a maximum of 29GB worth of swapped
:>out data.
:>
:>This is plenty and saves us 94MB of KVM which is roughly
:>equivalent to 30,000 nmbclusters/mbufs.
:
:   It's seems really hard to justify even that much SWAPMETA. A more
:reasonable amount would be more like 20MB.
:
:>  - I would like to cap the size of the buffer cache at 200MB,
:>giving us another 70MB or so of KVM which is equivalent to
:>another 30,000 or so nmbclusters.
:
:   That also seems like overkill for the vast majority of systems.
:
:
:-DG
:
:David Greenman

It's not a 1:1 mapping.  There is some sparseness to the way SWAPMETA
structures are used so 29GB worth of swap-meta supported VM may wind
up only being 15GB of actually-swapped-out-dat in reallity.  It 
depends very heavily on the applications being run.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Recommendation for minor KVM adjustments for the release

2001-08-19 Thread Matt Dillon


:
:David Greenman wrote:
:> 
:> >   - I would like to cap the size of the buffer cache at 200MB,
:> > giving us another 70MB or so of KVM which is equivalent to
:> > another 30,000 or so nmbclusters.
:> 
:>That also seems like overkill for the vast majority of systems.
:
:But probably not for the large-memory systems (and on the machines
:with small memory the limit will be smaller anyway).
:
:-SB

I should also say that even in the Linux and Solaris worlds, systems
with > 4GB of ram wind up being very specific-use systems.  Typically
such systems are used almost solely to run large databases.  For
example, so something like Oracle can manage a multi-gigabyte cache.
These applications do not actually require the memory to be 
swap-backed, or file-backed, or really managed at all.

In FreeBSD land the use-case would simply be our physical-backed-shared-
memory feature.  We could implement the 8-byte MMU extensions in the
PMAP code as a kernrel option to be able to access ram > 4GB without
having to change anything else in the kernel (not even vm_page_t or
the pmap supporting structures) *IF* we only use the ram > 4GB in
physical-backed SysV shared memory mappings.  This would actually
cover 99% of the needs of people who need to run systems with this much
ram.

There are lots of issues on IA32 in regards to memory > 4GB... for
example, many PCI cards cannot DMA beyond 4GB.  We would avoid these
issues as well by only using the memory as physical backing store
for SysV shared memory segments.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Recommendation for minor KVM adjustments for the release

2001-08-19 Thread Matt Dillon


:
:>David Greenman wrote:
:>> 
:>> >   - I would like to cap the size of the buffer cache at 200MB,
:>> > giving us another 70MB or so of KVM which is equivalent to
:>> > another 30,000 or so nmbclusters.
:>> 
:>>That also seems like overkill for the vast majority of systems.
:>
:>But probably not for the large-memory systems (and on the machines
:>with small memory the limit will be smaller anyway). Having a
:>machine with a few gigs of memory and being able to use only 200MB
:>for the buffer cache seems to be quite bad for a general-purpose
:>machine. 
:
:   Uh, I don't think you understand what this limit is about. It's
:essentially the limit on the amount of filesystem directory data that
:can be cached. It does not limit the amount of file data that can
:be cached - that is only limited by the amount of RAM in the machine.
:
:-DG
:
:David Greenman
:Co-founder, The FreeBSD Project - http://www.freebsd.org
:President, TeraSolutions, Inc. - http://www.terasolutions.com

Yes, and the buffer cache determines how much dirty file-backed data
(via write() or mmap()) the system is allowed to accumulate before
it forces it out, which should probably be the greater concern here.

The issue we face in regards to these limitations is simply that
the kernel only has 1 GB of KVA space available no matter how much
physical ram is in the box.   We currently scale a number of things
based on physical ram - sendfile() buffer space, buffer cache, swap
meta space, kernel malloc area, PV Entry space, and so forth.

Machine with large amounts of ram wind up eating up so much KVA that
simple tuning elements such as increasing the number of NMBCLUSTERS
or increasing 'maxusers' can cause the machine to run out of KVA space
during the boot process, resulting in a panic.

All I am suggesting here is that we throw in some reasonable limits
temporarily (until we can come up with a permanent solution), limits
which really only effect machines with greater then 1GB of real memory
and which can be overriden by kernel options, in order to avoid common
tuning configurations (e.g. large number of NMBCLUSTERS, high 'maxusers'
specification) from causing unexpected crashes at boot time.  This will
give us time to come up with a more permanent solution.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Recommendation for minor KVM adjustments for the release

2001-08-18 Thread Matt Dillon

I just got through diagnosing some problems Mohan Kokal has been
having with not being able to specify a large enough NMBCLUSTERS
on large-memory (2G) machine.  The symptoms were that he was able
to specify 65536 clusters on a 1G machine, but the same parameter
panic'd a 2G machine on boot.

Mohan graciously gave me a login on the system so I could gdb a
live kernel (with 61000 clusters, which worked) to figure out what
was going on.   This is what I found:

SYSTEM CONFIG: 2G physical ram, 61000 NMBCLUSTERS, 512 maxusers.

kernel_map  (1G)bfeff000 - ff80

kmem_map397MB   c347a000 - daf3  (mb_map is here - 187MB)
clean_map   267MB   db474000 - eb35  (buffer_map is here)

sf_buf's 35MB
zone allocator  299MB   (breakdown: 75MB for PVENTRY, 164MB for
 SWAPMETA)
-
998 MB  oops!

In otherwords, he actually ran out of KVM!

The problem we face is that KVM does not scale with real memory.  So on
a 1G machine the various maps are smaller, allowing more mbufs to
be specified in the kernel config.  On a 2G machine the various maps
are larger, allowing fewer mbufs to be specified.  On a 4G machine it
is even worse.

There are two things I would like to commit for the release:

- I would like to cap the SWAPMETA zone reservation to 70MB,
  which allows us to manage a maximum of 29GB worth of swapped
  out data.

  This is plenty and saves us 94MB of KVM which is roughly
  equivalent to 30,000 nmbclusters/mbufs.

- I would like to cap the size of the buffer cache at 200MB,
  giving us another 70MB or so of KVM which is equivalent to
  another 30,000 or so nmbclusters.

I would have kernel options to override the caps.  The already existing
NBUF option would override the buffer cache cap, and I would add
a kernel option called SWAPMAX which would override the swapmeta
cap.

These changes will allow large-memory machines to scale KVM a bit
better and reduce unexpected panic-at-boot problems.  Swap performance
will not be effected at all because my original SWAPMETA calculation
was overkill.  The buffer cache will be sized as if the machine had
about 1.5GB of ram and so the change only caps it when physmem is 
larger then that, and should have a minimal impact since the meat of
our caching is the VM page cache, not the buffer cache.

If the release engineer(s) give the OK, I will stage these changes
into -current this weekend and -stable on monday or tuesday.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Allocate a page at interrupt time

2001-08-07 Thread Matt Dillon


:
:Matt Dillon wrote:
:> :What "this", exactly?
:> :
:> :That "virtual wire" mode is actually a bad idea for some
:> :applications -- specifically, high speed networking with
:> :multiple gigabit ethernet cards?
:> 
:> All the cpu's don't get the interrupt, only one does.
:
:I think that you will end up taking an IPI (Inter Processor
:Interrupt) to shoot down the cache line during an invalidate
:cycle, when moving an interrupt processing thread from one
:CPU to another.  For multiple high speed interfaces (disk or
:network; doesn't matter), you will end up burining a *lot*
:of time, without a lockdown.

Cache line invalidation does not require an IPI.  TLB
shootdowns require IPIs.  TLB shootdowns are unrelated to
interrupt threads, they only occur when shared mmu mappings
change.  Cache line invalidation can waste cpu cycles --
when cache mastership changes occur between cpus due to
threads being switched between cpus.  I consider this a
serious problem in -current.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Allocate a page at interrupt time

2001-08-07 Thread Matt Dillon

:I'd agree that is a specialized situation, one which wouldn't
:be critical to many freebsd users.  Is Terry right that the
:current strategy will "lock us into virtual wire mode", in
:some way which means that this specialized situation CANNOT
:be handled?
:
:(it would be fine if it were "handled" via some specialized
:kernel option, imo.  I'm just wondering what the limitations
:are.  I do not mean to imply we should follow some different
:strategy here, I'm just wondering...)
:
:-- 
:Garance Alistair Drosehn=   [EMAIL PROTECTED]

In -current there is nothing preventing us from wiring
interrupt *threads* to cpus.  Wiring the actual interrupts
themselves might or might not yield a performance improvement
beyond that.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Allocate a page at interrupt time

2001-08-07 Thread Matt Dillon


:> > It also has the unfortunate property of locking us into virtual
:> > wire mode, when in fact Microsoft demonstrated that wiring down
:> > interrupts to particular CPUs was good practice, in terms of
:> > assuring best performance.  Specifically, running in virtual
:> > wire mode means that all your CPUs get hit with the interrupt,
:> > whereas running with the interrupt bound to a particular CPU
:> > reduces the overall overhead.  Even what we have today, with
:> > the big giant lock and redirecting interrupts to "the CPU in
:> > the kernel" is better than that...
:> 
:> Terry, this is *total* garbage.
:> 
:> Just so you know, ok?
:
:What "this", exactly?
:
:That "virtual wire" mode is actually a bad idea for some
:applications -- specifically, high speed networking with
:multiple gigabit ethernet cards?

All the cpu's don't get the interrupt, only one does.

:That Microsoft demonstrated that wiring down interrupts
:to a particular CPU was a good idea, and kicked both Linux'
:and FreeBSD's butt in the test at ZD Labs?

Well, if you happen to have four NICs and four CPUs, and
you are running them all full bore, I would say that
wiring the NICs to the CPUs would be a good idea.  That
seems like a rather specialized situation, though.

-Matt

:That taking interrupts on a single directed CPU is better
:than taking an IPI on all your CPUs, and then sorting out
:who's going to handle the interrupt?
:...
:
:-- Terry




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Page Coloring

2001-08-06 Thread Matt Dillon


:If I remember correctly from reading a thesis (can't remember its
:author) on the page coloring which I believe widely introduced this
:concept, page coloring adds a lot of efficiency to the directly 
:mapped caches but even for the 2-way caches is nearly pointless.
:
:-SB

For the most part, yes.  2-way set associative caches handle standard
compiled programs reasonably well.  4-way set associative 
caches handle standard interpreted programs reasonably well.
Page-coloring helps keep things consistent between program runs
but typically has very little effect on machines which already have
set-associative caches.  The main thing is the consistency - for
example, if you have a medium-sized buffer in memory which you are
accessing randomly, page coloring will prevent degenerate cache cases that
can occur in cases where the VM system (without coloring) happens to
assign the same cache page to every page of the buffer.  But you wouldn't
notice unless your buffer had more then N pages (N = set associativity).
For the same reason, 'random' also works fairly well (just in a less
consistent way), which is why page coloring doesn't add much when doing
a performance comparison on a system with a 2-way or better associative
cache.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Allocate a page at interrupt time

2001-08-05 Thread Matt Dillon

:I should have guessed the reason. Matthew Dillon answered this question on 
:Fri, 2 Jun 2000 as follows:
:
:
:The VM routines that manage pages associated with objects are not
:protected against interrupts, so interrupts aren't allowed to change
:page-object associations.  Otherwise an interrupt at just the wrong
:time could corrupt the mainline kernel VM code.
:
:
:On Thu, 2 Aug 2001, Zhihui Zhang wrote:
:
:> 
:> FreeBSD can not allocate from the PQ_CACHE queue in an interrupt context.
:> Can anyone explain it to me why this is the case?
:> 
:> 
:> Thanks,

Yes, that is precisely the reason.  In -current this all changes, though,
since interrupts are now threads.  *But*, that said, interrupts cannot
really afford to hold mutexes that might end up blocking them for 
long periods of time so I would still recommend that interrupt code not
attempt to allocate pages out of PQ_CACHE.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Page Coloring

2001-08-05 Thread Matt Dillon


:> Since most L1 caches these days are at least 16K and most L2 caches
:> these days are at least 64K (and often much higher, such as on the IA32),
:> our hardwired page coloring constants wind up being about 95% effective
:> across the entire range of chips our OS currently runs on.
:
:Yes, I understand that.  I'm just trying to find out why Mike keeps
:saying we cannot determine the processor cache characteristics at
:runtime.
:
:John
:-- 
:  John Polstra   [EMAIL PROTECTED]

You can find out from the cpuid or something like that, but it's
probably easier to simply do it programmatically, or not bother at
all.  It's not worth the effort.  We would not reap any additional
benefit from knowing.

It is interesting to note that one effect of the page-coloring
code is that the VM CACHE and FREE VM page queues are actually
multi-queues, which means that when I extend the SMP locking down into
them we will wind up with fine-grained locking for memory allocations.
But before I can even begin contemplating that I have to change the way
the vm_page BUSY'ing stuff works so page operations (such as allocations)
can occur without having to hold long term mutexes.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Page Coloring

2001-08-05 Thread Matt Dillon


:
:If I added this to a man page would I be telling the truth :).
:
:Note, these are my notes and not the exact text that I would
:add, and I have not bother with anything to do with object
:coloring etc.  I just want to make sure I've got this part
:down.
:
:Chad

It's a good description but it might be better to simplify it a bit.
You don't need to go into that level of detail.  There is a short
page coloring explanation at the end of my VM article which might
be more suitable to a man page:

http://www.daemonnews.org/21/freebsd_vm.html

-Matt

(quoted from the article):

Ok, so now onto page coloring: All modern memory caches are what are 
known as *physical* caches. They cache physical memory addresses, not
virtual memory addresses. This allows the cache to be left alone across
a process context switch, which is very important. 

But in the UNIX world you are dealing with virtual address spaces, not
physical address spaces. Any program you write will see the virtual
address space given to it. The actual *physical* pages underlying that
virtual address space are not necessarily physically contiguous! In
fact, you might have two pages that are side by side in a processes
address space which wind up being at offset 0 and offset 128K in
*physical* memory. 

A program normally assumes that two side-by-side pages will be 
optimally cached. That is, that you can access data objects in both
pages without having them blow away each other's cache entry. But this
is only true if the physical pages underlying the virtual address
space are contiguous (insofar as the cache is concerned). 

This is what Page coloring does. Instead of assigning *random*
physical pages to virtual addresses, which may result in non-optimal
cache performance , Page coloring assigns *reasonably-contiguous*
physical pages to virtual addresses. Thus programs can be written
under the assumption that the characteristics of the underlying
hardware cache are the same for their virtual address space as they
would be if the program had been run directly in a physical address
space. 

Note that I say 'reasonably' contiguous rather than simply 'contiguous'.
From the point of view of a 128K direct mapped cache, the physical
address 0 is the same as the physical address 128K. So two side-by-side
pages in your virtual address space may wind up being offset 128K and
offset 132K in physical memory, but could also easily be offset 128K
and offset 4K in physical memory and still retain the same cache
performance characteristics. So page-coloring does *NOT* have to assign
truely contiguous pages of physical memory to contiguous pages of
virtual memory, it just needs to make sure it assigns contiguous pages
from the point of view of cache performance and operation. 


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Page Coloring

2001-08-05 Thread Matt Dillon


:In article <[EMAIL PROTECTED]>,
:Mike Smith  <[EMAIL PROTECTED]> wrote:
:> 
:> It looks about right, but page colouring is pointless unless and until we 
:> can determine the processor cache characteristics at runtime.
:> 
:> Which we can't.
:
:Why can't we do this at least on the i386 with the CPUID instruction,
:initial %eax == 2?  It returns cache size, associativity, and line
:size for both the L1 and L2 caches.  As far as I can tell, it works
:for the Pentium Pro and subsequent processors.
:
:John
:-- 
:  John Polstra   [EMAIL PROTECTED]

Well, first of all the page coloring is not pointless with the
sizes hardwired.  The cache characteristics do not have to
match exactly for page coloring to work.  The effectiveness is
like a log-graph, and you don't lose a lot by guessing wrong.
Once you get past a designated cache size of 4-pages or so you've
already reaped 90% of the benefit on systems which use N-way (2, 4, 8)
associative caches (which is most systems these days).  For systems with
direct-mapped caches you reap 90% of the benefits once you get past
16 pages or so.

Since most L1 caches these days are at least 16K and most L2 caches
these days are at least 64K (and often much higher, such as on the IA32),
our hardwired page coloring constants wind up being about 95% effective
across the entire range of chips our OS currently runs on.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Why two cards on the same segment...

2001-07-26 Thread Matt Dillon


:..
:> You have to explicitly bind to the correct source IP if you care.
:>
:> For our machines I bind our external services specifically to the
:> external IP.  Beyond that I usually don't care because I NAT-out our
:> internal IP space anyway, so any packets sent 'from' an internal IP
:> to the internet wind up going through the NAT, which hides the fact
:> that the source machine chose the wrong IP.
:
:
:Hmm.. That hasn't been my experience at all.  I have _always_ seen
:outgoing connections use a source address of the closest interface
:address that exists on the same IP network as the destination, OR, if
:it is a non-local destination, then the source is whatever IP address
:is on the same IP network as the next-hop gateway.  If your next-hop
:gateway is an RFC1918 address, then your source address will be your
:RFC1918 address on the same subnet, unless you specify otherwise of
:course.  Maybe if you set net.inet.ip.subnets_are_local to 1, then
:maybe the system will use the primary non-alias address of the closest
:physical interface, be it a public address or whatever, but I've not
:tried that.
:
:-- Chris Dillon - [EMAIL PROTECTED] - [EMAIL PROTECTED]

Huh... your right!  How odd.  I think someone may have fixed something
since I last played with this.  I swear it wasn't going that before!  I
would set up a bunch of ip aliases and it was pot-luck.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Why two cards on the same segment...

2001-07-26 Thread Matt Dillon

:Not really. The private IP space probably never leaves that LAN segment so
:the source IP would get set properly and the default route is irrelevent.
:Whenever
:he communicated with a block that is not diretly attached then the code has
:to
:choose a source address and then send the packet to the next hop (usually
:the
:default route unless you have a dynamic protocol daemon (routed/gated/etc)
:running. As long as your just communicating to directly attached subnets
:everything
:will work peachy regardless of public/private/quantity/netmask.
:
:-Steve

I wish it were that easy.  If you have two interfaces on the same LAN
segment, but one is configured with an internal IP and one is 
configured with an external IP, and the default route points out the
interface configured with the external IP, then you are ok.

If you have one interface with *two* ip addresses.  For example (taking
a real life example):

ash:/home/dillon> ifconfig
fxp0: flags=8843 mtu 1500
inet 208.161.114.66 netmask 0xffc0 broadcast 208.161.114.127
inet 10.0.0.3 netmask 0xff00 broadcast 10.0.0.255
ether 00:b0:d0:49:3b:fd 
media: Ethernet autoselect (100baseTX )
status: active

Then the 'source IP' address the machine uses is completely up in the
air.   It could be the external IP, or the internal IP, and it could
change out from under you if you manipulate the interface with ifconfig.
You have to explicitly bind to the correct source IP if you care.

For our machines I bind our external services specifically to the
external IP.  Beyond that I usually don't care because I NAT-out our
internal IP space anyway, so any packets sent 'from' an internal IP
to the internet wind up going through the NAT, which hides the fact
that the source machine chose the wrong IP.

-Matt

:> Yes, but what that snippet showed from ifconfig showed 2 networks, 2 from
:> public IP space and 1 from private IP space, and since it's working the
:> networking code must know/care about something that it's being fed. --
:> Jonathan
:>
:> --
:> Jonathan M. Slivko <[EMAIL PROTECTED]>
:> Blinx Networks
:> http://www.blinx.net/
:>
:> - Original Message -
:> From: "Steven Ames" <[EMAIL PROTECTED]>
:> To: "Jonathan M. Slivko" <[EMAIL PROTECTED]>; "Chris Dillon"
:> <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
:> Sent: Thursday, July 26, 2001 4:56 PM
:> Subject: Re: Why two cards on the same segment...
:>
:>
:> > > Yes, but, I think the issue with the 2 IP classes working is because
:one
:> > is
:> > > not routable, and therefore it's not a real
:> > >  IP address, and the router knows this, hence it's not reacting to it
:by
:> > > stopping to work. As long as you use virtual
:> > > ip's (192.168.*.*) then there should be no reason why it wouldn't
:work.
:> > > However, if your talking about a routable
:> > > IP address, then you might have a problem, as there is a difference
:> > between
:> > > a virtual IP address and a real (routable)
:> > > IP address. Just my 0.02 cents. -- Jonathan
:> >
:> > I don't think the networking code knows/cares if something is private or
:> > public IP space. I might be off here but I think the real problem with
:> > two seperate networks on one card (or even on two cards) would be
:> > the default route (can't have two right?) and which IP address gets
:> > used as the 'source IP' on packets leaving the system.
:> >
:> > -Steve

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: qestion about vm page coloring

2001-07-26 Thread Matt Dillon


:
:  
:
:  yes, I mean vm_page_t, and understand what you said. I will try to print the 
:value of PQ_L2_SIZE in my kernel. Do you know what kernel options influence 
:this value? I saw it is decided by PQ_CACHESIZE which is decided by different 
:PQ_HUGE[LARGE/MEDIUM/...]CACHEsetting. Default setting PQ_CACHESIZE is 128,
:and corresponding PQ_L2_SIZE is 32. Am I right until now or something wrong?
:I use FreeBSD 4.1 release kernel to build my kernel.
:
:Anyway, thanks for your explaination.
:
:Rex Luo

Yes, 32 is what you should see with the defaults.  

I would also upgrade to the latest -stable, 4.1 was a fairly good release
but a lot of bugs have still been fixed between it and 4.3 including a
couple of root exploits.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: qestion about vm page coloring

2001-07-25 Thread Matt Dillon


:Dear all,
:
:   I study FreeBSD vm managememnt recently, however, I am a little confused 
:with vm_page's page color. when you call vm_add_new_page() in vm_startup(), 
:you will set each map entry's page color according to its physical addr.
:
:   m->pc = (pa >> PAGE_SHIFT)&PQ_L2_MASK;
:
:However, I found that almost each map entry's page color is zero, that means
:PQ_L2_SIZE is 1, and disable page coloring option. Maybe I can do some 
:modification to dump PQ_L2_SIZE's value, but I think my guess is right.
:Can someone please tell me the principle of page coloring, and why it's disabled
:now?
:
:Thanks,
:
:Rex Luo

I'm not sure what you mean by 'map entry'... vm_page_t's have color, and
vm_object's have a base color to randomly offset the color of the
vm_page_t's associated with the object, but vm_map_entry's do not have
a page color associated with them.

The page coloring works fine on my box, you may be looking at the wrong
thing. PQ_L2_SIZE is definitely not 1 unless you've specified some weird
kernel options in the kernel config.

-

Page coloring basically ensures that pages which are adjacent in 
virtual memory also wind up being adjacent in the L1 and L2
cpu caches in order to get more consistent cpu cache behavior.  Without
page coloring it is quite possible to have several adjacent pages in
virtual memory wind up utilizing the same cpu cache page, which can
effect performance with certain types of applications or certain cpu
cache topologies.   On IA32 pentium architectures the effect would
probably not be all that noticeable, but getting consistent behavior
is still a good thing.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



  1   2   3   4   5   >