Re: HEADS UP! KSE needs more attention

2004-06-07 Thread Thomas Moestl
On Sun, 2004/06/06 at 14:59:21 -0700, Kris Kennaway wrote:
> On Sun, Jun 06, 2004 at 03:49:13PM -0600, Scott Long wrote:
> > amd64 is approaching critical mass for tier-1.  There are a number of
> > developers that own amd64 hardware now, and a number of users who are
> > asking about it on the mailing lists.  Peter is finishing up the last
> > blocking item for it (kld's) not including the observed KSE problems.
> > It's very close and I _will_ hold up the release for it to get done.
> > amd64 is the future of commodity computing and we aren't going to
> > ignore it for 5-STABLE.
> 
> amd64 has a bug with swapping - when something begins to access swap,
> the entire system becomes almost entirely unresponsive (e.g. no mouse
> response for up to 10 seconds) until it stops.  Peter has some ideas
> about it, but it's a serious enough bug that it forced me to stop
> using amd64 as my desktop machine (hello, kde!).

Hmmm, I have encountered a similar problem on sparc64 once; the
reason was that vm_pageout_map_deactivate_pages() calls
pmap_remove() for the range from the start to the end of the
process's vm_map when a process is swapped out. Start and end
are VM_MIN_ADDRESS and VM_MAXUSER_ADDRESS respectively, and on
64-bit architectures, that range is very large (128TB on ia64
if I'm not mistaken), so the iteration in pmap_remove() must
be carefully designed to make as large steps as possible to
avoid long run times (or to not iterate over the range at all
if it becomes too large, which we did on sparc64).

It seems that the amd64 version of pmap_remove() will essentially
always iterate in 2MB (level 2 page table) steps, regardless of
whether there is mapping for the respective level 2 table in the
table levels above; that means that in the previously mentioned case,
the outer loop will usually run for about 67 million iterations (the
resident count guard may not be of much use here if a stack page is
left at the very end of the address space). Since there are a few
memory accesses needed in each iterations, that may already be the
cause of such a delay.

I have no hardware to test this, so all of the above is just a wild-
assed guess; but maybe it is of use (and sorry for the spam if it
is not).

- Thomas

-- 
Thomas Moestl   <[EMAIL PROTECTED]> http://www.tu-bs.de/~y0015675/
<[EMAIL PROTECTED]> http://people.FreeBSD.org/~tmm/
"I realized that the purpose of writing is to inflate weak ideas,
 obscure poor reasoning, and inhibit clarity."  -- Calvin and Hobbes
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD 5.2 v/s FreeBSD 4.9 MFLOPS performance (gcc3.3.3 v/s gcc2.9.5)

2004-02-16 Thread Thomas Moestl
On Mon, 2004/02/16 at 19:11:16 +0100, Dag-Erling Smørgrav wrote:
> Kris Kennaway <[EMAIL PROTECTED]> writes:
> > On Mon, Feb 16, 2004 at 03:52:16AM -0800, Wes Peters wrote:
> > > Should I commit this?
> > What effect does it have on non-i386 architectures?
> 
> It can't possibly hurt.  If the stack is already aligned on a "better"
> boundary (64 or 128 bytes), it is also aligned on a 32-byte boundary
> since 64 and 128 are multiples of 32, and the patch is a no-op.  If
> only a 16-byte alignment is required, a 32-byte alignment wastes a
> small amount of memory but does not hurt performance.  I believe that
> less-than-16 (and possibly even less-than-32) alignment is pessimal on
> all platforms we support.

Well, it misaligns stack_base on 64-bit architectures, for example
(notice the "- 4", which is there to compensate for the fixup in
kern_execve() that will subtract another sizeof(register_t)):

vectp = (char **)(((vm_offset_t)vectp & ~(vm_offset_t)0x1F) - 4);

It would by much better to be able to align the stack in
exec_setregs(), like amd64, ia64, powerpc and sparc64 do.
Unfortunately that would require changes to crt1, so it would pose a
compatibility problem.

- Thomas

-- 
Thomas Moestl   <[EMAIL PROTECTED]> http://www.tu-bs.de/~y0015675/
<[EMAIL PROTECTED]> http://people.FreeBSD.org/~tmm/
"Oh, great altar of passive entertainment... Bestow upon me thy
 discordant images at such speed as to render linear thought
 impossible!"   -- Calvin and Hobbes
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: PLEASE REVIEW: Adding a pci_if method to facilitate specializedPCI bus drivers

2003-06-27 Thread Thomas Moestl
On Fri, 2003/06/27 at 14:50:08 -0400, John Baldwin wrote:
> On 27-Jun-2003 Thomas Moestl wrote:
> > On Fri, 2003/06/27 at 14:01:45 -0400, John Baldwin wrote:
> >> On 27-Jun-2003 Thomas Moestl wrote:
> >> > On Fri, 2003/06/27 at 13:37:00 -0400, John Baldwin wrote:
> >> >> On 13-Jun-2003 Thomas Moestl wrote:
> >> >> > This requires us to get this firmware property in the OFW PCI bus
> >> >> > driver before routing the interrupt; that can't be done in the pcib
> >> >> > route_interrupt method, since we don't know whether we are routing for
> >> >> > another bridge (where we use whichever index we get passed) or for a
> >> >> > child device (in which case we would need to look at the firmware
> >> >> > property).
> >> >> 
> >> >> Actually, can't you tell this by doing:
> >> >> 
> >> >> if (device_get_parent(device_get_parent(dev)) == pcib)
> >> >> /* Routing direct child. */
> >> >> else
> >> >> /* Routing descedent of a child bridge. */
> >> > 
> >> > No, pcib will always be a grandparent of dev. When routing a
> > ^ for 
> >> > descendant child bridge, dev will itself be the device_t of a bridge,
> >   ^ of a (oops)
> >> > otherwise it is that of the device we are routing to.
> >> Doh, yes. :(  Hmm, can you try something like this maybe:
> >> 
> >> if (pci_get_class(dev) == PCIC_BRIDGE &&
> >> pci_get_subclass(dev) == PCIS_BRIDGE_PCI)
> >> /* Routing across a child bridge. */
> >> else
> >> /* Routing a direct child that is not a bridge. */
> > 
> > This leaves two possible problems: first, there are other types of
> > bridges (we currently support PCI-ISA and PCI-EBus ones, cardbus might
> > also work) for which we need to use PCIB_ROUTE_INTERRUPT(); that could
> > likely be dealt with by not testing the subclass at all.
> > More importantly, however, a bridge might want to allocate an
> > interrupt for itself; for example, cardbus bridge drivers do this to
> > handle insertion/ejection/etc events.
> > 
> > I think that handling this at the bus driver level is a much cleaner
> > solution.
> 
> Then should we eliminate pcib_route_interrupt() and make it a PCI bus
> method?  I just think it is rather odd to require this interface to be
> duplicated in two different places.

This is not necessarily duplication; in the OFW_PCI code, the work to
be done is divided up between the two.
Where the routing has to happen depends a lot on the information
available. If a mapping from device to interrupt is available, it
can make sense to do the routing completely in the bus driver. If
there are any bridges for which we lack such information, however, and
we need to fall back to other routing methods (like swizzling), there
might be a need to support the bridge routing methods. In such a
scenario, the bus and bridge methods usually would be quite different,
so there would be not much duplication.

To give an example, OpenFirmware contains an 'interrupt' property for
each PCI device which uses interrupts. This property can contain a
full interrupt number, or an index which has to be mapped using the
'interrupt-map' properties of higher bridges. In some cases (but not
always), these indices are actually intpins, and on some bridges we
need to do a swizzle instead of a map lookup.
We look up the 'interrupt' property of the device in the bus driver,
and decide whether it is full interrupt number already, in which case
we just use it. Otherwise, we let the parent bridge do the mapping as
required (which in turn might need to resort to it's parent bridge,
and so on).
This scheme also allows us to interface with PCI bridge drivers which
use their intpins, and for which we do not have (or need)
OpenFirmware-specific drivers.
Doing all of that in the bus method would be a layering violation, and
would have ugly consequences (for example, different bridges need to
handle different quirks).

> Doing it in the bus might actually make life simpler.  Right now
> when using different tables for routing PCI interrupts on x86, I
> have to write two different bridge drivers all 
> the time, one for host bridges and one for PCI-PCI bridges, thus we have
> ACPI host-PCI and PCI-PCI bridge drivers, PCIBIOS $PIR host-PCI and PCI-PCI
> bridge drivers and MPTable host-PCI and PCI-PCI bridge drivers.  Being
> able to just have an ACPI

Re: PLEASE REVIEW: Adding a pci_if method to facilitate specializedPCI bus drivers

2003-06-27 Thread Thomas Moestl
On Fri, 2003/06/27 at 14:01:45 -0400, John Baldwin wrote:
> On 27-Jun-2003 Thomas Moestl wrote:
> > On Fri, 2003/06/27 at 13:37:00 -0400, John Baldwin wrote:
> >> On 13-Jun-2003 Thomas Moestl wrote:
> >> > This requires us to get this firmware property in the OFW PCI bus
> >> > driver before routing the interrupt; that can't be done in the pcib
> >> > route_interrupt method, since we don't know whether we are routing for
> >> > another bridge (where we use whichever index we get passed) or for a
> >> > child device (in which case we would need to look at the firmware
> >> > property).
> >> 
> >> Actually, can't you tell this by doing:
> >> 
> >> if (device_get_parent(device_get_parent(dev)) == pcib)
> >> /* Routing direct child. */
> >> else
> >> /* Routing descedent of a child bridge. */
> > 
> > No, pcib will always be a grandparent of dev. When routing a
  ^ for 
> > descendant child bridge, dev will itself be the device_t of a bridge,
  ^ of a (oops)
> > otherwise it is that of the device we are routing to.
> Doh, yes. :(  Hmm, can you try something like this maybe:
> 
> if (pci_get_class(dev) == PCIC_BRIDGE &&
> pci_get_subclass(dev) == PCIS_BRIDGE_PCI)
> /* Routing across a child bridge. */
> else
> /* Routing a direct child that is not a bridge. */

This leaves two possible problems: first, there are other types of
bridges (we currently support PCI-ISA and PCI-EBus ones, cardbus might
also work) for which we need to use PCIB_ROUTE_INTERRUPT(); that could
likely be dealt with by not testing the subclass at all.
More importantly, however, a bridge might want to allocate an
interrupt for itself; for example, cardbus bridge drivers do this to
handle insertion/ejection/etc events.

I think that handling this at the bus driver level is a much cleaner
solution.

- Thomas

-- 
Thomas Moestl <[EMAIL PROTECTED]>   http://www.tu-bs.de/~y0015675/
  <[EMAIL PROTECTED]>   http://people.FreeBSD.org/~tmm/
PGP fingerprint: 1C97 A604 2BD0 E492 51D0  9C0F 1FE6 4F1D 419C 776C
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: PLEASE REVIEW: Adding a pci_if method to facilitate specializedPCI bus drivers

2003-06-27 Thread Thomas Moestl
On Fri, 2003/06/27 at 13:37:00 -0400, John Baldwin wrote:
> On 13-Jun-2003 Thomas Moestl wrote:
> > This requires us to get this firmware property in the OFW PCI bus
> > driver before routing the interrupt; that can't be done in the pcib
> > route_interrupt method, since we don't know whether we are routing for
> > another bridge (where we use whichever index we get passed) or for a
> > child device (in which case we would need to look at the firmware
> > property).
> 
> Actually, can't you tell this by doing:
> 
> if (device_get_parent(device_get_parent(dev)) == pcib)
> /* Routing direct child. */
> else
> /* Routing descedent of a child bridge. */

No, pcib will always be a grandparent of dev. When routing a
descendant child bridge, dev will itself be the device_t of a bridge,
otherwise it is that of the device we are routing to.

- Thomas

-- 
Thomas Moestl <[EMAIL PROTECTED]>   http://www.tu-bs.de/~y0015675/
  <[EMAIL PROTECTED]>   http://people.FreeBSD.org/~tmm/
PGP fingerprint: 1C97 A604 2BD0 E492 51D0  9C0F 1FE6 4F1D 419C 776C
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


PLEASE REVIEW: Adding a pci_if method to facilitate specialized PCIbus drivers

2003-06-13 Thread Thomas Moestl
Hi,

I've attached a patch that adds a new pci_if method,
assign_interrupt, and makes the PCI code implement and use
it. This is private to the PCI bus and is only used internally (to
find an interrupt of a child device); it is a method so that derived
PCI bus drivers can override it.

This is very useful for the sparc64 OFW PCI bus driver which I will
commit soon, hopefully. On sparc64, there are some on-board devices
which have special interrupt lines. To route their interrupts, we need
not only know the device to route for, but also an interrupt index
which is stored in the firmware for this device, and which is used to
route on bridges instead of the intpin (in other cases, there's even
the complete interrupt number stored there; for devices in PCI slots,
we (usually) can use the intpin).

This requires us to get this firmware property in the OFW PCI bus
driver before routing the interrupt; that can't be done in the pcib
route_interrupt method, since we don't know whether we are routing for
another bridge (where we use whichever index we get passed) or for a
child device (in which case we would need to look at the firmware
property). Currently, the OFW PCI bus driver solves this by providing
a stub version of the alloc_resource method, trapping the cases where
rerouting is required and doing it before calling the generic PCI bus
routine, but that leads to code duplication and is generally a bit
ugly, as we are second-guessing where the PCI bus routines will want
to route an interrupt.

By moving the actual routing in a method which can be overridden, this
can be solved cleanly, while not complicating the generic PCI bus code
by much.

If there are no objections, I would like to commit this soonish.

Thoughts?

Thanks,
    - Thomas

-- 
Thomas Moestl <[EMAIL PROTECTED]>   http://www.tu-bs.de/~y0015675/
  <[EMAIL PROTECTED]>   http://people.FreeBSD.org/~tmm/
PGP fingerprint: 1C97 A604 2BD0 E492 51D0  9C0F 1FE6 4F1D 419C 776C
Index: dev/acpica/acpi_pci.c
===
RCS file: /vol/ncvs/src/sys/dev/acpica/acpi_pci.c,v
retrieving revision 1.3
diff -u -r1.3 acpi_pci.c
--- dev/acpica/acpi_pci.c   17 Feb 2003 21:20:34 -  1.3
+++ dev/acpica/acpi_pci.c   13 Jun 2003 20:57:13 -
@@ -111,6 +111,7 @@
/* XXX: We should override these two. */
DEVMETHOD(pci_get_powerstate,   pci_get_powerstate_method),
DEVMETHOD(pci_set_powerstate,   pci_set_powerstate_method),
+   DEVMETHOD(pci_assign_interrupt, pci_assign_interrupt_method),
 
{ 0, 0 }
 };
Index: dev/cardbus/cardbus.c
===
RCS file: /vol/ncvs/src/sys/dev/cardbus/cardbus.c,v
retrieving revision 1.38
diff -u -r1.38 cardbus.c
--- dev/cardbus/cardbus.c   18 Feb 2003 21:24:00 -  1.38
+++ dev/cardbus/cardbus.c   13 Jun 2003 21:51:29 -
@@ -387,6 +387,7 @@
DEVMETHOD(pci_disable_io,   pci_disable_io_method),
DEVMETHOD(pci_get_powerstate,   pci_get_powerstate_method),
DEVMETHOD(pci_set_powerstate,   pci_set_powerstate_method),
+   DEVMETHOD(pci_assign_interrupt, pci_assign_interrupt_method),
 
{0,0}
 };
Index: dev/pci/pci.c
===
RCS file: /vol/ncvs/src/sys/dev/pci/pci.c,v
retrieving revision 1.219
diff -u -r1.219 pci.c
--- dev/pci/pci.c   9 Jun 2003 18:08:46 -   1.219
+++ dev/pci/pci.c   13 Jun 2003 21:36:01 -
@@ -70,7 +70,8 @@
 static int pci_memen(device_t pcib, int b, int s, int f);
 static int pci_add_map(device_t pcib, int b, int s, int f, int reg, 
struct resource_list *rl);
-static voidpci_add_resources(device_t pcib, device_t dev);
+static voidpci_add_resources(device_t pcib, device_t bus,
+ device_t dev);
 static int pci_probe(device_t dev);
 static int pci_attach(device_t dev);
 static voidpci_load_vendor_data(void);
@@ -119,6 +120,7 @@
DEVMETHOD(pci_disable_io,   pci_disable_io_method),
DEVMETHOD(pci_get_powerstate,   pci_get_powerstate_method),
DEVMETHOD(pci_set_powerstate,   pci_set_powerstate_method),
+   DEVMETHOD(pci_assign_interrupt, pci_assign_interrupt_method),
 
{ 0, 0 }
 };
@@ -776,7 +778,7 @@
 }
 
 static void
-pci_add_resources(device_t pcib, device_t dev)
+pci_add_resources(device_t pcib, device_t bus, device_t dev)
 {
struct pci_devinfo *dinfo = device_get_ivars(dev);
pcicfgregs *cfg = &dinfo->cfg;
@@ -805,7 +807,7 @@
 * If the re-route fails, then just stick with what we
 * have.
 */
-   irq = PCIB_ROUTE_INTERRUPT(pcib, dev, cfg->intpin);
+   irq = PCI_ASSIGN_INTERRUPT(bus, dev);
 

Re: Locale errors on 5.0-CURRENT (sparc)

2003-01-03 Thread Thomas Moestl
On Fri, 2003/01/03 at 15:29:27 +0200, Peter Pentchev wrote:
> On Fri, Jan 03, 2003 at 02:31:41PM +0200, Peter Pentchev wrote:
> > FWIW, I can reproduce this on panther.FreeBSD.org, a sparc64 running
> > 5.0-CURRENT as of December 6, 2001.  All the other machines in the
> > FreeBSD cluster that I could test - ref5 (i386), pluto1 (IA-64), and
> > beast (alpha) - are able to set both the is_IS.ISO8859-1 and the
> > bg_BG.CP1251 locales correctly for both LC_ALL and LC_TIME.  Only the
> > sparc64 machine is having trouble setting LC_ALL for any locale I tried.
> 
> Actually, I seem to have found the reason: on panther.FreeBSD.org, there
> are no LC_CTYPE files for any locale, and setlocale(LC_ALL) attempts to
> load, well, *all* locale type definitions, including LC_CTYPE.  Thus, it
> would (and does) return NULL with errno set to ENOENT for all locales...
> 
> The LC_CTYPE files are not there, because there is an explicit test for
> the build architecture in src/share/Makefile, which excludes the
> mklocale directory for sparc64; it was introduced in rev. 1.27 by David
> O'Brien about 7 months ago.  David, what exactly was the 'bad juju'
> mentioned in the commit message, and is it still there?  A test run of
> 'cvs up share/mklocale && cd share/mklocale && make' on panther seemed
> to complete OK; I have not yet tested the resulting locale files though.

Yeah, I just noticed that, too. As far as I can tell, this is purely
historical, if I remember correctly mklocale did crash back then. I
will re-enable it.

- Thomas

-- 
Thomas Moestl <[EMAIL PROTECTED]> http://www.tu-bs.de/~y0015675/
  <[EMAIL PROTECTED]> http://people.FreeBSD.org/~tmm/
PGP fingerprint: 1C97 A604 2BD0 E492 51D0  9C0F 1FE6 4F1D 419C 776C

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: uiomove busted for the past 3 years

2002-01-30 Thread Thomas Moestl

On Wed, 2002/01/30 at 07:02:25 -0800, User JHB wrote:
> When the DEADLKTREAT flag was added, uiomove() was broken.  :)  The
> problem is that a return() inside of a switch nested inside of a while
> loop was converted to a break leading to the following rather silly code:
> 
> if (error)
> break;
> break;
> 
> What is supposed to happen is that if there is an error, we break out of
> the while loop, but all we do is break out of the switch ignoring the
> error and continuing to loop.  Thus, in the best case, if copyin or
> copyout failed on the last iteration of the loop, we would return an
> error, but would bogusly update the counts in the iovec and uio
> structures.  In the worst case, if we kept looping and later copyin's or
> copyout's succeeded, then we wouldn't return an error at all.  A quick fix
> would be to add a goto to jump to the error return code at the end of the
> loop rather than the bogus break if (error).  However, I can't test this
> at the moment.  Someone please verify and fix this. 

The attached test program triggers this bug (if the test file is
located on a file system that does not split up the uiomove() in that
case, e.g. FFS with a block size >= 2 * page size). When NIOV is changed
from 2 to 1, the writev() will result in EFAULT, however with two
iovecs the write returns the sum of the sizes of both iovecs although
the buffer supplied to the first is unaccessible.

The patch you proposed:

--- kern_subr.c~Fri Jan 25 02:14:45 2002
+++ kern_subr.c Wed Jan 30 20:39:07 2002
@@ -104,7 +104,7 @@
else
error = copyin(iov->iov_base, cp,
cnt);
if (error)
-   break;
+   goto out;
break;
 
case UIO_SYSSPACE:
@@ -123,6 +123,7 @@
cp += cnt;
n -= cnt;
}
+out:
if (td != curthread) printf("uiomove: IT CHANGED!");
td = curthread; /* Might things have changed in
copyin/copyout? */
if (td) {
--

fixes it for me.

- thomas

#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 

#define NIOV2

int
main()
{
struct iovec iov[2];
size_t psz;
char *a, *b;
int fd, rv, en;

psz = getpagesize();
a = mmap(NULL, psz, PROT_NONE, MAP_ANON, -1, 0);
if (a == MAP_FAILED)
err(1, "mmap 1");
b = mmap(a + psz, psz, PROT_READ, MAP_ANON | MAP_FIXED, -1, 0);
if (b == MAP_FAILED)
err(1, "mmap 2");
fd = open("uiob.foo", O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
if (fd == -1)
err(1, "open");
iov[0].iov_base = a;
iov[0].iov_len = psz;
iov[1].iov_base = b;
iov[1].iov_len = psz;
rv = writev(fd, iov, NIOV);
en = errno;
printf("writev returned %d\n", rv);
if (rv == -1)
printf("error: %s\n", strerror(en));
close(fd);
return (0);
}

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: contigfree, free what?

2001-10-12 Thread Thomas Moestl

On Fri, 2001/10/12 at 10:07:10 -0700, Matt Dillon wrote:
> 
> :Mark,
> :
> :> I also placed some checks on vm_map_delete
> :
> :I did that also, and as far as I understand everything works fine.
> :The only thing I found was the fact that when contigmalloc() grabs the
> :contig pages it sets the value of pga[i] (for i in allocated pages)
> :note that: vm_page_t pga = vm_page_array;
> :
> :Then contigfree() does a pretty good job, but does not reset the values
> :of pga[i] to pqtype == PQ_FREE (pqtype = pga[i].queue - pga[i].pc)
> :
> :So the next contigmalloc() requiring the same number of pages fails on
> :the previously released pages because they are not PQ_FREE
> :
> :The other thing that puzzled me is the fact that in vm_map_delete()
> :called by contgigfree() has a variable
> :...

I have also looked into this a while ago, but got stuck at some
point. I have just looked at it again, and I think I have found a solution.

> I think what is going on is that contigmalloc() is wiring the pages
> but placing them in a pageable container (entry->wired_count == 0),
> so when contigfree() kmem_free()'s the block the system does not know
> that it must unwire the pages.  This leaves the pages wired and prevents
> them from being freed.
> 
> I haven't found a quick and easy solution to the problem yet.  kmem_alloc()
> doesn't do what we want either.  I tried calling vm_map_pageable() in
> contigmalloc1() but it crashed the machine, so there might be something
> else going on as well.

This is probably because the map entries do have a NULL object
pointer. vm_map_pageable() calls vm_fault_wire(), so this will fail.

I have attached a patch which works for me. It duplicates most of the
logic of kmem_alloc in that it calls vm_map_findspace() first, then
vm_map_insert() (which basically is what is done in
kmem_alloc_pageable() too, but here, kernel_object is passed instead
of a NULL pointer, so that the map entry will have a valid object
pointer). Then, the pages are inserted into the object as before, and
finally, the map entries are marked as wired by using
vm_map_pageable(). Because this will also call vm_fault_wire(), which
will among other things do a vm_page_wire(), contigmalloc does not
need to wire the pages itself. 

The pmap_kenter() calls can also be reomved, since the pages will be
mapped in any case by vm_fault(). 

- thomas


--- vm_contig.c.origFri Oct 12 20:05:09 2001
+++ vm_contig.c Fri Oct 12 20:44:03 2001
@@ -76,6 +76,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -232,7 +234,6 @@
m->busy = 0;
m->queue = PQ_NONE;
m->object = NULL;
-   vm_page_wire(m);
}
 
/*
@@ -240,24 +241,31 @@
 * Allocate kernel VM, unfree and assign the physical pages to it and
 * return kernel VM pointer.
 */
-   tmp_addr = addr = kmem_alloc_pageable(map, size);
-   if (addr == 0) {
+   vm_map_lock(map);
+   if (vm_map_findspace(map, vm_map_min(map), size, &addr) !=
+   KERN_SUCCESS) {
/*
 * XXX We almost never run out of kernel virtual
 * space, so we don't make the allocated memory
 * above available.
 */
+   vm_map_unlock(map);
splx(s);
return (NULL);
}
+   vm_object_reference(kernel_object);
+   vm_map_insert(map, kernel_object, addr - VM_MIN_KERNEL_ADDRESS,
+   addr, addr + size, VM_PROT_ALL, VM_PROT_ALL, 0);
+   vm_map_unlock(map);
 
+   tmp_addr = addr;
for (i = start; i < (start + size / PAGE_SIZE); i++) {
vm_page_t m = &pga[i];
vm_page_insert(m, kernel_object,
OFF_TO_IDX(tmp_addr - VM_MIN_KERNEL_ADDRESS));
-   pmap_kenter(tmp_addr, VM_PAGE_TO_PHYS(m));
tmp_addr += PAGE_SIZE;
}
+   vm_map_pageable(map, addr, addr + size, FALSE);
 
splx(s);
return ((void *)addr);



[PATCH REVIEW] zdestroy() for the zone allocator (and small nfs patch)

2001-07-22 Thread Thomas Moestl

Hi,

I've attached a patch that adds a functions for zone destruction to
the kernel zone allocator, which is needed to properly support the
unload case for modules that create zones (such as nfs.ko).
As a first application of this, it also patches the relevant nfs code
to destroy the internal nfs zones on module unload (failing to do so 
can cause substantial resource leaks and crashes).

I'd like to commit this, but I think it would be good to get some
additional review before, especially on the part that deals with
ZONE_INTERRUPT zones (which need somewhat lower level vm handling
than the rest).
The handling of regular zones involves some magic (and is probably a
little gross), but it works without having to maintain additional
state information.

So, to all who are interested in this, could you please review and
comment it? 

Thanks,
- thomas


Index: nfs/nfs.h
===
RCS file: /home/ncvs/src/sys/nfs/nfs.h,v
retrieving revision 1.59
diff -u -r1.59 nfs.h
--- nfs/nfs.h   2001/04/17 20:45:21 1.59
+++ nfs/nfs.h   2001/07/21 14:40:59
@@ -633,6 +633,7 @@
  struct mbuf *));
 intnfs_adv __P((struct mbuf **, caddr_t *, int, int));
 void   nfs_nhinit __P((void));
+void   nfs_nhuninit __P((void));
 void   nfs_timer __P((void*));
 intnfsrv_dorec __P((struct nfssvc_sock *, struct nfsd *, 
 struct nfsrv_descript **));
Index: nfs/nfs_node.c
===
RCS file: /home/ncvs/src/sys/nfs/nfs_node.c,v
retrieving revision 1.49
diff -u -r1.49 nfs_node.c
--- nfs/nfs_node.c  2001/05/01 08:13:14 1.49
+++ nfs/nfs_node.c  2001/07/22 12:57:46
@@ -154,6 +154,12 @@
nfsnodehashtbl = hashinit(desiredvnodes, M_NFSHASH, &nfsnodehash);
 }
 
+void
+nfs_nhuninit()
+{
+   zdestroy(nfsnode_zone);
+}
+
 /*
  * Look up a vnode/nfsnode by file handle.
  * Callers must check for mount points!!
Index: nfs/nfs_subs.c
===
RCS file: /home/ncvs/src/sys/nfs/nfs_subs.c,v
retrieving revision 1.103
diff -u -r1.103 nfs_subs.c
--- nfs/nfs_subs.c  2001/07/04 16:20:16 1.103
+++ nfs/nfs_subs.c  2001/07/22 01:48:05
@@ -1185,6 +1185,8 @@
lease_updatetime = nfs_prev_lease_updatetime;
sysent[SYS_nfssvc].sy_narg = nfs_prev_nfssvc_sy_narg;
sysent[SYS_nfssvc].sy_call = nfs_prev_nfssvc_sy_call;
+   nfs_nhuninit();
+   zdestroy(nfsmount_zone);
return (0);
 }
 
Index: vm/vm_zone.c
===
RCS file: /home/ncvs/src/sys/vm/vm_zone.c,v
retrieving revision 1.46
diff -u -r1.46 vm_zone.c
--- vm/vm_zone.c2001/07/09 03:37:33 1.46
+++ vm/vm_zone.c2001/07/22 16:18:49
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -253,6 +254,111 @@
 }
 
 /*
+ * Destroy a zone, freeing the allocated memory.
+ * This does not do any locking for the zone; make sure it is not used
+ * any more before calling. All zalloc()'ated memory in the zone must have
+ * been zfree()'d.
+ * zdestroy() may not be used with zbootinit()'ed zones.
+ */
+void
+zdestroy(vm_zone_t z)
+{
+   int i, nitems, nbytes;
+   void *item, *min, **itp;
+   vm_map_t map;
+   vm_map_entry_t entry;
+   vm_object_t obj;
+   vm_pindex_t pindex;
+   vm_prot_t prot;
+   boolean_t wired;
+
+   GIANT_REQUIRED;
+   KASSERT(z != NULL, ("invalid zone"));
+   /*
+* This is needed, or the algorithm used for non-interrupt zones will
+* blow up badly.
+*/
+   KASSERT(z->ztotal == z->zfreecnt,
+   ("zdestroy() used with an active zone"));
+   KASSERT((z->zflags & ZONE_BOOT) == 0,
+   ("zdestroy() used with a zbootinit()'ed zone"))
+
+   if (z->zflags & ZONE_INTERRUPT) {
+   kmem_free(kernel_map, z->zkva, z->zpagemax * PAGE_SIZE);
+   vm_object_deallocate(z->zobj);
+   atomic_subtract_int(&zone_kmem_kvaspace,
+   z->zpagemax * PAGE_SIZE);
+   atomic_subtract_int(&zone_kmem_pages,
+   z->zpagecount);
+   } else {
+   /*
+* This is evil h0h0 magic:
+* The items may be in z->zitems in a random oder; we have to
+* free the start of an allocated area, but do not want to save
+* extra information. Additionally, we may not access items that
+* were in a freed area.
+* This is achieved in the following way: the smallest address
+* is selected, and, after removing all items that are in a
+* range of z->zalloc * PAGE_SIZE (one allocation unit) from
+* it, kmem_free is called on it (since it is the smallest one,
+* it must be the start of an area). This is rep

libdevstat interface changes

2001-07-18 Thread Thomas Moestl

Hi,

I want to add support for reading crash dumps to libdevstat. This will
allow iostat and vmstat to fully work on crashdumps (with some
additional patches).
To give this a reasonable interface, a kvm handle needs to be passed
to getnumdevs, getgeneration, getversion, checkversion and getdevs.
A NULL argument will cause the old behaviour (sysctls will be used to
get the data).
Additionally, programs that link with libkdevstat will need to link
against libkvm.
Because of that, the library version number will need to be bumped.
While we are doing that, Kenneth and I think that the library
function names should all get a descriptive prefix ("devstat_"), so
that devstat library calls can easily be recognized in the sources.

Compatability functions will be provided under the old names.
Because the interface of the library is relatively small, this will
hopefully not add much overhead. The old interface will be marked as
deprecated and may be removed some day.
A library dependency to libkvm will be added, so that dynamically
linked executables can use the new library without having to
explicitely link to libkvm.
This way, we should hopefully be able to avoid breakage of
libdevstat-using programs outside of src/ (there are some in the
ports, for example).

Does anyone spot style or interface issues or compatability pitfalls
with this that I have overlooked? If not, this will probably be
committed relatively soon together with the new functions that Kenneth
has posted on -audit. 

Thanks,
- thomas

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Can someone verify this?

2001-07-11 Thread Thomas Moestl

On Wed, 2001/07/11 at 11:04:56 -0700, Matt Dillon wrote:
> 
> :
> :Thomas Moestl wrote:
> :> 
> :> If it was an audio CD you were trying to mount: this is a known
> :> problem. The attached patch fixes it for me by disallowing reading of
> :...
> 
> Thomas, that looks like a good thing to commit to me, are you going
> to commit it?  Maybe also MFC it to stable?

I've passed it to sos for review, since it's his driver. If he has no
objections, I intend to commit it and MFC it later.

- thomas

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Please review: bugfix for vinvalbuf()

2001-07-10 Thread Thomas Moestl

Hi,

I've just tripped over an obviously long-standing (since about
Jan. 1998) bug in vinvalbuf while looking into PR kern/26224. The
problematic code looks like (on -CURRENT):

/*
 * Destroy the copy in the VM cache, too.
 */
mtx_lock(&vp->v_interlock);
if (VOP_GETVOBJECT(vp, &object) == 0) {
vm_object_page_remove(object, 0, 0,
(flags & V_SAVE) ? TRUE : FALSE);
}
mtx_unlock(&vp->v_interlock);

The locks seem to be needed for file systems that don't perform real
locking (on -STABLE, they are simplelocks).
This, however, is incorrect because vm_object_page_remove may sleep.
I've attached a patch that passes the interlock to
vm_object_page_remove, which in turn passes it to a modified version
of vm_page_sleep, which unlocks it around the sleep.
I think that this is correct, because the object should be in a valid
state when we sleep (and all checks are reexecuted in that case).

Since I'm not very experienced with vfs and vm stuff, I'd be glad if
this patch could get some review. In particular, is the lock/unlock
pair really still needed, and are there notable differeces in -STABLE
(because the fix would need the be MFC'ed)?

Thanks,
- thomas


Index: kern/vfs_subr.c
===
RCS file: /home/ncvs/src/sys/kern/vfs_subr.c,v
retrieving revision 1.315
diff -u -r1.315 vfs_subr.c
--- kern/vfs_subr.c 2001/07/04 16:20:13 1.315
+++ kern/vfs_subr.c 2001/07/10 19:45:41
@@ -800,7 +800,7 @@
mtx_lock(&vp->v_interlock);
if (VOP_GETVOBJECT(vp, &object) == 0) {
vm_object_page_remove(object, 0, 0,
-   (flags & V_SAVE) ? TRUE : FALSE);
+   (flags & V_SAVE) ? TRUE : FALSE, &vp->v_interlock);
}
mtx_unlock(&vp->v_interlock);
 
Index: vm/vm_map.c
===
RCS file: /home/ncvs/src/sys/vm/vm_map.c,v
retrieving revision 1.206
diff -u -r1.206 vm_map.c
--- vm/vm_map.c 2001/07/04 20:15:16 1.206
+++ vm/vm_map.c 2001/07/10 20:02:41
@@ -1875,7 +1875,7 @@
vm_object_page_remove(object,
OFF_TO_IDX(offset),
OFF_TO_IDX(offset + size + PAGE_MASK),
-   FALSE);
+   FALSE, NULL);
}
VOP_UNLOCK(object->handle, 0, curproc);
vm_object_deallocate(object);
@@ -1991,7 +1991,7 @@
offidxend = offidxstart + count;
 
if ((object == kernel_object) || (object == kmem_object)) {
-   vm_object_page_remove(object, offidxstart, offidxend, FALSE);
+   vm_object_page_remove(object, offidxstart, offidxend, FALSE, 
+NULL);
} else {
pmap_remove(map->pmap, s, e);
if (object != NULL &&
@@ -1999,7 +1999,7 @@
(object->flags & (OBJ_NOSPLIT|OBJ_ONEMAPPING)) == 
OBJ_ONEMAPPING &&
(object->type == OBJT_DEFAULT || object->type == 
OBJT_SWAP)) {
vm_object_collapse(object);
-   vm_object_page_remove(object, offidxstart, offidxend, 
FALSE);
+   vm_object_page_remove(object, offidxstart, offidxend, 
+FALSE, NULL);
if (object->type == OBJT_SWAP) {
swap_pager_freespace(object, offidxstart, 
count);
}
@@ -2994,7 +2994,7 @@
/*
 * Remove unneeded old pages
 */
-   vm_object_page_remove(first_object, 0, 0, 0);
+   vm_object_page_remove(first_object, 0, 0, 0, NULL);
 
/*
 * Invalidate swap space
Index: vm/vm_object.c
===
RCS file: /home/ncvs/src/sys/vm/vm_object.c,v
retrieving revision 1.196
diff -u -r1.196 vm_object.c
--- vm/vm_object.c  2001/07/04 20:15:16 1.196
+++ vm/vm_object.c  2001/07/10 20:03:23
@@ -1438,7 +1438,7 @@
  * The object must be locked.
  */
 void
-vm_object_page_remove(vm_object_t object, vm_pindex_t start, vm_pindex_t end, 
boolean_t clean_only)
+vm_object_page_remove(vm_object_t object, vm_pindex_t start, vm_pindex_t end, 
+boolean_t clean_only, struct mtx *interlk)
 {
vm_page_t p, next;
unsigned int size;
@@ -1478,7 +1478,7 @@
 * interrupt -- minimize the spl transitions
 */
 
-   if (vm_page_sleep_busy(p, TRUE, "vmopar"))
+  

Re: Can someone verify this?

2001-07-10 Thread Thomas Moestl

On Sat, 2001/07/07 at 18:57:15 -0400, Paul Halliday wrote:
> FreeBSD dissent.p450.box 4.3-RC FreeBSD 4.3-RC #3: Sun Jun 10 22:27:47
> EDT 2001 [EMAIL PROTECTED]:/usr/src/sys/compile/workstation 
> i386
> 
> FreeBSD useless.dell.box 4.3-STABLE FreeBSD 4.3-STABLE #6: Fri Jul  6
> 18:57:08 EDT 2001
> [EMAIL PROTECTED]:/usr/src/sys/compile/useless  i386
> 
> mount /dev/acd0c /cdrom > should obviously fail, yet causes...
> 
> panic: vm -fault  on nofault entry, addr: c3e1e000
> 
> reboot.

> any ideas?

If it was an audio CD you were trying to mount: this is a known
problem. The attached patch fixes it for me by disallowing reading of
partial blocks; this could also be fixed by setting the buffer size
different from the transfer size in such a case.

- thomas


Index: dev/ata/atapi-cd.c
===
RCS file: /home/ncvs/src/sys/dev/ata/atapi-cd.c,v
retrieving revision 1.48.2.10
diff -u -r1.48.2.10 atapi-cd.c
--- dev/ata/atapi-cd.c  2001/02/25 21:35:20 1.48.2.10
+++ dev/ata/atapi-cd.c  2001/07/09 21:48:58
@@ -1126,9 +1126,7 @@
 /* reject all queued entries if media changed */
 if (cdp->atp->flags & ATAPI_F_MEDIA_CHANGED) {
bp->b_error = EIO;
-   bp->b_flags |= B_ERROR;
-   biodone(bp);
-   return;
+   goto failure;
 }
 
 bzero(ccb, sizeof(ccb));
@@ -1149,7 +1147,11 @@
lastlba = cdp->info.volsize;
 }
 
-count = (bp->b_bcount + (blocksize - 1)) / blocksize;
+if (bp->b_bcount % blocksize != 0) {
+   bp->b_error = EINVAL;
+   goto failure;
+}
+count = bp->b_bcount / blocksize;
 
 if (bp->b_flags & B_READ) {
/* if transfer goes beyond range adjust it to be within limits */
@@ -1191,6 +1193,11 @@
 
 atapi_queue_cmd(cdp->atp, ccb, bp->b_data, count * blocksize,
bp->b_flags & B_READ ? ATPR_F_READ : 0, 30, acd_done,bp);
+return;
+
+failure:
+bp->b_flags |= B_ERROR;
+biodone(bp);
 }
 
 static int 



robustness fix for SYSCTL_OUT

2001-02-13 Thread Thomas Moestl

Hi,

the following is from sys/kern/kern_sysctl.c:

static int
sysctl_old_kernel(struct sysctl_req *req, const void *p, size_t l)
{
size_t i = 0;

if (req->oldptr) {
i = l;
if (i > req->oldlen - req->oldidx)
i = req->oldlen - req->oldidx;
if (i > 0)
bcopy(p, (char *)req->oldptr + req->oldidx, i);
}
req->oldidx += l;
if (req->oldptr && i != l)
return (ENOMEM);
return (0);
}

oldidx and oldlen are both size_t (unsigned). If l happens to be larger
than (req->oldlen - req->oldidx), ENOMEM is returned correctly, but
req->oldidx is increased by the full length. If a buggy caller does
not react on the error and calls SYSCTL_OUT again (SYSCTL_OUT normally
causes sysctl_old_kernel() or sysctl_old_user, which has a similar bug,
to be called), oldidx will be greater than oldlen, and since both are
unsigned, the if test will fail, so we will bcopy outside of the buffer
and no longer return ENOMEM.

Not that this does not matter if SYSCTL_OUT is used correctly, but for
the sake of robustness, I think it should be fixed.
Currently, there is one place in the -CURRENT kernel (that I know of)
that actually gets hit by this bug. -STABLE seems fine.

I propose the attached fix. Could it please be reviewed and commited if
correct?

- thomas




*** sys.3/kern/kern_sysctl.cTue Feb 13 16:15:52 2001
--- sys/kern/kern_sysctl.c  Tue Feb 13 20:06:37 2001
***
*** 817,824 
  
if (req->oldptr) {
i = l;
!   if (i > req->oldlen - req->oldidx)
!   i = req->oldlen - req->oldidx;
if (i > 0)
bcopy(p, (char *)req->oldptr + req->oldidx, i);
}
--- 817,827 
  
if (req->oldptr) {
i = l;
!   if (req->oldlen <= req->oldidx)
!   i = 0;
!   else
!   if (i > req->oldlen - req->oldidx)
!   i = req->oldlen - req->oldidx;
if (i > 0)
bcopy(p, (char *)req->oldptr + req->oldidx, i);
}
***
*** 914,921 
}
if (req->oldptr) {
i = l;
!   if (i > req->oldlen - req->oldidx)
!   i = req->oldlen - req->oldidx;
if (i > 0)
error = copyout(p, (char *)req->oldptr + req->oldidx,
i);
--- 917,927 
}
if (req->oldptr) {
i = l;
!   if (req->oldlen <= req->oldidx)
!   i = 0;
!   else
!   if (i > req->oldlen - req->oldidx)
!   i = req->oldlen - req->oldidx;
if (i > 0)
error = copyout(p, (char *)req->oldptr + req->oldidx,
i);



Re: removing setgid kmem from top, collecting per-device swap stats

2001-02-01 Thread Thomas Moestl

On Fri, Feb 02, 2001 at 02:28:29AM +0900, Hajimu UMEMOTO wrote:
> >>>>> On Thu, 1 Feb 2001 17:11:35 +
> >>>>> Tony Finch <[EMAIL PROTECTED]> said:
> 
> dot> Thomas Moestl <[EMAIL PROTECTED]> wrote:
> >
> >Most kmem_read calls are easy to replace (the variables are already 
> >exported as sysctls), the only exception is nextproc (for which I might
> >add a sysctl, or just leave it out [anyone out there who needs the 
> >lastpid display?]).
> 
> dot> It's useful for seeing how fast the machine is forking.
> I beleive it's not meaningful if randompid is enabled.  
It is. It gives the greatest pid used up to now. The next pid is 
determined by adding 1 (and randomness in the randompid case) to it.
This is why I can export it without defeating randompid.

Of course, it would still be better to track the sysctls you named for
this, but people seem to like the lastpid dispay, so I won't remove it
for now.

- thomas




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: removing setgid kmem from top, collecting per-device swap stats

2001-01-30 Thread Thomas Moestl

On Tue, Jan 30, 2001 at 04:17:55PM -0800, Matt Dillon wrote:
> 
> : VOP_SWAPACCOUNT(nbp->b_vp, nbp);
> : BUF_KERNPROC(nbp);
> : BUF_STRATEGY(nbp);
> :
> :Now, I have to define the vop in vm_swap.c, where I can get the area
> :index from the block number in a clean way and frob my counters.
> :The same method could be used in swap_pager_putpages(). 
> :swap_pager_reserve() would need to construct a buf and act in a similar
> :way as e.g. swap_pager_strategy(), only that the buffer is not flushed
> :afterwards, only VOP_SWAPACCOUNT is called. Of course, this can be
> :done much easier by just passing the block number to VOP_SWAPACCOUNT,
> :and calling it for each blk returned by swp_pager_getswapspace()
> :(it costs virtually nothing, so it should be OK).
> :So, VOP_SWAPACCOUNT would become:
> : VOP_SWAPACCOUNT(nbp->b_vp, nbp->b_blkno, nbp->bcount);
> :
> :Or did I get things wrong?
> :
> : - thomas
> 
> Hmm.  I think it would be easier to figure it out in the swap
> allocation and free code.  Specifically, look in vm/swap_pager.c:
> 
> * The swp_pager_getswapspace() function
> * The swp_pager_freeswapspace() function
> 
> I think these are the best places to keep track of per-swap-area
> allocation and frees.  Note that the system tracks overall swap
> useage in these routines as well (the vm_swap_size global).
Yep. I only thought I should keep the layering by using VOPs. But you 
are right, swap_pager and vm_swap are dependent anyway, so this can
be justified, and things are not duplicated over the whole file. 
So, is it OK with you if I do it the way you described?

- thomas



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: removing setgid kmem from top, collecting per-device swap stats

2001-01-30 Thread Thomas Moestl

On Tue, Jan 30, 2001 at 03:21:38PM -0800, Matt Dillon wrote:
> :vm.swapdev1.total  (this is the one that is currently hard to get)
> You can't move swapinfo into the kernel as a sysctl unless you
> solve this problem.  Traversing the radix tree is expensive enough
> that the entire system will stall for a short period of time if you run
> the loop in the kernel.
Yes. I did not have that in mind.

> The issue here is that swapinfo tries to break the useage down by
> swap area, whereas the kernel has no real concept of swap areas
> in the allocation map -- it just sees one big contiguous allocation
> map.  So the kernel does not track allocations on a per-swap-area
> basis.
My thought was the following: add a swapaccount vnode operation, and
then do (in flushchainbuf):
VOP_SWAPACCOUNT(nbp->b_vp, nbp);
BUF_KERNPROC(nbp);
BUF_STRATEGY(nbp);

Now, I have to define the vop in vm_swap.c, where I can get the area
index from the block number in a clean way and frob my counters.
The same method could be used in swap_pager_putpages(). 
swap_pager_reserve() would need to construct a buf and act in a similar
way as e.g. swap_pager_strategy(), only that the buffer is not flushed
afterwards, only VOP_SWAPACCOUNT is called. Of course, this can be
done much easier by just passing the block number to VOP_SWAPACCOUNT,
and calling it for each blk returned by swp_pager_getswapspace()
(it costs virtually nothing, so it should be OK).
So, VOP_SWAPACCOUNT would become:
VOP_SWAPACCOUNT(nbp->b_vp, nbp->b_blkno, nbp->bcount);

Or did I get things wrong?

- thomas



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



removing setgid kmem from top, collecting per-device swap stats

2001-01-30 Thread Thomas Moestl

Hi,

I'm going to start working on making top work without setgid kmem. 

Is anyone already doing this? I don't want to waste my time...

Most kmem_read calls are easy to replace (the variables are already 
exported as sysctls), the only exception is nextproc (for which I might
add a sysctl, or just leave it out [anyone out there who needs the 
lastpid display?]).

Another thing I will need to do is to make kvm_getswapinfo use sysctls
when using on a "live" kernel (using a similar hack as in kvm_getprocs).
This will also allow to take the setgid kmem bit from swapinfo, while
it can still be used (along with pstat) on crash dumps.
This conversion is more complex. Is there any reason for going through
significant pains to collect swap statistics in userland instead of
doing this in the kernel?

>From what I can see in vm_swap.c and swap_pager.c, this seems not
to be too hard. To correctly handle layering, I would need to define a
new vnode op for statistics. 

I would propose as sysctl hierarchy like
vm.nswapdev(number of devices)
vm.swapdev1.dev
vm.swapdev1.total  (this is the one that is currently hard to get)
vm.swapdev1.used
vm.swapdev1.flags
vm.swapdev2...

This would also be much easier for the "non-live" case because we
would keep the statistics in memory, right for kvm to read.

Please correct me if I'm totally wrong somewhere... otherwise, I will
just start doing it like this, so don't let me run into my doom ;-)

- thomas




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: natd bug

2000-12-01 Thread Thomas Moestl

On Thu, Nov 30, 2000 at 08:25:15PM +0100, Frederik Meerwaldt wrote:
> I compiled my kernel with IPDIVERT IPFIREWALL and
> IPFIREWALL_DEFAULT_TO_ACCEPT and I set up only one rule:
> ipfw add divert natd all from any to any via isp0
> Then I started natd (at boot time):
> natd -unregistered_only -dynamic -n isp0
> But when a package arrives (doesn't matter from localhost or another
> host), natd gives out a kernel message:
> 
> Nov 30 15:03:06 server natd[195]: failed to write packet back (Permission
> denied)
Is your link up at that time? The usual setup for a sppp device using dynamic
ip's is an invalid ip (0.0.0.0) that is changed once an ip was assigned. So, if 
you are not dialled in, the invalid ip will be put in by natd, and that usually
causes this error message.

- Thomas


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: [PATCH] Fix for panics in lookup() after forced unmount

2000-11-29 Thread Thomas Moestl

X
On Mon, Nov 27, 2000 at 11:04:37AM +0600, Boris Popov wrote:
> On Sun, 26 Nov 2000, Thomas Moestl wrote:
> 
> > Actually, the panic will occur after a simple forced unmount of the current
> > working directory and subsequent try to access "..". This is because the
> > vnode of the cwd was cleared and it's v_mount member was set to NULL. This
> > member is however dereferenced in the handling for the ".." special case in
> > lookup(), causing a panic.
> > [...]
> 
>   Good work Thomas! I think patch can be committed as is.
Could someone please commit this? Or shall I rather file a PR?

Thanks,
- Thomas


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: [PATCH] Fix for panics in lookup() after forced unmount

2000-11-26 Thread Thomas Moestl

> :I think I have a sufficient fix for PR kern/19572. Could somebody please
> :Review/Comment this?
> :To quote:
> 
> Looks reasonable to me!  It certainly can't make things worse :-)
Could someone then please look into committing this?

TIA,
- Thomas


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



[PATCH] Fix for panics in lookup() after forced unmount

2000-11-26 Thread Thomas Moestl

Hi,

I think I have a sufficient fix for PR kern/19572. Could somebody please
Review/Comment this?
To quote:

> Description
>  Executing cd ../cdrom from /cdrom directory after cycle of mount-umount-
>  mount cycle causes trap 12 (page fault while in kernel mode) and hence
>  causes kernel panic.
> How-To-Repeat
>  chdir to /cdrom (which) is default mount point for cdrom as per fstab.
>  mount /cdrom and do ls
>  now umount -f /cdrom (force as we will be /cdrom)
>  doing ls will give error( . not a directory)
>  do mount /cdrom to mount the cdrom once again (we are still in /cdrom dir)
>  now do ls will give error once again
>  now do cd ../ (tab in case of bash or esc in case of csh) to do file
>  completion.
>  This will result in trap 12 (page fault in kernel mode) and thus
>  results in kernel panic.

Actually, the panic will occur after a simple forced unmount of the current
working directory and subsequent try to access "..". This is because the
vnode of the cwd was cleared and it's v_mount member was set to NULL. This
member is however dereferenced in the handling for the ".." special case in
lookup(), causing a panic.
The fix is rather trivial, just check the member before using it and return
an appropriate error. In the following patch, I use EBADF. 
Btw, after taking a look into the OpenBSD and NetBSD repos, I think they
might have the same problem. Is there any standard channel to pass bug
reports to them from FreeBSD, or should I just use the normal submit 
procedure?

- Thomas

The patch:

*** old/kern/vfs_lookup.c   Sat Nov 25 23:55:39 2000
--- new/kern/vfs_lookup.c   Sun Nov 26 00:58:15 2000
***
*** 403,408 
--- 403,412 
if ((dp->v_flag & VROOT) == 0 ||
(cnp->cn_flags & NOCROSSMOUNT))
break;
+   if (dp->v_mount == NULL) {  /* forced unmount */
+   error = EBADF;
+   goto bad;
+   }
tdp = dp;
dp = dp->v_mount->mnt_vnodecovered;
vput(tdp);
***
*** 424,430 
printf("not found\n");
  #endif
if ((error == ENOENT) &&
!   (dp->v_flag & VROOT) &&
(dp->v_mount->mnt_flag & MNT_UNION)) {
tdp = dp;
dp = dp->v_mount->mnt_vnodecovered;
--- 428,434 
printf("not found\n");
  #endif
if ((error == ENOENT) &&
!   (dp->v_flag & VROOT) && (dp->v_mount != NULL) &&
(dp->v_mount->mnt_flag & MNT_UNION)) {
tdp = dp;
dp = dp->v_mount->mnt_vnodecovered;
***
*** 502,507 
--- 506,517 
((cnp->cn_flags & FOLLOW) || trailing_slash ||
 *ndp->ni_next == '/')) {
cnp->cn_flags |= ISSYMLINK;
+   if (dp->v_mount == NULL) {
+   /* We can't know whether the directory was mounted with
+* NOSYMFOLLOW, so we can't follow safely. */
+   error = EBADF;
+   goto bad2;
+   }
if (dp->v_mount->mnt_flag & MNT_NOSYMFOLLOW) {
error = EACCES;
goto bad2;


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Byte order?

2000-11-20 Thread Thomas Moestl

> I  know,  that  x86  is  big  endian  architecture
> but  simple  programm  like  this:
> 
> #include  
> #include  
> main  ()  {
>  /*  Are  we  little  or  big  endian?   From  Harbison&Steele.   */
>  union
>  {
>  long  l;
>  char  c[sizeof  (long)];
>  }  u;
>  u.l  =  1;
>  printf  ("Little  endian?  %s\n",  (u.c[sizeof  (long)  -  1]  ==  1)  ?  "yes"  :  
>"no");
> #if  BYTE_ORDER  ==  BIG_ENDIAN
>  printf("Big  endian\n");
> #elif  BYTE_ORDER  ==  LITTLE_ENDIAN
>  printf("Little  endian\n");
> #else
>  printf("Unknown\n");
> #endif
> }
> 
> Give  me  a  strange  result:
> Little  endian?  no
> Little  endian
This program gets it wrong. When the last byte of a long is set after the long was
set to 1, we have a big endian architecture (the "little" end is at the 4th byte,
so the "big end" is at the 1st byte).

> On  my  FreeBSD  4.2-BETA  BYTE_ORDER  =  LITTLE_ENDIAN!
> I`m  very  confused  and  some  programms  detect  my  machine  as  Little  Endian,  
>by
> example  freetds.
The x86 architecture _is_ little endian.

- Thomas


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: [PATCH] per-packet UDP source address selection for sendmsg

2000-11-11 Thread Thomas Moestl

> As this is my first kernel patch, could someone please give it a quick
> look before I submit it, so that I can be sure that it contains no stupid
> glitches? 
> [...]
> + bzero(&src, sizeof(src));
> + src.sin_family = AF_INET;
> + src.sin_port = inp->inp_lport;
> + src.sin_addr = *(struct in_addr 
>*)CMSG_DATA(cm);
> + bound = 1;
Except of course forgetting to set src.sin_len :(

Thomas


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



[PATCH] per-packet UDP source address selection for sendmsg

2000-11-11 Thread Thomas Moestl

Hi,

recently, I have written a kernel patch that makes sendmsg(2) honor 
IP_RECVDSTADDR ancillary data on UDP sockets, so that the source address can
be set on a per-packet basis. This is very useful for servers that need to
use UDP and want to answer with the same source address a query went to.
This is not always the case with vanilla sendto(2) in certain routing layouts.

As this is my first kernel patch, could someone please give it a quick
look before I submit it, so that I can be sure that it contains no stupid
glitches? Unfortunately, I had no chance to test this on -current (I'm just
building a -current box).

Thanks,
Thomas



*** netinet/udp_usrreq.c.oldThu Nov  9 19:05:13 2000
--- netinet/udp_usrreq.cThu Nov  9 19:23:54 2000
***
*** 642,680 
  {
register struct udpiphdr *ui;
register int len = m->m_pkthdr.len;
struct in_addr laddr;
struct sockaddr_in *sin;
!   int s = 0, error = 0;
! 
!   if (control)
!   m_freem(control);   /* XXX */
! 
if (len + sizeof(struct udpiphdr) > IP_MAXPACKET) {
error = EMSGSIZE;
goto release;
}
  
if (addr) {
sin = (struct sockaddr_in *)addr;
prison_remote_ip(p, 0, &sin->sin_addr.s_addr);
!   laddr = inp->inp_laddr;
if (inp->inp_faddr.s_addr != INADDR_ANY) {
error = EISCONN;
!   goto release;
}
/*
 * Must block input while temporarily connected.
 */
!   s = splnet();
error = in_pcbconnect(inp, addr, p);
!   if (error) {
!   splx(s);
!   goto release;
!   }
} else {
if (inp->inp_faddr.s_addr == INADDR_ANY) {
error = ENOTCONN;
!   goto release;
}
}
/*
--- 642,741 
  {
register struct udpiphdr *ui;
register int len = m->m_pkthdr.len;
+   register struct cmsghdr *cm = 0;
struct in_addr laddr;
struct sockaddr_in *sin;
!   struct sockaddr_in src;
!   int s = 0, error = 0, bound = 0, addrset = 0, fmbuf = 0;
!   
if (len + sizeof(struct udpiphdr) > IP_MAXPACKET) {
error = EMSGSIZE;
+   if (control)
+   m_freem(control);
goto release;
}
  
+   if (control) {
+   /*
+* XXX: Currently, we assume all the optional information is stored
+* in a single mbuf.
+*/
+   if (control->m_next) 
+   error = EINVAL;
+   else {
+   for (; control->m_len; control->m_data += ALIGN(cm->cmsg_len),
+control->m_len -= ALIGN(cm->cmsg_len)) {
+   cm = mtod(control, struct cmsghdr *);
+   if (cm->cmsg_len == 0 || cm->cmsg_len > 
+control->m_len) {
+   error = EINVAL;
+   break;
+   }
+   if (cm->cmsg_level != IPPROTO_IP)
+   continue;
+   
+   switch(cm->cmsg_type) {
+   case IP_RECVDSTADDR:
+   if (cm->cmsg_len != CMSG_LEN(sizeof(struct 
+in_addr))) {
+   error = EINVAL;
+   break;
+   }
+   laddr = inp->inp_laddr;
+   bzero(&src, sizeof(src));
+   src.sin_family = AF_INET;
+   src.sin_port = inp->inp_lport;
+   src.sin_addr = *(struct in_addr 
+*)CMSG_DATA(cm);
+   bound = 1;
+   s = splnet();
+   if (inp->inp_laddr.s_addr == INADDR_ANY && 
+inp->inp_lport == 0) {
+   /* This will check the address */
+   error = in_pcbbind(inp, (struct 
+sockaddr *)&src, p);
+   } else {
+   if (prison_ip(p, 0, 
+&src.sin_addr.s_addr)) {
+   error = EINVAL;
+   break;
+   }
+   if (ifa_ifwithaddr((struct sockaddr 
+*)&src) == 0) {
+