Re: HEADS UP! KSE needs more attention
On Sun, 2004/06/06 at 14:59:21 -0700, Kris Kennaway wrote: > On Sun, Jun 06, 2004 at 03:49:13PM -0600, Scott Long wrote: > > amd64 is approaching critical mass for tier-1. There are a number of > > developers that own amd64 hardware now, and a number of users who are > > asking about it on the mailing lists. Peter is finishing up the last > > blocking item for it (kld's) not including the observed KSE problems. > > It's very close and I _will_ hold up the release for it to get done. > > amd64 is the future of commodity computing and we aren't going to > > ignore it for 5-STABLE. > > amd64 has a bug with swapping - when something begins to access swap, > the entire system becomes almost entirely unresponsive (e.g. no mouse > response for up to 10 seconds) until it stops. Peter has some ideas > about it, but it's a serious enough bug that it forced me to stop > using amd64 as my desktop machine (hello, kde!). Hmmm, I have encountered a similar problem on sparc64 once; the reason was that vm_pageout_map_deactivate_pages() calls pmap_remove() for the range from the start to the end of the process's vm_map when a process is swapped out. Start and end are VM_MIN_ADDRESS and VM_MAXUSER_ADDRESS respectively, and on 64-bit architectures, that range is very large (128TB on ia64 if I'm not mistaken), so the iteration in pmap_remove() must be carefully designed to make as large steps as possible to avoid long run times (or to not iterate over the range at all if it becomes too large, which we did on sparc64). It seems that the amd64 version of pmap_remove() will essentially always iterate in 2MB (level 2 page table) steps, regardless of whether there is mapping for the respective level 2 table in the table levels above; that means that in the previously mentioned case, the outer loop will usually run for about 67 million iterations (the resident count guard may not be of much use here if a stack page is left at the very end of the address space). Since there are a few memory accesses needed in each iterations, that may already be the cause of such a delay. I have no hardware to test this, so all of the above is just a wild- assed guess; but maybe it is of use (and sorry for the spam if it is not). - Thomas -- Thomas Moestl <[EMAIL PROTECTED]> http://www.tu-bs.de/~y0015675/ <[EMAIL PROTECTED]> http://people.FreeBSD.org/~tmm/ "I realized that the purpose of writing is to inflate weak ideas, obscure poor reasoning, and inhibit clarity." -- Calvin and Hobbes ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: FreeBSD 5.2 v/s FreeBSD 4.9 MFLOPS performance (gcc3.3.3 v/s gcc2.9.5)
On Mon, 2004/02/16 at 19:11:16 +0100, Dag-Erling Smørgrav wrote: > Kris Kennaway <[EMAIL PROTECTED]> writes: > > On Mon, Feb 16, 2004 at 03:52:16AM -0800, Wes Peters wrote: > > > Should I commit this? > > What effect does it have on non-i386 architectures? > > It can't possibly hurt. If the stack is already aligned on a "better" > boundary (64 or 128 bytes), it is also aligned on a 32-byte boundary > since 64 and 128 are multiples of 32, and the patch is a no-op. If > only a 16-byte alignment is required, a 32-byte alignment wastes a > small amount of memory but does not hurt performance. I believe that > less-than-16 (and possibly even less-than-32) alignment is pessimal on > all platforms we support. Well, it misaligns stack_base on 64-bit architectures, for example (notice the "- 4", which is there to compensate for the fixup in kern_execve() that will subtract another sizeof(register_t)): vectp = (char **)(((vm_offset_t)vectp & ~(vm_offset_t)0x1F) - 4); It would by much better to be able to align the stack in exec_setregs(), like amd64, ia64, powerpc and sparc64 do. Unfortunately that would require changes to crt1, so it would pose a compatibility problem. - Thomas -- Thomas Moestl <[EMAIL PROTECTED]> http://www.tu-bs.de/~y0015675/ <[EMAIL PROTECTED]> http://people.FreeBSD.org/~tmm/ "Oh, great altar of passive entertainment... Bestow upon me thy discordant images at such speed as to render linear thought impossible!" -- Calvin and Hobbes ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: PLEASE REVIEW: Adding a pci_if method to facilitate specializedPCI bus drivers
On Fri, 2003/06/27 at 14:50:08 -0400, John Baldwin wrote: > On 27-Jun-2003 Thomas Moestl wrote: > > On Fri, 2003/06/27 at 14:01:45 -0400, John Baldwin wrote: > >> On 27-Jun-2003 Thomas Moestl wrote: > >> > On Fri, 2003/06/27 at 13:37:00 -0400, John Baldwin wrote: > >> >> On 13-Jun-2003 Thomas Moestl wrote: > >> >> > This requires us to get this firmware property in the OFW PCI bus > >> >> > driver before routing the interrupt; that can't be done in the pcib > >> >> > route_interrupt method, since we don't know whether we are routing for > >> >> > another bridge (where we use whichever index we get passed) or for a > >> >> > child device (in which case we would need to look at the firmware > >> >> > property). > >> >> > >> >> Actually, can't you tell this by doing: > >> >> > >> >> if (device_get_parent(device_get_parent(dev)) == pcib) > >> >> /* Routing direct child. */ > >> >> else > >> >> /* Routing descedent of a child bridge. */ > >> > > >> > No, pcib will always be a grandparent of dev. When routing a > > ^ for > >> > descendant child bridge, dev will itself be the device_t of a bridge, > > ^ of a (oops) > >> > otherwise it is that of the device we are routing to. > >> Doh, yes. :( Hmm, can you try something like this maybe: > >> > >> if (pci_get_class(dev) == PCIC_BRIDGE && > >> pci_get_subclass(dev) == PCIS_BRIDGE_PCI) > >> /* Routing across a child bridge. */ > >> else > >> /* Routing a direct child that is not a bridge. */ > > > > This leaves two possible problems: first, there are other types of > > bridges (we currently support PCI-ISA and PCI-EBus ones, cardbus might > > also work) for which we need to use PCIB_ROUTE_INTERRUPT(); that could > > likely be dealt with by not testing the subclass at all. > > More importantly, however, a bridge might want to allocate an > > interrupt for itself; for example, cardbus bridge drivers do this to > > handle insertion/ejection/etc events. > > > > I think that handling this at the bus driver level is a much cleaner > > solution. > > Then should we eliminate pcib_route_interrupt() and make it a PCI bus > method? I just think it is rather odd to require this interface to be > duplicated in two different places. This is not necessarily duplication; in the OFW_PCI code, the work to be done is divided up between the two. Where the routing has to happen depends a lot on the information available. If a mapping from device to interrupt is available, it can make sense to do the routing completely in the bus driver. If there are any bridges for which we lack such information, however, and we need to fall back to other routing methods (like swizzling), there might be a need to support the bridge routing methods. In such a scenario, the bus and bridge methods usually would be quite different, so there would be not much duplication. To give an example, OpenFirmware contains an 'interrupt' property for each PCI device which uses interrupts. This property can contain a full interrupt number, or an index which has to be mapped using the 'interrupt-map' properties of higher bridges. In some cases (but not always), these indices are actually intpins, and on some bridges we need to do a swizzle instead of a map lookup. We look up the 'interrupt' property of the device in the bus driver, and decide whether it is full interrupt number already, in which case we just use it. Otherwise, we let the parent bridge do the mapping as required (which in turn might need to resort to it's parent bridge, and so on). This scheme also allows us to interface with PCI bridge drivers which use their intpins, and for which we do not have (or need) OpenFirmware-specific drivers. Doing all of that in the bus method would be a layering violation, and would have ugly consequences (for example, different bridges need to handle different quirks). > Doing it in the bus might actually make life simpler. Right now > when using different tables for routing PCI interrupts on x86, I > have to write two different bridge drivers all > the time, one for host bridges and one for PCI-PCI bridges, thus we have > ACPI host-PCI and PCI-PCI bridge drivers, PCIBIOS $PIR host-PCI and PCI-PCI > bridge drivers and MPTable host-PCI and PCI-PCI bridge drivers. Being > able to just have an ACPI
Re: PLEASE REVIEW: Adding a pci_if method to facilitate specializedPCI bus drivers
On Fri, 2003/06/27 at 14:01:45 -0400, John Baldwin wrote: > On 27-Jun-2003 Thomas Moestl wrote: > > On Fri, 2003/06/27 at 13:37:00 -0400, John Baldwin wrote: > >> On 13-Jun-2003 Thomas Moestl wrote: > >> > This requires us to get this firmware property in the OFW PCI bus > >> > driver before routing the interrupt; that can't be done in the pcib > >> > route_interrupt method, since we don't know whether we are routing for > >> > another bridge (where we use whichever index we get passed) or for a > >> > child device (in which case we would need to look at the firmware > >> > property). > >> > >> Actually, can't you tell this by doing: > >> > >> if (device_get_parent(device_get_parent(dev)) == pcib) > >> /* Routing direct child. */ > >> else > >> /* Routing descedent of a child bridge. */ > > > > No, pcib will always be a grandparent of dev. When routing a ^ for > > descendant child bridge, dev will itself be the device_t of a bridge, ^ of a (oops) > > otherwise it is that of the device we are routing to. > Doh, yes. :( Hmm, can you try something like this maybe: > > if (pci_get_class(dev) == PCIC_BRIDGE && > pci_get_subclass(dev) == PCIS_BRIDGE_PCI) > /* Routing across a child bridge. */ > else > /* Routing a direct child that is not a bridge. */ This leaves two possible problems: first, there are other types of bridges (we currently support PCI-ISA and PCI-EBus ones, cardbus might also work) for which we need to use PCIB_ROUTE_INTERRUPT(); that could likely be dealt with by not testing the subclass at all. More importantly, however, a bridge might want to allocate an interrupt for itself; for example, cardbus bridge drivers do this to handle insertion/ejection/etc events. I think that handling this at the bus driver level is a much cleaner solution. - Thomas -- Thomas Moestl <[EMAIL PROTECTED]> http://www.tu-bs.de/~y0015675/ <[EMAIL PROTECTED]> http://people.FreeBSD.org/~tmm/ PGP fingerprint: 1C97 A604 2BD0 E492 51D0 9C0F 1FE6 4F1D 419C 776C ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: PLEASE REVIEW: Adding a pci_if method to facilitate specializedPCI bus drivers
On Fri, 2003/06/27 at 13:37:00 -0400, John Baldwin wrote: > On 13-Jun-2003 Thomas Moestl wrote: > > This requires us to get this firmware property in the OFW PCI bus > > driver before routing the interrupt; that can't be done in the pcib > > route_interrupt method, since we don't know whether we are routing for > > another bridge (where we use whichever index we get passed) or for a > > child device (in which case we would need to look at the firmware > > property). > > Actually, can't you tell this by doing: > > if (device_get_parent(device_get_parent(dev)) == pcib) > /* Routing direct child. */ > else > /* Routing descedent of a child bridge. */ No, pcib will always be a grandparent of dev. When routing a descendant child bridge, dev will itself be the device_t of a bridge, otherwise it is that of the device we are routing to. - Thomas -- Thomas Moestl <[EMAIL PROTECTED]> http://www.tu-bs.de/~y0015675/ <[EMAIL PROTECTED]> http://people.FreeBSD.org/~tmm/ PGP fingerprint: 1C97 A604 2BD0 E492 51D0 9C0F 1FE6 4F1D 419C 776C ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
PLEASE REVIEW: Adding a pci_if method to facilitate specialized PCIbus drivers
Hi, I've attached a patch that adds a new pci_if method, assign_interrupt, and makes the PCI code implement and use it. This is private to the PCI bus and is only used internally (to find an interrupt of a child device); it is a method so that derived PCI bus drivers can override it. This is very useful for the sparc64 OFW PCI bus driver which I will commit soon, hopefully. On sparc64, there are some on-board devices which have special interrupt lines. To route their interrupts, we need not only know the device to route for, but also an interrupt index which is stored in the firmware for this device, and which is used to route on bridges instead of the intpin (in other cases, there's even the complete interrupt number stored there; for devices in PCI slots, we (usually) can use the intpin). This requires us to get this firmware property in the OFW PCI bus driver before routing the interrupt; that can't be done in the pcib route_interrupt method, since we don't know whether we are routing for another bridge (where we use whichever index we get passed) or for a child device (in which case we would need to look at the firmware property). Currently, the OFW PCI bus driver solves this by providing a stub version of the alloc_resource method, trapping the cases where rerouting is required and doing it before calling the generic PCI bus routine, but that leads to code duplication and is generally a bit ugly, as we are second-guessing where the PCI bus routines will want to route an interrupt. By moving the actual routing in a method which can be overridden, this can be solved cleanly, while not complicating the generic PCI bus code by much. If there are no objections, I would like to commit this soonish. Thoughts? Thanks, - Thomas -- Thomas Moestl <[EMAIL PROTECTED]> http://www.tu-bs.de/~y0015675/ <[EMAIL PROTECTED]> http://people.FreeBSD.org/~tmm/ PGP fingerprint: 1C97 A604 2BD0 E492 51D0 9C0F 1FE6 4F1D 419C 776C Index: dev/acpica/acpi_pci.c === RCS file: /vol/ncvs/src/sys/dev/acpica/acpi_pci.c,v retrieving revision 1.3 diff -u -r1.3 acpi_pci.c --- dev/acpica/acpi_pci.c 17 Feb 2003 21:20:34 - 1.3 +++ dev/acpica/acpi_pci.c 13 Jun 2003 20:57:13 - @@ -111,6 +111,7 @@ /* XXX: We should override these two. */ DEVMETHOD(pci_get_powerstate, pci_get_powerstate_method), DEVMETHOD(pci_set_powerstate, pci_set_powerstate_method), + DEVMETHOD(pci_assign_interrupt, pci_assign_interrupt_method), { 0, 0 } }; Index: dev/cardbus/cardbus.c === RCS file: /vol/ncvs/src/sys/dev/cardbus/cardbus.c,v retrieving revision 1.38 diff -u -r1.38 cardbus.c --- dev/cardbus/cardbus.c 18 Feb 2003 21:24:00 - 1.38 +++ dev/cardbus/cardbus.c 13 Jun 2003 21:51:29 - @@ -387,6 +387,7 @@ DEVMETHOD(pci_disable_io, pci_disable_io_method), DEVMETHOD(pci_get_powerstate, pci_get_powerstate_method), DEVMETHOD(pci_set_powerstate, pci_set_powerstate_method), + DEVMETHOD(pci_assign_interrupt, pci_assign_interrupt_method), {0,0} }; Index: dev/pci/pci.c === RCS file: /vol/ncvs/src/sys/dev/pci/pci.c,v retrieving revision 1.219 diff -u -r1.219 pci.c --- dev/pci/pci.c 9 Jun 2003 18:08:46 - 1.219 +++ dev/pci/pci.c 13 Jun 2003 21:36:01 - @@ -70,7 +70,8 @@ static int pci_memen(device_t pcib, int b, int s, int f); static int pci_add_map(device_t pcib, int b, int s, int f, int reg, struct resource_list *rl); -static voidpci_add_resources(device_t pcib, device_t dev); +static voidpci_add_resources(device_t pcib, device_t bus, + device_t dev); static int pci_probe(device_t dev); static int pci_attach(device_t dev); static voidpci_load_vendor_data(void); @@ -119,6 +120,7 @@ DEVMETHOD(pci_disable_io, pci_disable_io_method), DEVMETHOD(pci_get_powerstate, pci_get_powerstate_method), DEVMETHOD(pci_set_powerstate, pci_set_powerstate_method), + DEVMETHOD(pci_assign_interrupt, pci_assign_interrupt_method), { 0, 0 } }; @@ -776,7 +778,7 @@ } static void -pci_add_resources(device_t pcib, device_t dev) +pci_add_resources(device_t pcib, device_t bus, device_t dev) { struct pci_devinfo *dinfo = device_get_ivars(dev); pcicfgregs *cfg = &dinfo->cfg; @@ -805,7 +807,7 @@ * If the re-route fails, then just stick with what we * have. */ - irq = PCIB_ROUTE_INTERRUPT(pcib, dev, cfg->intpin); + irq = PCI_ASSIGN_INTERRUPT(bus, dev);
Re: Locale errors on 5.0-CURRENT (sparc)
On Fri, 2003/01/03 at 15:29:27 +0200, Peter Pentchev wrote: > On Fri, Jan 03, 2003 at 02:31:41PM +0200, Peter Pentchev wrote: > > FWIW, I can reproduce this on panther.FreeBSD.org, a sparc64 running > > 5.0-CURRENT as of December 6, 2001. All the other machines in the > > FreeBSD cluster that I could test - ref5 (i386), pluto1 (IA-64), and > > beast (alpha) - are able to set both the is_IS.ISO8859-1 and the > > bg_BG.CP1251 locales correctly for both LC_ALL and LC_TIME. Only the > > sparc64 machine is having trouble setting LC_ALL for any locale I tried. > > Actually, I seem to have found the reason: on panther.FreeBSD.org, there > are no LC_CTYPE files for any locale, and setlocale(LC_ALL) attempts to > load, well, *all* locale type definitions, including LC_CTYPE. Thus, it > would (and does) return NULL with errno set to ENOENT for all locales... > > The LC_CTYPE files are not there, because there is an explicit test for > the build architecture in src/share/Makefile, which excludes the > mklocale directory for sparc64; it was introduced in rev. 1.27 by David > O'Brien about 7 months ago. David, what exactly was the 'bad juju' > mentioned in the commit message, and is it still there? A test run of > 'cvs up share/mklocale && cd share/mklocale && make' on panther seemed > to complete OK; I have not yet tested the resulting locale files though. Yeah, I just noticed that, too. As far as I can tell, this is purely historical, if I remember correctly mklocale did crash back then. I will re-enable it. - Thomas -- Thomas Moestl <[EMAIL PROTECTED]> http://www.tu-bs.de/~y0015675/ <[EMAIL PROTECTED]> http://people.FreeBSD.org/~tmm/ PGP fingerprint: 1C97 A604 2BD0 E492 51D0 9C0F 1FE6 4F1D 419C 776C To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: uiomove busted for the past 3 years
On Wed, 2002/01/30 at 07:02:25 -0800, User JHB wrote: > When the DEADLKTREAT flag was added, uiomove() was broken. :) The > problem is that a return() inside of a switch nested inside of a while > loop was converted to a break leading to the following rather silly code: > > if (error) > break; > break; > > What is supposed to happen is that if there is an error, we break out of > the while loop, but all we do is break out of the switch ignoring the > error and continuing to loop. Thus, in the best case, if copyin or > copyout failed on the last iteration of the loop, we would return an > error, but would bogusly update the counts in the iovec and uio > structures. In the worst case, if we kept looping and later copyin's or > copyout's succeeded, then we wouldn't return an error at all. A quick fix > would be to add a goto to jump to the error return code at the end of the > loop rather than the bogus break if (error). However, I can't test this > at the moment. Someone please verify and fix this. The attached test program triggers this bug (if the test file is located on a file system that does not split up the uiomove() in that case, e.g. FFS with a block size >= 2 * page size). When NIOV is changed from 2 to 1, the writev() will result in EFAULT, however with two iovecs the write returns the sum of the sizes of both iovecs although the buffer supplied to the first is unaccessible. The patch you proposed: --- kern_subr.c~Fri Jan 25 02:14:45 2002 +++ kern_subr.c Wed Jan 30 20:39:07 2002 @@ -104,7 +104,7 @@ else error = copyin(iov->iov_base, cp, cnt); if (error) - break; + goto out; break; case UIO_SYSSPACE: @@ -123,6 +123,7 @@ cp += cnt; n -= cnt; } +out: if (td != curthread) printf("uiomove: IT CHANGED!"); td = curthread; /* Might things have changed in copyin/copyout? */ if (td) { -- fixes it for me. - thomas #include #include #include #include #include #include #include #include #include #include #define NIOV2 int main() { struct iovec iov[2]; size_t psz; char *a, *b; int fd, rv, en; psz = getpagesize(); a = mmap(NULL, psz, PROT_NONE, MAP_ANON, -1, 0); if (a == MAP_FAILED) err(1, "mmap 1"); b = mmap(a + psz, psz, PROT_READ, MAP_ANON | MAP_FIXED, -1, 0); if (b == MAP_FAILED) err(1, "mmap 2"); fd = open("uiob.foo", O_RDWR | O_CREAT, S_IRUSR | S_IWUSR); if (fd == -1) err(1, "open"); iov[0].iov_base = a; iov[0].iov_len = psz; iov[1].iov_base = b; iov[1].iov_len = psz; rv = writev(fd, iov, NIOV); en = errno; printf("writev returned %d\n", rv); if (rv == -1) printf("error: %s\n", strerror(en)); close(fd); return (0); } To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: contigfree, free what?
On Fri, 2001/10/12 at 10:07:10 -0700, Matt Dillon wrote: > > :Mark, > : > :> I also placed some checks on vm_map_delete > : > :I did that also, and as far as I understand everything works fine. > :The only thing I found was the fact that when contigmalloc() grabs the > :contig pages it sets the value of pga[i] (for i in allocated pages) > :note that: vm_page_t pga = vm_page_array; > : > :Then contigfree() does a pretty good job, but does not reset the values > :of pga[i] to pqtype == PQ_FREE (pqtype = pga[i].queue - pga[i].pc) > : > :So the next contigmalloc() requiring the same number of pages fails on > :the previously released pages because they are not PQ_FREE > : > :The other thing that puzzled me is the fact that in vm_map_delete() > :called by contgigfree() has a variable > :... I have also looked into this a while ago, but got stuck at some point. I have just looked at it again, and I think I have found a solution. > I think what is going on is that contigmalloc() is wiring the pages > but placing them in a pageable container (entry->wired_count == 0), > so when contigfree() kmem_free()'s the block the system does not know > that it must unwire the pages. This leaves the pages wired and prevents > them from being freed. > > I haven't found a quick and easy solution to the problem yet. kmem_alloc() > doesn't do what we want either. I tried calling vm_map_pageable() in > contigmalloc1() but it crashed the machine, so there might be something > else going on as well. This is probably because the map entries do have a NULL object pointer. vm_map_pageable() calls vm_fault_wire(), so this will fail. I have attached a patch which works for me. It duplicates most of the logic of kmem_alloc in that it calls vm_map_findspace() first, then vm_map_insert() (which basically is what is done in kmem_alloc_pageable() too, but here, kernel_object is passed instead of a NULL pointer, so that the map entry will have a valid object pointer). Then, the pages are inserted into the object as before, and finally, the map entries are marked as wired by using vm_map_pageable(). Because this will also call vm_fault_wire(), which will among other things do a vm_page_wire(), contigmalloc does not need to wire the pages itself. The pmap_kenter() calls can also be reomved, since the pages will be mapped in any case by vm_fault(). - thomas --- vm_contig.c.origFri Oct 12 20:05:09 2001 +++ vm_contig.c Fri Oct 12 20:44:03 2001 @@ -76,6 +76,8 @@ #include #include #include +#include +#include #include #include #include @@ -232,7 +234,6 @@ m->busy = 0; m->queue = PQ_NONE; m->object = NULL; - vm_page_wire(m); } /* @@ -240,24 +241,31 @@ * Allocate kernel VM, unfree and assign the physical pages to it and * return kernel VM pointer. */ - tmp_addr = addr = kmem_alloc_pageable(map, size); - if (addr == 0) { + vm_map_lock(map); + if (vm_map_findspace(map, vm_map_min(map), size, &addr) != + KERN_SUCCESS) { /* * XXX We almost never run out of kernel virtual * space, so we don't make the allocated memory * above available. */ + vm_map_unlock(map); splx(s); return (NULL); } + vm_object_reference(kernel_object); + vm_map_insert(map, kernel_object, addr - VM_MIN_KERNEL_ADDRESS, + addr, addr + size, VM_PROT_ALL, VM_PROT_ALL, 0); + vm_map_unlock(map); + tmp_addr = addr; for (i = start; i < (start + size / PAGE_SIZE); i++) { vm_page_t m = &pga[i]; vm_page_insert(m, kernel_object, OFF_TO_IDX(tmp_addr - VM_MIN_KERNEL_ADDRESS)); - pmap_kenter(tmp_addr, VM_PAGE_TO_PHYS(m)); tmp_addr += PAGE_SIZE; } + vm_map_pageable(map, addr, addr + size, FALSE); splx(s); return ((void *)addr);
[PATCH REVIEW] zdestroy() for the zone allocator (and small nfs patch)
Hi, I've attached a patch that adds a functions for zone destruction to the kernel zone allocator, which is needed to properly support the unload case for modules that create zones (such as nfs.ko). As a first application of this, it also patches the relevant nfs code to destroy the internal nfs zones on module unload (failing to do so can cause substantial resource leaks and crashes). I'd like to commit this, but I think it would be good to get some additional review before, especially on the part that deals with ZONE_INTERRUPT zones (which need somewhat lower level vm handling than the rest). The handling of regular zones involves some magic (and is probably a little gross), but it works without having to maintain additional state information. So, to all who are interested in this, could you please review and comment it? Thanks, - thomas Index: nfs/nfs.h === RCS file: /home/ncvs/src/sys/nfs/nfs.h,v retrieving revision 1.59 diff -u -r1.59 nfs.h --- nfs/nfs.h 2001/04/17 20:45:21 1.59 +++ nfs/nfs.h 2001/07/21 14:40:59 @@ -633,6 +633,7 @@ struct mbuf *)); intnfs_adv __P((struct mbuf **, caddr_t *, int, int)); void nfs_nhinit __P((void)); +void nfs_nhuninit __P((void)); void nfs_timer __P((void*)); intnfsrv_dorec __P((struct nfssvc_sock *, struct nfsd *, struct nfsrv_descript **)); Index: nfs/nfs_node.c === RCS file: /home/ncvs/src/sys/nfs/nfs_node.c,v retrieving revision 1.49 diff -u -r1.49 nfs_node.c --- nfs/nfs_node.c 2001/05/01 08:13:14 1.49 +++ nfs/nfs_node.c 2001/07/22 12:57:46 @@ -154,6 +154,12 @@ nfsnodehashtbl = hashinit(desiredvnodes, M_NFSHASH, &nfsnodehash); } +void +nfs_nhuninit() +{ + zdestroy(nfsnode_zone); +} + /* * Look up a vnode/nfsnode by file handle. * Callers must check for mount points!! Index: nfs/nfs_subs.c === RCS file: /home/ncvs/src/sys/nfs/nfs_subs.c,v retrieving revision 1.103 diff -u -r1.103 nfs_subs.c --- nfs/nfs_subs.c 2001/07/04 16:20:16 1.103 +++ nfs/nfs_subs.c 2001/07/22 01:48:05 @@ -1185,6 +1185,8 @@ lease_updatetime = nfs_prev_lease_updatetime; sysent[SYS_nfssvc].sy_narg = nfs_prev_nfssvc_sy_narg; sysent[SYS_nfssvc].sy_call = nfs_prev_nfssvc_sy_call; + nfs_nhuninit(); + zdestroy(nfsmount_zone); return (0); } Index: vm/vm_zone.c === RCS file: /home/ncvs/src/sys/vm/vm_zone.c,v retrieving revision 1.46 diff -u -r1.46 vm_zone.c --- vm/vm_zone.c2001/07/09 03:37:33 1.46 +++ vm/vm_zone.c2001/07/22 16:18:49 @@ -28,6 +28,7 @@ #include #include #include +#include #include #include #include @@ -253,6 +254,111 @@ } /* + * Destroy a zone, freeing the allocated memory. + * This does not do any locking for the zone; make sure it is not used + * any more before calling. All zalloc()'ated memory in the zone must have + * been zfree()'d. + * zdestroy() may not be used with zbootinit()'ed zones. + */ +void +zdestroy(vm_zone_t z) +{ + int i, nitems, nbytes; + void *item, *min, **itp; + vm_map_t map; + vm_map_entry_t entry; + vm_object_t obj; + vm_pindex_t pindex; + vm_prot_t prot; + boolean_t wired; + + GIANT_REQUIRED; + KASSERT(z != NULL, ("invalid zone")); + /* +* This is needed, or the algorithm used for non-interrupt zones will +* blow up badly. +*/ + KASSERT(z->ztotal == z->zfreecnt, + ("zdestroy() used with an active zone")); + KASSERT((z->zflags & ZONE_BOOT) == 0, + ("zdestroy() used with a zbootinit()'ed zone")) + + if (z->zflags & ZONE_INTERRUPT) { + kmem_free(kernel_map, z->zkva, z->zpagemax * PAGE_SIZE); + vm_object_deallocate(z->zobj); + atomic_subtract_int(&zone_kmem_kvaspace, + z->zpagemax * PAGE_SIZE); + atomic_subtract_int(&zone_kmem_pages, + z->zpagecount); + } else { + /* +* This is evil h0h0 magic: +* The items may be in z->zitems in a random oder; we have to +* free the start of an allocated area, but do not want to save +* extra information. Additionally, we may not access items that +* were in a freed area. +* This is achieved in the following way: the smallest address +* is selected, and, after removing all items that are in a +* range of z->zalloc * PAGE_SIZE (one allocation unit) from +* it, kmem_free is called on it (since it is the smallest one, +* it must be the start of an area). This is rep
libdevstat interface changes
Hi, I want to add support for reading crash dumps to libdevstat. This will allow iostat and vmstat to fully work on crashdumps (with some additional patches). To give this a reasonable interface, a kvm handle needs to be passed to getnumdevs, getgeneration, getversion, checkversion and getdevs. A NULL argument will cause the old behaviour (sysctls will be used to get the data). Additionally, programs that link with libkdevstat will need to link against libkvm. Because of that, the library version number will need to be bumped. While we are doing that, Kenneth and I think that the library function names should all get a descriptive prefix ("devstat_"), so that devstat library calls can easily be recognized in the sources. Compatability functions will be provided under the old names. Because the interface of the library is relatively small, this will hopefully not add much overhead. The old interface will be marked as deprecated and may be removed some day. A library dependency to libkvm will be added, so that dynamically linked executables can use the new library without having to explicitely link to libkvm. This way, we should hopefully be able to avoid breakage of libdevstat-using programs outside of src/ (there are some in the ports, for example). Does anyone spot style or interface issues or compatability pitfalls with this that I have overlooked? If not, this will probably be committed relatively soon together with the new functions that Kenneth has posted on -audit. Thanks, - thomas To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Can someone verify this?
On Wed, 2001/07/11 at 11:04:56 -0700, Matt Dillon wrote: > > : > :Thomas Moestl wrote: > :> > :> If it was an audio CD you were trying to mount: this is a known > :> problem. The attached patch fixes it for me by disallowing reading of > :... > > Thomas, that looks like a good thing to commit to me, are you going > to commit it? Maybe also MFC it to stable? I've passed it to sos for review, since it's his driver. If he has no objections, I intend to commit it and MFC it later. - thomas To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Please review: bugfix for vinvalbuf()
Hi, I've just tripped over an obviously long-standing (since about Jan. 1998) bug in vinvalbuf while looking into PR kern/26224. The problematic code looks like (on -CURRENT): /* * Destroy the copy in the VM cache, too. */ mtx_lock(&vp->v_interlock); if (VOP_GETVOBJECT(vp, &object) == 0) { vm_object_page_remove(object, 0, 0, (flags & V_SAVE) ? TRUE : FALSE); } mtx_unlock(&vp->v_interlock); The locks seem to be needed for file systems that don't perform real locking (on -STABLE, they are simplelocks). This, however, is incorrect because vm_object_page_remove may sleep. I've attached a patch that passes the interlock to vm_object_page_remove, which in turn passes it to a modified version of vm_page_sleep, which unlocks it around the sleep. I think that this is correct, because the object should be in a valid state when we sleep (and all checks are reexecuted in that case). Since I'm not very experienced with vfs and vm stuff, I'd be glad if this patch could get some review. In particular, is the lock/unlock pair really still needed, and are there notable differeces in -STABLE (because the fix would need the be MFC'ed)? Thanks, - thomas Index: kern/vfs_subr.c === RCS file: /home/ncvs/src/sys/kern/vfs_subr.c,v retrieving revision 1.315 diff -u -r1.315 vfs_subr.c --- kern/vfs_subr.c 2001/07/04 16:20:13 1.315 +++ kern/vfs_subr.c 2001/07/10 19:45:41 @@ -800,7 +800,7 @@ mtx_lock(&vp->v_interlock); if (VOP_GETVOBJECT(vp, &object) == 0) { vm_object_page_remove(object, 0, 0, - (flags & V_SAVE) ? TRUE : FALSE); + (flags & V_SAVE) ? TRUE : FALSE, &vp->v_interlock); } mtx_unlock(&vp->v_interlock); Index: vm/vm_map.c === RCS file: /home/ncvs/src/sys/vm/vm_map.c,v retrieving revision 1.206 diff -u -r1.206 vm_map.c --- vm/vm_map.c 2001/07/04 20:15:16 1.206 +++ vm/vm_map.c 2001/07/10 20:02:41 @@ -1875,7 +1875,7 @@ vm_object_page_remove(object, OFF_TO_IDX(offset), OFF_TO_IDX(offset + size + PAGE_MASK), - FALSE); + FALSE, NULL); } VOP_UNLOCK(object->handle, 0, curproc); vm_object_deallocate(object); @@ -1991,7 +1991,7 @@ offidxend = offidxstart + count; if ((object == kernel_object) || (object == kmem_object)) { - vm_object_page_remove(object, offidxstart, offidxend, FALSE); + vm_object_page_remove(object, offidxstart, offidxend, FALSE, +NULL); } else { pmap_remove(map->pmap, s, e); if (object != NULL && @@ -1999,7 +1999,7 @@ (object->flags & (OBJ_NOSPLIT|OBJ_ONEMAPPING)) == OBJ_ONEMAPPING && (object->type == OBJT_DEFAULT || object->type == OBJT_SWAP)) { vm_object_collapse(object); - vm_object_page_remove(object, offidxstart, offidxend, FALSE); + vm_object_page_remove(object, offidxstart, offidxend, +FALSE, NULL); if (object->type == OBJT_SWAP) { swap_pager_freespace(object, offidxstart, count); } @@ -2994,7 +2994,7 @@ /* * Remove unneeded old pages */ - vm_object_page_remove(first_object, 0, 0, 0); + vm_object_page_remove(first_object, 0, 0, 0, NULL); /* * Invalidate swap space Index: vm/vm_object.c === RCS file: /home/ncvs/src/sys/vm/vm_object.c,v retrieving revision 1.196 diff -u -r1.196 vm_object.c --- vm/vm_object.c 2001/07/04 20:15:16 1.196 +++ vm/vm_object.c 2001/07/10 20:03:23 @@ -1438,7 +1438,7 @@ * The object must be locked. */ void -vm_object_page_remove(vm_object_t object, vm_pindex_t start, vm_pindex_t end, boolean_t clean_only) +vm_object_page_remove(vm_object_t object, vm_pindex_t start, vm_pindex_t end, +boolean_t clean_only, struct mtx *interlk) { vm_page_t p, next; unsigned int size; @@ -1478,7 +1478,7 @@ * interrupt -- minimize the spl transitions */ - if (vm_page_sleep_busy(p, TRUE, "vmopar")) +
Re: Can someone verify this?
On Sat, 2001/07/07 at 18:57:15 -0400, Paul Halliday wrote: > FreeBSD dissent.p450.box 4.3-RC FreeBSD 4.3-RC #3: Sun Jun 10 22:27:47 > EDT 2001 [EMAIL PROTECTED]:/usr/src/sys/compile/workstation > i386 > > FreeBSD useless.dell.box 4.3-STABLE FreeBSD 4.3-STABLE #6: Fri Jul 6 > 18:57:08 EDT 2001 > [EMAIL PROTECTED]:/usr/src/sys/compile/useless i386 > > mount /dev/acd0c /cdrom > should obviously fail, yet causes... > > panic: vm -fault on nofault entry, addr: c3e1e000 > > reboot. > any ideas? If it was an audio CD you were trying to mount: this is a known problem. The attached patch fixes it for me by disallowing reading of partial blocks; this could also be fixed by setting the buffer size different from the transfer size in such a case. - thomas Index: dev/ata/atapi-cd.c === RCS file: /home/ncvs/src/sys/dev/ata/atapi-cd.c,v retrieving revision 1.48.2.10 diff -u -r1.48.2.10 atapi-cd.c --- dev/ata/atapi-cd.c 2001/02/25 21:35:20 1.48.2.10 +++ dev/ata/atapi-cd.c 2001/07/09 21:48:58 @@ -1126,9 +1126,7 @@ /* reject all queued entries if media changed */ if (cdp->atp->flags & ATAPI_F_MEDIA_CHANGED) { bp->b_error = EIO; - bp->b_flags |= B_ERROR; - biodone(bp); - return; + goto failure; } bzero(ccb, sizeof(ccb)); @@ -1149,7 +1147,11 @@ lastlba = cdp->info.volsize; } -count = (bp->b_bcount + (blocksize - 1)) / blocksize; +if (bp->b_bcount % blocksize != 0) { + bp->b_error = EINVAL; + goto failure; +} +count = bp->b_bcount / blocksize; if (bp->b_flags & B_READ) { /* if transfer goes beyond range adjust it to be within limits */ @@ -1191,6 +1193,11 @@ atapi_queue_cmd(cdp->atp, ccb, bp->b_data, count * blocksize, bp->b_flags & B_READ ? ATPR_F_READ : 0, 30, acd_done,bp); +return; + +failure: +bp->b_flags |= B_ERROR; +biodone(bp); } static int
robustness fix for SYSCTL_OUT
Hi, the following is from sys/kern/kern_sysctl.c: static int sysctl_old_kernel(struct sysctl_req *req, const void *p, size_t l) { size_t i = 0; if (req->oldptr) { i = l; if (i > req->oldlen - req->oldidx) i = req->oldlen - req->oldidx; if (i > 0) bcopy(p, (char *)req->oldptr + req->oldidx, i); } req->oldidx += l; if (req->oldptr && i != l) return (ENOMEM); return (0); } oldidx and oldlen are both size_t (unsigned). If l happens to be larger than (req->oldlen - req->oldidx), ENOMEM is returned correctly, but req->oldidx is increased by the full length. If a buggy caller does not react on the error and calls SYSCTL_OUT again (SYSCTL_OUT normally causes sysctl_old_kernel() or sysctl_old_user, which has a similar bug, to be called), oldidx will be greater than oldlen, and since both are unsigned, the if test will fail, so we will bcopy outside of the buffer and no longer return ENOMEM. Not that this does not matter if SYSCTL_OUT is used correctly, but for the sake of robustness, I think it should be fixed. Currently, there is one place in the -CURRENT kernel (that I know of) that actually gets hit by this bug. -STABLE seems fine. I propose the attached fix. Could it please be reviewed and commited if correct? - thomas *** sys.3/kern/kern_sysctl.cTue Feb 13 16:15:52 2001 --- sys/kern/kern_sysctl.c Tue Feb 13 20:06:37 2001 *** *** 817,824 if (req->oldptr) { i = l; ! if (i > req->oldlen - req->oldidx) ! i = req->oldlen - req->oldidx; if (i > 0) bcopy(p, (char *)req->oldptr + req->oldidx, i); } --- 817,827 if (req->oldptr) { i = l; ! if (req->oldlen <= req->oldidx) ! i = 0; ! else ! if (i > req->oldlen - req->oldidx) ! i = req->oldlen - req->oldidx; if (i > 0) bcopy(p, (char *)req->oldptr + req->oldidx, i); } *** *** 914,921 } if (req->oldptr) { i = l; ! if (i > req->oldlen - req->oldidx) ! i = req->oldlen - req->oldidx; if (i > 0) error = copyout(p, (char *)req->oldptr + req->oldidx, i); --- 917,927 } if (req->oldptr) { i = l; ! if (req->oldlen <= req->oldidx) ! i = 0; ! else ! if (i > req->oldlen - req->oldidx) ! i = req->oldlen - req->oldidx; if (i > 0) error = copyout(p, (char *)req->oldptr + req->oldidx, i);
Re: removing setgid kmem from top, collecting per-device swap stats
On Fri, Feb 02, 2001 at 02:28:29AM +0900, Hajimu UMEMOTO wrote: > >>>>> On Thu, 1 Feb 2001 17:11:35 + > >>>>> Tony Finch <[EMAIL PROTECTED]> said: > > dot> Thomas Moestl <[EMAIL PROTECTED]> wrote: > > > >Most kmem_read calls are easy to replace (the variables are already > >exported as sysctls), the only exception is nextproc (for which I might > >add a sysctl, or just leave it out [anyone out there who needs the > >lastpid display?]). > > dot> It's useful for seeing how fast the machine is forking. > I beleive it's not meaningful if randompid is enabled. It is. It gives the greatest pid used up to now. The next pid is determined by adding 1 (and randomness in the randompid case) to it. This is why I can export it without defeating randompid. Of course, it would still be better to track the sysctls you named for this, but people seem to like the lastpid dispay, so I won't remove it for now. - thomas To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: removing setgid kmem from top, collecting per-device swap stats
On Tue, Jan 30, 2001 at 04:17:55PM -0800, Matt Dillon wrote: > > : VOP_SWAPACCOUNT(nbp->b_vp, nbp); > : BUF_KERNPROC(nbp); > : BUF_STRATEGY(nbp); > : > :Now, I have to define the vop in vm_swap.c, where I can get the area > :index from the block number in a clean way and frob my counters. > :The same method could be used in swap_pager_putpages(). > :swap_pager_reserve() would need to construct a buf and act in a similar > :way as e.g. swap_pager_strategy(), only that the buffer is not flushed > :afterwards, only VOP_SWAPACCOUNT is called. Of course, this can be > :done much easier by just passing the block number to VOP_SWAPACCOUNT, > :and calling it for each blk returned by swp_pager_getswapspace() > :(it costs virtually nothing, so it should be OK). > :So, VOP_SWAPACCOUNT would become: > : VOP_SWAPACCOUNT(nbp->b_vp, nbp->b_blkno, nbp->bcount); > : > :Or did I get things wrong? > : > : - thomas > > Hmm. I think it would be easier to figure it out in the swap > allocation and free code. Specifically, look in vm/swap_pager.c: > > * The swp_pager_getswapspace() function > * The swp_pager_freeswapspace() function > > I think these are the best places to keep track of per-swap-area > allocation and frees. Note that the system tracks overall swap > useage in these routines as well (the vm_swap_size global). Yep. I only thought I should keep the layering by using VOPs. But you are right, swap_pager and vm_swap are dependent anyway, so this can be justified, and things are not duplicated over the whole file. So, is it OK with you if I do it the way you described? - thomas To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: removing setgid kmem from top, collecting per-device swap stats
On Tue, Jan 30, 2001 at 03:21:38PM -0800, Matt Dillon wrote: > :vm.swapdev1.total (this is the one that is currently hard to get) > You can't move swapinfo into the kernel as a sysctl unless you > solve this problem. Traversing the radix tree is expensive enough > that the entire system will stall for a short period of time if you run > the loop in the kernel. Yes. I did not have that in mind. > The issue here is that swapinfo tries to break the useage down by > swap area, whereas the kernel has no real concept of swap areas > in the allocation map -- it just sees one big contiguous allocation > map. So the kernel does not track allocations on a per-swap-area > basis. My thought was the following: add a swapaccount vnode operation, and then do (in flushchainbuf): VOP_SWAPACCOUNT(nbp->b_vp, nbp); BUF_KERNPROC(nbp); BUF_STRATEGY(nbp); Now, I have to define the vop in vm_swap.c, where I can get the area index from the block number in a clean way and frob my counters. The same method could be used in swap_pager_putpages(). swap_pager_reserve() would need to construct a buf and act in a similar way as e.g. swap_pager_strategy(), only that the buffer is not flushed afterwards, only VOP_SWAPACCOUNT is called. Of course, this can be done much easier by just passing the block number to VOP_SWAPACCOUNT, and calling it for each blk returned by swp_pager_getswapspace() (it costs virtually nothing, so it should be OK). So, VOP_SWAPACCOUNT would become: VOP_SWAPACCOUNT(nbp->b_vp, nbp->b_blkno, nbp->bcount); Or did I get things wrong? - thomas To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
removing setgid kmem from top, collecting per-device swap stats
Hi, I'm going to start working on making top work without setgid kmem. Is anyone already doing this? I don't want to waste my time... Most kmem_read calls are easy to replace (the variables are already exported as sysctls), the only exception is nextproc (for which I might add a sysctl, or just leave it out [anyone out there who needs the lastpid display?]). Another thing I will need to do is to make kvm_getswapinfo use sysctls when using on a "live" kernel (using a similar hack as in kvm_getprocs). This will also allow to take the setgid kmem bit from swapinfo, while it can still be used (along with pstat) on crash dumps. This conversion is more complex. Is there any reason for going through significant pains to collect swap statistics in userland instead of doing this in the kernel? >From what I can see in vm_swap.c and swap_pager.c, this seems not to be too hard. To correctly handle layering, I would need to define a new vnode op for statistics. I would propose as sysctl hierarchy like vm.nswapdev(number of devices) vm.swapdev1.dev vm.swapdev1.total (this is the one that is currently hard to get) vm.swapdev1.used vm.swapdev1.flags vm.swapdev2... This would also be much easier for the "non-live" case because we would keep the statistics in memory, right for kvm to read. Please correct me if I'm totally wrong somewhere... otherwise, I will just start doing it like this, so don't let me run into my doom ;-) - thomas To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: natd bug
On Thu, Nov 30, 2000 at 08:25:15PM +0100, Frederik Meerwaldt wrote: > I compiled my kernel with IPDIVERT IPFIREWALL and > IPFIREWALL_DEFAULT_TO_ACCEPT and I set up only one rule: > ipfw add divert natd all from any to any via isp0 > Then I started natd (at boot time): > natd -unregistered_only -dynamic -n isp0 > But when a package arrives (doesn't matter from localhost or another > host), natd gives out a kernel message: > > Nov 30 15:03:06 server natd[195]: failed to write packet back (Permission > denied) Is your link up at that time? The usual setup for a sppp device using dynamic ip's is an invalid ip (0.0.0.0) that is changed once an ip was assigned. So, if you are not dialled in, the invalid ip will be put in by natd, and that usually causes this error message. - Thomas To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: [PATCH] Fix for panics in lookup() after forced unmount
X On Mon, Nov 27, 2000 at 11:04:37AM +0600, Boris Popov wrote: > On Sun, 26 Nov 2000, Thomas Moestl wrote: > > > Actually, the panic will occur after a simple forced unmount of the current > > working directory and subsequent try to access "..". This is because the > > vnode of the cwd was cleared and it's v_mount member was set to NULL. This > > member is however dereferenced in the handling for the ".." special case in > > lookup(), causing a panic. > > [...] > > Good work Thomas! I think patch can be committed as is. Could someone please commit this? Or shall I rather file a PR? Thanks, - Thomas To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: [PATCH] Fix for panics in lookup() after forced unmount
> :I think I have a sufficient fix for PR kern/19572. Could somebody please > :Review/Comment this? > :To quote: > > Looks reasonable to me! It certainly can't make things worse :-) Could someone then please look into committing this? TIA, - Thomas To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
[PATCH] Fix for panics in lookup() after forced unmount
Hi, I think I have a sufficient fix for PR kern/19572. Could somebody please Review/Comment this? To quote: > Description > Executing cd ../cdrom from /cdrom directory after cycle of mount-umount- > mount cycle causes trap 12 (page fault while in kernel mode) and hence > causes kernel panic. > How-To-Repeat > chdir to /cdrom (which) is default mount point for cdrom as per fstab. > mount /cdrom and do ls > now umount -f /cdrom (force as we will be /cdrom) > doing ls will give error( . not a directory) > do mount /cdrom to mount the cdrom once again (we are still in /cdrom dir) > now do ls will give error once again > now do cd ../ (tab in case of bash or esc in case of csh) to do file > completion. > This will result in trap 12 (page fault in kernel mode) and thus > results in kernel panic. Actually, the panic will occur after a simple forced unmount of the current working directory and subsequent try to access "..". This is because the vnode of the cwd was cleared and it's v_mount member was set to NULL. This member is however dereferenced in the handling for the ".." special case in lookup(), causing a panic. The fix is rather trivial, just check the member before using it and return an appropriate error. In the following patch, I use EBADF. Btw, after taking a look into the OpenBSD and NetBSD repos, I think they might have the same problem. Is there any standard channel to pass bug reports to them from FreeBSD, or should I just use the normal submit procedure? - Thomas The patch: *** old/kern/vfs_lookup.c Sat Nov 25 23:55:39 2000 --- new/kern/vfs_lookup.c Sun Nov 26 00:58:15 2000 *** *** 403,408 --- 403,412 if ((dp->v_flag & VROOT) == 0 || (cnp->cn_flags & NOCROSSMOUNT)) break; + if (dp->v_mount == NULL) { /* forced unmount */ + error = EBADF; + goto bad; + } tdp = dp; dp = dp->v_mount->mnt_vnodecovered; vput(tdp); *** *** 424,430 printf("not found\n"); #endif if ((error == ENOENT) && ! (dp->v_flag & VROOT) && (dp->v_mount->mnt_flag & MNT_UNION)) { tdp = dp; dp = dp->v_mount->mnt_vnodecovered; --- 428,434 printf("not found\n"); #endif if ((error == ENOENT) && ! (dp->v_flag & VROOT) && (dp->v_mount != NULL) && (dp->v_mount->mnt_flag & MNT_UNION)) { tdp = dp; dp = dp->v_mount->mnt_vnodecovered; *** *** 502,507 --- 506,517 ((cnp->cn_flags & FOLLOW) || trailing_slash || *ndp->ni_next == '/')) { cnp->cn_flags |= ISSYMLINK; + if (dp->v_mount == NULL) { + /* We can't know whether the directory was mounted with +* NOSYMFOLLOW, so we can't follow safely. */ + error = EBADF; + goto bad2; + } if (dp->v_mount->mnt_flag & MNT_NOSYMFOLLOW) { error = EACCES; goto bad2; To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Byte order?
> I know, that x86 is big endian architecture > but simple programm like this: > > #include > #include > main () { > /* Are we little or big endian? From Harbison&Steele. */ > union > { > long l; > char c[sizeof (long)]; > } u; > u.l = 1; > printf ("Little endian? %s\n", (u.c[sizeof (long) - 1] == 1) ? "yes" : >"no"); > #if BYTE_ORDER == BIG_ENDIAN > printf("Big endian\n"); > #elif BYTE_ORDER == LITTLE_ENDIAN > printf("Little endian\n"); > #else > printf("Unknown\n"); > #endif > } > > Give me a strange result: > Little endian? no > Little endian This program gets it wrong. When the last byte of a long is set after the long was set to 1, we have a big endian architecture (the "little" end is at the 4th byte, so the "big end" is at the 1st byte). > On my FreeBSD 4.2-BETA BYTE_ORDER = LITTLE_ENDIAN! > I`m very confused and some programms detect my machine as Little Endian, >by > example freetds. The x86 architecture _is_ little endian. - Thomas To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: [PATCH] per-packet UDP source address selection for sendmsg
> As this is my first kernel patch, could someone please give it a quick > look before I submit it, so that I can be sure that it contains no stupid > glitches? > [...] > + bzero(&src, sizeof(src)); > + src.sin_family = AF_INET; > + src.sin_port = inp->inp_lport; > + src.sin_addr = *(struct in_addr >*)CMSG_DATA(cm); > + bound = 1; Except of course forgetting to set src.sin_len :( Thomas To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
[PATCH] per-packet UDP source address selection for sendmsg
Hi, recently, I have written a kernel patch that makes sendmsg(2) honor IP_RECVDSTADDR ancillary data on UDP sockets, so that the source address can be set on a per-packet basis. This is very useful for servers that need to use UDP and want to answer with the same source address a query went to. This is not always the case with vanilla sendto(2) in certain routing layouts. As this is my first kernel patch, could someone please give it a quick look before I submit it, so that I can be sure that it contains no stupid glitches? Unfortunately, I had no chance to test this on -current (I'm just building a -current box). Thanks, Thomas *** netinet/udp_usrreq.c.oldThu Nov 9 19:05:13 2000 --- netinet/udp_usrreq.cThu Nov 9 19:23:54 2000 *** *** 642,680 { register struct udpiphdr *ui; register int len = m->m_pkthdr.len; struct in_addr laddr; struct sockaddr_in *sin; ! int s = 0, error = 0; ! ! if (control) ! m_freem(control); /* XXX */ ! if (len + sizeof(struct udpiphdr) > IP_MAXPACKET) { error = EMSGSIZE; goto release; } if (addr) { sin = (struct sockaddr_in *)addr; prison_remote_ip(p, 0, &sin->sin_addr.s_addr); ! laddr = inp->inp_laddr; if (inp->inp_faddr.s_addr != INADDR_ANY) { error = EISCONN; ! goto release; } /* * Must block input while temporarily connected. */ ! s = splnet(); error = in_pcbconnect(inp, addr, p); ! if (error) { ! splx(s); ! goto release; ! } } else { if (inp->inp_faddr.s_addr == INADDR_ANY) { error = ENOTCONN; ! goto release; } } /* --- 642,741 { register struct udpiphdr *ui; register int len = m->m_pkthdr.len; + register struct cmsghdr *cm = 0; struct in_addr laddr; struct sockaddr_in *sin; ! struct sockaddr_in src; ! int s = 0, error = 0, bound = 0, addrset = 0, fmbuf = 0; ! if (len + sizeof(struct udpiphdr) > IP_MAXPACKET) { error = EMSGSIZE; + if (control) + m_freem(control); goto release; } + if (control) { + /* +* XXX: Currently, we assume all the optional information is stored +* in a single mbuf. +*/ + if (control->m_next) + error = EINVAL; + else { + for (; control->m_len; control->m_data += ALIGN(cm->cmsg_len), +control->m_len -= ALIGN(cm->cmsg_len)) { + cm = mtod(control, struct cmsghdr *); + if (cm->cmsg_len == 0 || cm->cmsg_len > +control->m_len) { + error = EINVAL; + break; + } + if (cm->cmsg_level != IPPROTO_IP) + continue; + + switch(cm->cmsg_type) { + case IP_RECVDSTADDR: + if (cm->cmsg_len != CMSG_LEN(sizeof(struct +in_addr))) { + error = EINVAL; + break; + } + laddr = inp->inp_laddr; + bzero(&src, sizeof(src)); + src.sin_family = AF_INET; + src.sin_port = inp->inp_lport; + src.sin_addr = *(struct in_addr +*)CMSG_DATA(cm); + bound = 1; + s = splnet(); + if (inp->inp_laddr.s_addr == INADDR_ANY && +inp->inp_lport == 0) { + /* This will check the address */ + error = in_pcbbind(inp, (struct +sockaddr *)&src, p); + } else { + if (prison_ip(p, 0, +&src.sin_addr.s_addr)) { + error = EINVAL; + break; + } + if (ifa_ifwithaddr((struct sockaddr +*)&src) == 0) { +