Re: Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?
On 11 Sep 2007 at 17:04, Al Viro wrote: > On Tue, Sep 11, 2007 at 05:54:38PM +0200, Ulrich Windl wrote: > > > If not, any clues on debugging/tracing? There's a > > /usr/src/linux/Documentation/oops-tracing.txt, but no "segfault-tracing". > > That would be because it has fsck-all to do with the kernel. Get the > coredump, then use gdb to deal with it. Ok, but why is the message there at all? I think in Windows/XP the offending code and the registers are shown in such occasions. I'd say either drop the message, or improve it. It's also difficult to find the code after the program is gone due to mapping of shared libraries. I managed to get a core dump of the application however, and I did modify some code. I'll report once I have results. Maybe it's "mea culpa" for my program, but powersaved and slapd are still to be examined. Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [announce] CFS-devel, performance improvements
On Tue, 2007-09-11 at 22:04 +0200, Ingo Molnar wrote: > fresh back from the Kernel Summit, Peter Zijlstra and me are pleased to > announce the latest iteration of the CFS scheduler development tree. Our > main focus has been on simplifications and performance - and as part of > that we've also picked up some ideas from Roman Zippel's 'Really Fair > Scheduler' patch as well and integrated them into CFS. We'd like to ask > people go give these patches a good workout, especially with an eye on > any interactivity regressions. Initial test-drive looks good here, but I do see a regression. First the good news. fairtest2 is perfect, more perfect than ever seen before in fact. Mixed interval sleepers/hog looks fine as well (can't say perfect due to startup differences with the various proggies, but cpu% looks perfect). Amarok song switch time under hefty kbuild load is fine as well. I haven't done heavy multimedia testing yet, but will give it a more thorough workout later (errands). The regression: I see some GUI lurch, easily reproducible by running a make -j5 and moving the mouse in a circle... perceptible (100ms or so) lurches not present in rc5. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUGFIX] x86_64: NX bit handling in change_page_attr
On Tue, 2007-09-11 at 20:23 -0700, Andrew Morton wrote: > On Fri, 17 Aug 2007 13:28:38 +0800 "Huang, Ying" <[EMAIL PROTECTED]> wrote: > > > This patch fixes a bug of change_page_attr/change_page_attr_addr on > > Intel x86_64 CPU. After changing page attribute to be executable with > > these functions, the page remains un-executable on Intel x86_64 > > CPU. Because on Intel x86_64 CPU, only if the "NX" bits of all four > > level page tables are cleared, the corresponding page is executable > > (refer to section 4.13.2 of Intel 64 and IA-32 Architectures Software > > Developer's Manual). So, the bug is fixed through clearing the "NX" > > bit of PMD when splitting the huge PMD. > > > > Signed-off-by: Huang Ying <[EMAIL PROTECTED]> > > > > --- > > > > Index: linux-2.6.23-rc2-mm2/arch/x86_64/mm/pageattr.c > > === > > --- linux-2.6.23-rc2-mm2.orig/arch/x86_64/mm/pageattr.c 2007-08-17 > > 12:50:25.0 +0800 > > +++ linux-2.6.23-rc2-mm2/arch/x86_64/mm/pageattr.c 2007-08-17 > > 12:50:48.0 +0800 > > @@ -147,6 +147,7 @@ > > split = split_large_page(address, prot, ref_prot2); > > if (!split) > > return -ENOMEM; > > + pgprot_val(ref_prot2) &= ~_PAGE_NX; > > set_pte(kpte, mk_pte(split, ref_prot2)); > > kpte_page = split; > > } > > What happened with this? Still valid? I am waiting for reviewing or merging. And I think it is still valid. Best Regards, Huang Ying - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem charging blackberry 8700c with berry_charge (2.6.22.6)
On Mon, 10 Sep 2007 23:35:02 -0700 Greg KH <[EMAIL PROTECTED]> wrote: > > > Sep 9 13:49:01 prizm kernel: [ 584.407498] drivers/usb/core/inode.c: > > creating file '003' > > Sep 9 13:49:01 prizm kernel: [ 584.407509] hub 5-0:1.0: state 7 ports 8 > > chg evt 0004 > > Sep 9 13:49:01 prizm kernel: [ 584.407520] hub 1-0:1.0: state 7 ports 2 > > chg evt 0004 > > Sep 9 13:49:03 prizm kernel: [ 586.405512] usb 1-2: usb auto-suspend > > Sep 9 13:49:03 prizm kernel: [ 586.421471] hub 5-0:1.0: hub_suspend > > Sep 9 13:49:03 prizm kernel: [ 586.421481] ehci_hcd :00:10.4: suspend > > root hub > > Sep 9 13:49:03 prizm kernel: [ 586.421496] usb usb5: usb auto-suspend > > Sep 9 13:49:05 prizm kernel: [ 588.421351] hub 1-0:1.0: hub_suspend > > Sep 9 13:49:05 prizm kernel: [ 588.421361] usb usb1: suspend_rh > > Sep 9 13:49:05 prizm kernel: [ 588.421481] usb usb1: usb auto-suspend > > Ah, oh wait, now we just turned the power off. > > Try disabling CONFIG_USB_SUSPEND and see if that fixes this issue. Or > you can manually turn the power back on to your blackberry by writing to > the autosuspend file for the usb device in sysfs, but that can be a > pain. > > Let me know if just changing that config option works for you. > And now for the dramatic conclusion... To begin, I have no access to the original machine at the moment, as I'm now out of that area for a couple weeks. I built a similar kernel (same version) on another box that I have at my current location. The new machine is different hardware, so some kernel re-configuring was required, but I kept with the same USB settings (and similar overall design). Interestingly, this machine didn't reproduce the "magic command failed" error, but it did fail very similarly to the original at charging the device. I disabled CONFIG_USB_SUSPEND as suggested, and lo and behold, it now charges the berry. Looks like an excellent diagnosis to me, doctor. Thanks! :) > thanks, > > greg k-h -- Matt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] doc: about email clients for Linux kernel patches
On Wed, Sep 12, 2007 at 01:24:13PM +0800, WANG Cong wrote: > On Tue, Sep 11, 2007 at 08:29:26PM +0200, Adrian Bunk wrote: > >On Tue, Sep 11, 2007 at 10:16:44AM -0700, Randy Dunlap wrote: > >>... > >> +~~ > >> +Mutt (TUI) > >> + > >> +Plenty of Linux developers use mutt, so it must work pretty well. > >> + > >> +Are there any special config options that are needed?? > >>... > > > >It should work with default settings. > > > I can't agree with this. > > It took me lots of time to configure mutt to work well for me in the first > time. Just default settings are far _not_ enough, especially for us > non-english-speakers. One common setting is the encoding, of course, lkml > prefers UTF-8, so I must set my mutt with `set send_charset="us-ascii:utf-8"`. This makes sense, but it's not really a mutt specific issue and problems because mutt prefers iso-8859-1 over UTF-8 by default are quite rare. > Manuals of mutt told me to add "subscribe linux-kernel@vger.kernel.org" if I > subscribed lkml, but in fact, we'd better _not_ add this, or it will drop > myself from cc list. > > Or other things like these. >... Whether or not people want to get personal copies of answers to mailing list posts is a religious issue being second only to the vi<->emacs wars... But as far as I understand it, this documentation is intended to help people to get sending patches right (no line wrap etc.), not as a generic documentation for mail clients. > Regards. cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-mm patch] remove ide_get_error_location()
On Tue, Sep 11 2007, Bartlomiej Zolnierkiewicz wrote: > On Sunday 09 September 2007, Adrian Bunk wrote: > > On Fri, Aug 31, 2007 at 09:58:22PM -0700, Andrew Morton wrote: > > >... > > > Changes since 2.6.23-rc3-mm1: > > >... > > > git-block.patch > > >... > > > git trees > > >... > > > > ide_get_error_location() is no longer used. > > > > Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> > > Signed-off-by: Bartlomiej Zolnierkiewicz <[EMAIL PROTECTED]> > > Since git-block contains the patch which removes the only user of > ide_get_error_location() I think that this patch should be also merged > through block tree. Jens? Yeah, I'll add it there. > PS none of the blkdev_issue_flush() users uses *error_sector argument > so it can be probably removed as well I had hoped that the existance was enough incentive, but it didn't happen. I'll make a note to kill that again. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Do not deprecate binary semaphore or do allow mutex in software interrupt contexts
The following code seems to me to be a valid example of a binary semaphore (mutex) in a timer: //timer called 10 times a second static void status_timer(unsigned long device) { struct etp_device_private *dp = (struct etp_device_private *)device; if (unlikely(dp->status_interface == 0)) dp->status_interface = INTERFACES_PER_DEVICE - 1; else dp->status_interface--; //DBG_PRINT ("%s: In status timer, interface:0x%x.\n",etp_NAME, dp->status_interface); idt_los_interrupt_1(dp, dp->status_interface); if (likely(!dp->reset)) // reset the timer: mod_timer(&dp->status_timer, jiffies + HZ / 10); } static inline void read_idt_register_interrupt(struct etp_device_private *dp, unsigned reg) { DBG_PRINT("read_idt_register_interrupt to mutex_lock.\n"); if (unlikely(down_trylock(&dp->semaphore))) return;/* Do not read because failed to lock. */ if (likely (!dp->status && !(inl((void *)(dp->ioaddr + REG_E1_CTRL)) & E1_ACCESS_ON))) { outl(((reg << E1_REGISTER_SHIFT) & E1_REGISTER_MASK) | E1_DIR_READ | E1_ACCESS_ON, (void *)(dp->ioaddr + REG_E1_CTRL)); dp->status = 1; DBG_PRINT("read_idt_register_interrupt set status read.\n"); } else DBG_PRINT ("read_idt_register_interrupt did not set status %u read.\n", dp->status); DBG_PRINT ("read_idt_register_interrupt do not wait for result here, read in tasklet.\n"); } //for getting los information with interrupt: void idt_los_interrupt_1(struct etp_device_private *dp, unsigned interface) { read_idt_register_if_interrupt(dp, E1_TRNCVR_LINE_STATUS0_REG, interface); } static void e1_access_task(unsigned long data)//called after e1_access_interrupt { struct etp_device_private *dp = (struct etp_device_private *)data; struct etp_interface_private *ip; unsigned int interface, error; bool los; //check if los status was read: if (unlikely(!dp->status)) { DBG_PRINT("e1_access_task wakes up user.\n"); wake_up(&dp->e1_access_q); return; } error = idt_los_interrupt_2(dp->ioaddr, &interface, &los, dp->pci_dev->device); //DBG_PRINT ("%s: In e1 task, error:0x%x, interface:0x%x, los:0x%x.\n", // etp_NAME, error, interface, los); dp->status = 0; up(&dp->semaphore); DBG_PRINT("e1_access_task got error %u.\n", error); if (unlikely(error)) return; //update los status: ip = &dp->interface_privates[interface]; ip->los = los; //update status: if ((ip->if_mode == IF_MODE_CLOSED) ||//interface closed or (ip->los)) {//link down set_led(LED_CTRL_OFF, ip); if (netif_carrier_ok(ip->ch_priv.this_netdev)) netif_carrier_off(ip->ch_priv.this_netdev); } else {//link up and interface opened if (!netif_carrier_ok(ip->ch_priv.this_netdev)) netif_carrier_on(ip->ch_priv.this_netdev); if (ip->if_mode == IF_MODE_HDLC) { set_led(LED_CTRL_TRAFFIC, ip); } else {//ip->if_mode == IF_MODE_TIMESLOT set_led(LED_CTRL_ON, ip); } } } int idt_los_interrupt_2(u8 * ioaddr, unsigned *interface, bool * los, unsigned pci_device_id) {//returns 0 in success unsigned int value = inl((void *)(ioaddr + REG_E1_CTRL)); //if access not ended: if (value & E1_ACCESS_ON) { return 1; } //if access not to los status register if ((value & E1_REGISTER_MASK_NO_IF) != (E1_TRNCVR_LINE_STATUS0_REG << E1_REGISTER_SHIFT)) { return 1; } //get interface *interface = idt_if_to_if((value & E1_REGISTER_MASK_IF) >> E1_REGISTER_SHIFT_IF, pci_device_id); *los = value & 0x1; return 0; } int write_idt_register_lock(unsigned device, unsigned reg, u32 value) { struct etp_device_private *etp = get_dev_priv(device); unsigned ctrl; DBG_PRINT("write_idt_register_lock to mutex lock device %u.\n", device); down(&etp->semaphore); if (unlikely(etp->reset)) { up(&etp->semaphore); DBG_PRINT ("write_idt_register_lock device %u unusable.\n", device); return -ENXIO; } DBG_PRINT("write_idt_register_lock mutex locked device %u.\n", device); do { DBG_PRINT ("write_idt_register_lock to wait_event_timeout device %u.\n", device); wait_event_timeout(etp->e1_access_q, !((ctrl = inl((void *)(etp->ioaddr + REG_E1_CTRL))) & E1_ACCESS_ON), HZ / 500); } while (ctrl & E1_ACCESS_ON); DBG_PRINT("write_idt_register_lock to outl device %u.\n", device); outl(((reg << E1_REGISTER_SHIFT) & E1_REGISTER_MASK) | E1_DIR_WRITE | E1_ACCESS_ON | (value & E1_DATA_MASK
Re: SYSFS: need a noncaching read
On Tue, Sep 11, 2007 at 11:43:17AM +0200, Heiko Schocher wrote: > I have developed a device driver and use the sysFS to export some > registers to userspace. Uuuh, uggly. Don't do that. Device drivers are there to abstract things, not to play around with registers from userspace. > I opened the sysFS File for one register and did some reads from this > File, but I alwas becoming the same value from the register, whats not > OK, because they are changing. So I found out that the sysFS caches > the reads ... :-( Yes, it does. What you can do is close()ing the file handle between accesses, which makes it work but is slow. > Is there a way to retrigger the reads (in that way, that the sysFS > rereads the values from the driver), without closing and opening the > sysFS Files? Or must I better use the ioctl () Driver-interface for > exporting these registers? What kind of problem do you want to solve? Userspace is for applications, and applications usually don't have to know about hardware details like registers. So if you have to do something every 10 ms from userspace, your design is probably wrong. If you absolutely need to do such things from userspace, have a look at uio. But in most cases the answer is: make a proper abstraction for the problem you wanna solve and write a proper driver. Robert -- Pengutronix - Linux Solutions for Science and Industry Entwicklungszentrum Nord http://www.pengutronix.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] doc: about email clients for Linux kernel patches
On Tue, Sep 11, 2007 at 08:29:26PM +0200, Adrian Bunk wrote: >On Tue, Sep 11, 2007 at 10:16:44AM -0700, Randy Dunlap wrote: >>... >> +~~ >> +Mutt (TUI) >> + >> +Plenty of Linux developers use mutt, so it must work pretty well. >> + >> +Are there any special config options that are needed?? >>... > >It should work with default settings. I can't agree with this. It took me lots of time to configure mutt to work well for me in the first time. Just default settings are far _not_ enough, especially for us non-english-speakers. One common setting is the encoding, of course, lkml prefers UTF-8, so I must set my mutt with `set send_charset="us-ascii:utf-8"`. Manuals of mutt told me to add "subscribe linux-kernel@vger.kernel.org" if I subscribed lkml, but in fact, we'd better _not_ add this, or it will drop myself from cc list. Or other things like these. > >mutt doesn't come with an editor, so whatever editor you use should be >used in a way that there are no automatic linebreaks. Most editors have >an "insert file" option that inserts the contents of a file unaltered. > Yes, you can `set editor="vi"` or other editors you prefer. Regards. -- "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Calling PnP bios routines like get device node from x86_84 arch
Hi Alan, To be specific I want to call the function 0x60 ,0x61 and few more specified in the BBS specification (attached). These functions or alive in 16 bit mode (0xf000 segment) We can call this functions in i386 using the pnpbios driver (bioscalls.c). I must call this functions to change the boot order and reboot from linux x86_64 using my driver. I have seen in some forum that we can off the ACPI during the linux boot. Can you help on this? -mkr Alan Cox wrote: Actually I want to call the BIOS run time functions as per the PNPBIOSSpecification-v1.0a (attached). We use ACPI for x86_64, which means you need to use the ACPI methods not the PnPBIOS ones. PnPBIOS isn't valid when ACPI is in use. Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Query on Real Time Signalling in Linux 2.6.16 kernel
Hi, I would like to use the Real time signalling mechanism. When I try to use "F_SETAUXFL", compilation fails. While looking through the archives I found a patch one-sig-perfd-2.4.4.patch.gz at http://www.uwsg.iu.edu/hypermail/linux/kernel/0105.2/0642.html which needs to be applied for the support. I am using Linux 2.6.16 kernel. So I can't apply this patch directly. Is there a different patch for this kernel or is the support in built. If the support is already present, then any pointers on usage(RT Signalling) would be of great help. Thanks and Regards, Sreenath. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] powerpc: add new required termio functions
Michael Neuling wrote: > The "tty: termios locking functions break with new termios type" patch > (f629307c857c030d5a3dd777fee37c8bb395e171) breaks the powerpc compile. > This adds the required API to asm-powerpc. > > Signed-off-by: Michael Neuling <[EMAIL PROTECTED]> > -- > This needs to go up for 2.6.23. > > Should we really put these definitions in asm-generic/termios.h as I'm > guessing other architectures are broken too? I think it would be better to do so, as that is where we pickup the defs for the original kernel_termios_to_user_termios and user_termios_to_kernel_termios. -Geoff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] doc: about email clients for Linux kernel patches
On Tue, Sep 11, 2007 at 08:52:14PM +0200, Peter Zijlstra wrote: > >On Tue, 2007-09-11 at 14:38 -0400, Lee Revell wrote: >> On 9/11/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote: >> > On Tue, 2007-09-11 at 10:16 -0700, Randy Dunlap wrote: >> > >> > > +~~ >> > > +Evolutions (GUI) >> > >> > I take it you mean: Evolution >> > >> > > +Some people seem to use this successfully for patches. >> > > + >> > > +What config options are needed? >> > >> > When composing mail select: Preformat >> > from Format->Heading->Preformatted (Ctrl-7) >> > or the toolbar >> > >> > Then use: >> > Insert->Text File... (Alt-n x) >> > >> > to insert the patch. >> >> You can also diff -Nru old.c new.c | xclip, select Preformat, then >> paste with the middle button. > >Ah, I shall try: > > cat `quilt top` | xclip > >next time I have a single patch to send. > Oh, great! Thank you for this hint. -- "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[git pull] Input updates for 2.6.23-rc6
Hi Linus, Please consider pulling from: git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input.git for-linus or master.kernel.org:/pub/scm/linux/kernel/git/dtor/input.git for-linus to receive updates for input subsystem. Changelog: -- Elvis Pranskevichus (1): Input: i8042 - add HP Pavilion DV4270ca to the MUX blacklist Ralf Baechle (1): Input: i8042 - fix modpost warning Samuel Thibault (1): Input: add more Braille keycodes Vladimir Shebordaev (1): Input: usbtouchscreen - correctly set 'phys' Diffstat: - drivers/input/serio/i8042-x86ia64io.h | 10 ++ drivers/input/serio/i8042.c|2 +- drivers/input/touchscreen/usbtouchscreen.c |2 +- include/linux/input.h |2 ++ include/linux/keyboard.h |4 +++- 5 files changed, 17 insertions(+), 3 deletions(-) -- Dmitry - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: r8169: can't send magic packet for Wake-On-Lan
Le mardi 11 septembre 2007 à 23:30 +0200, Francois Romieu a écrit : > Xavier Bestel <[EMAIL PROTECTED]> : > [...] > > with the r8169 I can't send a magic packet anymore. I'm using ethtool > > for that, with the previous one (an rtl8139b) it was working very well. > > ethtool -D apparently says it could send the packet ok. > > I see no "-D" option in the sources from the git repository of ethtool. > > Where did you find it ? Err sorry, I mixed up everything ... I'm using *etherwake* to make the WOL magic packet, and ethtool to check the interface options. Xav - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] bfin_twi: Remove useless twi_lock mutex
This patch removes this unneeded mutex. Indeed it was used to serialized access to the hardware, but this is already done by the i2c-core layer, see 'bus_lock' mutex used by i2c_transfer(). Signed-off-by: Francis Moreau <[EMAIL PROTECTED]> Acked-by: Bryan Wu <[EMAIL PROTECTED]> Acked-by: Sonic Zhang <[EMAIL PROTECTED]> --- drivers/i2c/busses/i2c-bfin-twi.c | 16 1 files changed, 0 insertions(+), 16 deletions(-) diff --git a/drivers/i2c/busses/i2c-bfin-twi.c b/drivers/i2c/busses/i2c-bfin-twi.c index 6311039..67224a4 100644 --- a/drivers/i2c/busses/i2c-bfin-twi.c +++ b/drivers/i2c/busses/i2c-bfin-twi.c @@ -44,7 +44,6 @@ #define TWI_I2C_MODE_COMBINED 0x04 struct bfin_twi_iface { - struct mutextwi_lock; int irq; spinlock_t lock; charread_write; @@ -228,12 +227,8 @@ static int bfin_twi_master_xfer(struct i2c_adapter *adap, if (!(bfin_read_TWI_CONTROL() & TWI_ENA)) return -ENXIO; - mutex_lock(&iface->twi_lock); - while (bfin_read_TWI_MASTER_STAT() & BUSBUSY) { - mutex_unlock(&iface->twi_lock); yield(); - mutex_lock(&iface->twi_lock); } ret = 0; @@ -310,9 +305,6 @@ static int bfin_twi_master_xfer(struct i2c_adapter *adap, break; } - /* Release mutex */ - mutex_unlock(&iface->twi_lock); - return ret; } @@ -330,12 +322,8 @@ int bfin_twi_smbus_xfer(struct i2c_adapter *adap, u16 addr, if (!(bfin_read_TWI_CONTROL() & TWI_ENA)) return -ENXIO; - mutex_lock(&iface->twi_lock); - while (bfin_read_TWI_MASTER_STAT() & BUSBUSY) { - mutex_unlock(&iface->twi_lock); yield(); - mutex_lock(&iface->twi_lock); } iface->writeNum = 0; @@ -502,9 +490,6 @@ int bfin_twi_smbus_xfer(struct i2c_adapter *adap, u16 addr, rc = (iface->result >= 0) ? 0 : -1; - /* Release mutex */ - mutex_unlock(&iface->twi_lock); - return rc; } @@ -555,7 +540,6 @@ static int i2c_bfin_twi_probe(struct platform_device *dev) struct i2c_adapter *p_adap; int rc; - mutex_init(&(iface->twi_lock)); spin_lock_init(&(iface->lock)); init_completion(&(iface->complete)); iface->irq = IRQ_TWI; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm] add-a-rounddown_pow_of_two-routine-to-log2h.patch fix
On Sat, 1 Sep 2007 07:55:36 +0200 Mariusz Kozlowski <[EMAIL PROTECTED]> wrote: > Hello, > > This patch fixes the unbalanced parenthesis inroduced by > add-a-rounddown_pow_of_two-routine-to-log2h.patch. > > Signed-off-by: Mariusz Kozlowski <[EMAIL PROTECTED]> > > include/linux/log2.h |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- linux-2.6.23-rc4-mm1-a/include/linux/log2.h 2007-09-01 > 07:23:28.0 +0200 > +++ linux-2.6.23-rc4-mm1-b/include/linux/log2.h 2007-09-01 > 07:29:27.0 +0200 > @@ -186,7 +186,7 @@ unsigned long __rounddown_pow_of_two(uns > (\ > __builtin_constant_p(n) ? ( \ > (n == 1) ? 0 : \ > - (1UL << ilog2(n)) : \ > + (1UL << ilog2(n))) :\ > __rounddown_pow_of_two(n) \ > ) umm, could we get some users of this thing, preferably in some code path which people use? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] doc: about email clients for Linux patches
From: Randy Dunlap <[EMAIL PROTECTED]> Requested by Jeff Garzik. v2, updated from lkml comments. Add info about various email clients and their applicability in being used to send Linux kernel patches. Some notes takes from http://mbligh.org/linuxdocs/Email/Clients Portions used with permission. Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]> --- Documentation/email-clients.txt | 210 1 file changed, 210 insertions(+) --- /dev/null +++ linux-2.6.23-rc5-git1/Documentation/email-clients.txt @@ -0,0 +1,210 @@ +Email clients info for Linux +== + +General Preferences +-- +Patches for the Linux kernel are submitted via email, preferably as +inline text in the body of the email. Some maintainers accept +attachments, but then the attachments should have content-type +"text/plain". However, attachments are generally frowned upon because +it makes quoting portions of the patch more difficult in the patch +review process. + +Email clients that are used for Linux kernel patches should send the +patch text untouched. For example, they should not modify or delete tabs +or spaces, even at the beginning or end of lines. + +Don't send patches with "format=flowed". This can cause unexpected +and unwanted line breaks. + +Don't let your email client do automatic word wrapping for you. +This can also corrupt your patch. + + +They also should not modify the character set encoding of the text. + +Email clients should generate and maintain References: or In-Reply-To: +headers so that mail threading is not broken. + +Copy-and-paste (or cut-and-paste) usually does not work for patches +because tabs are converted to spaces. I have seen comments that +xclipboard, xclip, and/or xcutsel do work, but I cannot confirm this. + +Don't use PGP/GPG signatures in mail that contains patches. +This breaks many scripts that read and apply the patches. +(This should be fixable. ??) + +It's a good idea to send a patch to yourself, save the received message, +and successfully apply it with 'patch' before sending patches to Linux +mailing lists. + + +Some email client (MUA) hints +-- +Legend: +TUI = text-based user interface +GUI = graphical user interface + +~~ +Alpine (TUI) + +Config options: +In the "Sending Preferences" section: + +- "Do Not Send Flowed Text" must be enabled +- "Strip Whitespace Before Sending" must be disabled + +When composing the message, the cursor should be placed where the patch +should appear, and then pressing CTRL-R let you specify the patch file +to insert into the message. + +~~ +Evolution (GUI) + +Some people use this successfully for patches. + +When composing mail select: Preformat + from Format->Heading->Preformatted (Ctrl-7) + or the toolbar + +Then use: + Insert->Text File... (Alt-n x) +to insert the patch. + +You can also "diff -Nru old.c new.c | xclip", select Preformat, then +paste with the middle button. + +~~ +Kmail (GUI) + +Some people use Kmail successfully for patches. + +The default setting of not composing in HTML is appropriate; do not +enable it. + +When composing an email, under options, uncheck "word wrap". The only +disadvantage is any text you type in the email will not be word-wrapped +so you will have to manually word wrap text before the patch. The easiest +way around this is to compose your email with word wrap enabled, then save +it as a draft. Once you pull it up again from your drafts it is now hard +word-wrapped and you can uncheck "word wrap" without losing the existing +wrapping. + +At the bottom of your email, put the commonly-used patch delimiter before +inserting your patch: three hyphens (---). + +Then from the "Message" menu item, select insert file and choose your patch. +As an added bonus I recommend customising the message creation toolbar menu +and putting the "insert file" icon there. + +You can safely GPG sign attachments, but inlined text is preferred for +patches so do not GPG sign them. Signing patches that have been inserted +as inlined text will make them tricky to extract from their 7-bit encoding. + +If you absolutely must send patches as attachments instead of inlining +them as text, right click on the attachment and select properties, and +highlight "Suggest automatic display" to make the attachment inlined to +make it more viewable. + +When saving patches that are sent as inlined text, select the email that +contains the patch from the message list pane, right click and select +"save as". You can use the whole email unmodified as a patch if it was +properly composed. There is no option currently to save the email when +you are actually viewing it in its own window -
Re: [PATCH/RFC] doc: about email clients for Linux kernel patches
On Tue, 11 Sep 2007 19:36:42 +0200 Peter Zijlstra wrote: > On Tue, 2007-09-11 at 10:16 -0700, Randy Dunlap wrote: > > > +~~ > > +Evolutions (GUI) > > I take it you mean: Evolution Yep, lousy keyboard. ;) I've updated the text file and will resend it shortly. Thanks for everyone's comments. (not replying to each one indiviually) --- ~Randy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Configurable tap interface MTU
Ed Swierk <[EMAIL PROTECTED]> wrote: > > The patch caps the MTU somewhat arbitrarily at 16000 bytes. This is > slightly lower than the value used by the e1000 driver, so it seems > like a safe upper limit. Please make it 65535 without an Ethernet header and 65521 with an Ethernet header. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] doc: about email clients for Linux kernel patches
Jeff Garzik wrote: Chris Friesen wrote: Can someone describe the problems with just attaching the patch in Thunderbird? It's what Martin says he does on the linked document... Email clients don't like to quote attachments, even text/plain ones, which then makes attached patches much more difficult to review and comment on (i.e. you greatly reduce the number of reviewers). Thunderbird, at least, will automatically inline a single text/plain attachment when replying. (At least with my current settings, it does.) Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix blkdev size calculation in generic_write_checks
On Wed, 15 Aug 2007 17:52:28 +0400 Dmitry Monakhov <[EMAIL PROTECTED]> wrote: > Currently block device size calculated regardless its > bd_block_size. This may result attempt to write outside > block device if i_size not aligned to bdev->bd_block_size > and result in EIO. > > TEST_CASE_BEGIN > # fdisk -l /dev/sdc > Disk /dev/sdc: 36.7 GB, 36703918080 bytes > 255 heads, 63 sectors/track, 4462 cylinders > Units = cylinders of 16065 * 512 = 8225280 bytes > >Device Boot Start End Blocks Id System > /dev/sdc1 * 1 254 2040223+ 83 Ldinux > /dev/sdc2 255 379 1004062+ 83 Linux > > /dev/sdc2 size not aligned to 4K > > at this time bd_block_size == 512 so generic_write_check > performed correctly > # dd if=/dev/zero of=/dev/sdc2 bs=1k count=7 seek=1004058 > dd: writing `/dev/sdc2': No space left on device > 5+0 records in > 4+0 records out > > this bdev contain ext4fs with blksize = 4K > # mount /dev/sdc2 /mnt/ > after we mounted this bdev bd_block_size == fsblksize == 4K > > the same write operation failed with EIO > # dd if=/dev/zero of=/dev/sdc2 bs=1k count=7 seek=1004058 > dd: writing `/dev/sdc2': Input/output error > 3+0 records in > 2+0 records out > Attempt to write whole fsblock result write access outside > blkdevice and cause -EIO (returned by blkdev_get_block) > TEST_CASE_END > > Signed-off-by: Dmitry Monakhov <[EMAIL PROTECTED]> > --- > mm/filemap.c |4 +++- > 1 files changed, 3 insertions(+), 1 deletions(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 2c8776b..a23ee8a 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1867,9 +1867,11 @@ inline int generic_write_checks(struct file *file, > loff_t *pos, size_t *count, i > } else { > #ifdef CONFIG_BLOCK > loff_t isize; > + unsigned int blksize; > if (bdev_read_only(I_BDEV(inode))) > return -EPERM; > - isize = i_size_read(inode); > + blksize = block_size(I_BDEV(inode)); > + isize = i_size_read(inode) & ~(blksize - 1); > if (*pos >= isize) { > if (*count || *pos > isize) > return -ENOSPC; Can't say I really like the idea of adding additional overhead in this hotpath for such an odd case. Is there a faster way of doing it? Maybe adjust i_size, perhaps when the blocksize gets changed? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUGFIX] x86_64: NX bit handling in change_page_attr
On Fri, 17 Aug 2007 13:28:38 +0800 "Huang, Ying" <[EMAIL PROTECTED]> wrote: > This patch fixes a bug of change_page_attr/change_page_attr_addr on > Intel x86_64 CPU. After changing page attribute to be executable with > these functions, the page remains un-executable on Intel x86_64 > CPU. Because on Intel x86_64 CPU, only if the "NX" bits of all four > level page tables are cleared, the corresponding page is executable > (refer to section 4.13.2 of Intel 64 and IA-32 Architectures Software > Developer's Manual). So, the bug is fixed through clearing the "NX" > bit of PMD when splitting the huge PMD. > > Signed-off-by: Huang Ying <[EMAIL PROTECTED]> > > --- > > Index: linux-2.6.23-rc2-mm2/arch/x86_64/mm/pageattr.c > === > --- linux-2.6.23-rc2-mm2.orig/arch/x86_64/mm/pageattr.c 2007-08-17 > 12:50:25.0 +0800 > +++ linux-2.6.23-rc2-mm2/arch/x86_64/mm/pageattr.c2007-08-17 > 12:50:48.0 +0800 > @@ -147,6 +147,7 @@ > split = split_large_page(address, prot, ref_prot2); > if (!split) > return -ENOMEM; > + pgprot_val(ref_prot2) &= ~_PAGE_NX; > set_pte(kpte, mk_pte(split, ref_prot2)); > kpte_page = split; > } What happened with this? Still valid? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Moxa: Fix tiny compiler warning when building withoug CONFIG_PCI
On Fri, 17 Aug 2007 00:08:58 +0200 Jesper Juhl <[EMAIL PROTECTED]> wrote: > > Fix this tiny compiler warning in Moxa driver : > drivers/char/mxser.c:386: warning: 'mxser_get_PCI_conf' declared 'static' > but never defined > when building without CONFIG_PCI. > > > Signed-off-by: Jesper Juhl <[EMAIL PROTECTED]> > --- > > drivers/char/mxser.c |2 ++ > 1 files changed, 2 insertions(+), 0 deletions(-) > > diff --git a/drivers/char/mxser.c b/drivers/char/mxser.c > index 2aee3fe..83b15b5 100644 > --- a/drivers/char/mxser.c > +++ b/drivers/char/mxser.c > @@ -383,7 +383,9 @@ static int mxser_init(void); > > /* static void mxser_poll(unsigned long); */ > static int mxser_get_ISA_conf(int, struct mxser_hwconf *); > +#ifdef CONFIG_PCI > static int mxser_get_PCI_conf(int, int, int, struct mxser_hwconf *); > +#endif > static void mxser_do_softint(struct work_struct *); > static int mxser_open(struct tty_struct *, struct file *); > static void mxser_close(struct tty_struct *, struct file *); > mxser_get_PCI_conf() is defined before it is used anwyay. So that prototype is a stupid waste of space and just adds problems. --- a/drivers/char/mxser.c~mxser-fix-compiler-warning-when-building-withoug-config_pci +++ a/drivers/char/mxser.c @@ -383,7 +383,6 @@ static int mxser_init(void); /* static void mxser_poll(unsigned long); */ static int mxser_get_ISA_conf(int, struct mxser_hwconf *); -static int mxser_get_PCI_conf(int, int, int, struct mxser_hwconf *); static void mxser_do_softint(struct work_struct *); static int mxser_open(struct tty_struct *, struct file *); static void mxser_close(struct tty_struct *, struct file *); _ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SYSFS: need a noncaching read
On Wed, 2007-09-12 at 12:05 +1000, David Gibson wrote: > On Tue, Sep 11, 2007 at 11:43:17AM +0200, Heiko Schocher wrote: > > Hello, > > > > I have developed a device driver and use the sysFS to export some > > registers to userspace. I opened the sysFS File for one register and did > > some reads from this File, but I alwas becoming the same value from the > > register, whats not OK, because they are changing. So I found out that > > the sysFS caches the reads ... :-( > > > > Is there a way to retrigger the reads (in that way, that the sysFS > > rereads the values from the driver), without closing and opening the > > sysFS Files? Or must I better use the ioctl () Driver-interface for > > exporting these registers? > > > > I am asking this, because I must read every 10 ms 2 registers, so > > doing a open/read/close for reading one registers is a little bit too > > much overhead. > > > > I made a sysFS seek function, which retriggers the read, and that works > > fine, but I have again 2 syscalls, whats also is not optimal. > > > > Or can we make a open () with a (new?)Flag, that informs the sysFS to > > always reread the values from the underlying driver? > > > > Or a new flag in the "struct attribute_group" in include/linux/sysfs.h, > > which let the sysfs rereading the values? > > This sounds more like sysfs is really not the right interface for > polling your registers. You would probably be better off having your > driver export a character device from which the register values could > be read. I thought relay(fs) was the trendy way to do this these days? Documentation/filesystems/relay.txt cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person signature.asc Description: This is a digitally signed message part
Re: [PATCH 21/23] mm: per device dirty threshold
Peter> Scale writeback cache per backing device, proportional to its Peter> writeout speed. By decoupling the BDI dirty thresholds a Peter> number of problems we currently have will go away, namely: Ah, this clarifies my questions! Thanks! Peter> - mutual interference starvation (for any number of BDIs); Peter> - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts). Peter> It might be that all dirty pages are for a single BDI while Peter> other BDIs are idling. By giving each BDI a 'fair' share of the Peter> dirty limit, each one can have dirty pages outstanding and make Peter> progress. Question, can you change (shrink) the limit on a BDI while it has IO in flight? And what will that do to the system? I.e. if you have one device doing IO, so that it has a majority of the dirty limit. Then another device starts IO, and it's a *faster* device, how quickly/slowly does the BDI dirty limits change for both the old and new device? Peter> A global threshold also creates a deadlock for stacked BDIs; Peter> when A writes to B, and A generates enough dirty pages to get Peter> throttled, B will never start writeback until the dirty pages Peter> go away. Again, by giving each BDI its own 'independent' dirty Peter> limit, this problem is avoided. Peter> So the problem is to determine how to distribute the total Peter> dirty limit across the BDIs fairly and efficiently. A DBI that You mean BDI here, not DBI. Peter> has a large dirty limit but does not have any dirty pages Peter> outstanding is a waste. Peter> What is done is to keep a floating proportion between the DBIs Peter> based on writeback completions. This way faster/more active Peter> devices get a larger share than slower/idle devices. Does a slower device get a BDI which is calculated to keep it's limit under a certain number of seconds of outstanding IO? This way no device can build up more than say 15 seconds of outstanding IO to flush at any one time. Thanks! John - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] powerpc: add new required termio functions
On Tue, Sep 11, 2007 at 07:17:42PM -0700, Linus Torvalds wrote: > Really? > > It shouldn't. The use of kernel_termios_to_user_termios_1() is conditional > on the architecture having a define for TCGETS2, and I think they match > up. I see: > > [EMAIL PROTECTED] linux]$ git grep -l kernel_termios_to_user_termios_1 > include | wc -l > 10 > [EMAIL PROTECTED] linux]$ git grep -l TCGETS2 include | wc -l > 10 > > and in neither case is ppc in that list of architecures. > > So maybe you just read the patch without actually testing whether it > actually broke powerpc? > > Or is something subtler going on? As far as I can see TIOCSLCKTRMIOS and TIOCGLCKTRMIOS aren't protected by TCGETS2 guards. Do they need to be ... Perhaps From: Tony Breeds <[EMAIL PROTECTED]> Add Guards around TIOCSLCKTRMIOS and TIOCGLCKTRMIOS. Signed-off-by: Tony Breeds <[EMAIL PROTECTED]> --- drivers/char/tty_ioctl.c | 14 ++ 1 files changed, 14 insertions(+), 0 deletions(-) diff --git a/drivers/char/tty_ioctl.c b/drivers/char/tty_ioctl.c index 4a8969c..3ee73cf 100644 --- a/drivers/char/tty_ioctl.c +++ b/drivers/char/tty_ioctl.c @@ -795,6 +795,19 @@ int n_tty_ioctl(struct tty_struct * tty, struct file * file, if (L_ICANON(tty)) retval = inq_canon(tty); return put_user(retval, (unsigned int __user *) arg); +#ifndef TCGETS2 + case TIOCGLCKTRMIOS: + if (kernel_termios_to_user_termios((struct termios __user *)arg, real_tty->termios_locked)) + return -EFAULT; + return 0; + + case TIOCSLCKTRMIOS: + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + if (user_termios_to_kernel_termios(real_tty->termios_locked, (struct termios __user *) arg)) + return -EFAULT; + return 0; +#else case TIOCGLCKTRMIOS: if (kernel_termios_to_user_termios_1((struct termios __user *)arg, real_tty->termios_locked)) return -EFAULT; @@ -806,6 +819,7 @@ int n_tty_ioctl(struct tty_struct * tty, struct file * file, if (user_termios_to_kernel_termios_1(real_tty->termios_locked, (struct termios __user *) arg)) return -EFAULT; return 0; +#endif case TIOCPKT: { Yours Tony linux.conf.auhttp://linux.conf.au/ || http://lca2008.linux.org.au/ Jan 28 - Feb 02 2008 The Australian Linux Technical Conference! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] powerpc: add new required termio functions
> On Wed, 12 Sep 2007, Michael Neuling wrote: > > > > The "tty: termios locking functions break with new termios type" patch > > (f629307c857c030d5a3dd777fee37c8bb395e171) breaks the powerpc compile. > > Really? > > It shouldn't. The use of kernel_termios_to_user_termios_1() is conditional > on the architecture having a define for TCGETS2, and I think they match > up. I see: > > [EMAIL PROTECTED] linux]$ git grep -l kernel_termios_to_user_termios_1 > in clude | wc -l > 10 > [EMAIL PROTECTED] linux]$ git grep -l TCGETS2 include | wc -l > 10 > > and in neither case is ppc in that list of architecures. > > So maybe you just read the patch without actually testing whether it > actually broke powerpc? Not, I actually compiled it. > Or is something subtler going on? Looks like those new calls are not protected by the TCGETS2 define. Adding those ifdefs seems like the correct fix. Mikey - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/23] per device dirty throttling -v10
Peter> Per device dirty throttling patches These patches aim to Peter> improve balance_dirty_pages() and directly address three Peter> issues: Peter> 1) inter device starvation Peter> 2) stacked device deadlocks Peter> 3) inter process starvation Peter> 1 and 2 are a direct result from removing the global dirty Peter> limit and using per device dirty limits. By giving each device Peter> its own dirty limit is will no longer starve another device, Peter> and the cyclic dependancy on the dirty limit is broken. Ye haa! This should be a big improvement. Peter> In order to efficiently distribute the dirty limit across the Peter> independant devices a floating proportion is used, this will Peter> allocate a share of the total limit proportional to the Peter> device's recent activity. I'm not sure I like or agree with this. Shouldn't we be limiting based on the device's capability to sustain traffic? So if I have a RAID device which can read/write a total of 100Mb/sec, while at the same time I've got a CF device which can do 5Mb/sec, shouldn't we be more strongly limiting the CF device, even if it is the only device being written to? Of course, I haven't read the patches yet, nor am I qualified to comment on them in any meanginful way I think. Hopefully I'm just missing something key here in the explanation. Peter> 3 is done by also scaling the dirty limit proportional to the Peter> current task's recent dirty rate. Do you mean task or device here? I'm just wondering how well this works with a bunch of devices with wildly varying speeds. John - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] powerpc: add new required termio functions
On Wed, 12 Sep 2007, Michael Neuling wrote: > > The "tty: termios locking functions break with new termios type" patch > (f629307c857c030d5a3dd777fee37c8bb395e171) breaks the powerpc compile. Really? It shouldn't. The use of kernel_termios_to_user_termios_1() is conditional on the architecture having a define for TCGETS2, and I think they match up. I see: [EMAIL PROTECTED] linux]$ git grep -l kernel_termios_to_user_termios_1 include | wc -l 10 [EMAIL PROTECTED] linux]$ git grep -l TCGETS2 include | wc -l 10 and in neither case is ppc in that list of architecures. So maybe you just read the patch without actually testing whether it actually broke powerpc? Or is something subtler going on? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm] ssb: Make pcmciahost depend on PCMCIA=y
SSB uses a bool (SSB_PCMCIAHOST_POSSIBLE) to determine whether to build in PCMCIA support or not, as the PCMCIA host code itself is also only a bool, make SSB_PCMCIAHOST_POSSIBLE depend on PCMCIA=y. Without this, SSB_PCMCIAHOST_POSSIBLE evaluates to y when PCMCIA is built as a module, which results in link errors due to the pcmcia_access_configuration_register() accesses, where the symbol is only defined in a module. Signed-off-by: Paul Mundt <[EMAIL PROTECTED]> -- drivers/ssb/Kconfig |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- linux-2.6.23-rc4-mm1.orig/drivers/ssb/Kconfig 2007-09-11 15:15:52.0 +0900 +++ linux-2.6.23-rc4-mm1/drivers/ssb/Kconfig2007-09-12 10:51:53.0 +0900 @@ -37,7 +37,7 @@ config SSB_PCMCIAHOST_POSSIBLE bool - depends on SSB && PCMCIA && EXPERIMENTAL + depends on SSB && PCMCIA=y && EXPERIMENTAL default y config SSB_PCMCIAHOST - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Union Mount: Readdir approaches
"Josef 'Jeff' Sipek": > So, if I understand correctly, you create the entire block as if you were > going to write to disk? Unionfs keeps the data in a linked list. Basically yes. But the dir block in cache has no hole which is contiguous memory. Junjiro Okajima - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm] fs: define file_fsync() even for CONFIG_BLOCK=n
There's nothing that is problematic for file_fsync() with CONFIG_BLOCK=n, and it's built in unconditionally anyways, so move the prototype out to reflect that. Without this, the unionfs build bails out. CC fs/unionfs/file.o fs/unionfs/file.c:148: error: 'file_fsync' undeclared here (not in a function) make[2]: *** [fs/unionfs/file.o] Error 1 make[2]: *** Waiting for unfinished jobs make[1]: *** [fs/unionfs] Error 2 Signed-off-by: Paul Mundt <[EMAIL PROTECTED]> -- include/linux/buffer_head.h |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- linux-2.6.23-rc4-mm1.orig/include/linux/buffer_head.h 2007-09-11 15:15:56.0 +0900 +++ linux-2.6.23-rc4-mm1/include/linux/buffer_head.h2007-09-12 10:18:57.0 +0900 @@ -14,6 +14,8 @@ #include #include +int file_fsync(struct file *, struct dentry *, int); + #ifdef CONFIG_BLOCK enum bh_state_bits { @@ -225,7 +227,6 @@ sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *); int generic_commit_write(struct file *, struct page *, unsigned, unsigned); int block_truncate_page(struct address_space *, loff_t, get_block_t *); -int file_fsync(struct file *, struct dentry *, int); int nobh_prepare_write(struct page*, unsigned, unsigned, get_block_t*); int nobh_commit_write(struct file *, struct page *, unsigned, unsigned); int nobh_truncate_page(struct address_space *, loff_t); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] powerpc: add new required termio functions
The "tty: termios locking functions break with new termios type" patch (f629307c857c030d5a3dd777fee37c8bb395e171) breaks the powerpc compile. This adds the required API to asm-powerpc. Signed-off-by: Michael Neuling <[EMAIL PROTECTED]> -- This needs to go up for 2.6.23. Should we really put these definitions in asm-generic/termios.h as I'm guessing other architectures are broken too? [EMAIL PROTECTED]/ % git grep kernel_termios_to_user_termios_1 asm-arm/termios.h:#define kernel_termios_to_user_termios_1(u, k) asm-cris/termios.h:#define kernel_termios_to_user_termios_1(u, k) asm-h8300/termios.h:#define kernel_termios_to_user_termios_1(u, k) asm-i386/termios.h:#define kernel_termios_to_user_termios_1(u, k) asm-ia64/termios.h:#define kernel_termios_to_user_termios_1(u, k) asm-m32r/termios.h:#define kernel_termios_to_user_termios_1(u, k) asm-m68k/termios.h:#define kernel_termios_to_user_termios_1(u, k) asm-mips/termios.h:#define kernel_termios_to_user_termios_1(u, k) asm-v850/termios.h:#define kernel_termios_to_user_termios_1(u, k) asm-x86_64/termios.h:#define kernel_termios_to_user_termios_1(u, k) include/asm-powerpc/termios.h |3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6-ozlabs/include/asm-powerpc/termios.h === --- linux-2.6-ozlabs.orig/include/asm-powerpc/termios.h +++ linux-2.6-ozlabs/include/asm-powerpc/termios.h @@ -80,6 +80,9 @@ struct termio { #include +#define user_termios_to_kernel_termios_1(k, u) copy_from_user(k, u, sizeof(struct termios)) +#define kernel_termios_to_user_termios_1(u, k) copy_to_user(u, k, sizeof(struct termios)) + #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_TERMIOS_H */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SYSFS: need a noncaching read
On Tue, Sep 11, 2007 at 11:43:17AM +0200, Heiko Schocher wrote: > Hello, > > I have developed a device driver and use the sysFS to export some > registers to userspace. I opened the sysFS File for one register and did > some reads from this File, but I alwas becoming the same value from the > register, whats not OK, because they are changing. So I found out that > the sysFS caches the reads ... :-( > > Is there a way to retrigger the reads (in that way, that the sysFS > rereads the values from the driver), without closing and opening the > sysFS Files? Or must I better use the ioctl () Driver-interface for > exporting these registers? > > I am asking this, because I must read every 10 ms 2 registers, so > doing a open/read/close for reading one registers is a little bit too > much overhead. > > I made a sysFS seek function, which retriggers the read, and that works > fine, but I have again 2 syscalls, whats also is not optimal. > > Or can we make a open () with a (new?)Flag, that informs the sysFS to > always reread the values from the underlying driver? > > Or a new flag in the "struct attribute_group" in include/linux/sysfs.h, > which let the sysfs rereading the values? This sounds more like sysfs is really not the right interface for polling your registers. You would probably be better off having your driver export a character device from which the register values could be read. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 08/10] ia64: Convert cpu_sibling_map to a per_cpu data array (v3)
Convert cpu_sibling_map to a per_cpu cpumask_t array for the ia64 architecture. This fixes build errors in block/blktrace.c and kernel/sched.c when CONFIG_SCHED_SMT is defined. There was one access to cpu_sibling_map before the per_cpu data area was created, so that step was moved to after the per_cpu area is setup. Tested and verified on an A4700. Signed-off-by: Mike Travis <[EMAIL PROTECTED]> --- arch/ia64/kernel/setup.c|4 arch/ia64/kernel/smpboot.c | 18 ++ arch/ia64/mm/contig.c |6 ++ include/asm-ia64/smp.h |2 +- include/asm-ia64/topology.h |2 +- 5 files changed, 18 insertions(+), 14 deletions(-) --- a/arch/ia64/kernel/setup.c +++ b/arch/ia64/kernel/setup.c @@ -528,10 +528,6 @@ #ifdef CONFIG_SMP cpu_physical_id(0) = hard_smp_processor_id(); - - cpu_set(0, cpu_sibling_map[0]); - cpu_set(0, cpu_core_map[0]); - check_for_logical_procs(); if (smp_num_cpucores > 1) printk(KERN_INFO --- a/arch/ia64/kernel/smpboot.c +++ b/arch/ia64/kernel/smpboot.c @@ -138,7 +138,9 @@ EXPORT_SYMBOL(cpu_possible_map); cpumask_t cpu_core_map[NR_CPUS] __cacheline_aligned; -cpumask_t cpu_sibling_map[NR_CPUS] __cacheline_aligned; +DEFINE_PER_CPU_SHARED_ALIGNED(cpumask_t, cpu_sibling_map); +EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); + int smp_num_siblings = 1; int smp_num_cpucores = 1; @@ -650,12 +652,12 @@ { int i; - for_each_cpu_mask(i, cpu_sibling_map[cpu]) - cpu_clear(cpu, cpu_sibling_map[i]); + for_each_cpu_mask(i, per_cpu(cpu_sibling_map, cpu)) + cpu_clear(cpu, per_cpu(cpu_sibling_map, i)); for_each_cpu_mask(i, cpu_core_map[cpu]) cpu_clear(cpu, cpu_core_map[i]); - cpu_sibling_map[cpu] = cpu_core_map[cpu] = CPU_MASK_NONE; + per_cpu(cpu_sibling_map, cpu) = cpu_core_map[cpu] = CPU_MASK_NONE; } static void @@ -666,7 +668,7 @@ if (cpu_data(cpu)->threads_per_core == 1 && cpu_data(cpu)->cores_per_socket == 1) { cpu_clear(cpu, cpu_core_map[cpu]); - cpu_clear(cpu, cpu_sibling_map[cpu]); + cpu_clear(cpu, per_cpu(cpu_sibling_map, cpu)); return; } @@ -807,8 +809,8 @@ cpu_set(i, cpu_core_map[cpu]); cpu_set(cpu, cpu_core_map[i]); if (cpu_data(cpu)->core_id == cpu_data(i)->core_id) { - cpu_set(i, cpu_sibling_map[cpu]); - cpu_set(cpu, cpu_sibling_map[i]); + cpu_set(i, per_cpu(cpu_sibling_map, cpu)); + cpu_set(cpu, per_cpu(cpu_sibling_map, i)); } } } @@ -839,7 +841,7 @@ if (cpu_data(cpu)->threads_per_core == 1 && cpu_data(cpu)->cores_per_socket == 1) { - cpu_set(cpu, cpu_sibling_map[cpu]); + cpu_set(cpu, per_cpu(cpu_sibling_map, cpu)); cpu_set(cpu, cpu_core_map[cpu]); return 0; } --- a/include/asm-ia64/smp.h +++ b/include/asm-ia64/smp.h @@ -58,7 +58,7 @@ extern cpumask_t cpu_online_map; extern cpumask_t cpu_core_map[NR_CPUS]; -extern cpumask_t cpu_sibling_map[NR_CPUS]; +DECLARE_PER_CPU(cpumask_t, cpu_sibling_map); extern int smp_num_siblings; extern int smp_num_cpucores; extern void __iomem *ipi_base_addr; --- a/include/asm-ia64/topology.h +++ b/include/asm-ia64/topology.h @@ -112,7 +112,7 @@ #define topology_physical_package_id(cpu) (cpu_data(cpu)->socket_id) #define topology_core_id(cpu) (cpu_data(cpu)->core_id) #define topology_core_siblings(cpu)(cpu_core_map[cpu]) -#define topology_thread_siblings(cpu) (cpu_sibling_map[cpu]) +#define topology_thread_siblings(cpu) (per_cpu(cpu_sibling_map, cpu)) #define smt_capable() (smp_num_siblings > 1) #endif --- a/arch/ia64/mm/contig.c +++ b/arch/ia64/mm/contig.c @@ -212,6 +212,12 @@ cpu_data += PERCPU_PAGE_SIZE; per_cpu(local_per_cpu_offset, cpu) = __per_cpu_offset[cpu]; } + /* +* cpu_sibling_map is now a per_cpu variable - it needs to +* be accessed after per_cpu_init() sets up the per_cpu area. +*/ + cpu_set(0, per_cpu(cpu_sibling_map, 0)); + cpu_set(0, cpu_core_map[0]); } return __per_cpu_start + __per_cpu_offset[smp_processor_id()]; } -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 09/10] ppc64: Convert cpu_sibling_map to a per_cpu data array (v3)
Convert cpu_sibling_map to a per_cpu cpumask_t array for the ppc64 architecture. This fixes build errors in block/blktrace.c and kernel/sched.c when CONFIG_SCHED_SMT is defined. Note: these changes have not been built nor tested. Signed-off-by: Mike Travis <[EMAIL PROTECTED]> --- arch/powerpc/kernel/setup-common.c|4 ++-- arch/powerpc/kernel/smp.c |4 ++-- arch/powerpc/platforms/cell/cbe_cpufreq.c |2 +- include/asm-powerpc/smp.h |4 +++- include/asm-powerpc/topology.h|2 +- 5 files changed, 9 insertions(+), 7 deletions(-) --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -415,9 +415,9 @@ * Do the sibling map; assume only two threads per processor. */ for_each_possible_cpu(cpu) { - cpu_set(cpu, cpu_sibling_map[cpu]); + cpu_set(cpu, cpu_sibling_map(cpu)); if (cpu_has_feature(CPU_FTR_SMT)) - cpu_set(cpu ^ 0x1, cpu_sibling_map[cpu]); + cpu_set(cpu ^ 0x1, cpu_sibling_map(cpu)); } vdso_data->processorCount = num_present_cpus(); --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -61,11 +61,11 @@ cpumask_t cpu_possible_map = CPU_MASK_NONE; cpumask_t cpu_online_map = CPU_MASK_NONE; -cpumask_t cpu_sibling_map[NR_CPUS] = { [0 ... NR_CPUS-1] = CPU_MASK_NONE }; +DEFINE_PER_CPU(cpumask_t, cpu_sibling_map) = CPU_MASK_NONE; EXPORT_SYMBOL(cpu_online_map); EXPORT_SYMBOL(cpu_possible_map); -EXPORT_SYMBOL(cpu_sibling_map); +EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); /* SMP operations for this machine */ struct smp_ops_t *smp_ops; --- a/arch/powerpc/platforms/cell/cbe_cpufreq.c +++ b/arch/powerpc/platforms/cell/cbe_cpufreq.c @@ -119,7 +119,7 @@ policy->cur = cbe_freqs[cur_pmode].frequency; #ifdef CONFIG_SMP - policy->cpus = cpu_sibling_map[policy->cpu]; + policy->cpus = cpu_sibling_map(policy->cpu); #endif cpufreq_frequency_table_get_attr(cbe_freqs, policy->cpu); --- a/include/asm-powerpc/smp.h +++ b/include/asm-powerpc/smp.h @@ -25,6 +25,7 @@ #ifdef CONFIG_PPC64 #include +#include #endif extern int boot_cpuid; @@ -58,7 +59,8 @@ (smp_hw_index[(cpu)] = (phys)) #endif -extern cpumask_t cpu_sibling_map[NR_CPUS]; +DECLARE_PER_CPU(cpumask_t, cpu_sibling_map); +#define cpu_sibling_map(cpu) per_cpu(cpu_sibling_map, cpu) /* Since OpenPIC has only 4 IPIs, we use slightly different message numbers. * --- a/include/asm-powerpc/topology.h +++ b/include/asm-powerpc/topology.h @@ -108,7 +108,7 @@ #ifdef CONFIG_PPC64 #include -#define topology_thread_siblings(cpu) (cpu_sibling_map[cpu]) +#define topology_thread_siblings(cpu) (cpu_sibling_map(cpu)) #endif #endif -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 07/10] x86: acpi-use-cpu_physical_id (v3)
This is from an earlier message from Christoph Lameter: processor_core.c currently tries to determine the apicid by special casing for IA64 and x86. The desired information is readily available via cpu_physical_id() on IA64, i386 and x86_64. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Additionally, boot_cpu_id needed to be exported to fix compile errors in dma code when !CONFIG_SMP. Signed-off-by: Mike Travis <[EMAIL PROTECTED]> --- arch/x86_64/kernel/mpparse.c |2 ++ drivers/acpi/processor_core.c |8 +--- 2 files changed, 3 insertions(+), 7 deletions(-) --- a/drivers/acpi/processor_core.c +++ b/drivers/acpi/processor_core.c @@ -419,12 +419,6 @@ return 0; } -#ifdef CONFIG_IA64 -#define arch_cpu_to_apicid ia64_cpu_to_sapicid -#else -#define arch_cpu_to_apicid x86_cpu_to_apicid -#endif - static int map_madt_entry(u32 acpi_id) { unsigned long madt_end, entry; @@ -498,7 +492,7 @@ return apic_id; for (i = 0; i < NR_CPUS; ++i) { - if (arch_cpu_to_apicid[i] == apic_id) + if (cpu_physical_id(i) == apic_id) return i; } return -1; --- a/arch/x86_64/kernel/mpparse.c +++ b/arch/x86_64/kernel/mpparse.c @@ -57,6 +57,8 @@ /* Processor that is doing the boot up */ unsigned int boot_cpu_id = -1U; +EXPORT_SYMBOL(boot_cpu_id); + /* Internal processor count */ unsigned int num_processors __cpuinitdata = 0; -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 10/10] sparc64: Convert cpu_sibling_map to a per_cpu data array (v3)
Convert cpu_sibling_map to a per_cpu cpumask_t array for the sparc64 architecture. This fixes build errors in block/blktrace.c and kernel/sched.c when CONFIG_SCHED_SMT is defined. Note: these changes have not been built nor tested. Signed-off-by: Mike Travis <[EMAIL PROTECTED]> --- arch/sparc64/kernel/smp.c | 17 - include/asm-sparc64/smp.h |3 ++- include/asm-sparc64/topology.h |2 +- 3 files changed, 11 insertions(+), 11 deletions(-) --- a/arch/sparc64/kernel/smp.c +++ b/arch/sparc64/kernel/smp.c @@ -52,14 +52,13 @@ cpumask_t cpu_possible_map __read_mostly = CPU_MASK_NONE; cpumask_t cpu_online_map __read_mostly = CPU_MASK_NONE; -cpumask_t cpu_sibling_map[NR_CPUS] __read_mostly = - { [0 ... NR_CPUS-1] = CPU_MASK_NONE }; +DEFINE_PER_CPU(cpumask_t, cpu_sibling_map) = CPU_MASK_NONE; cpumask_t cpu_core_map[NR_CPUS] __read_mostly = { [0 ... NR_CPUS-1] = CPU_MASK_NONE }; EXPORT_SYMBOL(cpu_possible_map); EXPORT_SYMBOL(cpu_online_map); -EXPORT_SYMBOL(cpu_sibling_map); +EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); EXPORT_SYMBOL(cpu_core_map); static cpumask_t smp_commenced_mask; @@ -1259,16 +1258,16 @@ for_each_present_cpu(i) { unsigned int j; - cpus_clear(cpu_sibling_map[i]); + cpus_clear(per_cpu(cpu_sibling_map, i)); if (cpu_data(i).proc_id == -1) { - cpu_set(i, cpu_sibling_map[i]); + cpu_set(i, per_cpu(cpu_sibling_map, i)); continue; } for_each_present_cpu(j) { if (cpu_data(i).proc_id == cpu_data(j).proc_id) - cpu_set(j, cpu_sibling_map[i]); + cpu_set(j, per_cpu(cpu_sibling_map, i)); } } } @@ -1340,9 +1339,9 @@ cpu_clear(cpu, cpu_core_map[i]); cpus_clear(cpu_core_map[cpu]); - for_each_cpu_mask(i, cpu_sibling_map[cpu]) - cpu_clear(cpu, cpu_sibling_map[i]); - cpus_clear(cpu_sibling_map[cpu]); + for_each_cpu_mask(i, per_cpu(cpu_sibling_map, cpu)) + cpu_clear(cpu, per_cpu(cpu_sibling_map, i)); + cpus_clear(per_cpu(cpu_sibling_map, cpu)); c = &cpu_data(cpu); --- a/include/asm-sparc64/smp.h +++ b/include/asm-sparc64/smp.h @@ -28,8 +28,9 @@ #include #include +#include -extern cpumask_t cpu_sibling_map[NR_CPUS]; +DECLARE_PER_CPU(cpumask_t, cpu_sibling_map); extern cpumask_t cpu_core_map[NR_CPUS]; extern int sparc64_multi_core; --- a/include/asm-sparc64/topology.h +++ b/include/asm-sparc64/topology.h @@ -5,7 +5,7 @@ #define topology_physical_package_id(cpu) (cpu_data(cpu).proc_id) #define topology_core_id(cpu) (cpu_data(cpu).core_id) #define topology_core_siblings(cpu)(cpu_core_map[cpu]) -#define topology_thread_siblings(cpu) (cpu_sibling_map[cpu]) +#define topology_thread_siblings(cpu) (per_cpu(cpu_sibling_map, cpu)) #define mc_capable() (sparc64_multi_core) #define smt_capable() (sparc64_multi_core) #endif /* CONFIG_SMP */ -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 04/10] x86: Convert cpu_sibling_map to be a per cpu variable (v3)
Convert cpu_sibling_map from a static array sized by NR_CPUS to a per_cpu variable. This saves sizeof(cpumask_t) * NR unused cpus. Access is mostly from startup and CPU HOTPLUG functions. Signed-off-by: Mike Travis <[EMAIL PROTECTED]> --- arch/i386/kernel/cpu/cpufreq/p4-clockmod.c |2 - arch/i386/kernel/cpu/cpufreq/speedstep-ich.c |2 - arch/i386/kernel/io_apic.c |4 +-- arch/i386/kernel/smpboot.c | 36 +-- arch/i386/oprofile/op_model_p4.c |2 - arch/i386/xen/smp.c |4 +-- arch/x86_64/kernel/smpboot.c | 26 +-- block/blktrace.c |2 - include/asm-i386/smp.h |2 - include/asm-i386/topology.h |2 - include/asm-x86_64/smp.h |6 +++- include/asm-x86_64/topology.h|2 - kernel/sched.c |8 +++--- 13 files changed, 50 insertions(+), 48 deletions(-) --- a/arch/i386/kernel/cpu/cpufreq/p4-clockmod.c +++ b/arch/i386/kernel/cpu/cpufreq/p4-clockmod.c @@ -200,7 +200,7 @@ unsigned int i; #ifdef CONFIG_SMP - policy->cpus = cpu_sibling_map[policy->cpu]; + policy->cpus = per_cpu(cpu_sibling_map, policy->cpu); #endif /* Errata workaround */ --- a/arch/i386/kernel/cpu/cpufreq/speedstep-ich.c +++ b/arch/i386/kernel/cpu/cpufreq/speedstep-ich.c @@ -322,7 +322,7 @@ /* only run on CPU to be set, or on its sibling */ #ifdef CONFIG_SMP - policy->cpus = cpu_sibling_map[policy->cpu]; + policy->cpus = per_cpu(cpu_sibling_map, policy->cpu); #endif cpus_allowed = current->cpus_allowed; --- a/arch/i386/kernel/io_apic.c +++ b/arch/i386/kernel/io_apic.c @@ -378,7 +378,7 @@ #define IRQ_ALLOWED(cpu, allowed_mask) cpu_isset(cpu, allowed_mask) -#define CPU_TO_PACKAGEINDEX(i) (first_cpu(cpu_sibling_map[i])) +#define CPU_TO_PACKAGEINDEX(i) (first_cpu(per_cpu(cpu_sibling_map, i))) static cpumask_t balance_irq_affinity[NR_IRQS] = { [0 ... NR_IRQS-1] = CPU_MASK_ALL @@ -598,7 +598,7 @@ * (A+B)/2 vs B */ load = CPU_IRQ(min_loaded) >> 1; - for_each_cpu_mask(j, cpu_sibling_map[min_loaded]) { + for_each_cpu_mask(j, per_cpu(cpu_sibling_map, min_loaded)) { if (load > CPU_IRQ(j)) { /* This won't change cpu_sibling_map[min_loaded] */ load = CPU_IRQ(j); --- a/arch/i386/kernel/smpboot.c +++ b/arch/i386/kernel/smpboot.c @@ -70,8 +70,8 @@ int cpu_llc_id[NR_CPUS] __cpuinitdata = {[0 ... NR_CPUS-1] = BAD_APICID}; /* representing HT siblings of each logical CPU */ -cpumask_t cpu_sibling_map[NR_CPUS] __read_mostly; -EXPORT_SYMBOL(cpu_sibling_map); +DEFINE_PER_CPU(cpumask_t, cpu_sibling_map); +EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); /* representing HT and core siblings of each logical CPU */ DEFINE_PER_CPU(cpumask_t, cpu_core_map); @@ -319,8 +319,8 @@ for_each_cpu_mask(i, cpu_sibling_setup_map) { if (c[cpu].phys_proc_id == c[i].phys_proc_id && c[cpu].cpu_core_id == c[i].cpu_core_id) { - cpu_set(i, cpu_sibling_map[cpu]); - cpu_set(cpu, cpu_sibling_map[i]); + cpu_set(i, per_cpu(cpu_sibling_map, cpu)); + cpu_set(cpu, per_cpu(cpu_sibling_map, i)); cpu_set(i, per_cpu(cpu_core_map, cpu)); cpu_set(cpu, per_cpu(cpu_core_map, i)); cpu_set(i, c[cpu].llc_shared_map); @@ -328,13 +328,13 @@ } } } else { - cpu_set(cpu, cpu_sibling_map[cpu]); + cpu_set(cpu, per_cpu(cpu_sibling_map, cpu)); } cpu_set(cpu, c[cpu].llc_shared_map); if (current_cpu_data.x86_max_cores == 1) { - per_cpu(cpu_core_map, cpu) = cpu_sibling_map[cpu]; + per_cpu(cpu_core_map, cpu) = per_cpu(cpu_sibling_map, cpu); c[cpu].booted_cores = 1; return; } @@ -351,12 +351,12 @@ /* * Does this new cpu bringup a new core? */ - if (cpus_weight(cpu_sibling_map[cpu]) == 1) { + if (cpus_weight(per_cpu(cpu_sibling_map, cpu)) == 1) { /* * for each core in package, increment * the booted_cores for this new cpu */ - if (first_cpu(cpu_sibling_map[i]) == i) + if (first_cpu(per_cpu(cpu_sibling_map, i)) == i) c[cpu].booted_cores++;
[PATCH 05/10] x86: Convert x86_cpu_to_apicid to be a per cpu variable (v3)
This patch converts the x86_cpu_to_apicid array to be a per cpu variable. This saves sizeof(apicid) * NR unused cpus. Access is mostly from startup and CPU HOTPLUG functions. MP_processor_info() is one of the functions that require access to the x86_cpu_to_apicid array before the per_cpu data area is setup. For this case, a pointer to the __initdata array is initialized in setup_arch() and removed in smp_prepare_cpus() after the per_cpu data area is initialized. A second change is included to change the initial array value of ARCH i386 from 0xff to BAD_APICID to be consistent with ARCH x86_64. Signed-off-by: Mike Travis <[EMAIL PROTECTED]> --- arch/i386/kernel/acpi/boot.c |2 +- arch/i386/kernel/smp.c|2 +- arch/i386/kernel/smpboot.c| 22 +++--- arch/x86_64/kernel/genapic.c | 15 --- arch/x86_64/kernel/genapic_flat.c |2 +- arch/x86_64/kernel/mpparse.c | 15 +-- arch/x86_64/kernel/setup.c|5 + arch/x86_64/kernel/smpboot.c | 23 ++- arch/x86_64/mm/numa.c |2 +- include/asm-i386/smp.h|6 -- include/asm-x86_64/ipi.h |2 +- include/asm-x86_64/smp.h |6 -- 12 files changed, 80 insertions(+), 22 deletions(-) --- a/arch/i386/kernel/acpi/boot.c +++ b/arch/i386/kernel/acpi/boot.c @@ -555,7 +555,7 @@ int acpi_unmap_lsapic(int cpu) { - x86_cpu_to_apicid[cpu] = -1; + per_cpu(x86_cpu_to_apicid, cpu) = -1; cpu_clear(cpu, cpu_present_map); num_processors--; --- a/arch/i386/kernel/smp.c +++ b/arch/i386/kernel/smp.c @@ -673,7 +673,7 @@ int i; for (i = 0; i < NR_CPUS; i++) { - if (x86_cpu_to_apicid[i] == apic_id) + if (per_cpu(x86_cpu_to_apicid, i) == apic_id) return i; } return -1; --- a/arch/i386/kernel/smpboot.c +++ b/arch/i386/kernel/smpboot.c @@ -92,9 +92,17 @@ struct cpuinfo_x86 cpu_data[NR_CPUS] __cacheline_aligned; EXPORT_SYMBOL(cpu_data); -u8 x86_cpu_to_apicid[NR_CPUS] __read_mostly = - { [0 ... NR_CPUS-1] = 0xff }; -EXPORT_SYMBOL(x86_cpu_to_apicid); +/* + * The following static array is used during kernel startup + * and the x86_cpu_to_apicid_ptr contains the address of the + * array during this time. Is it zeroed when the per_cpu + * data area is removed. + */ +u8 x86_cpu_to_apicid_init[NR_CPUS] __initdata = + { [0 ... NR_CPUS-1] = BAD_APICID }; +void *x86_cpu_to_apicid_ptr; +DEFINE_PER_CPU(u8, x86_cpu_to_apicid) = BAD_APICID; +EXPORT_PER_CPU_SYMBOL(x86_cpu_to_apicid); u8 apicid_2_node[MAX_APICID]; @@ -804,7 +812,7 @@ irq_ctx_init(cpu); - x86_cpu_to_apicid[cpu] = apicid; + per_cpu(x86_cpu_to_apicid, cpu) = apicid; /* * This grunge runs the startup process for * the targeted processor. @@ -866,7 +874,7 @@ cpu_clear(cpu, cpu_initialized); /* was set by cpu_init() */ cpucount--; } else { - x86_cpu_to_apicid[cpu] = apicid; + per_cpu(x86_cpu_to_apicid, cpu) = apicid; cpu_set(cpu, cpu_present_map); } @@ -915,7 +923,7 @@ struct warm_boot_cpu_info info; int apicid, ret; - apicid = x86_cpu_to_apicid[cpu]; + apicid = per_cpu(x86_cpu_to_apicid, cpu); if (apicid == BAD_APICID) { ret = -ENODEV; goto exit; @@ -965,7 +973,7 @@ boot_cpu_physical_apicid = GET_APIC_ID(apic_read(APIC_ID)); boot_cpu_logical_apicid = logical_smp_processor_id(); - x86_cpu_to_apicid[0] = boot_cpu_physical_apicid; + per_cpu(x86_cpu_to_apicid, 0) = boot_cpu_physical_apicid; current_thread_info()->cpu = 0; --- a/arch/x86_64/kernel/mpparse.c +++ b/arch/x86_64/kernel/mpparse.c @@ -86,7 +86,7 @@ return sum & 0xFF; } -static void __cpuinit MP_processor_info (struct mpc_config_processor *m) +static void __cpuinit MP_processor_info(struct mpc_config_processor *m) { int cpu; cpumask_t tmp_map; @@ -123,7 +123,18 @@ cpu = 0; } bios_cpu_apicid[cpu] = m->mpc_apicid; - x86_cpu_to_apicid[cpu] = m->mpc_apicid; + /* +* We get called early in the the start_kernel initialization +* process when the per_cpu data area is not yet setup, so we +* use a static array that is removed after the per_cpu data +* area is created. +*/ + if (x86_cpu_to_apicid_ptr) { + u8 *x86_cpu_to_apicid = (u8 *)x86_cpu_to_apicid_ptr; + x86_cpu_to_apicid[cpu] = m->mpc_apicid; + } else { + per_cpu(x86_cpu_to_apicid, cpu) = m->mpc_apicid; + } cpu_set(cpu, cpu_possible_map); cpu_set(cpu, cpu_present_map); --- a/arch/x86_64/kernel/smpboot.c +++ b/arch/x86_64/kernel/smpboot.c @@ -701,7 +
[PATCH 06/10] x86: Convert cpu_llc_id to be a per cpu variable (v3)
Convert cpu_llc_id from a static array sized by NR_CPUS to a per_cpu variable. This saves sizeof(cpu_llc_id) * NR unused cpus. Access is mostly from startup and CPU HOTPLUG functions. Note there's an addtional change of the type of cpu_llc_id from int to u8 for ARCH i386 to correspond with the same type in ARCH x86_64. Signed-off-by: Mike Travis <[EMAIL PROTECTED]> --- arch/i386/kernel/cpu/intel_cacheinfo.c |4 ++-- arch/i386/kernel/smpboot.c |6 +++--- arch/x86_64/kernel/smpboot.c |6 +++--- include/asm-i386/processor.h |6 +- include/asm-x86_64/smp.h |9 - 5 files changed, 17 insertions(+), 14 deletions(-) --- a/arch/i386/kernel/cpu/intel_cacheinfo.c +++ b/arch/i386/kernel/cpu/intel_cacheinfo.c @@ -417,14 +417,14 @@ if (new_l2) { l2 = new_l2; #ifdef CONFIG_X86_HT - cpu_llc_id[cpu] = l2_id; + per_cpu(cpu_llc_id, cpu) = l2_id; #endif } if (new_l3) { l3 = new_l3; #ifdef CONFIG_X86_HT - cpu_llc_id[cpu] = l3_id; + per_cpu(cpu_llc_id, cpu) = l3_id; #endif } --- a/arch/i386/kernel/smpboot.c +++ b/arch/i386/kernel/smpboot.c @@ -67,7 +67,7 @@ EXPORT_SYMBOL(smp_num_siblings); /* Last level cache ID of each logical CPU */ -int cpu_llc_id[NR_CPUS] __cpuinitdata = {[0 ... NR_CPUS-1] = BAD_APICID}; +DEFINE_PER_CPU(u8, cpu_llc_id) = BAD_APICID; /* representing HT siblings of each logical CPU */ DEFINE_PER_CPU(cpumask_t, cpu_sibling_map); @@ -348,8 +348,8 @@ } for_each_cpu_mask(i, cpu_sibling_setup_map) { - if (cpu_llc_id[cpu] != BAD_APICID && - cpu_llc_id[cpu] == cpu_llc_id[i]) { + if (per_cpu(cpu_llc_id, cpu) != BAD_APICID && + per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i)) { cpu_set(i, c[cpu].llc_shared_map); cpu_set(cpu, c[i].llc_shared_map); } --- a/arch/x86_64/kernel/smpboot.c +++ b/arch/x86_64/kernel/smpboot.c @@ -65,7 +65,7 @@ EXPORT_SYMBOL(smp_num_siblings); /* Last level cache ID of each logical CPU */ -u8 cpu_llc_id[NR_CPUS] __cpuinitdata = {[0 ... NR_CPUS-1] = BAD_APICID}; +DEFINE_PER_CPU(u8, cpu_llc_id) = BAD_APICID; /* Bitmask of currently online CPUs */ cpumask_t cpu_online_map __read_mostly; @@ -285,8 +285,8 @@ } for_each_cpu_mask(i, cpu_sibling_setup_map) { - if (cpu_llc_id[cpu] != BAD_APICID && - cpu_llc_id[cpu] == cpu_llc_id[i]) { + if (per_cpu(cpu_llc_id, cpu) != BAD_APICID && + per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i)) { cpu_set(i, c[cpu].llc_shared_map); cpu_set(cpu, c[i].llc_shared_map); } --- a/include/asm-i386/processor.h +++ b/include/asm-i386/processor.h @@ -110,7 +110,11 @@ #define current_cpu_data boot_cpu_data #endif -extern int cpu_llc_id[NR_CPUS]; +/* + * the following now lives in the per cpu area: + * extern int cpu_llc_id[NR_CPUS]; + */ +DECLARE_PER_CPU(u8, cpu_llc_id); extern char ignore_fpu_irq; void __init cpu_detect(struct cpuinfo_x86 *c); --- a/include/asm-x86_64/smp.h +++ b/include/asm-x86_64/smp.h @@ -39,16 +39,14 @@ extern void smp_send_reschedule(int cpu); /* - * cpu_sibling_map and cpu_core_map now live - * in the per cpu area - * + * the following now live in the per cpu area: * extern cpumask_t cpu_sibling_map[NR_CPUS]; * extern cpumask_t cpu_core_map[NR_CPUS]; + * extern u8 cpu_llc_id[NR_CPUS]; */ DECLARE_PER_CPU(cpumask_t, cpu_sibling_map); DECLARE_PER_CPU(cpumask_t, cpu_core_map); - -extern u8 cpu_llc_id[NR_CPUS]; +DECLARE_PER_CPU(u8, cpu_llc_id); #define SMP_TRAMPOLINE_BASE 0x6000 @@ -120,6 +118,7 @@ #ifdef CONFIG_SMP #define cpu_physical_id(cpu) per_cpu(x86_cpu_to_apicid, cpu) #else +extern unsigned int boot_cpu_id; #define cpu_physical_id(cpu) boot_cpu_id #endif /* !CONFIG_SMP */ #endif -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 02/10] x86: fix cpu_to_node references (v3)
Fix four instances where cpu_to_node is referenced by array instead of via the cpu_to_node macro. This is preparation to moving it to the per_cpu data area. Signed-off-by: Mike Travis <[EMAIL PROTECTED]> --- arch/x86_64/kernel/vsyscall.c |2 +- arch/x86_64/mm/numa.c |4 ++-- arch/x86_64/mm/srat.c |4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) --- a/arch/x86_64/kernel/vsyscall.c +++ b/arch/x86_64/kernel/vsyscall.c @@ -291,7 +291,7 @@ unsigned long *d; unsigned long node = 0; #ifdef CONFIG_NUMA - node = cpu_to_node[cpu]; + node = cpu_to_node(cpu); #endif if (cpu_has(&cpu_data[cpu], X86_FEATURE_RDTSCP)) write_rdtscp_aux((node << 12) | cpu); --- a/arch/x86_64/mm/numa.c +++ b/arch/x86_64/mm/numa.c @@ -261,7 +261,7 @@ We round robin the existing nodes. */ rr = first_node(node_online_map); for (i = 0; i < NR_CPUS; i++) { - if (cpu_to_node[i] != NUMA_NO_NODE) + if (cpu_to_node(i) != NUMA_NO_NODE) continue; numa_set_node(i, rr); rr = next_node(rr, node_online_map); @@ -543,7 +543,7 @@ void __cpuinit numa_set_node(int cpu, int node) { cpu_pda(cpu)->nodenumber = node; - cpu_to_node[cpu] = node; + cpu_to_node(cpu) = node; } unsigned long __init numa_free_all_bootmem(void) --- a/arch/x86_64/mm/srat.c +++ b/arch/x86_64/mm/srat.c @@ -431,9 +431,9 @@ setup_node_bootmem(i, nodes[i].start, nodes[i].end); for (i = 0; i < NR_CPUS; i++) { - if (cpu_to_node[i] == NUMA_NO_NODE) + if (cpu_to_node(i) == NUMA_NO_NODE) continue; - if (!node_isset(cpu_to_node[i], node_possible_map)) + if (!node_isset(cpu_to_node(i), node_possible_map)) numa_set_node(i, NUMA_NO_NODE); } numa_init_array(); -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 03/10] x86: Convert cpu_core_map to be a per cpu variable (v3)
This is from an earlier message from 'Christoph Lameter': cpu_core_map is currently an array defined using NR_CPUS. This means that we overallocate since we will rarely really use maximum configured cpu. If we put the cpu_core_map into the per cpu area then it will be allocated for each processor as it comes online. This means that the core map cannot be accessed until the per cpu area has been allocated. Xen does a weird thing here looping over all processors and zeroing the masks that are not yet allocated and that will be zeroed when they are allocated. I commented the code out. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Signed-off-by: Mike Travis <[EMAIL PROTECTED]> --- arch/i386/kernel/cpu/cpufreq/acpi-cpufreq.c |2 - arch/i386/kernel/cpu/cpufreq/powernow-k8.c | 10 arch/i386/kernel/cpu/proc.c |3 +- arch/i386/kernel/smpboot.c | 34 ++-- arch/i386/xen/smp.c | 14 +-- arch/x86_64/kernel/mce_amd.c|6 ++-- arch/x86_64/kernel/setup.c |3 +- arch/x86_64/kernel/smpboot.c| 24 +-- include/asm-i386/smp.h |2 - include/asm-i386/topology.h |2 - include/asm-x86_64/smp.h|8 +- include/asm-x86_64/topology.h |2 - 12 files changed, 64 insertions(+), 46 deletions(-) --- a/include/asm-x86_64/smp.h +++ b/include/asm-x86_64/smp.h @@ -39,7 +39,13 @@ extern void smp_send_reschedule(int cpu); extern cpumask_t cpu_sibling_map[NR_CPUS]; -extern cpumask_t cpu_core_map[NR_CPUS]; +/* + * cpu_core_map lives in a per cpu area + * + * extern cpumask_t cpu_core_map[NR_CPUS]; + */ +DECLARE_PER_CPU(cpumask_t, cpu_core_map); + extern u8 cpu_llc_id[NR_CPUS]; #define SMP_TRAMPOLINE_BASE 0x6000 --- a/arch/i386/kernel/cpu/cpufreq/acpi-cpufreq.c +++ b/arch/i386/kernel/cpu/cpufreq/acpi-cpufreq.c @@ -595,7 +595,7 @@ dmi_check_system(sw_any_bug_dmi_table); if (bios_with_sw_any_bug && cpus_weight(policy->cpus) == 1) { policy->shared_type = CPUFREQ_SHARED_TYPE_ALL; - policy->cpus = cpu_core_map[cpu]; + policy->cpus = per_cpu(cpu_core_map, cpu); } #endif --- a/arch/i386/kernel/cpu/cpufreq/powernow-k8.c +++ b/arch/i386/kernel/cpu/cpufreq/powernow-k8.c @@ -57,7 +57,7 @@ static int cpu_family = CPU_OPTERON; #ifndef CONFIG_SMP -static cpumask_t cpu_core_map[1]; +DEFINE_PER_CPU(cpumask_t, cpu_core_map); #endif /* Return a frequency in MHz, given an input fid */ @@ -664,7 +664,7 @@ dprintk("cfid 0x%x, cvid 0x%x\n", data->currfid, data->currvid); data->powernow_table = powernow_table; - if (first_cpu(cpu_core_map[data->cpu]) == data->cpu) + if (first_cpu(per_cpu(cpu_core_map, data->cpu)) == data->cpu) print_basics(data); for (j = 0; j < data->numps; j++) @@ -818,7 +818,7 @@ /* fill in data */ data->numps = data->acpi_data.state_count; - if (first_cpu(cpu_core_map[data->cpu]) == data->cpu) + if (first_cpu(per_cpu(cpu_core_map, data->cpu)) == data->cpu) print_basics(data); powernow_k8_acpi_pst_values(data, 0); @@ -1212,7 +1212,7 @@ if (cpu_family == CPU_HW_PSTATE) pol->cpus = cpumask_of_cpu(pol->cpu); else - pol->cpus = cpu_core_map[pol->cpu]; + pol->cpus = per_cpu(cpu_core_map, pol->cpu); data->available_cores = &(pol->cpus); /* Take a crude guess here. @@ -1279,7 +1279,7 @@ cpumask_t oldmask = current->cpus_allowed; unsigned int khz = 0; - data = powernow_data[first_cpu(cpu_core_map[cpu])]; + data = powernow_data[first_cpu(per_cpu(cpu_core_map, cpu))]; if (!data) return -EINVAL; --- a/arch/i386/kernel/cpu/proc.c +++ b/arch/i386/kernel/cpu/proc.c @@ -122,7 +122,8 @@ #ifdef CONFIG_X86_HT if (c->x86_max_cores * smp_num_siblings > 1) { seq_printf(m, "physical id\t: %d\n", c->phys_proc_id); - seq_printf(m, "siblings\t: %d\n", cpus_weight(cpu_core_map[n])); + seq_printf(m, "siblings\t: %d\n", + cpus_weight(per_cpu(cpu_core_map, n))); seq_printf(m, "core id\t\t: %d\n", c->cpu_core_id); seq_printf(m, "cpu cores\t: %d\n", c->booted_cores); } --- a/arch/i386/kernel/smpboot.c +++ b/arch/i386/kernel/smpboot.c @@ -74,8 +74,8 @@ EXPORT_SYMBOL(cpu_sibling_map); /* representing HT and core siblings of each logical CPU */ -cpumask_t cpu_core_map[NR_CPUS] __read_mostly; -EXPORT_SYMBOL(cpu_core_map); +DEFINE_PER_CPU(cpumask_t, cpu_core_map); +EXPORT_PER_CPU_SYMBOL(cpu_core_map); /* bitmap of online cpus */ cpumask_t cpu_online_map __read_mostly; @@ -300,7 +300,7 @@ * And for powe
[PATCH 00/10] x86: Reduce Memory Usage and Inter-Node message traffic (v3)
Note: This patch consolidates all the previous patches regarding the conversion of static arrays sized by NR_CPUS into per_cpu data arrays and is referenced against 2.6.23-rc6 . v1 Intro: In x86_64 and i386 architectures most arrays that are sized using NR_CPUS lay in local memory on node 0. Not only will most (99%?) of the systems not use all the slots in these arrays, particularly when NR_CPUS is increased to accommodate future very high cpu count systems, but a number of cache lines are passed unnecessarily on the system bus when these arrays are referenced by cpus on other nodes. Typically, the values in these arrays are referenced by the cpu accessing it's own values, though when passing IPI interrupts, the cpu does access the data relevant to the targeted cpu/node. Of course, if the referencing cpu is not on node 0, then the reference will still require cross node exchanges of cache lines. A common use of this is for an interrupt service routine to pass the interrupt to other cpus local to that node. Ideally, all the elements in these arrays should be moved to the per_cpu data area. In some cases (such as x86_cpu_to_apicid) the array is referenced before the per_cpu data areas are setup. In this case, a static array is declared in the __initdata area and initialized by the booting cpu (BSP). The values are then moved to the per_cpu area after it is initialized and the original static array is freed with the rest of the __initdata. This patch is referenced against 2.6.23-rc6. -- Changes for version v2: > > Note the addtional change of the cpu_llc_id type from u8 > > to int for ARCH x86_64 to correspond with ARCH i386. > At least currently it cannot be more than 8 bit. So why > waste memory? It would be better to change i386 Done. (x86_64 type => u8). > > Fix four instances where cpu_to_node is referenced > > > by array instead of via the cpu_to_node macro. This > > > is preparation to moving it to the per_cpu data area. > Shouldn't this patch be logically before the per cpu > conversion (which is 3). This way the result would > be git bisectable. Done. (Moved to PATCH 1). > > processor_core.c currently tries to determine the apicid by special > > casing > > > for IA64 and x86. The desired information is readily available via > > > > > > cpu_physical_id() > > > > > > on IA64, i386 and x86_64. > > Have you tried this with a !CONFIG_SMP build? The drivers/dma code was doing > the same and running into problems because it wasn't defined there. Fixed. (New export in PATCH 1). -- Changes for version v3: cpu_sibling_map has been converted to a per_cpu data array to fix build errors on ia64, ppc64 and sparc64 to accomodate references in block/blktrace.c and kernel/sched.c when CONFIG_SCHED_SMT is defined. Warning: ppc64 and sparc64 have not yet been built nor tested. -- -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 01/10] x86: remove x86_cpu_to_log_apicid array (v3)
This is a copy of an older patch that is in rc3-mm1. It's needed to allow the remaining patches to integrate correctly. Signed-off-by: Mike Travis <[EMAIL PROTECTED]> --- arch/x86_64/kernel/genapic.c |2 -- arch/x86_64/kernel/genapic_flat.c |1 - arch/x86_64/kernel/smpboot.c |1 - include/asm-x86_64/smp.h |1 - 4 files changed, 5 deletions(-) --- a/arch/x86_64/kernel/genapic.c +++ b/arch/x86_64/kernel/genapic.c @@ -29,8 +29,6 @@ = { [0 ... NR_CPUS-1] = BAD_APICID }; EXPORT_SYMBOL(x86_cpu_to_apicid); -u8 x86_cpu_to_log_apicid[NR_CPUS] = { [0 ... NR_CPUS-1] = BAD_APICID }; - struct genapic __read_mostly *genapic = &apic_flat; /* --- a/arch/x86_64/kernel/genapic_flat.c +++ b/arch/x86_64/kernel/genapic_flat.c @@ -52,7 +52,6 @@ num = smp_processor_id(); id = 1UL << num; - x86_cpu_to_log_apicid[num] = id; apic_write(APIC_DFR, APIC_DFR_FLAT); val = apic_read(APIC_LDR) & ~APIC_LDR_MASK; val |= SET_APIC_LOGICAL_ID(id); --- a/arch/x86_64/kernel/smpboot.c +++ b/arch/x86_64/kernel/smpboot.c @@ -702,7 +702,6 @@ cpu_clear(cpu, cpu_present_map); cpu_clear(cpu, cpu_possible_map); x86_cpu_to_apicid[cpu] = BAD_APICID; - x86_cpu_to_log_apicid[cpu] = BAD_APICID; return -EIO; } --- a/include/asm-x86_64/smp.h +++ b/include/asm-x86_64/smp.h @@ -78,7 +78,6 @@ * the real APIC ID <-> CPU # mapping. */ extern u8 x86_cpu_to_apicid[NR_CPUS]; /* physical ID */ -extern u8 x86_cpu_to_log_apicid[NR_CPUS]; extern u8 bios_cpu_apicid[]; static inline int cpu_present_to_apicid(int mps_cpu) -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Tue, Sep 11, 2007 at 04:00:17PM +1000, Nick Piggin wrote: > > > OTOH, I'm not sure how much buy-in there was from the filesystems guys. > > > Particularly Christoph H and XFS (which is strange because they already > > > do vmapping in places). > > > > I think they use vmapping because they have to, not because they want > > to. They might be a lot happier with fsblock if it used contiguous pages > > for large blocks whenever possible - I don't know for sure. The metadata > > accessors they might be unhappy with because it's inconvenient but as > > Christoph Hellwig pointed out at VM/FS, the filesystems who really care > > will convert. > > Sure, they would rather not to. But there are also a lot of ways you can > improve vmap more than what XFS does (or probably what darwin does) > (more persistence for cached objects, and batched invalidates for example). XFS already has persistence across the object life time (which can be many tens of seconds for a frequently used buffer) and it also does batched unmapping of objects as well. > There are also a lot of trivial things you can do to make a lot of those > accesses not require vmaps (and less trivial things, but even such things > as binary searches over multiple pages should be quite possible with a bit > of logic). Yes, we already do the many of these things (via xfs_buf_offset()), but that is not good enough for something like a memcpy that spans multiple pages in a large block (think btree block compaction, splits and recombines). IOWs, we already play these vmap harm-minimisation games in the places where we can, but still the overhead is high and something we'd prefer to be able to avoid. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/6] cpuset dirty limits
Per cpuset dirty ratios This implements dirty ratios per cpuset. Two new files are added to the cpuset directories: background_dirty_ratio Percentage at which background writeback starts throttle_dirty_ratioPercentage at which the application is throttled and we start synchrononous writeout. Both variables are set to -1 by default which means that the global limits (/proc/sys/vm/vm_dirty_ratio and /proc/sys/vm/dirty_background_ratio) are used for a cpuset. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Acked-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.23-rc4-mm1 diff -uprN -X 0/Documentation/dontdiff 5/include/linux/cpuset.h 7/include/linux/cpuset.h --- 5/include/linux/cpuset.h2007-09-11 14:50:48.0 -0700 +++ 7/include/linux/cpuset.h2007-09-11 14:51:12.0 -0700 @@ -77,6 +77,7 @@ extern void cpuset_track_online_nodes(vo extern int current_cpuset_is_being_rebound(void); +extern void cpuset_get_current_ratios(int *background, int *ratio); /* * We need macros since struct address_space is not defined yet */ diff -uprN -X 0/Documentation/dontdiff 5/kernel/cpuset.c 7/kernel/cpuset.c --- 5/kernel/cpuset.c 2007-09-11 14:50:49.0 -0700 +++ 7/kernel/cpuset.c 2007-09-11 14:56:18.0 -0700 @@ -51,6 +51,7 @@ #include #include #include +#include #include #include @@ -92,6 +93,9 @@ struct cpuset { int mems_generation; struct fmeter fmeter; /* memory_pressure filter */ + + int background_dirty_ratio; + int throttle_dirty_ratio; }; /* Retrieve the cpuset for a container */ @@ -169,6 +173,8 @@ static struct cpuset top_cpuset = { .flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)), .cpus_allowed = CPU_MASK_ALL, .mems_allowed = NODE_MASK_ALL, + .background_dirty_ratio = -1, + .throttle_dirty_ratio = -1, }; /* @@ -785,6 +791,21 @@ static int update_flag(cpuset_flagbits_t return 0; } +static int update_int(int *cs_int, char *buf, int min, int max) +{ + char *endp; + int val; + + val = simple_strtol(buf, &endp, 10); + if (val < min || val > max) + return -EINVAL; + + mutex_lock(&callback_mutex); + *cs_int = val; + mutex_unlock(&callback_mutex); + return 0; +} + /* * Frequency meter - How fast is some event occurring? * @@ -933,6 +954,8 @@ typedef enum { FILE_MEMORY_PRESSURE, FILE_SPREAD_PAGE, FILE_SPREAD_SLAB, + FILE_THROTTLE_DIRTY_RATIO, + FILE_BACKGROUND_DIRTY_RATIO, } cpuset_filetype_t; static ssize_t cpuset_common_file_write(struct container *cont, @@ -997,6 +1020,12 @@ static ssize_t cpuset_common_file_write( retval = update_flag(CS_SPREAD_SLAB, cs, buffer); cs->mems_generation = cpuset_mems_generation++; break; + case FILE_BACKGROUND_DIRTY_RATIO: + retval = update_int(&cs->background_dirty_ratio, buffer, -1, 100); + break; + case FILE_THROTTLE_DIRTY_RATIO: + retval = update_int(&cs->throttle_dirty_ratio, buffer, -1, 100); + break; default: retval = -EINVAL; goto out2; @@ -1090,6 +1119,12 @@ static ssize_t cpuset_common_file_read(s case FILE_SPREAD_SLAB: *s++ = is_spread_slab(cs) ? '1' : '0'; break; + case FILE_BACKGROUND_DIRTY_RATIO: + s += sprintf(s, "%d", cs->background_dirty_ratio); + break; + case FILE_THROTTLE_DIRTY_RATIO: + s += sprintf(s, "%d", cs->throttle_dirty_ratio); + break; default: retval = -EINVAL; goto out; @@ -1173,6 +1208,20 @@ static struct cftype cft_spread_slab = { .private = FILE_SPREAD_SLAB, }; +static struct cftype cft_background_dirty_ratio = { + .name = "background_dirty_ratio", + .read = cpuset_common_file_read, + .write = cpuset_common_file_write, + .private = FILE_BACKGROUND_DIRTY_RATIO, +}; + +static struct cftype cft_throttle_dirty_ratio = { + .name = "throttle_dirty_ratio", + .read = cpuset_common_file_read, + .write = cpuset_common_file_write, + .private = FILE_THROTTLE_DIRTY_RATIO, +}; + static int cpuset_populate(struct container_subsys *ss, struct container *cont) { int err; @@ -1193,6 +1242,10 @@ static int cpuset_populate(struct contai return err; if ((err = container_add_file(cont, ss, &cft_spread_slab)) < 0) return err; + if ((err = container_add_file(cont, ss, &cft_background_dirty_ratio)) < 0) + return err; + if ((err = container_add_file(cont, ss, &cft_throttle_dirty_ratio)) < 0) + return err; /* memory_pressure_enabled is in root cpuset only */ if (err == 0 && !cont->parent)
[PATCH 5/6] cpuset write vm writeout
Throttle VM writeout in a cpuset aware way This bases the vm throttling from the reclaim path on the dirty ratio of the cpuset. Note that a cpuset is only effective if shrink_zone is called from direct reclaim. kswapd has a cpuset context that includes the whole machine. VM throttling will only work during synchrononous reclaim and not from kswapd. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Acked-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.23-rc4-mm1 diff -uprN -X 0/Documentation/dontdiff 4/include/linux/writeback.h 5/include/linux/writeback.h --- 4/include/linux/writeback.h 2007-09-11 14:49:47.0 -0700 +++ 5/include/linux/writeback.h 2007-09-11 14:50:52.0 -0700 @@ -94,7 +94,7 @@ static inline void inode_sync_wait(struc int wakeup_pdflush(long nr_pages, nodemask_t *nodes); void laptop_io_completion(void); void laptop_sync_completion(void); -void throttle_vm_writeout(gfp_t gfp_mask); +void throttle_vm_writeout(nodemask_t *nodes,gfp_t gfp_mask); /* These are exported to sysctl. */ extern int dirty_background_ratio; diff -uprN -X 0/Documentation/dontdiff 4/mm/page-writeback.c 5/mm/page-writeback.c --- 4/mm/page-writeback.c 2007-09-11 14:49:47.0 -0700 +++ 5/mm/page-writeback.c 2007-09-11 14:50:52.0 -0700 @@ -386,7 +386,7 @@ void balance_dirty_pages_ratelimited_nr( } EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr); -void throttle_vm_writeout(gfp_t gfp_mask) +void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask) { struct dirty_limits dl; @@ -401,7 +401,7 @@ void throttle_vm_writeout(gfp_t gfp_mask } for ( ; ; ) { - get_dirty_limits(&dl, NULL, &node_online_map); + get_dirty_limits(&dl, NULL, nodes); /* * Boost the allowable dirty threshold a bit for page diff -uprN -X 0/Documentation/dontdiff 4/mm/vmscan.c 5/mm/vmscan.c --- 4/mm/vmscan.c 2007-09-11 14:50:41.0 -0700 +++ 5/mm/vmscan.c 2007-09-11 14:50:52.0 -0700 @@ -1185,7 +1185,7 @@ static unsigned long shrink_zone(int pri } } - throttle_vm_writeout(sc->gfp_mask); + throttle_vm_writeout(&cpuset_current_mems_allowed, sc->gfp_mask); atomic_dec(&zone->reclaim_in_progress); return nr_reclaimed; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/6] cpuset write throttle
Make page writeback obey cpuset constraints Currently dirty throttling does not work properly in a cpuset. If f.e a cpuset contains only 1/10th of available memory then all of the memory of a cpuset can be dirtied without any writes being triggered. If all of the cpusets memory is dirty then only 10% of total memory is dirty. The background writeback threshold is usually set at 10% and the synchrononous threshold at 40%. So we are still below the global limits while the dirty ratio in the cpuset is 100%! Writeback throttling and background writeout do not work at all in such scenarios. This patch makes dirty writeout cpuset aware. When determining the dirty limits in get_dirty_limits() we calculate values based on the nodes that are reachable from the current process (that has been dirtying the page). Then we can trigger writeout based on the dirty ratio of the memory in the cpuset. We trigger writeout in a a cpuset specific way. We go through the dirty inodes and search for inodes that have dirty pages on the nodes of the active cpuset. If an inode fulfills that requirement then we begin writeout of the dirty pages of that inode. Adding up all the counters for each node in a cpuset may seem to be quite an expensive operation (in particular for large cpusets with hundreds of nodes) compared to just accessing the global counters if we do not have a cpuset. However, please remember that the global counters were only introduced recently. Before 2.6.18 we did add up per processor counters for each processor on each invocation of get_dirty_limits(). We now add per node information which I think is equal or less effort since there are less nodes than processors. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.23-rc4-mm1 diff -uprN -X 0/Documentation/dontdiff 2/mm/page-writeback.c 3/mm/page-writeback.c --- 2/mm/page-writeback.c 2007-09-11 14:39:22.0 -0700 +++ 3/mm/page-writeback.c 2007-09-11 14:49:35.0 -0700 @@ -103,6 +103,14 @@ EXPORT_SYMBOL(laptop_mode); static void background_writeout(unsigned long _min_pages, nodemask_t *nodes); +struct dirty_limits { + long thresh_background; + long thresh_dirty; + unsigned long nr_dirty; + unsigned long nr_unstable; + unsigned long nr_writeback; +}; + /* * Work out the current dirty-memory clamping and background writeout * thresholds. @@ -121,16 +129,20 @@ static void background_writeout(unsigned * clamping level. */ -static unsigned long highmem_dirtyable_memory(unsigned long total) +static unsigned long highmem_dirtyable_memory(nodemask_t *nodes, unsigned long total) { #ifdef CONFIG_HIGHMEM int node; unsigned long x = 0; + if (nodes == NULL) + nodes = &node_online_mask; for_each_node_state(node, N_HIGH_MEMORY) { struct zone *z = &NODE_DATA(node)->node_zones[ZONE_HIGHMEM]; + if (!node_isset(node, nodes)) + continue; x += zone_page_state(z, NR_FREE_PAGES) + zone_page_state(z, NR_INACTIVE) + zone_page_state(z, NR_ACTIVE); @@ -154,26 +166,74 @@ static unsigned long determine_dirtyable x = global_page_state(NR_FREE_PAGES) + global_page_state(NR_INACTIVE) + global_page_state(NR_ACTIVE); - x -= highmem_dirtyable_memory(x); + x -= highmem_dirtyable_memory(NULL, x); return x + 1; /* Ensure that we never return 0 */ } -static void -get_dirty_limits(long *pbackground, long *pdirty, - struct address_space *mapping) +static int +get_dirty_limits(struct dirty_limits *dl, struct address_space *mapping, + nodemask_t *nodes) { int background_ratio; /* Percentages */ int dirty_ratio; int unmapped_ratio; long background; long dirty; - unsigned long available_memory = determine_dirtyable_memory(); + unsigned long available_memory; + unsigned long nr_mapped; struct task_struct *tsk; + int is_subset = 0; - unmapped_ratio = 100 - ((global_page_state(NR_FILE_MAPPED) + - global_page_state(NR_ANON_PAGES)) * 100) / - available_memory; +#ifdef CONFIG_CPUSETS + if (unlikely(nodes && + !nodes_subset(node_online_map, *nodes))) { + int node; + /* +* Calculate the limits relative to the current cpuset. +* +* We do not disregard highmem because all nodes (except +* maybe node 0) have either all memory in HIGHMEM (32 bit) or +* all memory in non HIGHMEM (64 bit). If we would disregard +* highmem then cpuset throttl
[PATCH 4/6] cpuset write vmscan
Direct reclaim: cpuset aware writeout During direct reclaim we traverse down a zonelist and are carefully checking each zone if its a member of the active cpuset. But then we call pdflush without enforcing the same restrictions. In a larger system this may have the effect of a massive amount of pages being dirtied and then either A. No writeout occurs because global dirty limits have not been reached or B. Writeout starts randomly for some dirty inode in the system. Pdflush may just write out data for nodes in another cpuset and miss doing proper dirty handling for the current cpuset. In both cases dirty pages in the zones of interest may not be affected and writeout may not occur as necessary. Fix that by restricting pdflush to the active cpuset. Writeout will occur from direct reclaim the same way as without a cpuset. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Acked-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.23-rc4-mm1 diff -uprN -X 0/Documentation/dontdiff 3/mm/vmscan.c 4/mm/vmscan.c --- 3/mm/vmscan.c 2007-09-11 14:41:56.0 -0700 +++ 4/mm/vmscan.c 2007-09-11 14:50:41.0 -0700 @@ -1301,7 +1301,8 @@ unsigned long do_try_to_free_pages(struc */ if (total_scanned > sc->swap_cluster_max + sc->swap_cluster_max / 2) { - wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL); + wakeup_pdflush(laptop_mode ? 0 : total_scanned, + &cpuset_current_mems_allowed); sc->may_writepage = 1; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/6] cpuset write pdflush nodemask
pdflush: Allow the passing of a nodemask parameter If we want to support nodeset specific writeout then we need a way to communicate the set of nodes that an operation should affect. So add a nodemask_t parameter to the pdflush functions and also store the nodemask in the pdflush control structure. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Acked-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.23-rc4-mm1 diff -uprN -X 0/Documentation/dontdiff 1/fs/buffer.c 2/fs/buffer.c --- 1/fs/buffer.c 2007-09-11 14:36:24.0 -0700 +++ 2/fs/buffer.c 2007-09-11 14:39:22.0 -0700 @@ -372,7 +372,7 @@ static void free_more_memory(void) struct zone **zones; pg_data_t *pgdat; - wakeup_pdflush(1024); + wakeup_pdflush(1024, NULL); yield(); for_each_online_pgdat(pgdat) { diff -uprN -X 0/Documentation/dontdiff 1/fs/super.c 2/fs/super.c --- 1/fs/super.c2007-09-11 14:36:05.0 -0700 +++ 2/fs/super.c2007-09-11 14:39:22.0 -0700 @@ -616,7 +616,7 @@ int do_remount_sb(struct super_block *sb return 0; } -static void do_emergency_remount(unsigned long foo) +static void do_emergency_remount(unsigned long foo, nodemask_t *bar) { struct super_block *sb; @@ -644,7 +644,7 @@ static void do_emergency_remount(unsigne void emergency_remount(void) { - pdflush_operation(do_emergency_remount, 0); + pdflush_operation(do_emergency_remount, 0, NULL); } /* diff -uprN -X 0/Documentation/dontdiff 1/fs/sync.c 2/fs/sync.c --- 1/fs/sync.c 2007-09-11 14:36:05.0 -0700 +++ 2/fs/sync.c 2007-09-11 14:39:22.0 -0700 @@ -21,9 +21,9 @@ * sync everything. Start out by waking pdflush, because that writes back * all queues in parallel. */ -static void do_sync(unsigned long wait) +static void do_sync(unsigned long wait, nodemask_t *unused) { - wakeup_pdflush(0); + wakeup_pdflush(0, NULL); sync_inodes(0); /* All mappings, inodes and their blockdevs */ DQUOT_SYNC(NULL); sync_supers(); /* Write the superblocks */ @@ -38,13 +38,13 @@ static void do_sync(unsigned long wait) asmlinkage long sys_sync(void) { - do_sync(1); + do_sync(1, NULL); return 0; } void emergency_sync(void) { - pdflush_operation(do_sync, 0); + pdflush_operation(do_sync, 0, NULL); } /* diff -uprN -X 0/Documentation/dontdiff 1/include/linux/writeback.h 2/include/linux/writeback.h --- 1/include/linux/writeback.h 2007-09-11 14:37:46.0 -0700 +++ 2/include/linux/writeback.h 2007-09-11 14:39:22.0 -0700 @@ -91,7 +91,7 @@ static inline void inode_sync_wait(struc /* * mm/page-writeback.c */ -int wakeup_pdflush(long nr_pages); +int wakeup_pdflush(long nr_pages, nodemask_t *nodes); void laptop_io_completion(void); void laptop_sync_completion(void); void throttle_vm_writeout(gfp_t gfp_mask); @@ -122,7 +122,8 @@ balance_dirty_pages_ratelimited(struct a typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc, void *data); -int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0); +int pdflush_operation(void (*fn)(unsigned long, nodemask_t *nodes), + unsigned long arg0, nodemask_t *nodes); int generic_writepages(struct address_space *mapping, struct writeback_control *wbc); int write_cache_pages(struct address_space *mapping, diff -uprN -X 0/Documentation/dontdiff 1/mm/page-writeback.c 2/mm/page-writeback.c --- 1/mm/page-writeback.c 2007-09-11 14:36:24.0 -0700 +++ 2/mm/page-writeback.c 2007-09-11 14:39:22.0 -0700 @@ -101,7 +101,7 @@ EXPORT_SYMBOL(laptop_mode); /* End of sysctl-exported parameters */ -static void background_writeout(unsigned long _min_pages); +static void background_writeout(unsigned long _min_pages, nodemask_t *nodes); /* * Work out the current dirty-memory clamping and background writeout @@ -272,7 +272,7 @@ static void balance_dirty_pages(struct a */ if ((laptop_mode && pages_written) || (!laptop_mode && (nr_reclaimable > background_thresh))) - pdflush_operation(background_writeout, 0); + pdflush_operation(background_writeout, 0, NULL); } void set_page_dirty_balance(struct page *page) @@ -362,7 +362,7 @@ void throttle_vm_writeout(gfp_t gfp_mask * writeback at least _min_pages, and keep writing until the amount of dirty * memory is less than the background threshold, or until we're all clean. */ -static void background_writeout(unsigned long _min_pages) +static void background_writeout(unsigned long _min_pages, nodemask_t *unused) { long min_pages = _min_pages; struct writeback_control wbc = { @@ -402,12 +402,12 @@ static void background_writeout(unsigned * the whole world. Returns 0 if a pdflush thread was dispatched. Returns * -1 if all pdflush thre
[PATCH 1/6] cpuset write dirty map
Add a dirty map to struct address_space In a NUMA system it is helpful to know where the dirty pages of a mapping are located. That way we will be able to implement writeout for applications that are constrained to a portion of the memory of the system as required by cpusets. This patch implements the management of dirty node maps for an address space through the following functions: cpuset_clear_dirty_nodes(mapping) Clear the map of dirty nodes cpuset_update_nodes(mapping, page) Record a node in the dirty nodes map cpuset_init_dirty_nodes(mapping)First time init of the map The dirty map may be stored either directly in the mapping (for NUMA systems with less then BITS_PER_LONG nodes) or separately allocated for systems with a large number of nodes (f.e. IA64 with 1024 nodes). Updating the dirty map may involve allocating it first for large configurations. Therefore we protect the allocation and setting of a node in the map through the tree_lock. The tree_lock is already taken when a page is dirtied so there is no additional locking overhead if we insert the updating of the nodemask there. The dirty map is only cleared (or freed) when the inode is cleared. At that point no pages are attached to the inode anymore and therefore it can be done without any locking. The dirty map therefore records all nodes that have been used for dirty pages by that inode until the inode is no longer used. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Acked-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.23-rc4-mm1 diff -uprN -X 0/Documentation/dontdiff 0/fs/buffer.c 1/fs/buffer.c --- 0/fs/buffer.c 2007-09-11 14:35:58.0 -0700 +++ 1/fs/buffer.c 2007-09-11 14:36:24.0 -0700 @@ -41,6 +41,7 @@ #include #include #include +#include static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); @@ -723,6 +724,7 @@ static int __set_page_dirty(struct page radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } + cpuset_update_dirty_nodes(mapping, page); write_unlock_irq(&mapping->tree_lock); __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); diff -uprN -X 0/Documentation/dontdiff 0/fs/fs-writeback.c 1/fs/fs-writeback.c --- 0/fs/fs-writeback.c 2007-09-11 14:35:58.0 -0700 +++ 1/fs/fs-writeback.c 2007-09-11 14:36:24.0 -0700 @@ -22,6 +22,7 @@ #include #include #include +#include #include "internal.h" int sysctl_inode_debug __read_mostly; @@ -476,6 +477,12 @@ int generic_sync_sb_inodes(struct super_ continue; /* blockdev has wrong queue */ } + if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) { + /* No pages on the nodes under writeback */ + list_move(&inode->i_list, &sb->s_dirty); + continue; + } + /* Was this inode dirtied after sync_sb_inodes was called? */ if (time_after(inode->dirtied_when, start)) break; diff -uprN -X 0/Documentation/dontdiff 0/fs/inode.c 1/fs/inode.c --- 0/fs/inode.c2007-09-11 14:35:58.0 -0700 +++ 1/fs/inode.c2007-09-11 14:36:24.0 -0700 @@ -22,6 +22,7 @@ #include #include #include +#include /* * This is needed for the following functions: @@ -157,6 +158,7 @@ static struct inode *alloc_inode(struct mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE); mapping->assoc_mapping = NULL; mapping->backing_dev_info = &default_backing_dev_info; + cpuset_init_dirty_nodes(mapping); /* * If the block_device provides a backing_dev_info for client @@ -264,6 +266,7 @@ void clear_inode(struct inode *inode) bd_forget(inode); if (S_ISCHR(inode->i_mode) && inode->i_cdev) cd_forget(inode); + cpuset_clear_dirty_nodes(inode->i_mapping); inode->i_state = I_CLEAR; } diff -uprN -X 0/Documentation/dontdiff 0/include/linux/cpuset.h 1/include/linux/cpuset.h --- 0/include/linux/cpuset.h2007-09-11 14:35:58.0 -0700 +++ 1/include/linux/cpuset.h2007-09-11 14:36:24.0 -0700 @@ -77,6 +77,45 @@ extern void cpuset_track_online_nodes(vo extern int current_cpuset_is_being_rebound(void); +/* + * We need macros since struct address_space is not defined yet + */ +#if MAX_NUMNODES <= BITS_PER_LONG +#define cpuset_update_dirty_nodes(__mapping, __page) \ + do {\ + int node = page_to_nid(__page); \ + if (!node_isset(node, (__mapping)->dirty_nodes))\ + node_set(node, (__mapping)->dirty_nodes); \ + } while (0) + +#define cpuse
Re: [PATCH 0/6] cpuset aware writeback
Perform writeback and dirty throttling with awareness of cpuset mem_allowed. The theory of operation has two primary elements: 1. Add a nodemask per mapping which indicates the nodes which have set PageDirty on any page of the mappings. 2. Add a nodemask argument to wakeup_pdflush() which is propagated down to sync_sb_inodes. This leaves sync_sb_inodes() with two nodemasks. One is passed to it and specifies the nodes the caller is interested in syncing, and will either be null (i.e. all nodes) or will be cpuset_current_mems_allowed in the caller's context. The second nodemask is attached to the inode's mapping and shows who has modified data in the inode. sync_sb_inodes() will then skip syncing of inodes if the nodemask argument does not intersect with the mapping nodemask. cpuset_current_mems_allowed will be passed in to pdflush background_writeout by try_to_free_pages and balance_dirty_pages. balance_dirty_pages also passes the nodemask in to writeback_inodes directly when doing active reclaim. Other callers do not limit inode writeback, passing in a NULL nodemask pointer. A final change is to get_dirty_limits. It takes a nodemask argument, and when it is null there is no change in behavior. If the nodemask is set, page statistics are accumulated only for specified nodes, and the background and throttle dirty ratios will be read from a new per-cpuset ratio feature. For testing I did a variety of basic tests, verifying individual features of the test. To verify that it fixes the core problem, I created a stress test which involved using cpusets and mems_allowed to split memory so that all daemons had memory set aside for them, and my memory stress test had a separate set of memory. The stress test was mmaping 7GB of a very large file on disk. It then scans the entire 7GB of memory reading and modifying each byte. 7GB is more than the amount of physical memory made available to the stress test. Using iostat I can see the initial period of reading from disk, followed by a period of simultaneous reads and writes as dirty bytes are pushed to make room for new reads. In a separate log-in, in the other cpuset, I am running: while `true`; do date | tee -a date.txt; sleep 5; done date.txt resides on the same disk as the large file mentioned above. The above while-loop serves the dual purpose of providing me visual clues of progress along with the opportunity for the "tee" command to become throttled writing to the disk. The effect of this patchset is straightforward. Without it there are long hangs between appearances of the date. With it the dates are all 5 (or sometimes 6) seconds apart. I also added printks to the kernel to verify that, without these patches, the tee was being throttled (along with lots of other things), and with the patch only pdflush is being throttled. These patches are mostly unchanged from Chris Lameter's original changelist posted previously to linux-mm. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][Intel-IOMMU] Fix for IOMMU early crash
On Wed, Sep 12, 2007 at 05:48:52AM +1000, Paul Mackerras wrote: > Keshavamurthy, Anil S writes: > > > Subject: Fix IOMMU early crash > > > > This patch avoids copying pci_bus's->sysdata to > > pci_dev's->sysdata as one can easily obtain > > the same through pci_dev->bus->sysdata. > > At the moment this will cause ppc64 to crash, since we rely on > pci_dev->sysdata pointing to some node in the firmware device tree, > either the device's node or the node for a parent bus. > > We could change the ppc64 code to use pci_dev->bus->sysdata in the > case when pci_dev->sysdata is NULL, which would fix the problem. I > think that change should be incorporated as part of this patch so that > we don't break git bisection. Can I expect the ppc64 code changes from you? Once I get your, I will merge with mine and post it again. > > In other words I don't want to see this patch applied as it stands. Yup, I agree with you. -Anil - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [announce] CFS-devel, performance improvements
Hi Ingo, When compiling, I get: In file included from kernel/sched.c:794: kernel/sched_fair.c: In function 'task_new_fair': kernel/sched_fair.c:857: error: 'sysctl_sched_child_runs_first' undeclared (first use in this function) kernel/sched_fair.c:857: error: (Each undeclared identifier is reported only once kernel/sched_fair.c:857: error: for each function it appears in.) Presumably because sched_fair.c is being included into sched.c before sysctl_sched_child_runs_first is defined. Regards, Rob - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] v3 of IBM power meter driver
On Tue, Sep 11, 2007 at 09:23:35AM -0400, Mark M. Hoffman wrote: > I am not an IPMI expert, so I would appreciate getting an Acked-by from > someone who knows more about that subsystem. > > Anyway, some comments are below. This is nowhere near a complete review yet. Thank you for the review! Comments interspersed below, though for brevity the one-liners have been fixed. > > +config SENSORS_IBMPEX > > + tristate "IBM PowerExecutive temperature/power sensors" > > + depends on IPMI_SI > > Open question: can we use "select" here? As written, it took some hunting to > even get this driver to show up as an option in menuconfig. Changed, since it seems reasonable that someone looking for PEx support might not necessarily know that it is based upon IPMI. > > +struct ibmpex_bmc_data { > > + struct list_headlist; > > + struct class_device *class_dev; > > My current stack of patches includes one which requires that this be changed > to 'struct device *hwmon_dev', as 'struct class_device' is going away soon. > You may rebase on my testing tree[1], or else I will just follow up with a > patch to fix this up after I eventually merge yours. > > [1] > http://lm-sensors.org/kernel?p=kernel/mhoffman/hwmon-2.6.git;a=shortlog;h=testing Done. > > +static ssize_t ibmpex_show_sensor(struct device *dev, > > + struct device_attribute *devattr, > > + char *buf) > > +{ > > + struct sensor_device_attribute *attr = to_sensor_dev_attr(devattr); > > + int iface = PEX_INTERFACE(attr->index); > > + int sensor = PEX_SENSOR(attr->index); > > + int func = PEX_FUNC(attr->index); > > + struct ibmpex_bmc_data *data = get_bmc_data(iface); > > ... especially given how many times you're going to call it. Is there any > reason you can't use the driver_data field of struct device *dev for that? I can (and did) update the code to use dev_get/set_drvdata for the accessors. However, the "iface" field exists as a mechanism to map interface numbers to struct ibmpex_bmc_data/struct device data because the callback that IPMI uses to notify clients that BMCs are going away only passes the interface number, not the struct device itself. Unfortunately, this means that get_bmc_data() must remain, but now it is only used once at the end of life. > E.g. i2c based hwmon drivers do this at some point during the probe: > > i2c_set_clientdata(new_client, data); > > (which becomes) > > dev_set_drvdata(&new_client->dev, data); > > If you could do that, then you no longer need 'iface' at all in the function > above... *that* may allow you to use the SENSOR_ATTR_2 mechanism from > hwmon-sysfs.h - much easier to read than the manual number packing for > 'sensor' > and 'func'. Doesn't look too hard; I'll have a go at it and see how it does. > > + err = ibmpex_query_sensor_count(data); > > + if (err < 0) > > + return -ENOENT; > > + data->num_sensors = err; > > + > > Did you mean 'if (err <= 0)' ? Yes. > > + /* Create attributes */ > > + for (j = 0; j < PEX_NUM_SENSOR_FUNCS; j++) > > + if (create_sensor(data, sensor_type, sensor_counter, > > + i, j)) > > Why not 'err = create_sensor(...)' and propagate the actual error here? Rough draft syndrome? 'tis fixed. :) --D signature.asc Description: Digital signature
Re: [PATCH -mm] uvesafb: Don't access VGA registers directly when running on non-x86
On Wed, Sep 12, 2007 at 01:09:59AM +0200, Michal Januszewski wrote: > The VGA registers are only available at their legacy IO locations on x86. > Don't try to access them when running on other arches. > > Note that the code accessing them directly is just an optimization (limits > slow BIOS function calls). We don't lose any functionality by using > BIOS calls instead of it on non-x86. > If you do that, then you also have to #ifdef CONFIG_X86 around video/vga.h, as that drags in asm/vga.h, which does not exist on all platforms. I have little interest in adding a stub vga.h on my architectures to support a driver that in practice works on nothing but x86. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [announce] CFS-devel, performance improvements
Hi, Hi, Out of curiousity: will I ever get answers to my questions? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 merge - a little feedback
On Tue, Sep 11, 2007 at 10:34:23PM +0100, Andi Kleen wrote: > > People do not expect code under arch/i386/ to be used by code under > > arch/x86_64/ and vice versa. > > > > That regularly results in people sending patches that don't compile on > > the other architecture. > > > > With one architecture it's much more obvious that the code is shared. > > Will that cause people to compile test both? I have my doubts that > will really work. > > e.g. a similar example would be CONFIG_MMU=n. The code > is mostly shared and in the same directories, but people still > break the MMUless architectures all the time. > As I was the first one to do CONFIG_MMU=y/n in the same arch directory, since 2.5, I can tell you that that's simply crap. The only reason CONFIG_MMU=n gets broken all the time is because people don't think about it in generic code, it's rarely broken in the architecture code, and even with the most occasional of build tests most of that gets caught in a hurry. You do of course have to consider both cases when writing new code, but those things tend to be pretty obvious. It's a bit more work for the arch maintainer, but it's certainly far less confusing and problematic than separating things out. In fact, going the opposite route is what leads to endless trouble in the long run, since you brought up the MMUless example, m68knommu is a good example. Rather than beginning life in arch/m68k, it was forked off, mostly to deal with the ColdFire CPUs that weren't planned to have MMUs. Now that the product line has moved along, adding an MMU to it is in the roadmap, which means that inevitably they're both going to have to be combined anyways. Simply dealing with the initial trouble of having them combined initially would have solved a lot of that mess. You can ignore the added maintenance for as long as possible, but sooner or later it's going to be a problem. Procrastination is not something that bodes particularly well for divergent hardware support. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wed, 12 Sep 2007, Andrea Arcangeli wrote: > On Tue, Sep 11, 2007 at 01:41:08PM -0700, Christoph Lameter wrote: > > The advantages of this approach over Andreas is basically that the 4k > > filesystems still can be used as is. 4k is useful for binaries and for > > If you mean that with my approach you can't use a 4k filesystem as is, > that's not correct. I even run the (admittedly premature but > promising) benchmarks on my patch on a 4k blocksized > filesystem... Guess what, you can even still mount a 1k fs on a 2.6 > kernel. Right you can use a 4k filesystem. The 4k blocks are buffers in a larger page then. > The main advantage I can see in your patch is that distributions won't > need to ship a 64k PAGE_SIZE kernel rpm (but your single rpm will be > slower). I would think that your approach would be slower since you always have to populate 1 << N ptes when mmapping a file? Plus there is a lot of wastage of memory because even a file with one character needs an order N page? So there are less pages available for the same workload. Then you are breaking mmap assumptions of applications becaused the order N kernel will no longer be able to map 4k pages. You likely need a new binary format that has pages correctly aligned. I know that we would need one on IA64 if we go beyond the established page sizes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: time_after - what on earth???
On 12/09/2007, Björn Steinbrink <[EMAIL PROTECTED]> wrote: > > A fix would likely initialize "when" to jiffies. > > Björn > Thanks, I'll try that :) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Tue, 11 Sep 2007, Nick Piggin wrote: > > Well its seems that we have different interpretations of what was agreed > > on. My understanding was that the large blocksize patchset was okay > > provided that I supply an acceptable mmap implementation and put a > > warning in. > > Yes. I think we differ on our interpretations of "okay". In my interpretation, > it is not OK to use this patch as a way to solve VM or FS or IO scalability > issues, especially not while the alternative approaches that do _not_ have > these problems have not been adequately compared or argued against. We never talked about not being able to solve scalability issues with this patchset. The alternate approaches were discussed at the VM MiniSummit and at the VM/FS meeting. You organized the VM/FS summit. I know you were there and were arguing for your approach. That was not sufficient? > > Well even without slab targeted reclaim: Mel's antifrag will sort the > > dentries into separate blocks of memory and so isolate the issue. > > So even after all this time you do not understand what the fundamental > problem is with anti-frag and yet you are happy to waste both our time > in endless flamewars telling me how wrong I am about it. We obviously have discussed this before and the only point of asking this question by you seems to be to have me repeat the whole line argument again? > Forgive me if I'm starting to be rude, Christoph. This is really irritating. Sorry but I have had too much exposure to philosophy. Talk about absolutes like guarantees (that do not exist even for order 0 allocs) and unlikely memory fragmentation scenarios to show that something does not work seems to be getting into some metaphysical realm where there is no data anymore to draw any firm conclusions. Software reliability is inherent probabilistic otherwise we would not have things like CRC sums and SHA1 algorithms. Its just a matter of reducing the failure rate sufficiently. The failure rate for lower order allocations (0-3) seems to have been significantly reduced in 2.6.23 through lumpy reclaim. If antifrag measures are not successful (likely for 2M allocs) then other methods (like the large page pools that you skipped when reading my post) will need to be used. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + git-net-broke-ixgbe.patch added to -mm tree
[EMAIL PROTECTED] wrote: The patch titled git-net-broke-ixgbe has been added to the -mm tree. Its filename is git-net-broke-ixgbe.patch *** Remember to use Documentation/SubmitChecklist when testing your code *** See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find out what to do about this -- Subject: git-net-broke-ixgbe From: Andrew Morton <[EMAIL PROTECTED]> igiveup relax! do not dispair! I will have a patch for ixgbe to fixup the NAPI API stuff tomorrow! This assumes that you have the version that I sent out last week though (v4). It's running smoke tests right now, should be ready tomorrow. Cheers, Auke - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: time_after - what on earth???
On 2007.09.12 00:10:19 +0100, Adrian McMenamin wrote: > On 12/09/2007, Björn Steinbrink <[EMAIL PROTECTED]> wrote: > > On 2007.09.12 00:19:09 +0200, Rene Herman wrote: > > > On 09/12/2007 12:15 AM, Adrian McMenamin wrote: > > > > > >> On 11/09/2007, Rene Herman <[EMAIL PROTECTED]> wrote: > > >>> On 09/12/2007 12:05 AM, Adrian McMenamin wrote: > > >>> > > OK, why does this line occasionally return true: > > > > What exactly is "occassionally"? Does it happen more than once per > > boot? If not, and it happens after a certain time after booting, it > > might be wrapping of the jiffie counter (see below). > > > > > > if ((maple_dev->interval > 0) && (jiffies >maple_dev->when)) > > > > while this one never does (no other changes made): > > > > if ((maple_dev->interval > 0) && (time_after(jiffies, > > maple_dev->when))) > > >>> Is maple_dev->when an unsigned long? > > >>> > > >> Yes. Does that make a difference? > > > > > > If it had been a signed type, it could've wrapped to something you didn't > > > expect, explaining the difference at least... > > > > > > With an unsigned long, the only diference should be that time_after() > > > deals > > > with jiffie wrapping which I assume is not an actual problem here. I'll > > > retreat into the shades again... ;-( > > > > If "occasionally" is limited to once per boot, it might be jiffie > > wrapping. IIRC jiffies are initialized so that they wrap after about 5 > > minutes of uptime to reveal such bugs without forcing you to wait for > > ages just to have the counter wrap for the first time. > > > > No, I mean "works properly" - ie occasionally evaluates as true Ehrm, yeah, I somehow parsed that as if it had a negation in there. Anyway, I looked up the patches you posted. "when" is initialized to 0 and only changed if the above condition evaluates to true. Now, time_after and "<" have different points at which "future" and "past" are separated. time_after splits (about) equally between future and past, so 0 can be either, depending on the value of jiffies. But for "<" 0 is almost always in the past, except for the seldom event of jiffies being 0. Now, given that jiffies start out at a huge value to make the counter wrap around early, 0 happens to be in the "future" for time_after, until the wrap around occurs. So in your case, you just might have to wait those 5 minutes to get the working behaviour, instead of the common case in which it breaks after that time ;-) A fix would likely initialize "when" to jiffies. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git patches] IDE fixes for 2.6.23-rc6
Bartlomiej Zolnierkiewicz wrote: Please pull from: master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6.git/ to receive the following updates: drivers/ata/pata_ali.c |7 ++ drivers/ide/Kconfig|4 +- drivers/ide/ide-iops.c |3 +- drivers/ide/pci/alim15x3.c |7 ++ drivers/ide/pci/hpt366.c | 138 +++- drivers/ide/pci/pdc202xx_new.c |9 ++- drivers/ide/pci/via82cxxx.c| 15 +++- drivers/ide/ppc/mpc8xx.c |1 - drivers/ide/setup-pci.c| 41 +--- include/linux/ide.h| 13 10 files changed, 141 insertions(+), 97 deletions(-) Bartlomiej Zolnierkiewicz (1): via82cxxx: add Arima W730-K8 and other rebadgings to short cables list Daniel Exner (1): pata_ali/alim15x3: override 80-wire cable detection for Toshiba S1800-814 Kumar Gala (1): mpc8xx: Only build mpc8xx on arch/ppc Mikael Pettersson (1): pdc202xx_new: PLL detection fix Sergei Shtylyov (5): ide: fix PCI refcounting pdc202xx_new: fix PCI refcounting hpt366: fix PCI clock detection for HPT374 (take 4) ide: add ide_dev_is_sata() helper (take 2) hpt366: UltraDMA filter for SATA cards (take 2) Tony Breeds (1): pmac: build fix diff --git a/drivers/ata/pata_ali.c b/drivers/ata/pata_ali.c index 94e5edc..71bdc3b 100644 --- a/drivers/ata/pata_ali.c +++ b/drivers/ata/pata_ali.c @@ -48,6 +48,13 @@ static struct dmi_system_id cable_dmi_table[] = { DMI_MATCH(DMI_BOARD_VERSION, "OmniBook N32N-736"), }, }, + { + .ident = "Toshiba Satelite S1800-814", + .matches = { + DMI_MATCH(DMI_SYS_VENDOR, "TOSHIBA"), + DMI_MATCH(DMI_PRODUCT_NAME, "S1800-814"), + }, + }, { } }; ACK - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
irq load balancing
Most of the load in my system is triggered by a single ethernet IRQ. Essentially the IRQ schedules a tasklet and most of the work is done in the taskelet which is scheduled in the IRQ. From what I read looks like the tasklet would be executed on the same CPU on which it was scheduled. So this means even in an SMP system it will be one processor which is overloaded. So will using the user space IRQ loadbalancer really help? What I am doubtful about is that the user space load balance comes along and changes the affinity once in a while. But really what I need is every interrupt to go to a different CPU in a round robin fashion. Looks like the APIC can distribute IRQ's dynamically? Is this supported in the kernel and any config or proc interface to turn this on/off. Thx, Venkat - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Tue, Sep 11, 2007 at 01:41:08PM -0700, Christoph Lameter wrote: > The advantages of this approach over Andreas is basically that the 4k > filesystems still can be used as is. 4k is useful for binaries and for If you mean that with my approach you can't use a 4k filesystem as is, that's not correct. I even run the (admittedly premature but promising) benchmarks on my patch on a 4k blocksized filesystem... Guess what, you can even still mount a 1k fs on a 2.6 kernel. The main advantage I can see in your patch is that distributions won't need to ship a 64k PAGE_SIZE kernel rpm (but your single rpm will be slower). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: time_after - what on earth???
On 09/12/2007 01:09 AM, Björn Steinbrink wrote: On 2007.09.12 00:19:09 +0200, Rene Herman wrote: On 09/12/2007 12:15 AM, Adrian McMenamin wrote: On 11/09/2007, Rene Herman <[EMAIL PROTECTED]> wrote: On 09/12/2007 12:05 AM, Adrian McMenamin wrote: OK, why does this line occasionally return true: What exactly is "occassionally"? Does it happen more than once per boot? If not, and it happens after a certain time after booting, it might be wrapping of the jiffie counter (see below). if ((maple_dev->interval > 0) && (jiffies >maple_dev->when)) while this one never does (no other changes made): if ((maple_dev->interval > 0) && (time_after(jiffies, maple_dev->when))) Is maple_dev->when an unsigned long? Yes. Does that make a difference? If it had been a signed type, it could've wrapped to something you didn't expect, explaining the difference at least... With an unsigned long, the only diference should be that time_after() deals with jiffie wrapping which I assume is not an actual problem here. I'll retreat into the shades again... ;-( If "occasionally" is limited to once per boot, it might be jiffie wrapping. IIRC jiffies are initialized so that they wrap after about 5 minutes of uptime to reveal such bugs without forcing you to wait for ages just to have the counter wrap for the first time. Yes, but if jiifie wrapping was the problem, I'd expect the contrary behaviour with the time_after() one hitting while the > one does not. Rene. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm] uvesafb: Don't access VGA registers directly when running on non-x86
The VGA registers are only available at their legacy IO locations on x86. Don't try to access them when running on other arches. Note that the code accessing them directly is just an optimization (limits slow BIOS function calls). We don't lose any functionality by using BIOS calls instead of it on non-x86. Signed-off-by: Michal Januszewski <[EMAIL PROTECTED]> --- diff --git a/drivers/video/uvesafb.c b/drivers/video/uvesafb.c index 853323e..74fa7c7 100644 --- a/drivers/video/uvesafb.c +++ b/drivers/video/uvesafb.c @@ -935,6 +935,7 @@ static int uvesafb_setpalette(struct uvesafb_pal_entry *entries, int count, if (start + count > 256) return -EINVAL; +#ifdef CONFIG_X86 /* Use VGA registers if mode is VGA-compatible. */ if (i >= 0 && i < par->vbe_modes_cnt && par->vbe_modes[i].mode_attr & VBE_MODE_VGACOMPAT) { @@ -957,8 +958,10 @@ static int uvesafb_setpalette(struct uvesafb_pal_entry *entries, int count, "D" (entries),/* EDI */ "S" (&par->pmi_pal)); /* ESI */ } -#endif - else { +#endif /* CONFIG_X86_32 */ + else +#endif /* CONFIG_X86 */ + { task = uvesafb_prep(); if (!task) return -ENOMEM; @@ -1102,6 +1105,7 @@ static int uvesafb_blank(int blank, struct fb_info *info) struct uvesafb_ktask *task; int err = 1; +#ifdef CONFIG_X86 if (par->vbe_ib.capabilities & VBE_CAP_VGACOMPAT) { int loop = 1; u8 seq = 0, crtc17 = 0; @@ -1124,7 +1128,9 @@ static int uvesafb_blank(int blank, struct fb_info *info) while (loop--); vga_wcrt(NULL, 0x17, crtc17); vga_wseq(NULL, 0x00, 0x03); - } else { + } else +#endif /* CONFIG_X86 */ + { task = uvesafb_prep(); if (!task) return -ENOMEM; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: time_after - what on earth???
On 12/09/2007, Björn Steinbrink <[EMAIL PROTECTED]> wrote: > On 2007.09.12 00:19:09 +0200, Rene Herman wrote: > > On 09/12/2007 12:15 AM, Adrian McMenamin wrote: > > > >> On 11/09/2007, Rene Herman <[EMAIL PROTECTED]> wrote: > >>> On 09/12/2007 12:05 AM, Adrian McMenamin wrote: > >>> > OK, why does this line occasionally return true: > > What exactly is "occassionally"? Does it happen more than once per > boot? If not, and it happens after a certain time after booting, it > might be wrapping of the jiffie counter (see below). > > > if ((maple_dev->interval > 0) && (jiffies >maple_dev->when)) > > while this one never does (no other changes made): > > if ((maple_dev->interval > 0) && (time_after(jiffies, > maple_dev->when))) > >>> Is maple_dev->when an unsigned long? > >>> > >> Yes. Does that make a difference? > > > > If it had been a signed type, it could've wrapped to something you didn't > > expect, explaining the difference at least... > > > > With an unsigned long, the only diference should be that time_after() deals > > with jiffie wrapping which I assume is not an actual problem here. I'll > > retreat into the shades again... ;-( > > If "occasionally" is limited to once per boot, it might be jiffie > wrapping. IIRC jiffies are initialized so that they wrap after about 5 > minutes of uptime to reveal such bugs without forcing you to wait for > ages just to have the counter wrap for the first time. > No, I mean "works properly" - ie occasionally evaluates as true - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: time_after - what on earth???
On 2007.09.12 00:19:09 +0200, Rene Herman wrote: > On 09/12/2007 12:15 AM, Adrian McMenamin wrote: > >> On 11/09/2007, Rene Herman <[EMAIL PROTECTED]> wrote: >>> On 09/12/2007 12:05 AM, Adrian McMenamin wrote: >>> OK, why does this line occasionally return true: What exactly is "occassionally"? Does it happen more than once per boot? If not, and it happens after a certain time after booting, it might be wrapping of the jiffie counter (see below). if ((maple_dev->interval > 0) && (jiffies >maple_dev->when)) while this one never does (no other changes made): if ((maple_dev->interval > 0) && (time_after(jiffies, maple_dev->when))) >>> Is maple_dev->when an unsigned long? >>> >> Yes. Does that make a difference? > > If it had been a signed type, it could've wrapped to something you didn't > expect, explaining the difference at least... > > With an unsigned long, the only diference should be that time_after() deals > with jiffie wrapping which I assume is not an actual problem here. I'll > retreat into the shades again... ;-( If "occasionally" is limited to once per boot, it might be jiffie wrapping. IIRC jiffies are initialized so that they wrap after about 5 minutes of uptime to reveal such bugs without forcing you to wait for ages just to have the counter wrap for the first time. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm] video: uvesafb: Add X86 dependency.
On Tue, Sep 11, 2007 at 09:31:59PM +0900, Paul Mundt wrote: > > Anyway, I think it is up to Michal to decide if we should remove the > > kernel support for other archs, or let it stay a bit more while working > > on solving the v86d side of things. So I'll just step aside now > > > Once v86d is fixed up to get at the ROM directly and the driver uses MMIO > directly, I don't see a problem with unrestricting it. For the time being > however, the build is both broken, and the emulator it uses won't run on > anything but x86, so I see no reason not to add a Kconfig dependency that > reflects this until such a time where it's no longer true. > > At least if there's a set of restrictions on something fairly generic, > they tend to be visible, and they also tend to get fixed up over time. We > should however not enable something generically which at the moment is > very much tied to a single platform. Later patches can remove the > dependency at such a time that that assertion no longer holds true. Just to clear things up: yes, at the moment v86d supports only x86 and amd64 (aka x86_64) and yes, supporting other arches is possible and planned. The main limiting factors as far as this is concerned are the little amount of my free time and the fact that I don't currently have access to non-x86 hardware. Please note that the kernel part (i.e. uvesafb) is meant to be generic (it currently uses VGA IO ports on non-x86, which is a bug and for which a patch will follow) and support or lack thereof for a specific arch should be dependent on v86d only. That being said, I think that having a kernel dependency tracking the development status of userspace code is generally a bad idea. Best regards, -- Michal Januszewski JID: [EMAIL PROTECTED] Gentoo Linux Developerhttp://people.gentoo.org/spock - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] eCryptfs: Use generic_file_splice_read()
eCryptfs is currently just passing through splice reads to the lower filesystem. This is obviously incorrect behavior; the decrypted data is what needs to be read, not the lower encrypted data. I cannot think of any good reason for eCryptfs to implement splice_read, so this patch points the eCryptfs fops splice_read to use generic_file_splice_read. Signed-off-by: Michael Halcrow <[EMAIL PROTECTED]> --- linux-2.6.23-rc4-mm1.orig/fs/ecryptfs/file.c +++ linux-2.6.23-rc4-mm1/fs/ecryptfs/file.c @@ -338,21 +338,6 @@ static int ecryptfs_fasync(int fd, struc return rc; } -static ssize_t ecryptfs_splice_read(struct file *file, loff_t * ppos, - struct pipe_inode_info *pipe, size_t count, - unsigned int flags) -{ - struct file *lower_file = NULL; - int rc = -EINVAL; - - lower_file = ecryptfs_file_to_lower(file); - if (lower_file->f_op && lower_file->f_op->splice_read) - rc = lower_file->f_op->splice_read(lower_file, ppos, pipe, - count, flags); - - return rc; -} - static int ecryptfs_ioctl(struct inode *inode, struct file *file, unsigned int cmd, unsigned long arg); @@ -365,7 +350,7 @@ const struct file_operations ecryptfs_di .release = ecryptfs_release, .fsync = ecryptfs_fsync, .fasync = ecryptfs_fasync, - .splice_read = ecryptfs_splice_read, + .splice_read = generic_file_splice_read, }; const struct file_operations ecryptfs_main_fops = { @@ -382,7 +367,7 @@ const struct file_operations ecryptfs_ma .release = ecryptfs_release, .fsync = ecryptfs_fsync, .fasync = ecryptfs_fasync, - .splice_read = ecryptfs_splice_read, + .splice_read = generic_file_splice_read, }; static int - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata not working for sis5533
On Sun, 09 Sep 2007 13:35:26 +0200 Patrizio Bassi <[EMAIL PROTECTED]> wrote: > Patrizio Bassi ha scritto: > > Jan Engelhardt ha scritto: > >> On Sep 8 2007 11:38, Patrizio Bassi wrote: > >> > >>> Jan Engelhardt wrote: > >>> > I shall give this a spin too, since I happen to have sis5513. > > Just booted this fresh ata-enabled system (a matter of mkinitrd). > It has > not exploded yet. > > >>> don't you have the "irq 14" issue? > >>> > >> No, does not seem so. > >> > >> > >>> can you post here your .config? > >>> > >> http://rafb.net/p/vfTX0966.html > >> > >> Maybe it is solved in 2.6.22.3? (I don't remember what your version > >> was.) > >> > >> > >> Jan > >> > > > > For Alan, libata devs...hope can help debug... > > this is http://www.patriziobassi.it/downloads/libata_issue.jpg Looks more like a platform irq routing issue than an ata issue. Perhaps an x86 or an acpi person can help out with this. Probably nothing will happen, in which case I'll get back to you later and ask you to raise a bugzilla entry, not that this will get it fixed :( > > and this is the relative config i'm using > > http://www.patriziobassi.it/downloads/config > > > > Let me know > > > > Patrizio > > more debug: > > I tried as suggested with the irqpoll option, i just get a faster panic > as i don't have the 3 xfermode lines...but always impossibile to boot... > > Patrizio > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] timekeeping: Prevent time going backwards on resume
Patch below fixes the problem we were seeing (negative delta calculated in tick_do_update_jiffies64). Thanks again Thomas! On Wed, Sep 12, 2007 at 12:36:34AM +0200, Thomas Gleixner wrote: > Timekeeping resume adjusts xtime by adding the slept time in seconds and > resets the reference value of the clock source (clock->cycle_last). > clock->cycle last is used to calculate the delta between the last xtime > update and the readout of the clock source in __get_nsec_offset(). xtime > plus the offset is the current time. The resume code ignores the delta > which had already elapsed between the last xtime update and the actual > time of suspend. If the suspend time is short, then we can see time > going backwards on resume. > > Suspend: > offs_s = clock->read() - clock->cycle_last; > now = xtime + offs_s; > timekeeping_suspend_time = read_rtc(); > > Resume: > sleep_time = read_rtc() - timekeeping_suspend_time; > xtime.tv_sec += sleep_time; > clock->cycle_last = clock->read(); > offs_r = clock->read() - clock->cycle_last; > now = xtime + offs_r; > > if sleep_time_seconds == 0 and offs_r < offs_s, then time goes > backwards. > > Fix this by storing the offset from the last xtime update and add it to > xtime during resume, when we reset clock->cycle_last: > > sleep_time = read_rtc() - timekeeping_suspend_time; > xtime.tv_sec += sleep_time; > xtime += offs_s; /* Fixup xtime offset at suspend time */ > clock->cycle_last = clock->read(); > offs_r = clock->read() - clock->cycle_last; > now = xtime + offs_r; > > Thanks to Marcelo for tracking this down on the OLPC and providing the > necessary details to analyze the root cause. > > Signed-off-by: Thomas Gleixner <[EMAIL PROTECTED]> > > --- a/kernel/time/timekeeping.c > +++ b/kernel/time/timekeeping.c > @@ -280,6 +280,8 @@ void __init timekeeping_init(void) > static int timekeeping_suspended; > /* time in seconds when suspend began */ > static unsigned long timekeeping_suspend_time; > +/* xtime offset when we went into suspend */ > +static s64 timekeeping_suspend_offset; > > /** > * timekeeping_resume - Resumes the generic timekeeping subsystem. > @@ -305,6 +307,8 @@ static int timekeeping_resume(struct sys_device *dev) > wall_to_monotonic.tv_sec -= sleep_length; > total_sleep_time += sleep_length; > } > + /* Make sure that we have the correct xtime reference */ > + timespec_add_ns(&xtime, timekeeping_suspend_offset); > /* re-base the last cycle value */ > clock->cycle_last = clocksource_read(clock); > clock->error = 0; > @@ -326,6 +330,8 @@ static int timekeeping_suspend(struct sys_device *dev, > pm_message_t state) > unsigned long flags; > > write_seqlock_irqsave(&xtime_lock, flags); > timekeeping_suspended = 1; > + /* Get the current xtime offset */ > + timekeeping_suspend_offset = __get_nsec_offset(); > timekeeping_suspend_time = read_persistent_clock(); > write_sequnlock_irqrestore(&xtime_lock, flags); > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sk98lin for 2.6.23-rc1
On Tue, Sep 11, 2007 at 05:03:57PM +0200, Adrian Bunk wrote: > On Tue, Sep 11, 2007 at 10:29:47AM -0400, Bill Davidsen wrote: > > So if you want people to try a new driver, I think it really has to have > > some benefits to the users, in terms of performance, reliability, or > > features. "Cleaner design" doesn't motivate, and it does raise the question > > of why the old driver wasn't just cleaned up. I've been doing software for > > decades, I appreciate why, but users in general just want to use their > > system. Which raises the question of why to delete drivers which work for > > many or even most users? > > As I already explained, there is a long term advantage for all users if > there is only one driver in the kernel. Not only that. You have to place the switch in its context with history. Stephen, please correct me if I'm wrong, but sk98lin has been randomly working for a very long time. Not 100% the driver's fault, because it has had to workaround a lot of chips bugs. The fact that this driver supports *all* chips in the family makes it harder to identify whether problems are caused by the hardware or by the driver because it is bloated with tons of if/else. I've personally encountered random data corruption on the receive path with PCI-E hardware with sk98lin, as well as random TX stops. Sometimes it would require one terabyte of data, sometimes just a few hundreds megs. On other hardware (skge now), UDP would simply stop being sent and some TCP traffic was necessary to restart UDP! One guy at Marvell once asked me for more information, but it was not easy to provide much more, given the randomness of the problems! Stephen has done an excellent (and thankless) job at restarting from scratch, and the idea to separate the two chips was a good one IMHO. The problem is that he might have thought that most of the bugs were in the driver, while most of them are in the hardware, and this requires a lot of workarounds, which do not always work the same for everybody (I remember having tried to disable flow control with sk98lin because it helped with sky2). In parallel, sk98lin has improved on the vendor's site. v8 exhibited all the problems I explained above, but v10 has fixed a lot of them, making the new sk98lin more reliable. In parallel, sky2 and skge had got wider acceptance and testing. The nastiest hardware bugs will slowly surface, a good deal of driver bugs have been detected too (and that's expected from any new driver). It is possible that after 2 or 3 patches, a lot of the remaining problems will suddenly vanish. But it's also possible that the driver will still not work for 1% of people for 1 or 2 years because of some obscure hardware combinations which trigger some obscure hardware bugs. > Therefore all users should > switch away from obsolete drivers to the replacement drivers, and the > obsolete driver will be removed at some point in time. The only question > is how to do it. Desktop users genreally have no problem experimenting with multiple kernels or drivers. They can report feedback too, but generally, they're not very good at downloading alternative drivers and patching their kernel with those. Server users cannot experiment for a long time. After 2 or 3 losses of service, they *have* to provide a definitive solution. For some of them when sky2 fails, it may very well be to switch over to sk98lin. Downloading from the vendor's site and patching is not a problem for those users, but it causes them the trouble of updating the kernel for security fixes, so the old driver must be shipped with the kernel. However, I remember something which might constitute a solution. In 2.4, there's a small bug in the kbuild process on alpha. One question is always asked during make oldconfig. Its saved value is ignored because of the way it is computed. I don't know if we could do this with 2.6 kbuild. It would then be nice to always set sk98lin to unset if it was set to "Y" or "M", so that at each build, the user has to explicitly state he wants it. It's annoying enough to give the other one a try once in a while, without causing too much trouble to people who really have no other choice right now. What we need with this driver is people being fed up with it, not them being unable to use it as a last resort. Also, given that it has improved over the last years (probably due to competition pressure from sky2/skge), users will even less understand why there is such incentive to remove it. Another trick for obsolete drivers would be to simply remove them from the usual build system, but have them being available for explicit build. Eg: make modules will not build them, but make obsolete-modules would do. > > Testing a new kernel is no longer a drop in a boot > > operation if modprobe.conf must be edited to get the network up, and the > > typical user isn't going to write that shell script to try one or the other > > driver. > > The typical user will let his distribution handl
[PATCH] timekeeping: Prevent time going backwards on resume
Timekeeping resume adjusts xtime by adding the slept time in seconds and resets the reference value of the clock source (clock->cycle_last). clock->cycle last is used to calculate the delta between the last xtime update and the readout of the clock source in __get_nsec_offset(). xtime plus the offset is the current time. The resume code ignores the delta which had already elapsed between the last xtime update and the actual time of suspend. If the suspend time is short, then we can see time going backwards on resume. Suspend: offs_s = clock->read() - clock->cycle_last; now = xtime + offs_s; timekeeping_suspend_time = read_rtc(); Resume: sleep_time = read_rtc() - timekeeping_suspend_time; xtime.tv_sec += sleep_time; clock->cycle_last = clock->read(); offs_r = clock->read() - clock->cycle_last; now = xtime + offs_r; if sleep_time_seconds == 0 and offs_r < offs_s, then time goes backwards. Fix this by storing the offset from the last xtime update and add it to xtime during resume, when we reset clock->cycle_last: sleep_time = read_rtc() - timekeeping_suspend_time; xtime.tv_sec += sleep_time; xtime += offs_s;/* Fixup xtime offset at suspend time */ clock->cycle_last = clock->read(); offs_r = clock->read() - clock->cycle_last; now = xtime + offs_r; Thanks to Marcelo for tracking this down on the OLPC and providing the necessary details to analyze the root cause. Signed-off-by: Thomas Gleixner <[EMAIL PROTECTED]> --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -280,6 +280,8 @@ void __init timekeeping_init(void) static int timekeeping_suspended; /* time in seconds when suspend began */ static unsigned long timekeeping_suspend_time; +/* xtime offset when we went into suspend */ +static s64 timekeeping_suspend_offset; /** * timekeeping_resume - Resumes the generic timekeeping subsystem. @@ -305,6 +307,8 @@ static int timekeeping_resume(struct sys_device *dev) wall_to_monotonic.tv_sec -= sleep_length; total_sleep_time += sleep_length; } + /* Make sure that we have the correct xtime reference */ + timespec_add_ns(&xtime, timekeeping_suspend_offset); /* re-base the last cycle value */ clock->cycle_last = clocksource_read(clock); clock->error = 0; @@ -326,6 +330,8 @@ static int timekeeping_suspend(struct sys_device *dev, pm_message_t state) unsigned long flags; write_seqlock_irqsave(&xtime_lock, flags); timekeeping_suspended = 1; + /* Get the current xtime offset */ + timekeeping_suspend_offset = __get_nsec_offset(); timekeeping_suspend_time = read_persistent_clock(); write_sequnlock_irqrestore(&xtime_lock, flags); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[git patches] ocfs2 fixes
This includes a small doc update, which I missed earlier. It doesn't change any code. The other three patches are real fixes. --Mark Please pull from 'upstream-linus' branch of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2.git upstream-linus to receive the following updates: Documentation/filesystems/ocfs2.txt | 13 -- fs/Kconfig |3 - fs/ocfs2/alloc.c|1 fs/ocfs2/aops.c |4 +- fs/ocfs2/file.c |1 fs/ocfs2/super.c| 69 +++- 6 files changed, 50 insertions(+), 41 deletions(-) Mark Fasheh (2): ocfs2: update docs for new features ocfs2: Fix calculation of i_blocks during truncate Tiger Yang (1): ocfs2: fix mount option parsing [EMAIL PROTECTED] (1): ocfs2: Fix a wrong cluster calculation. diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt index 8ccf0c1..ed55238 100644 --- a/Documentation/filesystems/ocfs2.txt +++ b/Documentation/filesystems/ocfs2.txt @@ -28,11 +28,7 @@ Manish Singh <[EMAIL PROTECTED]> Caveats === Features which OCFS2 does not support yet: - - sparse files - extended attributes - - shared writable mmap - - loopback is supported, but data written will not - be cluster coherent. - quotas - cluster aware flock - cluster aware lockf @@ -57,3 +53,12 @@ nointr Do not allow signals to interrupt cluster atime_quantum=60(*)OCFS2 will not update atime unless this number of seconds has passed since the last update. Set to zero to always update atime. +data=ordered (*) All data are forced directly out to the main file + system prior to its metadata being committed to the + journal. +data=writeback Data ordering is not preserved, data may be written + into the main file system after its metadata has been + committed to the journal. +preferred_slot=0(*)During mount, try to use this filesystem slot first. If + it is in use by another node, the first empty one found + will be chosen. Invalid values will be ignored. diff --git a/fs/Kconfig b/fs/Kconfig index 58a0650..f9eed6d 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -441,9 +441,6 @@ config OCFS2_FS Note: Features which OCFS2 does not support yet: - extended attributes - - shared writeable mmap - - loopback is supported, but data written will not - be cluster coherent. - quotas - cluster aware flock - Directory change notification (F_NOTIFY) diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c index 4f51766..778a850 100644 --- a/fs/ocfs2/alloc.c +++ b/fs/ocfs2/alloc.c @@ -5602,6 +5602,7 @@ static int ocfs2_do_truncate(struct ocfs2_super *osb, clusters_to_del; spin_unlock(&OCFS2_I(inode)->ip_lock); le32_add_cpu(&fe->i_clusters, -clusters_to_del); + inode->i_blocks = ocfs2_inode_sector_count(inode); status = ocfs2_trim_tree(inode, path, handle, tc, clusters_to_del, &delete_blk); diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index 460d440..50cd8a2 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -855,6 +855,7 @@ static int ocfs2_alloc_write_ctxt(struct ocfs2_write_ctxt **wcp, struct ocfs2_super *osb, loff_t pos, unsigned len, struct buffer_head *di_bh) { + u32 cend; struct ocfs2_write_ctxt *wc; wc = kzalloc(sizeof(struct ocfs2_write_ctxt), GFP_NOFS); @@ -862,7 +863,8 @@ static int ocfs2_alloc_write_ctxt(struct ocfs2_write_ctxt **wcp, return -ENOMEM; wc->w_cpos = pos >> osb->s_clustersize_bits; - wc->w_clen = ocfs2_clusters_for_bytes(osb->sb, len); + cend = (pos + len - 1) >> osb->s_clustersize_bits; + wc->w_clen = cend - wc->w_cpos + 1; get_bh(di_bh); wc->w_di_bh = di_bh; diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index 4ffa715..7e34e66 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -314,7 +314,6 @@ static int ocfs2_orphan_for_truncate(struct ocfs2_super *osb, } i_size_write(inode, new_i_size); - inode->i_blocks = ocfs2_align_bytes_to_sectors(new_i_size); inode->i_ctime = inode->i_mtime = CURRENT_TIME; di = (struct ocfs2_dinode *) fe_bh->b_data; diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index f2fc9a7..c034b51 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -81,8 +81,15 @@ static struct dentry *ocfs2_debugfs_root = NULL; MODULE_AUTHOR("Oracle"); MODULE_LICENSE("GPL");
Re: sk98lin for 2.6.23-rc1
--- Stephen Hemminger <[EMAIL PROTECTED]> wrote: > On Sun, 9 Sep 2007 13:13:26 +0200 > Adrian Bunk <[EMAIL PROTECTED]> wrote: > > > On Sat, Sep 08, 2007 at 10:42:20PM -0400, Kyle > Rose wrote: > > > > > > > You are a regular reader of linux-kernel, and > therefore the sk98lin > > > > removal can hardly be a surprise for you. If > you prefer whining over > > > > helping to improve the kernel that's your > choice... > > > > > > > In my case the issue is simply one of > practicality: I cannot go to the > > > data center 5 times per day to reboot my colo > box. Therefore, I run > > > sk98lin. It's really that simple. > > > > When did you report this bug the first time? > > > > What we need is that people when testing a new > kernel they plan to use > > test the new drivers *and report the bugs if they > run into any*. > > > > What could we have done so that you reported your > bug without removing > > the sk98lin driver? > > > > > Kyle > > > > cu > > Adrian > > > There are several different problems in this thread: > 1. The removal of old sk98lin driver caused some > users to be forced to use > skge. These users have uncovered issues with the > dual port fiber based versions > of the board. > Short term: The sk98lin driver should be > restored to previous state, >and the PCI table should be used to limit the > usage to only fiber systems. >If Adrian doesn't do it, I'll do it when I > return from Germany. > Long term: I have fiber based board (thanks > ebay) on the way to resolve >skge bug. > > 2. Sky2 driver has it's own fiber based problems. > Solve these after skge fiber. > > 3. Sky2 doesn't have as many workarounds for > hardware problems as vendor sk98lin > driver. > - Hm, hope I didn't trigger a religious debate. When you get to the point of working on the SKY2 driver problem with DGE-550SX (Syskonnect SK-9S81) also known as the "hw csum failure" issue, I'll be glad to test a patch or take debug data. Til then, I'll stay out of the way. -J Shape Yahoo! in your own image. Join our Network Research Panel today! http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: time_after - what on earth???
On 09/12/2007 12:15 AM, Adrian McMenamin wrote: On 11/09/2007, Rene Herman <[EMAIL PROTECTED]> wrote: On 09/12/2007 12:05 AM, Adrian McMenamin wrote: OK, why does this line occasionally return true: if ((maple_dev->interval > 0) && (jiffies >maple_dev->when)) while this one never does (no other changes made): if ((maple_dev->interval > 0) && (time_after(jiffies, maple_dev->when))) Is maple_dev->when an unsigned long? Yes. Does that make a difference? If it had been a signed type, it could've wrapped to something you didn't expect, explaining the difference at least... With an unsigned long, the only diference should be that time_after() deals with jiffie wrapping which I assume is not an actual problem here. I'll retreat into the shades again... ;-( Rene. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: time_after - what on earth???
On 11/09/2007, Rene Herman <[EMAIL PROTECTED]> wrote: > On 09/12/2007 12:05 AM, Adrian McMenamin wrote: > > > OK, why does this line occasionally return true: > > > > if ((maple_dev->interval > 0) && (jiffies >maple_dev->when)) > > > > while this one never does (no other changes made): > > > > if ((maple_dev->interval > 0) && (time_after(jiffies, maple_dev->when))) > > Is maple_dev->when an unsigned long? > Yes. Does that make a difference? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: time_after - what on earth???
On 09/12/2007 12:05 AM, Adrian McMenamin wrote: OK, why does this line occasionally return true: if ((maple_dev->interval > 0) && (jiffies >maple_dev->when)) while this one never does (no other changes made): if ((maple_dev->interval > 0) && (time_after(jiffies, maple_dev->when))) Is maple_dev->when an unsigned long? Rene. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[BUG:] forcedeth: MCP55 not allowing DHCP
I have an Asus Striker Extreme motherboard with two built in MCP55 GigE interfaces. When I build with the original Fedora 7 release kernel ( ftp://ftp.belnet.be/linux/fedora/linux/releases/7/Fedora/i386/os/Fedora/kernel-2.6.21-1.3194.fc7.i686.rpm ) everything works fine. However, when I boot with any updated kernels or any other kernel (have tried building from several points in the linus git tree between 2.6.20 and .23-rc3, and 2.6.21.2 in -stable) I cannot get an IP address via dhcp. There is no error in dmesg. The card shows a link and otherwise appears to be working, but it is as if the dhcp server has been removed from the network. On a running system there is no indication that this is a kernel bug at all, however by varying only the kernel the bug appears and disappears. I've run all these tests repeatedly with no intervening updates of any other packages. As I said I attempted to build 2.6.21.2 ( the point of divergence between the Fedora kernel in question and -stable ) and still the card did not work. I will next attempt to manually build the rpm for the release kernel. If this works I will try experimenting with the included patches to narrow it down, but at this point I'm at a complete loss. -Casey Dahlin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] doc: about email clients for Linux kernel patches
On Tue, Sep 11, 2007 at 14:38:13 -0400, Lee Revell wrote: > You can also diff -Nru old.c new.c | xclip, select Preformat, then > paste with the middle button. mutt does not come with text editor, so I'd like to add note about vim: If using xclip, type command :set paste before middle button or shift-insert or use :r filename ...if you want to include patch inline. (a)ttach works fine without "set paste". -- Do what you love because life is too short for anything else. pgp9tnOcB12Bo.pgp Description: PGP signature
time_after - what on earth???
OK, why does this line occasionally return true: if ((maple_dev->interval > 0) && (jiffies >maple_dev->when)) while this one never does (no other changes made): if ((maple_dev->interval > 0) && (time_after(jiffies, maple_dev->when))) Is this a gcc issue or what? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wednesday 12 September 2007 07:48, Christoph Lameter wrote: > On Tue, 11 Sep 2007, Nick Piggin wrote: > > But that's not my place to say, and I'm actually not arguing that high > > order pagecache does not have uses (especially as a practical, > > shorter-term solution which is unintrusive to filesystems). > > > > So no, I don't think I'm really going against the basics of what we > > agreed in Cambridge. But it sounds like it's still being billed as > > first-order support right off the bat here. > > Well its seems that we have different interpretations of what was agreed > on. My understanding was that the large blocksize patchset was okay > provided that I supply an acceptable mmap implementation and put a > warning in. Yes. I think we differ on our interpretations of "okay". In my interpretation, it is not OK to use this patch as a way to solve VM or FS or IO scalability issues, especially not while the alternative approaches that do _not_ have these problems have not been adequately compared or argued against. > > But even so, you can just hold an open fd in order to pin the dentry you > > want. My attack would go like this: get the page size and allocation > > group size for the machine, then get the number of dentries required to > > fill a slab. Then read in that many dentries and pin one of them. Repeat > > the process. Even if there is other activity on the system, it seems > > possible that such a thing will cause some headaches after not too long a > > time. Some sources of pinned memory are going to be better than others > > for this of course, so yeah maybe pagetables will be a bit easier (I > > don't know). > > Well even without slab targeted reclaim: Mel's antifrag will sort the > dentries into separate blocks of memory and so isolate the issue. So even after all this time you do not understand what the fundamental problem is with anti-frag and yet you are happy to waste both our time in endless flamewars telling me how wrong I am about it. Forgive me if I'm starting to be rude, Christoph. This is really irritating. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On (11/09/07 14:48), Christoph Lameter didst pronounce: > On Tue, 11 Sep 2007, Nick Piggin wrote: > > > But that's not my place to say, and I'm actually not arguing that high > > order pagecache does not have uses (especially as a practical, > > shorter-term solution which is unintrusive to filesystems). > > > > So no, I don't think I'm really going against the basics of what we agreed > > in Cambridge. But it sounds like it's still being billed as first-order > > support right off the bat here. > > Well its seems that we have different interpretations of what was agreed > on. My understanding was that the large blocksize patchset was okay > provided that I supply an acceptable mmap implementation and put a > warning in. > Warnings == #2 citizen in my mind with known potential failure cases. That was the point I thought. > > But even so, you can just hold an open fd in order to pin the dentry you > > want. My attack would go like this: get the page size and allocation group > > size for the machine, then get the number of dentries required to fill a > > slab. Then read in that many dentries and pin one of them. Repeat the > > process. Even if there is other activity on the system, it seems possible > > that such a thing will cause some headaches after not too long a time. > > Some sources of pinned memory are going to be better than others for > > this of course, so yeah maybe pagetables will be a bit easier (I don't > > know). > > Well even without slab targeted reclaim: Mel's antifrag will sort the > dentries into separate blocks of memory and so isolate the issue. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] drivers/firmware: const-ify DMI API and internals
On Saturday 01 September 2007, Jeff Garzik wrote: > > commit 457b6eb3bf3341d2e143518a0bb99ffbb8d754c4 > Author: Jeff Garzik <[EMAIL PROTECTED]> > Date: Sat Sep 1 10:16:45 2007 -0400 > > drivers/firmware: const-ify DMI API and internals > > Three main sets of changes: > > 1) dmi_get_system_info() return value should have been marked const, >since callers should not be changing that data. > > 2) const-ify DMI internals, since DMI firmware tables should, >whenever possible, be marked const to ensure we never ever write to >that data area. > > 3) const-ify DMI API, to enable marking tables const where possible >in low-level drivers. > > And if we're really lucky, this might enable some additional > optimizations on the part of the compiler. > > The bulk of the changes are #2 and #3, which are interrelated. #1 could > have been a separate patch, but it was so small compared to the others, > it was easier to roll it into this changeset. > > Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]> [ a bit late ] Acked-by: Bartlomiej Zolnierkiewicz <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Tue, 11 Sep 2007, Nick Piggin wrote: > > No you have not explained why the theoretical issues continue to exist > > given even just considering Lumpy Reclaim in .23 nor what effect the > > antifrag patchset would have. > > So how does lumpy reclaim, your slab patches, or anti-frag have > much effect on the worst case situation? Or help much against a > targetted fragmentation attack? F.e. Lumpy reclaim reclaim neighboring pages and thus works against fragmentation. So your formulae no longer works. > > And you have used a 2M pagesize which is > > irrelevant to this patchset that deals with blocksizes up to 64k. In my > > experience the use of blocksize < PAGE_COSTLY_ORDER (32k) is reasonably > > safe. > > I used EXACTLY the page sizes that you brought up in your patch > description (ie. 64K and 2MB). The patch currently only supports 64k. There is hope that it will support 2M at some point and as mentioned also a special large page pool facility may be required. Quoting from the post: I would like to increase the supported blocksize to very large pages in the future so that device drives will be capable of providing large contiguous mapping. For that purpose I think that we need a mechanism to reserve pools of varying large sizes at boot time. Such a mechanism can also be used to compensate in situations where one wants to use larger buffers but defragmentation support is not (yet?) capable to reliably provide pages of the desired sizes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 merge - a little feedback
On Tue, 11 Sep 2007, Andi Kleen wrote: > > Will that cause people to compile test both? I have my doubts that > will really work. If people don't compile-test both now, then why would they compile-test things when merged? So no, that's not the point. But at least things like "grep" will work sanely, and people will be *aware* that "Oh, this touches a file that may be used by the other word-size". Right now, we have people changing "i386-only" files that turn out to be used by x86-64 too - through very subtle Makefile things that the person who only looks into the i386 Makefile will never even *see*. THAT is the problem (well, at least part of it). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] leds: add #include to include/linux/leds.h for rwlock_t
On Tue, 2007-09-11 at 17:48 +0900, Yoichi Yuasa wrote: > This patch has added #include to include/linux/leds.h for > rwlock_t. > > Signed-off-by: Yoichi Yuasa <[EMAIL PROTECTED]> Added to the leds tree[1], thanks. http://git.o-hand.com/?p=linux-rpurdie-leds;a=shortlog;h=for-mm Richard - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 merge - a little feedback
On Tue, Sep 11, 2007 at 10:34:23PM +0100, Andi Kleen wrote: > > > > > People do not expect code under arch/i386/ to be used by code under > > arch/x86_64/ and vice versa. > > > > That regularly results in people sending patches that don't compile on > > the other architecture. > > > > With one architecture it's much more obvious that the code is shared. > > Will that cause people to compile test both? I have my doubts that > will really work. >... You will see that it could be shared, and it'll be much easier to see all configurations it's used in. Currently, there are 6 or 7 different ways how a function under arch/i386/ could be used by a function under arch/x86_64/ (and vice versa) and it's non-trivial to figure out all usages when grep'ing for users. > -Andi cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Tue, 11 Sep 2007, Nick Piggin wrote: > But that's not my place to say, and I'm actually not arguing that high > order pagecache does not have uses (especially as a practical, > shorter-term solution which is unintrusive to filesystems). > > So no, I don't think I'm really going against the basics of what we agreed > in Cambridge. But it sounds like it's still being billed as first-order > support right off the bat here. Well its seems that we have different interpretations of what was agreed on. My understanding was that the large blocksize patchset was okay provided that I supply an acceptable mmap implementation and put a warning in. > But even so, you can just hold an open fd in order to pin the dentry you > want. My attack would go like this: get the page size and allocation group > size for the machine, then get the number of dentries required to fill a > slab. Then read in that many dentries and pin one of them. Repeat the > process. Even if there is other activity on the system, it seems possible > that such a thing will cause some headaches after not too long a time. > Some sources of pinned memory are going to be better than others for > this of course, so yeah maybe pagetables will be a bit easier (I don't know). Well even without slab targeted reclaim: Mel's antifrag will sort the dentries into separate blocks of memory and so isolate the issue. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: clockevents: fix resume logic
On Tue, 2007-09-11 at 21:52 +0200, Thomas Gleixner wrote: > > C1: type[C1] promotion[C2] demotion[--] latency[001] > > usage[0010] duration[] > >*C2: type[C2] promotion[--] demotion[C1] latency[001] > > usage[8316] duration[000170717293] > > Ok, here we are. The bad one uses C2 which stops the local apic on the > VAIO. I suspect we end up in the suspend/resume with going into C2 > without the broadcast active. > > Can you try to get the output of SysRq-Q during the "it needs help from > keyboard" period ? Summary of the oddities we are seing: 1.) disabling local apic timer makes the problem go away 2.) disabling nohz and highres makes the problem go away 3.) adding the cpuidle patches from the acpi tree makes the problem go away The obvious conclusion is, that in all other cases we run into a state, where the local apic timer is not working. 1) we do not use it 2) it is used in periodic mode 3) the cpu does not enter C2 (which turns the lapic timer off on the VAIO) While 1) and 3) are understandable, the reason why 2) is working is a mystery because the periodic mode is affected by the C state as well. Andrew, can you please provide the output of /proc/timer_list when you boot the kernel with "nohz=off highres=off", but honestly I do not expect a lot of enlightenment from it. The Sysrq-Q output from the point where the box is stuck without keystrokes might give us more information. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wednesday 12 September 2007 07:41, Christoph Lameter wrote: > On Tue, 11 Sep 2007, Nick Piggin wrote: > > I think I would have as good a shot as any to write a fragmentation > > exploit, yes. I think I've given you enough info to do the same, so I'd > > like to hear a reason why it is not a problem. > > No you have not explained why the theoretical issues continue to exist > given even just considering Lumpy Reclaim in .23 nor what effect the > antifrag patchset would have. So how does lumpy reclaim, your slab patches, or anti-frag have much effect on the worst case situation? Or help much against a targetted fragmentation attack? > And you have used a 2M pagesize which is > irrelevant to this patchset that deals with blocksizes up to 64k. In my > experience the use of blocksize < PAGE_COSTLY_ORDER (32k) is reasonably > safe. I used EXACTLY the page sizes that you brought up in your patch description (ie. 64K and 2MB). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] build system: section garbage collection for vmlinux
On Tue, 2007-09-11 at 21:07 +0100, Denys Vlasenko wrote: > This patch is needed for --gc-sections to work, regardless > of which final form that support will have. > > This patch renames .text.xxx and .data.xxx sections > into .xxx.text and .xxx.data, respectively. I think you'll have better luck with this if you focus on a single architecture (i386 would be best) .. Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][Intel-IOMMU] Fix for IOMMU early crash
On Wed, Sep 12, 2007 at 05:48:52AM +1000, Paul Mackerras wrote: > Keshavamurthy, Anil S writes: > > > Subject: Fix IOMMU early crash > > > > This patch avoids copying pci_bus's->sysdata to > > pci_dev's->sysdata as one can easily obtain > > the same through pci_dev->bus->sysdata. > > At the moment this will cause ppc64 to crash, since we rely on > pci_dev->sysdata pointing to some node in the firmware device tree, > either the device's node or the node for a parent bus. > > We could change the ppc64 code to use pci_dev->bus->sysdata in the > case when pci_dev->sysdata is NULL, which would fix the problem. I > think that change should be incorporated as part of this patch so that > we don't break git bisection. Why do you want to check if pci_dev->sysdata is NULL then use pci_dev->bus->sysdata else pci_dev->sysdata? If you change this to always use pci_dev->bus->sysdata, then you don;t have to depend on my patch and your patch can get in independent of mine. > > In other words I don't want to see this patch applied as it stands. Is it possible to post your patch anytime soon? Or feel free to combine mine with yours and post it with your signed-off-by. Thanks, Anil - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Tue, 11 Sep 2007, Nick Piggin wrote: > I think I would have as good a shot as any to write a fragmentation > exploit, yes. I think I've given you enough info to do the same, so I'd > like to hear a reason why it is not a problem. No you have not explained why the theoretical issues continue to exist given even just considering Lumpy Reclaim in .23 nor what effect the antifrag patchset would have. And you have used a 2M pagesize which is irrelevant to this patchset that deals with blocksizes up to 64k. In my experience the use of blocksize < PAGE_COSTLY_ORDER (32k) is reasonably safe. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wednesday 12 September 2007 06:53, Mel Gorman wrote: > On (11/09/07 11:44), Nick Piggin didst pronounce: > However, this discussion belongs more with the non-existant-remove-slab > patch. Based on what we've seen since the summits, we need a thorough > analysis with benchmarks before making a final decision (kernbench, ebizzy, > tbench (netpipe if someone has the time/resources), hackbench and maybe > sysbench as well as something the filesystem people recommend to get good > coverage of the subsystems). True. Aside, it seems I might have been mistaken in saying Christoph is proposing to use higher order allocations to fix the SLUB regression. Anyway, I agree let's not get sidetracked about this here. > I'd rather not get side-tracked here. I regret you feel stream-rolled but I > think grouping pages by mobility is the right thing to do for better usage > of the TLB by the kernel and for improving hugepage support in userspace > minimally. We never really did see eye-to-eye but this way, if I'm wrong > you get to chuck eggs down the line. No it's a fair point, and even the hugepage allocations alone are a fair point. From the discussions I think it seems like quite probably the right thing to do pragmatically, which is what Linux is about and I hope will result in a better kernel in the end. So I don't have complaints except from little ivory tower ;) > > Sure. And some people run workloads where fragmentation is likely never > > going to be a problem, they are shipping this poorly configured hardware > > now or soon, so they don't have too much interest in doing it right at > > this point, rather than doing it *now*. OK, that's a valid reason which > > is why I don't use the argument that we should do it correctly or never > > at all. > > So are we saying the right thing to do is go with fs-block from day 1 once > we get it to optimistically use high-order pages? I think your concern > might be that if this goes in then it'll be harder to justify fsblock in > the future because it'll be solving a theoritical problem that takes months > to trigger if at all. i.e. The filesystem people will push because > apparently large block support as it is solves world peace. Is that > accurate? Heh. It's hard to say. I think fsblock could take a while to implement, regardless of high order pages or not. I actually would like to be able to pass down a mandate to say higher order pagecache will never get merged, simply so that these talented people would work on fsblock ;) But that's not my place to say, and I'm actually not arguing that high order pagecache does not have uses (especially as a practical, shorter-term solution which is unintrusive to filesystems). So no, I don't think I'm really going against the basics of what we agreed in Cambridge. But it sounds like it's still being billed as first-order support right off the bat here. > > OTOH, I'm not sure how much buy-in there was from the filesystems guys. > > Particularly Christoph H and XFS (which is strange because they already > > do vmapping in places). > > I think they use vmapping because they have to, not because they want > to. They might be a lot happier with fsblock if it used contiguous pages > for large blocks whenever possible - I don't know for sure. The metadata > accessors they might be unhappy with because it's inconvenient but as > Christoph Hellwig pointed out at VM/FS, the filesystems who really care > will convert. Sure, they would rather not to. But there are also a lot of ways you can improve vmap more than what XFS does (or probably what darwin does) (more persistence for cached objects, and batched invalidates for example). There are also a lot of trivial things you can do to make a lot of those accesses not require vmaps (and less trivial things, but even such things as binary searches over multiple pages should be quite possible with a bit of logic). > > It would be interesting to craft an attack. If you knew roughly the > > layout and size of your dentry slab for example... maybe you could stat a > > whole lot of files, then open one and keep it open (maybe post the fd to > > a unix socket or something crazy!) when you think you have filled up a > > couple of MB worth of them. > > I might regret saying this, but it would be easier to craft an attack > using pagetable pages. It's woefully difficult to do but it's probably > doable. I say pagetables because while slub targetted reclaim is on the > cards and memory compaction exists for page cache pages, pagetables are > currently pinned with no prototype patch existing to deal with them. But even so, you can just hold an open fd in order to pin the dentry you want. My attack would go like this: get the page size and allocation group size for the machine, then get the number of dentries required to fill a slab. Then read in that many dentries and pin one of them. Repeat the process. Even if there is other activity on the system, it seems possible that such a thing will