Re: High lock spin time for zone->lru_lock under extreme conditions
Ravikiran G Thirumalai wrote: On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote: What is the "CS time"? Critical Section :). This is the maximal time interval I measured from t2 above to the time point we release the spin lock. This is the hold time I guess. It would be interesting to know how long the maximal lru_lock *hold* time is, which could give us a better indication of whether it is a hardware problem. For example, if the maximum hold time is 10ms, that it might indicate a hardware fairness problem. The maximal hold time was about 3s. Well then it doesn't seem very surprising that this could cause a 30s wait time for one CPU in a 16 core system, regardless of fairness. I guess most of the contention, and the lock hold times are coming from vmscan? Do you know exactly which critical sections are the culprits? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Fri, Jan 12, 2007 at 05:11:16PM -0800, Andrew Morton wrote: > On Fri, 12 Jan 2007 17:00:39 -0800 > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > > But is > > lru_lock an issue is another question. > > I doubt it, although there might be changes we can make in there to > work around it. > > I tested with PAGEVEC_SIZE define to 62 and 126 -- no difference. I still notice the atrociously high spin times. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote: > Ravikiran G Thirumalai wrote: > >Hi, > >We noticed high interrupt hold off times while running some memory > >intensive > >tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups, > > [...] > > >We did not use any lock debugging options and used plain old rdtsc to > >measure cycles. (We disable cpu freq scaling in the BIOS). All we did was > >this: > > > >void __lockfunc _spin_lock_irq(spinlock_t *lock) > >{ > >local_irq_disable(); > >> rdtsc(t1); > >preempt_disable(); > >spin_acquire(>dep_map, 0, 0, _RET_IP_); > >_raw_spin_lock(lock); > >> rdtsc(t2); > >if (lock->spin_time < (t2 - t1)) > >lock->spin_time = t2 - t1; > >} > > > >On some runs, we found that the zone->lru_lock spun for 33 seconds or more > >while the maximal CS time was 3 seconds or so. > > What is the "CS time"? Critical Section :). This is the maximal time interval I measured from t2 above to the time point we release the spin lock. This is the hold time I guess. > > It would be interesting to know how long the maximal lru_lock *hold* time > is, > which could give us a better indication of whether it is a hardware problem. > > For example, if the maximum hold time is 10ms, that it might indicate a > hardware fairness problem. The maximal hold time was about 3s. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.20-rc4-mm1: status of sn9c102_pas202bca?
On Thu, Jan 11, 2007 at 10:26:27PM -0800, Andrew Morton wrote: >... > Changes since 2.6.20-rc3-mm1: >... > git-dvb.patch >... > git trees >... drivers/media/video/sn9c102/sn9c102_pas202bca.c is no longer used or built but still shipped. cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: tuning/tweaking VM settings for low memory (preventing OOM)
On Fri, Jan 12, 2007 at 03:58:08PM -0600, Kumar Gala wrote: > I'm working on an embedded PPC setup with 64M of memory and no swap. > I'm trying to figure out how best to tune the VM for an OOM situation > I'm running into. > > I'm running a 2.6.16.35 kernel and have a bittorrent app that appears > to be initializing a large file for it to download into. What I see > before running the app: > > /bigfoot/usb_disk # cat /proc/meminfo > MemTotal:62520 kB > MemFree: 49192 kB > Buffers: 8240 kB > Cached:740 kB > SwapCached: 0 kB > Active: 8196 kB > Inactive: 1236 kB > HighTotal: 0 kB > HighFree:0 kB > LowTotal:62520 kB > LowFree: 49192 kB > SwapTotal: 0 kB > SwapFree:0 kB > Dirty: 0 kB > Writeback: 0 kB > Mapped:916 kB > Slab: 2224 kB > CommitLimit: 31260 kB > Committed_AS: 1704 kB > PageTables: 88 kB > VmallocTotal: 933872 kB > VmallocUsed: 9416 kB > VmallocChunk: 923628 kB > > after the OOM: > > /bigfoot/usb_disk # cat /proc/meminfo > MemTotal:62520 kB > MemFree: 1608 kB > Buffers: 8212 kB > Cached: 42780 kB > SwapCached: 0 kB > Active: 6228 kB > Inactive:45176 kB > HighTotal: 0 kB > HighFree:0 kB > LowTotal:62520 kB > LowFree: 1608 kB > SwapTotal: 0 kB > SwapFree:0 kB > Dirty: 35208 kB > Writeback:5616 kB > Mapped:892 kB > Slab: 7788 kB > CommitLimit: 31260 kB > Committed_AS: 1704 kB > PageTables: 88 kB > VmallocTotal: 933872 kB > VmallocUsed: 9416 kB > VmallocChunk: 923628 kB > > Which makes me think that we aren't writing back fast enough. If I > mount the drive "sync" the issue clearly goes away. > > It appears from an strace we are doing ftruncate64(5, 178257920) when > we OOM. > > Any ideas on VM parameters to tweak so we throttle this from occurring? Take a look at /proc/sys/vm/bdflush. There are several useful parameters there (doc is in linux-xxx/Documentation). For instance, the first column is the percentage of memory used by writes before starting to write on disk. When using tcpdump intensively, I lower this one to about 1%. Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA hotplug from the user side ?
On Sat, 2007-01-13 at 10:55 +0900, Tejun Heo wrote: > Soeren Sonnenburg wrote: > > It is true it detects a removal and newly plugged devices immediately... > > However it still prints warnings and errors that it could not > > synchronize SCSI cache for the disks. Then it prints regular 'rejects > > I/O to dead device' warning messages and on replugging the disks puts > > them to the next free sd device (e.g. sdc -> sdd). > > You need to stop using the devices before unplugging. If you have no > pending IO to the device, there won't be 'rejects IO to dead device' > messages. You can ignore the SCSI cache sync failure if the device is > properly closed before being unplugged. Jeff & Tejun thanks *a lot* for clarifying this. I am quite happy to see that this is working very reliably! > > These messages sound eval - so now the question is should I care ? > > ( On the other hand it did not crash the machine ) > > So, no, you don't really have to care. Just make sure the device is > unmounted prior to unplugging. OK, but then this really should be in the SATA hotplug FAQ (or can one fix this somehow?)... No user will ignore messages like this. What is especially annoying is that udev on the first remove/insert cycle created a new device node so the disk became /dev/sde (was /dev/sdd): dmesg output of reinserting the disk 2 times follows: ata4: exception Emask 0x10 SAct 0x0 SErr 0x1 action 0x2 frozen ata4: hard resetting port ata4: SATA link down (SStatus 0 SControl 310) ata4: failed to recover some devices, retrying in 5 secs ata4: hard resetting port ata4: SATA link down (SStatus 0 SControl 310) ata4: failed to recover some devices, retrying in 5 secs ata4: hard resetting port ata4: SATA link down (SStatus 0 SControl 310) ata4.00: disabled ata4: EH complete ata4.00: detaching (SCSI 3:0:0:0) Synchronizing SCSI cache for disk sdd: FAILED status = 0, message = 00, host = 4, driver = 00 <3>ata4: exception Emask 0x10 SAct 0x0 SErr 0x5 action 0x2 frozen ata4: hard resetting port ata4: COMRESET failed (device not ready) ata4: hardreset failed, retrying in 5 secs ata4: hard resetting port ata4: COMRESET failed (device not ready) ata4: hardreset failed, retrying in 5 secs ata4: hard resetting port ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata4.00: ATA-7, max UDMA/133, 1465149168 sectors: LBA48 NCQ (depth 0/32) ata4.00: configured for UDMA/100 ata4: EH complete scsi 3:0:0:0: Direct-Access ATA ST3750640AS 3.AA PQ: 0 ANSI: 5 SCSI device sde: 1465149168 512-byte hdwr sectors (750156 MB) sde: Write Protect is off sde: Mode Sense: 00 3a 00 00 SCSI device sde: drive cache: write back SCSI device sde: 1465149168 512-byte hdwr sectors (750156 MB) sde: Write Protect is off sde: Mode Sense: 00 3a 00 00 SCSI device sde: drive cache: write back sde: unknown partition table sd 3:0:0:0: Attached scsi disk sde sd 3:0:0:0: Attached scsi generic sg3 type 0 ata4: exception Emask 0x10 SAct 0x0 SErr 0x1 action 0x2 frozen ata4: hard resetting port ata4: SATA link down (SStatus 0 SControl 310) ata4: failed to recover some devices, retrying in 5 secs ata4: hard resetting port ata4: SATA link down (SStatus 0 SControl 310) ata4: failed to recover some devices, retrying in 5 secs ata4: hard resetting port ata4: SATA link down (SStatus 0 SControl 310) ata4.00: disabled ata4: EH complete ata4.00: detaching (SCSI 3:0:0:0) Synchronizing SCSI cache for disk sde: FAILED status = 0, message = 00, host = 4, driver = 00 <3>ata4: exception Emask 0x10 SAct 0x0 SErr 0x5 action 0x2 frozen ata4: hard resetting port ata4: COMRESET failed (device not ready) ata4: hardreset failed, retrying in 5 secs ata4: hard resetting port ata4: COMRESET failed (device not ready) ata4: hardreset failed, retrying in 5 secs ata4: hard resetting port ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata4.00: ATA-7, max UDMA/133, 1465149168 sectors: LBA48 NCQ (depth 0/32) ata4.00: configured for UDMA/100 ata4: EH complete scsi 3:0:0:0: Direct-Access ATA ST3750640AS 3.AA PQ: 0 ANSI: 5 SCSI device sde: 1465149168 512-byte hdwr sectors (750156 MB) sde: Write Protect is off sde: Mode Sense: 00 3a 00 00 SCSI device sde: drive cache: write back SCSI device sde: 1465149168 512-byte hdwr sectors (750156 MB) sde: Write Protect is off sde: Mode Sense: 00 3a 00 00 SCSI device sde: drive cache: write back sde: unknown partition table sd 3:0:0:0: Attached scsi disk sde sd 3:0:0:0: Attached scsi generic sg3 type 0 remains /dev/sde ... Soeren -- Sometimes, there's a moment as you're waking, when you become aware of the real world around you, but you're still dreaming. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.20-rc5: known regressions with patches
This email lists some known regressions in 2.6.20-rc5 compared to 2.6.19 with patches available. If you find your name in the Cc header, you are either submitter of one of the bugs, maintainer of an affectected subsystem or driver, a patch of you caused a breakage or I'm considering you in any other way possibly involved with one or more of these issues. Due to the huge amount of recipients, please trim the Cc when answering. Subject: WARNING: "profile_hits" [drivers/kvm/kvm-intel.ko] undefined! References : http://lkml.org/lkml/2007/1/12/16 Submitter : Miles Lane <[EMAIL PROTECTED]> Caused-By : Ingo Molnar <[EMAIL PROTECTED]> commit 07031e14c1127fc7e1a5b98dfcc59f434e025104 Handled-By : Andrew Morton <[EMAIL PROTECTED]> Patch : http://lkml.org/lkml/2007/1/12/18 Status : patch available Subject: KVM: guest crash References : http://lkml.org/lkml/2007/1/8/163 Submitter : Roland Dreier <[EMAIL PROTECTED]> Handled-By : Avi Kivity <[EMAIL PROTECTED]> Patch : http://lkml.org/lkml/2007/1/9/280 Status : patch available Subject: compile error: USB_HID must depend on INPUT References : http://lkml.org/lkml/2007/1/12/157 Submitter : Russell King <[EMAIL PROTECTED]> Handled-By : Russell King <[EMAIL PROTECTED]> Patch : http://lkml.org/lkml/2007/1/12/177 Status : patch available - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.20-rc5: known unfixed regressions
On Fri, Jan 12, 2007 at 02:27:48PM -0500, Linus Torvalds wrote: >... > A lot of developers (including me) will be gone next week for > Linux.Conf.Au, so you have a week of rest and quiet to test this, and > report any problems. > > Not that there will be any, right? You all behave now! >... This still leaves the old regressions we have not yet fixed... This email lists some known regressions in 2.6.20-rc5 compared to 2.6.19. If you find your name in the Cc header, you are either submitter of one of the bugs, maintainer of an affectected subsystem or driver, a patch of you caused a breakage or I'm considering you in any other way possibly involved with one or more of these issues. Due to the huge amount of recipients, please trim the Cc when answering. Subject: pktcdvd fails with pata_amd References : http://bugzilla.kernel.org/show_bug.cgi?id=7810 Submitter : [EMAIL PROTECTED] Status : unknown Subject: problems with CD burning References : http://www.spinics.net/lists/linux-ide/msg06545.html Submitter : Uwe Bugla <[EMAIL PROTECTED]> Status : unknown Subject: BUG: scheduling while atomic: hald-addon-stor/... cdrom_{open,release,ioctl} in trace References : http://lkml.org/lkml/2006/12/26/105 http://lkml.org/lkml/2006/12/29/22 http://lkml.org/lkml/2006/12/31/133 Submitter : Jon Smirl <[EMAIL PROTECTED]> Damien Wyart <[EMAIL PROTECTED]> Aaron Sethman <[EMAIL PROTECTED]> Status : unknown Subject: 'shutdown -h now' reboots the system (CONFIG_USB_SUSPEND) References : http://lkml.org/lkml/2006/12/25/40 Submitter : Berthold Cogel <[EMAIL PROTECTED]> Handled-By : Alexey Starikovskiy <[EMAIL PROTECTED]> Status : problem is being debugged Subject: USB keyboard unresponsive after some time References : http://lkml.org/lkml/2006/12/25/35 http://lkml.org/lkml/2006/12/26/106 Submitter : Florin Iucha <[EMAIL PROTECTED]> Handled-By : Jiri Kosina <[EMAIL PROTECTED]> Alan Stern <[EMAIL PROTECTED]> Status : problem is being debugged Subject: BUG: at fs/inotify.c:172 set_dentry_child_flags() References : http://bugzilla.kernel.org/show_bug.cgi?id=7785 Submitter : Cijoml Cijomlovic Cijomlov <[EMAIL PROTECTED]> Handled-By : Nick Piggin <[EMAIL PROTECTED]> Status : problem is being debugged Subject: BUG: at mm/truncate.c:60 cancel_dirty_page() (XFS) References : http://lkml.org/lkml/2007/1/5/308 Submitter : Sami Farin <[EMAIL PROTECTED]> Handled-By : David Chinner <[EMAIL PROTECTED]> Status : problem is being discussed Subject: BUG: at mm/truncate.c:60 cancel_dirty_page() (reiserfs) References : http://lkml.org/lkml/2007/1/7/117 http://lkml.org/lkml/2007/1/10/202 Submitter : Malte Schröder <[EMAIL PROTECTED]> Handled-By : Vladimir V. Saveliev <[EMAIL PROTECTED]> Nick Piggin <[EMAIL PROTECTED]> Patch : http://lkml.org/lkml/2007/1/10/202 Status : problem is being discussed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: avoid div in rebalance_tick
On Fri, Jan 12, 2007 at 09:59:40AM +, Alan wrote: > On Fri, 12 Jan 2007 07:02:13 +0100 > Nick Piggin <[EMAIL PROTECTED]> wrote: > > > Just noticed this while looking at a bug. > > Avoid an expensive integer divide 3 times per CPU per tick. > > Integer divide is cheap on some modern processors, and multibit shift > isn't on all embedded ones. > > How about putting back scale = 1 and using > > scale += scale; > > instead of the shift and getting what ought to be even better results OK, how about this? It only works out to be around 0.01% of my P3's CPU time at 1000HZ, but it also did make the x86 code 16 bytes smaller. -- Avoid expensive integer divide 3 times per CPU per tick. A userspace test of this loop went from 26ns, down to 19ns on a G5; and from 123ns down to 28ns on a P3. (Also avoid a variable bit shift, as suggested by Alan. The effect of this wasn't noticable on the CPUs I tested with). Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -2887,14 +2887,16 @@ static void active_load_balance(struct r static void update_load(struct rq *this_rq) { unsigned long this_load; - int i, scale; + unsigned int i, scale; this_load = this_rq->raw_weighted_load; /* Update our load: */ - for (i = 0, scale = 1; i < 3; i++, scale <<= 1) { + for (i = 0, scale = 1; i < 3; i++, scale += scale) { unsigned long old_load, new_load; + /* scale is effectively 1 << i now, and >> i divides by scale */ + old_load = this_rq->cpu_load[i]; new_load = this_load; /* @@ -2904,7 +2906,7 @@ static void update_load(struct rq *this_ */ if (new_load > old_load) new_load += scale-1; - this_rq->cpu_load[i] = (old_load*(scale-1) + new_load) / scale; + this_rq->cpu_load[i] = (old_load*(scale-1) + new_load) >> i; } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch]cleanup and error reporting for sound/core/init.c
Am Freitag, 12. Januar 2007 18:42 schrieb Takashi Iwai: > At Fri, 12 Jan 2007 14:49:57 +0100, > Oliver Neukum wrote: > > > > + } else { > > +if (idx < snd_ecards_limit) { > > + if (snd_cards_lock & (1 << idx)) > > + err = -EBUSY; /* invalid */ > > + } else if (idx < SNDRV_CARDS) > > + snd_ecards_limit = idx + 1; /* increase the > > limit */ > > + else > > + err = -ENODEV; > > The indent looks strange in the above three lines. > Also, for me it's not much better than before... :) > (all if's are comparisons of idx with other values.) Hi, OK, how about this one? The original indentation makes the control flow very hard to follow. Regards Oliver Signed-off-by: Oliver Neukum <[EMAIL PROTECTED]> -- --- sound/core/init.c.alt 2007-01-12 14:26:47.0 +0100 +++ sound/core/init.c 2007-01-13 07:34:29.0 +0100 @@ -114,22 +114,28 @@ if (idx < 0) { int idx2; for (idx2 = 0; idx2 < SNDRV_CARDS; idx2++) + /* idx == -1 == 0x means: take any free slot */ if (~snd_cards_lock & idx & 1<= snd_ecards_limit) snd_ecards_limit = idx + 1; break; } - } else if (idx < snd_ecards_limit) { - if (snd_cards_lock & (1 << idx)) - err = -ENODEV; /* invalid */ - } else if (idx < SNDRV_CARDS) - snd_ecards_limit = idx + 1; /* increase the limit */ - else - err = -ENODEV; + } else { +if (idx < snd_ecards_limit) { + if (snd_cards_lock & (1 << idx)) + err = -EBUSY; /* invalid */ + } else { + if (idx < SNDRV_CARDS) + snd_ecards_limit = idx + 1; /* increase the limit */ + else + err = -ENODEV; + } + } if (idx < 0 || err < 0) { mutex_unlock(_card_mutex); - snd_printk(KERN_ERR "cannot find the slot for index %d (range 0-%i)\n", idx, snd_ecards_limit - 1); + snd_printk(KERN_ERR "cannot find the slot for index %d (range 0-%i), error: %d\n", +idx, snd_ecards_limit - 1, err); goto __error; } snd_cards_lock |= 1 << idx; /* lock it */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
reiserfs BUGs
Running fsx-linux (akpm ext3-tools version) on reiserfs, 2.6.20-rc5 on x86_64. [ 4496.964604] [ cut here ] [ 4496.964614] Kernel BUG at 880b4499 [verbose debug info unavailable] [ 4496.964621] invalid opcode: [1] SMP [ 4496.964629] CPU 2 [ 4496.964635] Modules linked in: reiserfs xfs jfs loop [ 4496.964650] Pid: 298, comm: pdflush Not tainted 2.6.20-rc5 #1 [ 4496.964655] RIP: 0010:[] [] :reiserfs:flush_commit_list+0x532/0x60a [ 4496.964684] RSP: 0018:81011fa47bf0 EFLAGS: 00010246 [ 4496.964690] RAX: RBX: c2001090f240 RCX: [ 4496.964697] RDX: RSI: 72b3 RDI: c2001090f240 [ 4496.964703] RBP: 81011fa47c60 R08: 81011e521000 R09: [ 4496.964710] R10: 810005044100 R11: fffa R12: 81011d497180 [ 4496.964716] R13: 81011e521000 R14: 0088 R15: [ 4496.964723] FS: () GS:81011fc78cc0() knlGS: [ 4496.964730] CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b [ 4496.964737] CR2: 2b370e2a3000 CR3: 00011e118000 CR4: 06e0 [ 4496.964744] Process pdflush (pid: 298, threadinfo 81011fa46000, task 81011fb02140) [ 4496.964749] Stack: 0003 00010282 0058 0282 [ 4496.964767] 0058 c20010884000 1fa47c60 8012ddcb [ 4496.964785] 81011a3abf00 810117180980 45a858b5 [ 4496.964798] Call Trace: [ 4496.964815] [] __wake_up+0x43/0x50 [ 4496.964834] [] :reiserfs:do_journal_end+0xc95/0xced [ 4496.964845] [] find_busiest_group+0x24e/0x68f [ 4496.964856] [] keventd_create_kthread+0x0/0x79 [ 4496.964875] [] :reiserfs:journal_end_sync+0x75/0x7e [ 4496.964886] [] pdflush+0x0/0x1d4 [ 4496.964904] [] :reiserfs:reiserfs_sync_fs+0x41/0x67 [ 4496.964922] [] :reiserfs:reiserfs_write_super+0xe/0x10 [ 4496.964932] [] sync_supers+0x67/0xb6 [ 4496.964942] [] wb_kupdate+0x4d/0x133 [ 4496.964951] [] pdflush+0x0/0x1d4 [ 4496.964958] [] pdflush+0x129/0x1d4 [ 4496.964967] [] wb_kupdate+0x0/0x133 [ 4496.964975] [] kthread+0xd8/0x10c [ 4496.964984] [] schedule_tail+0x45/0xad [ 4496.964994] [] child_rip+0xa/0x12 [ 4496.965002] [] keventd_create_kthread+0x0/0x79 [ 4496.965011] [] kthread+0x0/0x10c [ 4496.965019] [] child_rip+0x0/0x12 [ 4496.965030] [ 4496.965031] Code: 0f 0b eb fe 48 8b 03 f0 0f ba 30 10 48 8b 13 8b 02 a9 00 00 [ 4496.965073] RIP [] :reiserfs:flush_commit_list+0x532/0x60a [ 4496.965094] RSP [ 4496.965395] BUG: at kernel/exit.c:860 do_exit() [ 4496.965407] [ 4496.965409] Call Trace: [ 4496.965420] [] profile_task_exit+0x15/0x17 [ 4496.965430] [] do_exit+0x55/0x81f [ 4496.965439] [] kernel_math_error+0x0/0x96 [ 4496.965450] [] do_trap+0xdc/0xeb [ 4496.965458] [] notifier_call_chain+0x29/0x3e [ 4496.965468] [] do_invalid_op+0xa7/0xb3 [ 4496.965488] [] :reiserfs:flush_commit_list+0x532/0x60a [ 4496.965498] [] __wait_on_bit+0x67/0x77 [ 4496.965508] [] sync_buffer+0x0/0x42 [ 4496.965516] [] sync_buffer+0x0/0x42 [ 4496.965525] [] error_exit+0x0/0x84 [ 4496.965545] [] :reiserfs:flush_commit_list+0x532/0x60a [ 4496.965556] [] __wake_up+0x43/0x50 [ 4496.965581] [] :reiserfs:do_journal_end+0xc95/0xced [ 4496.965591] [] find_busiest_group+0x24e/0x68f [ 4496.965601] [] keventd_create_kthread+0x0/0x79 [ 4496.965620] [] :reiserfs:journal_end_sync+0x75/0x7e [ 4496.965630] [] pdflush+0x0/0x1d4 [ 4496.965649] [] :reiserfs:reiserfs_sync_fs+0x41/0x67 [ 4496.965668] [] :reiserfs:reiserfs_write_super+0xe/0x10 [ 4496.965678] [] sync_supers+0x67/0xb6 [ 4496.965687] [] wb_kupdate+0x4d/0x133 [ 4496.965696] [] pdflush+0x0/0x1d4 [ 4496.965705] [] pdflush+0x129/0x1d4 [ 4496.965713] [] wb_kupdate+0x0/0x133 [ 4496.965722] [] kthread+0xd8/0x10c [ 4496.965731] [] schedule_tail+0x45/0xad [ 4496.965740] [] child_rip+0xa/0x12 [ 4496.965748] [] keventd_create_kthread+0x0/0x79 [ 4496.965758] [] kthread+0x0/0x10c [ 4496.965772] [] child_rip+0x0/0x12 msg log: http://oss.oracle.com/~rdunlap/kerneltest/logs/2620-rc5-reis-fsx.log config: http://oss.oracle.com/~rdunlap/kerneltest/configs/config-2620-rc5-reis-fsx --- ~Randy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux Software RAID 5 Performance Optimizations: 2.6.19.1: (211MB/s read & 195MB/s write)
Justin Piszcz wrote: > On Sat, 13 Jan 2007, Al Boldi wrote: > > Justin Piszcz wrote: > > > Btw, max sectors did improve my performance a little bit but > > > stripe_cache+read_ahead were the main optimizations that made > > > everything go faster by about ~1.5x. I have individual bonnie++ > > > benchmarks of [only] the max_sector_kb tests as well, it improved the > > > times from 8min/bonnie run -> 7min 11 seconds or so, see below and > > > then after that is what you requested. > > > > Can you repeat with /dev/sda only? > > For sda-- (is a 74GB raptor only)-- but ok. Do you get the same results for the 150GB-raptor on sd{e,g,i,k}? > # uptime > 16:25:38 up 1 min, 3 users, load average: 0.23, 0.14, 0.05 > # cat /sys/block/sda/queue/max_sectors_kb > 512 > # echo 3 > /proc/sys/vm/drop_caches > # dd if=/dev/sda of=/dev/null bs=1M count=10240 > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB) copied, 150.891 seconds, 71.2 MB/s > # echo 192 > /sys/block/sda/queue/max_sectors_kb > # echo 3 > /proc/sys/vm/drop_caches > # dd if=/dev/sda of=/dev/null bs=1M count=10240 > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB) copied, 150.192 seconds, 71.5 MB/s > # echo 128 > /sys/block/sda/queue/max_sectors_kb > # echo 3 > /proc/sys/vm/drop_caches > # dd if=/dev/sda of=/dev/null bs=1M count=10240 > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB) copied, 150.15 seconds, 71.5 MB/s > > > Does this show anything useful? Probably a latency issue. md is highly latency sensitive. What CPU type/speed do you have? Bootlog/dmesg? Thanks! -- Al - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.17 - weird, boot CPU (#0) not listed by the BIOS.
On Friday 12 January 2007 10:50, Mark Hounschell wrote: > Mark Hounschell wrote: > > I have a Tyan S4881 Thunder K8QW 4 processor (8 cores). Kernel 2.6.16.37 > > boots > > and runs fine. > > However kernel 2.6.17 and up doesn't. Here is my boot error msg. > > > > > > kernel /vmlinuz-2.6.17-smp root=/dev/sda5inux version 2.6.17-smp ([EMAIL > > PROTECTED]) > > (gcc version 4.1.0 (SUSE Linux)) #1 SMP PREEMPT Fri Jan 12 07:53:35 EST 2007 > > BIOS-provided physical RAM map: > > BIOS-e820: - 00093800 (usable) > > BIOS-e820: 00093800 - 000a (reserved) > > BIOS-e820: 000c2000 - 0010 (reserved) > > BIOS-e820: 0010 - cfea (usable) > > BIOS-e820: cfea - cfea4000 (ACPI data) > > BIOS-e820: cfea4000 - cff0 (ACPI NVS) > > BIOS-e820: cff0 - d000 (reserved) > > BIOS-e820: e000 - f000 (reserved) > > BIOS-e820: fec0 - fec00400 (reserved) > > BIOS-e820: fee0 - fee01000 (reserved) > > BIOS-e820: fff8 - 0001 (reserved) > > BIOS-e820: 0001 - 00023000 (usable) > > Warning only 4GB will be used. > > Use a PAE enabled kernel. > > 3200MB HIGHMEM available. > > 896MB LOWMEM available. > > found SMP MP-table at 000f71f0 > > DMI present. > > ACPI: PM-Timer IO Port: 0x8008 > > ACPI: LAPIC (acpi_id[0x00] lapic_id[0x10] enabled) > > Processor #16 15:1 APIC version 16 The APIC id for the 1st processor here is 16. Usually it is 0. Apparently this has confused some of the smpboot code with all their new nifty bitmaps for processors online and offline... Does the latest kernel work any better, say 2.6.19? What if you throw CONFIG_NR_CPUS=32 at it? -Len > > ACPI: LAPIC (acpi_id[0x01] lapic_id[0x11] enabled) > > Processor #17 15:1 APIC version 16 > > ACPI: LAPIC (acpi_id[0x02] lapic_id[0x12] enabled) > > Processor #18 15:1 APIC version 16 > > ACPI: LAPIC (acpi_id[0x03] lapic_id[0x13] enabled) > > Processor #19 15:1 APIC version 16 > > ACPI: LAPIC (acpi_id[0x04] lapic_id[0x14] enabled) > > Processor #20 15:1 APIC version 16 > > ACPI: LAPIC (acpi_id[0x05] lapic_id[0x15] enabled) > > Processor #21 15:1 APIC version 16 > > ACPI: LAPIC (acpi_id[0x06] lapic_id[0x16] enabled) > > Processor #22 15:1 APIC version 16 > > ACPI: LAPIC (acpi_id[0x07] lapic_id[0x17] enabled) > > Processor #23 15:1 APIC version 16 > > ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1]) > > ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) > > ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1]) > > ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1]) > > ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1]) > > ACPI: LAPIC_NMI (acpi_id[0x05] high edge lint[0x1]) > > ACPI: LAPIC_NMI (acpi_id[0x06] high edge lint[0x1]) > > ACPI: LAPIC_NMI (acpi_id[0x07] high edge lint[0x1]) > > ACPI: IOAPIC (id[0x00] address[0xfec0] gsi_base[0]) > > IOAPIC[0]: apic_id 0, version 17, address 0xfec0, GSI 0-23 > > ACPI: IOAPIC (id[0x01] address[0xda20] gsi_base[24]) > > IOAPIC[1]: apic_id 1, version 17, address 0xda20, GSI 24-27 > > ACPI: IOAPIC (id[0x02] address[0xda201000] gsi_base[28]) > > IOAPIC[2]: apic_id 2, version 17, address 0xda201000, GSI 28-31 > > ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level) > > Enabling APIC mode: Flat. Using 3 I/O APICs > > Using ACPI (MADT) for SMP configuration information > > Allocating PCI resources starting at d100 (gap: d000:1000) > > Built 1 zonelists > > Kernel command line: root=/dev/sda5 vga=normal resume=/dev/sda2 > > splash=silent > > "console=ttyS0,19200" > > Enabling fast FPU save and restore... done. > > Enabling unmasked SIMD FPU exception support... done. > > Initializing CPU#0 > > PID hash table entries: 4096 (order: 12, 16384 bytes) > > Detected 2411.454 MHz processor. > > Using pmtmr for high-res timesource > > Console: colour VGA+ 80x25 > > Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) > > Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) > > Memory: 3366304k/4194304k available (1529k kernel code, 38968k reserved, > > 633k > > data, 184k init, 2488960k highmem) > > Checking if this processor honours the WP bit even in supervisor mode... Ok. > > Calibrating delay using timer specific routine.. 4827.61 BogoMIPS > > (lpj=9655232) > > Security Framework v1.0.0 initialized > > Capability LSM initialized > > Mount-cache hash table entries: 512 > > CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) > > CPU: L2 Cache: 1024K (64 bytes/line) > > CPU 0(2) -> Core 0 > > Intel machine check architecture supported. > > Intel machine check reporting enabled on CPU#0. > > Checking 'hlt' instruction... OK. > > Freeing SMP alternatives: 12k freed > > ACPI Warning (nsload-0106): Zero-length AML block in table [SSDT] [20060127] > > CPU0: AMD Athlon(tm) or Opteron(tm) CPU-model unknown stepping
Re: Linux v2.6.20-rc5
From: Jeff Chua <[EMAIL PROTECTED]> CC [M] drivers/kvm/vmx.o {standard input}: Assembler messages: {standard input}:3257: Error: bad register name `%sil' make[2]: *** [drivers/kvm/vmx.o] Error 1 make[1]: *** [drivers/kvm] Error 2 make: *** [drivers] Error 2 I'm not using the kernel profiler, so here's a patch to make it work without CONFIG_PROFILING. Thanks, Jeff --- linux/drivers/kvm/vmx.c.org 2007-01-13 12:57:28 +0800 +++ linux/drivers/kvm/vmx.c 2007-01-13 14:01:17 +0800 @@ -21,7 +21,11 @@ #include #include #include + +#ifdef CONFIG_PROFILING #include +#endif + #include #include @@ -1861,11 +1865,13 @@ asm ("mov %0, %%ds; mov %0, %%es" : : "r"(__USER_DS)); #endif +#ifdef CONFIG_PROFILING /* * Profile KVM exit RIPs: */ if (unlikely(prof_on == KVM_PROFILING)) profile_hit(KVM_PROFILING, (void *)vmcs_readl(GUEST_RIP)); +#endif kvm_run->exit_type = 0; if (fail) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ahci_softreset prevents acpi_power_off
13 Oca 2007 Cts 03:12 tarihinde, Tejun Heo şunları yazmıştı: > Hello, Hello, Thanks for the response. > [...] > Does everything else work okay? > Can you access devices attached to > ahci? Yes. While the machine is on, there seems to be no problem at all. Everything works great. > What happens when you try to shutdown? Does not shutdown and freezes. Hand copied last messages seen on console: Synchronizing SCSI cache for disk sda: ACPI: PCI Interrupt for device :06:08.0 disabled Power down. acpi_power_off called hwsleep-0285 [01] enter_sleep_state: Entering sleep state [S5] > If possible, please post > dmesg of shutting down. Following is the netcat output. Please ask if you need anything else. Regards, - Faik Linux version 2.6.20-rc4 ([EMAIL PROTECTED]) (gcc version 3.4.6) #58 SMP Sat Jan 13 07:38:22 EET 2007 BIOS-provided physical RAM map: sanitize start sanitize end copy_e820_map() start: size: 0009f800 end: 0009f800 type: 1 copy_e820_map() type is E820_RAM copy_e820_map() start: 0009f800 size: 0800 end: 000a type: 2 copy_e820_map() start: 000d8000 size: 00028000 end: 0010 type: 2 copy_e820_map() start: 0010 size: 1fd9 end: 1fe9 type: 1 copy_e820_map() type is E820_RAM copy_e820_map() start: 1fe9 size: d000 end: 1fe9d000 type: 3 copy_e820_map() start: 1fe9d000 size: 00063000 end: 1ff0 type: 4 copy_e820_map() start: 1ff0 size: 0010 end: 2000 type: 2 copy_e820_map() start: e000 size: 10006000 end: f0006000 type: 2 copy_e820_map() start: f0008000 size: 4000 end: f000c000 type: 2 copy_e820_map() start: fed2 size: 0007 end: fed9 type: 2 copy_e820_map() start: ff00 size: 0100 end: 0001 type: 2 BIOS-e820: - 0009f800 (usable) BIOS-e820: 0009f800 - 000a (reserved) BIOS-e820: 000d8000 - 0010 (reserved) BIOS-e820: 0010 - 1fe9 (usable) BIOS-e820: 1fe9 - 1fe9d000 (ACPI data) BIOS-e820: 1fe9d000 - 1ff0 (ACPI NVS) BIOS-e820: 1ff0 - 2000 (reserved) BIOS-e820: e000 - f0006000 (reserved) BIOS-e820: f0008000 - f000c000 (reserved) BIOS-e820: fed2 - fed9 (reserved) BIOS-e820: ff00 - 0001 (reserved) 0MB HIGHMEM available. 510MB LOWMEM available. Zone PFN ranges: DMA 0 -> 4096 Normal 4096 -> 130704 HighMem130704 -> 130704 early_node_map[1] active PFN ranges 0:0 -> 130704 DMI 2.3 present. ACPI: PM-Timer IO Port: 0x1008 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 6:13 APIC version 20 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] disabled) ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) ACPI: IOAPIC (id[0x01] address[0xfec0] gsi_base[0]) IOAPIC[0]: apic_id 1, version 32, address 0xfec0, GSI 0-23 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) Enabling APIC mode: Flat. Using 1 I/O APICs Using ACPI (MADT) for SMP configuration information Allocating PCI resources starting at 3000 (gap: 2000:c000) Detected 1729.118 MHz processor. Built 1 zonelists. Total pages: 129045 Kernel command line: root=/dev/sda1 mudur=language:tr init=/bin/bash [EMAIL PROTECTED]/eth0,[EMAIL PROTECTED]/00:13:02:50:5C:2B netconsole: local port netconsole: local IP 192.168.1.8 netconsole: interface eth0 netconsole: remote port 9353 netconsole: remote IP 192.168.1.3 netconsole: remote ethernet address 00:13:02:50:5c:2b Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Initializing CPU#0 PID hash table entries: 2048 (order: 11, 8192 bytes) Console: colour VGA+ 80x25 Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar ... MAX_LOCKDEP_SUBCLASSES:8 ... MAX_LOCK_DEPTH: 30 ... MAX_LOCKDEP_KEYS:2048 ... CLASSHASH_SIZE: 1024 ... MAX_LOCKDEP_ENTRIES: 8192 ... MAX_LOCKDEP_CHAINS: 16384 ... CHAINHASH_SIZE: 8192 memory used by lock dependency info: 1064 kB per task-struct memory footprint: 1200 bytes | Locking API testsuite: | spin |wlock |rlock |mutex | wsem | rsem | -- A-A deadlock: ok | ok | ok | ok | ok | ok | A-B-B-A deadlock: ok | ok | ok
Re: [PATCH 0/4] Linux Kernel Markers
Hi Richard, * Richard J Moore ([EMAIL PROTECTED]) wrote: > > > Mathieu Desnoyers <[EMAIL PROTECTED]> wrote on 20/12/2006 > 23:52:16: > > > Hi, > > > > You will find, in the following posts, the latest revision of the Linux > Kernel > > Markers. Due to the need some tracing projects (LTTng, SystemTAP) has of > this > > kind of mechanism, it could be nice to consider it for mainstream > inclusion. > > > > The following patches apply on 2.6.20-rc1-git7. > > > > Signed-off-by : Mathieu Desnoyers <[EMAIL PROTECTED]> > > Mathiue, FWIW I like this idea. A few years ago I implemented something > similar, but that had no explicit clients. Consequently I made my hooks > code more generalized than is needed in practice. I do remember that Karim > reworked the LTT instrumentation to use hooks and it worked fine. > Yes, I think some features you implemented in GKHI, like chained calls to multiple probes, should be implemented in a "probe management module" which would be built on top of the marker infrastructure. One of my goal is to concentrate on having the core right so that, afterward, building on top of it will be easy. > You've got the same optimizations for x86 by modifying an instruction's > immediate operand and thus avoiding a d-cache hit. The only real caveat is > the need to avoid the unsynchronised cross modification erratum. Which > means that all processors will need to issue a serializing operation before > executing a Marker whose state is changed. How is that handled? > Good catch. I thought that modifying only 1 byte would spare us from this errata, but looking at it in detail tells me than it's not the case. I see three different ways to address the problem : 1 - Adding some synchronization code in the marker and using synchronize_sched(). 2 - Using an IPI to make other CPUs busy loop while we change the code and then execute a serializing instruction (iret, cpuid...). 3 - First write an int3 instead of the instruction's first byte. The handler would do the following : int3_handler : single-step the original instruction. iret Secondly, we call an IPI that does a smp_processor_id() on each CPU and wait for them to complete. It will make sure we execute a synchronizing instruction on every CPU even if we do not execute the trap handler. Then, we write the new 2 bytes instruction atomically instead of the int3 and immediate value. I exclude (1) because of the performance impact, (2) because it does not deal with NMIs. It leaves (3). Does it make sense ? > One additional thing we did, which might be useful at some future point, > was adding a /proc interface. We reflected the current instrumentation > though /proc and gave the status of each hook. We even talked about being > able to enable or disabled instrumentation by writing to /proc but I don't > think we ever implemented this. > Adding a /proc output to list the active probes and their callback will be tribial to add to the markers. I think the probe management module should have its /proc file too to list the chains of connected handlers once we get there. > It's high time we settled the issue of instrumentation. It gets my vote, > > Good luck! > > Richard > Thanks, Mathieu > - - > Richard J Moore > IBM Linux Technology Centre > -- OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] 2.6.15-rc5 - removes "video device notify" message (fwd)
Here's a line fix to ignore the "video device notify" message ... --- linux/drivers/acpi/video.c.org 2007-01-12 23:05:23 +0800 +++ linux/drivers/acpi/video.c 2007-01-12 23:05:29 +0800 @@ -1771,1 +1771,1 @@ - printk("video device notify\n"); + //printk("video device notify\n"); Thanks, Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux v2.6.20-rc5
On Fri, Jan 12, 2007 at 02:26:45PM -0800, Andrew Morton wrote: > On Fri, 12 Jan 2007 14:27:48 -0500 (EST) > Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > > > Ok, there it is, in all its shining glory. > > It still doesn't run Excel. >... It should work with CrossOver. cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: tuning/tweaking VM settings for low memory (preventing OOM)
Kumar Gala wrote: I'm working on an embedded PPC setup with 64M of memory and no swap. I'm trying to figure out how best to tune the VM for an OOM situation I'm running into. I'm running a 2.6.16.35 kernel and have a bittorrent app that appears to be initializing a large file for it to download into. What I see before running the app: /bigfoot/usb_disk # cat /proc/meminfo MemTotal:62520 kB MemFree: 49192 kB Buffers: 8240 kB Cached:740 kB SwapCached: 0 kB Active: 8196 kB Inactive: 1236 kB HighTotal: 0 kB HighFree:0 kB LowTotal:62520 kB LowFree: 49192 kB SwapTotal: 0 kB SwapFree:0 kB Dirty: 0 kB Writeback: 0 kB Mapped:916 kB Slab: 2224 kB CommitLimit: 31260 kB Committed_AS: 1704 kB PageTables: 88 kB VmallocTotal: 933872 kB VmallocUsed: 9416 kB VmallocChunk: 923628 kB after the OOM: /bigfoot/usb_disk # cat /proc/meminfo MemTotal:62520 kB MemFree: 1608 kB Buffers: 8212 kB Cached: 42780 kB SwapCached: 0 kB Active: 6228 kB Inactive:45176 kB HighTotal: 0 kB HighFree:0 kB LowTotal:62520 kB LowFree: 1608 kB SwapTotal: 0 kB SwapFree:0 kB Dirty: 35208 kB Writeback:5616 kB Mapped:892 kB Slab: 7788 kB CommitLimit: 31260 kB Committed_AS: 1704 kB PageTables: 88 kB VmallocTotal: 933872 kB VmallocUsed: 9416 kB VmallocChunk: 923628 kB Which makes me think that we aren't writing back fast enough. If I mount the drive "sync" the issue clearly goes away. It appears from an strace we are doing ftruncate64(5, 178257920) when we OOM. Any ideas on VM parameters to tweak so we throttle this from occurring? You don't give us the actual OOM message. In newer kernels, there has been quite a bit of work done to improve the OOM situation -- search changelogs in mm/oom_kill.c mm/vmscan.c mm/page_alloc.c. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Bill Davidsen wrote: The point is that if you want to be able to allocate at all, sometimes you will have to write dirty pages, garbage collect, and move or swap programs. The hardware is just too limited to do something less painful, and the user can't see memory to do things better. Linus is right, 'Claiming that there is a "proper solution" is usually a total red herring. Quite often there isn't, and the "paper over" is actually not papering over, it's quite possibly the best solution there is.' I think any solution is going to be ugly, unfortunately. It seems quite robust and clean to me, actually. Any userspace memory that absolutely must be large contiguous regions have to be allocated at boot or from a pool reserved at boot. All other allocations can be broken into smaller ones. Write dirty pages, garbage collect, move or swap programs isn't going to be robust because there is lots of vital kernel memory that cannot be moved and will cause fragmentation. The reclaimable zone work that went on a while ago for hugepages is exactly how you would also fix this problem and still have a reasonable degree of flexibility at runtime. It isn't really ugly or hard, compared with some of the non-working "solutions" that have been proposed. The other good thing is that the core mm already has practically everything required, so the functionality is unintrusive. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
Ravikiran G Thirumalai wrote: Hi, We noticed high interrupt hold off times while running some memory intensive tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups, [...] We did not use any lock debugging options and used plain old rdtsc to measure cycles. (We disable cpu freq scaling in the BIOS). All we did was this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { local_irq_disable(); > rdtsc(t1); preempt_disable(); spin_acquire(>dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); > rdtsc(t2); if (lock->spin_time < (t2 - t1)) lock->spin_time = t2 - t1; } On some runs, we found that the zone->lru_lock spun for 33 seconds or more while the maximal CS time was 3 seconds or so. What is the "CS time"? It would be interesting to know how long the maximal lru_lock *hold* time is, which could give us a better indication of whether it is a hardware problem. For example, if the maximum hold time is 10ms, that it might indicate a hardware fairness problem. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19.1 failing
On Sat, 13 Jan 2007 03:58:19 +0100 Von Wolher wrote: > Hi, > > I just build a 2.6.19.1 vanilla kernel based on the previous config > (make oldconfig) but for some reason it is not starting. Despite > following the usual procedure with lilo like many times before it seems > that lilo tries to boot it and jumps back to the menu screen. Was your previous config 2.6.18* or 2.6.19? If it was 2.6.18* and you are using SATA, the config symbol names for SATA changed and you'll need set them via make *config. Otherwise we'll probably need more info. > But selecting the old kernel boots just fine. > > Any one can advise on what could cause such behaviour beside the obvious > steps like did i run lilo after kernel compile, check paths ... --- ~Randy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Kdump documentation update for 2.6.20: ia64 portion
Hi, this patch fills in the portions for ia64 kexec. I'm actually not sure what options are required for the dump-capture kernel, but "init 1 irqpoll maxcpus=1" has been working fine for me. Or more to the point, I'm not sure if irqpoll is needed or not. This patch requires the documentation patch update that Vivek Goyal has been circulating, and I believe is currently in mm. Feel free to fold it into that change if it makes things easier for anyone. Take II Nanhai, I have noted that vmlinux.gz may also be used. And added a note about the kernel being able to automatically place the crashkernel region. Furthermore, I added a note that if manually specified, the region should be 64Mb aligned to avoid wastage. I notice that the auto placement code uses 64Mb. But is this strictly neccessary for all page sizes? Take III Fixed some typos, thaniks to Andreas Schwab Signed-off-by: Simon Horman <[EMAIL PROTECTED]> Index: linux-2.6/Documentation/kdump/kdump.txt === --- linux-2.6.orig/Documentation/kdump/kdump.txt2007-01-12 17:45:19.0 +0900 +++ linux-2.6/Documentation/kdump/kdump.txt 2007-01-12 17:59:42.0 +0900 @@ -17,7 +17,7 @@ memory image to a dump file on the local disk, or across the network to a remote system. -Kdump and kexec are currently supported on the x86, x86_64, ppc64 and IA64 +Kdump and kexec are currently supported on the x86, x86_64, ppc64 and ia64 architectures. When the system kernel boots, it reserves a small section of memory for @@ -229,7 +229,23 @@ Dump-capture kernel config options (Arch Dependent, ia64) -- -(To be filled) + +- No specific options are required to create a dump-capture kernel + for ia64, other than those specified in the arch idependent section + above. This means that it is possible to use the system kernel + as a dump-capture kernel if desired. + + The crashkernel region can be automatically placed by the system + kernel at run time. This is done by specifying the base address as 0, + or omitting it all together. + + [EMAIL PROTECTED] + or + crashkernel=256M + + If the start address is specified, note that the start address of the + kernel will be aligned to 64Mb, so if the start address is not then + any space below the alignment point will be wasted. Boot into System Kernel @@ -248,6 +264,10 @@ On ppc64, use "[EMAIL PROTECTED]". + On ia64, [EMAIL PROTECTED] is a generous value that typically works. + The region may be automatically placed on ia64, see the + dump-capture kernel config option notes above. + Load the Dump-capture Kernel @@ -266,7 +286,8 @@ For ppc64: - Use vmlinux For ia64: - (To be filled) + - Use vmlinux or vmlinuz.gz + If you are using a uncompressed vmlinux image then use following command to load dump-capture kernel. @@ -282,18 +303,19 @@ --initrd= \ --append="root= " +Please note, that --args-linux does not need to be specified for ia64. +It is planned to make this a no-op on that architecture, but for now +it should be omitted + Following are the arch specific command line options to be used while loading dump-capture kernel. -For i386 and x86_64: +For i386, x86_64 and ia64: "init 1 irqpoll maxcpus=1" For ppc64: "init 1 maxcpus=1 noirqdistrib" -For IA64 - (To be filled) - Notes on loading the dump-capture kernel: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fastboot] [PATCH] Kdump documentation update for 2.6.20: ia64 portion
On Fri, Jan 12, 2007 at 11:46:39AM -0800, Jay Lan wrote: > Horms wrote: > > Hi, > > > > this patch fills in the portions for ia64 kexec. > > > > I'm actually not sure what options are required for the dump-capture > > kernel, but "init 1 irqpoll maxcpus=1" has been working fine for me. > > Or more to the point, I'm not sure if irqpoll is needed or not. > > > > This patch requires the documentation patch update that Vivek Goyal has > > been circulating, and I believe is currently in mm. Feel free to fold it > > into that change if it makes things easier for anyone. > > > > Take II > > > > Nanhai, > > > > I have noted that vmlinux.gz may also be used. And added a note about the > > kernel being able to automatically place the crashkernel region. > > Furthermore, I added a note that if manually specified, the region should > > be 64Mb aligned to avoid wastage. I notice that the auto placement code > > uses 64Mb. But is this strictly neccessary for all page sizes? > > > > Signed-off-by: Simon Horman <[EMAIL PROTECTED]> > > > > Index: linux-2.6/Documentation/kdump/kdump.txt > > === > > --- linux-2.6.orig/Documentation/kdump/kdump.txt2007-01-12 > > 17:45:19.0 +0900 > > +++ linux-2.6/Documentation/kdump/kdump.txt 2007-01-12 17:59:42.0 > > +0900 > > @@ -17,7 +17,7 @@ > > memory image to a dump file on the local disk, or across the network to > > a remote system. > > > > -Kdump and kexec are currently supported on the x86, x86_64, ppc64 and IA64 > > +Kdump and kexec are currently supported on the x86, x86_64, ppc64 and ia64 > > architectures. > > > > When the system kernel boots, it reserves a small section of memory for > > @@ -229,7 +229,23 @@ > > > > Dump-capture kernel config options (Arch Dependent, ia64) > > -- > > -(To be filled) > > + > > +- No specific options are required to create a dump-capture kernel > > + for ia64, other than those specified in the arch idependent section > > + above. This means that it is possible to use the system kernel > > + as a dump-capture kernel if desired. > > + > > + The crashkernel region can be automatically placed by the system > > + kernel at run time. This is done by specifying the base address as 0, > > + or omitting it all together. > > In my testing, i found the base address was ignored. Whatever value > specified was fine. Not necessary to be 0. But i guess it is fine to > give people a guideline telling them to specify 0. I submitted a patch to honour non-zero base addresses, I'm pretty sure it is in there now. -- Horms H: http://www.vergenet.net/~horms/ W: http://www.valinux.co.jp/en/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-rc3 regression: suspend to RAM broken on Mac mini Core Duo
On Sat, Jan 13, 2007 at 04:05:28 +0100, Tino Keitel wrote: [...] > I think I found the problem. In 2.6.18, I had a slightly different > config. With 2.6.20-rc4, I had sucessful suspend/resume cycles without > the USB DVB-T box attached. I tweaked the USB options a bit and > activated some options (CONFIG_USB_SUSPEND, > CONFIG_USB_MULTITHREAD_PROBE, CONFIG_USB_EHCI_SPLIT_ISO, > CONFIG_USB_EHCI_ROOT_HUB_TT, CONFIG_USB_EHCI_TT_NEWSCHED) and now I can > suspend/resume without hangs. At least I haven't seen one until now. Just after I sent the mail, I had 2 failures again. :-( Regards, Tino - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux v2.6.20-rc5
On 1/13/07, Jeff Chua <[EMAIL PROTECTED]> wrote: On 1/13/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > On Fri, 12 Jan 2007 14:27:48 -0500 (EST) > Linus Torvalds <[EMAIL PROTECTED]> wrote: CC [M] drivers/kvm/vmx.o {standard input}: Assembler messages: {standard input}:3257: Error: bad register name `%sil' make[2]: *** [drivers/kvm/vmx.o] Error 1 make[1]: *** [drivers/kvm] Error 2 make: *** [drivers] Error 2 Am I missing something or this is a real problem? Applied 2.6.20-rc5-mm-fixes and got this problem. Using gcc version 3.4.5, binutils-2.17.50.0.8 Same problem with vanilla linux-2.6.20-rc5. Thanks, Jeff. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Choosing a HyperThreading/SMP/MultiCore kernel ?
On Fri, 12 Jan 2007 10:03:49 EST, Lennart Sorensen said: > > I would expect any distribution should work on these (as long as the > kernel they use isn't too old.). Of course if it is a Mac, you need a > distribution that supports their firmware (which is of course not a PC > bios). As long as you can boot it, any i386 or amd64 kernel with smp > enabled should use all the processors present (well amd64 on the > core2duo and on the p4 if it is em64t enabled). amd64 will only work on a core2duo if it's a T7200 or higher - the lower numbers are 32-bit-only chipsets. I admit not knowing what exact variant the Mac has. > I believe the closest optimization for a Core2 is probably the Pentium M > (certainly not the P4/netburst). Not entirely sure though. CONFIG_MCORE2=y That's probably even closer :) At least in 2.6.20-rc4-mm1. pgpcgUQwo7pWp.pgp Description: PGP signature
Re: Linux v2.6.20-rc5
On 1/13/07, Andrew Morton <[EMAIL PROTECTED]> wrote: On Fri, 12 Jan 2007 14:27:48 -0500 (EST) Linus Torvalds <[EMAIL PROTECTED]> wrote: http://userweb.kernel.org/~akpm/2.6.20-rc5-mm-fixes The KVM and direct-io changes are significant, so if people are testing those things, please be sure to have that patch applied. CC [M] drivers/kvm/vmx.o {standard input}: Assembler messages: {standard input}:3257: Error: bad register name `%sil' make[2]: *** [drivers/kvm/vmx.o] Error 1 make[1]: *** [drivers/kvm] Error 2 make: *** [drivers] Error 2 Am I missing something or this is a real problem? Applied 2.6.20-rc5-mm-fixes and got this problem. Using gcc version 3.4.5, binutils-2.17.50.0.8 Thanks, Jeff. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.19.1 failing
Hi, I just build a 2.6.19.1 vanilla kernel based on the previous config (make oldconfig) but for some reason it is not starting. Despite following the usual procedure with lilo like many times before it seems that lilo tries to boot it and jumps back to the menu screen. But selecting the old kernel boots just fine. Any one can advise on what could cause such behaviour beside the obvious steps like did i run lilo after kernel compile, check paths ... Thanks Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 4/7] mm: merge populate and nopage into fault (fixes nonlinear)
Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes the virtual address -> file offset differently from linear mappings. I can't see why the filesystem/pagecache code should need to know anything about it, except for the fact that the ->nopage handler didn't quite pass down enough information (ie. pgoff). But it is more logical to pass pgoff rather than have the ->nopage function calculate it itself anyway. And having the nopage handler install the pte itself is sort of nasty. This patch introduces a new fault handler that replaces ->nopage and ->populate and (later) ->nopfn. Most of the old mechanism is still in place so there is a lot of duplication and nice cleanups that can be removed if everyone switches over. The rationale for doing this in the first place is that nonlinear mappings are subject to the pagefault vs invalidate/truncate race too, and it seemed stupid to duplicate the synchronisation logic rather than just consolidate the two. After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in pagecache. Seems like a fringe functionality anyway. NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no users have hit mainline yet. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/include/linux/mm.h === --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -168,11 +168,12 @@ extern unsigned int kobjsize(const void #define VM_NONLINEAR 0x0080 /* Is non-linear (remap_file_pages) */ #define VM_MAPPED_COPY 0x0100 /* T if mapped copy of data (nommu mmap) */ #define VM_INSERTPAGE 0x0200 /* The vma has had "vm_insert_page()" done on it */ -#define VM_CAN_INVALIDATE 0x0400 /* The mapping may be invalidated, +#define VM_CAN_INVALIDATE 0x0400 /* The mapping may be invalidated, * eg. truncate or invalidate_inode_*. * In this case, do_no_page must * return with the page locked. */ +#define VM_CAN_NONLINEAR 0x0800/* Has ->fault & does nonlinear pages */ #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */ #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS @@ -196,6 +197,26 @@ extern unsigned int kobjsize(const void */ extern pgprot_t protection_map[16]; +#define FAULT_FLAG_WRITE 0x01 +#define FAULT_FLAG_NONLINEAR 0x02 + +/* + * fault_data is filled in the the pagefault handler and passed to the + * vma's ->fault function. That function is responsible for filling in + * 'type', which is the type of fault if a page is returned, or the type + * of error if NULL is returned. + * + * pgoff should be used in favour of address, if possible. If pgoff is + * used, one may set VM_CAN_NONLINEAR in the vma->vm_flags to get + * nonlinear mapping support. + */ +struct fault_data { + unsigned long address; + pgoff_t pgoff; + unsigned int flags; + + int type; +}; /* * These are the virtual MM functions - opening of an area, closing and @@ -205,6 +226,7 @@ extern pgprot_t protection_map[16]; struct vm_operations_struct { void (*open)(struct vm_area_struct * area); void (*close)(struct vm_area_struct * area); + struct page * (*fault)(struct vm_area_struct *vma, struct fault_data * fdata); struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type); unsigned long (*nopfn)(struct vm_area_struct * area, unsigned long address); int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock); @@ -635,7 +657,6 @@ static inline int page_mapped(struct pag */ #define NOPAGE_SIGBUS (NULL) #define NOPAGE_OOM ((struct page *) (-1)) -#define NOPAGE_REFAULT ((struct page *) (-2)) /* Return to userspace, rerun */ /* * Error return values for the *_nopfn functions @@ -669,14 +690,13 @@ extern void pagefault_out_of_memory(void extern void show_free_areas(void); #ifdef CONFIG_SHMEM -struct page *shmem_nopage(struct vm_area_struct *vma, - unsigned long address, int *type); +struct page *shmem_fault(struct vm_area_struct *vma, struct fault_data *fdata); int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new); struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, unsigned long addr); int shmem_lock(struct file *file, int lock, struct user_struct *user); #else -#define shmem_nopage filemap_nopage +#define shmem_fault filemap_fault static inline int shmem_lock(struct file *file, int lock, struct user_struct *user) @@ -1069,9 +1089,11 @@ extern void truncate_inode_pages_range(s
[patch 7/7] mm: remove legacy cruft
Remove legacy filemap_nopage and all of the .populate API cruft. This patch is optional and can be left out (eg. for a cleaner merge with -mm), and rebased after the previous patches go upstream. include/linux/mm.h |9 -- mm/filemap.c | 195 - mm/fremap.c| 71 ++- mm/memory.c| 37 ++ 4 files changed, 21 insertions(+), 291 deletions(-) Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/include/linux/mm.h === --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -228,8 +228,6 @@ struct vm_operations_struct { void (*close)(struct vm_area_struct * area); struct page * (*fault)(struct vm_area_struct *vma, struct fault_data * fdata); struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type); - int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock); - /* notification that a previously read-only page is about to become * writable, if an error is returned it will cause a SIGBUS */ int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page); @@ -771,8 +769,6 @@ static inline void unmap_shared_mapping_ extern int vmtruncate(struct inode * inode, loff_t offset); extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end); -extern int install_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot); -extern int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long pgoff, pgprot_t prot); #ifdef CONFIG_MMU extern int __handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, @@ -1083,10 +1079,6 @@ extern void truncate_inode_pages_range(s /* generic vm_area_ops exported for stackable file systems */ extern struct page *filemap_fault(struct vm_area_struct *, struct fault_data *); -extern struct page * __deprecated_for_modules filemap_nopage( - struct vm_area_struct *, unsigned long, int *); -extern int __deprecated_for_modules filemap_populate(struct vm_area_struct *, - unsigned long, unsigned long, pgprot_t, unsigned long, int); /* mm/page-writeback.c */ int write_one_page(struct page *page, int wait); Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1496,201 +1496,6 @@ page_not_uptodate: } EXPORT_SYMBOL(filemap_fault); -/* - * filemap_nopage and filemap_populate are legacy exports that are not used - * in tree. Scheduled for removal. - */ -struct page *filemap_nopage(struct vm_area_struct *area, - unsigned long address, int *type) -{ - struct page *page; - struct fault_data fdata; - fdata.address = address; - fdata.pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT) - + area->vm_pgoff; - fdata.flags = 0; - - page = filemap_fault(area, ); - if (type) - *type = fdata.type; - - return page; -} -EXPORT_SYMBOL(filemap_nopage); - -static struct page * filemap_getpage(struct file *file, unsigned long pgoff, - int nonblock) -{ - struct address_space *mapping = file->f_mapping; - struct page *page; - int error; - - /* -* Do we have something in the page cache already? -*/ -retry_find: - page = find_get_page(mapping, pgoff); - if (!page) { - if (nonblock) - return NULL; - goto no_cached_page; - } - - /* -* Ok, found a page in the page cache, now we need to check -* that it's up-to-date. -*/ - if (!PageUptodate(page)) { - if (nonblock) { - page_cache_release(page); - return NULL; - } - goto page_not_uptodate; - } - -success: - /* -* Found the page and have a reference on it. -*/ - mark_page_accessed(page); - return page; - -no_cached_page: - error = page_cache_read(file, pgoff); - - /* -* The page we want has now been added to the page cache. -* In the unlikely event that someone removed it in the -* meantime, we'll just come back here and read it again. -*/ - if (error >= 0) - goto retry_find; - - /* -* An error return from page_cache_read can result if the -* system is low on memory, or a problem occurs while trying -* to schedule I/O. -*/ - return NULL; - -page_not_uptodate: - lock_page(page); - - /* Did it get truncated
[patch 6/7] mm: merge nopfn into fault
Remove ->nopfn and reimplement the only existing handler using ->fault Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/drivers/char/mspec.c === --- linux-2.6.orig/drivers/char/mspec.c +++ linux-2.6/drivers/char/mspec.c @@ -182,24 +182,25 @@ mspec_close(struct vm_area_struct *vma) /* - * mspec_nopfn + * mspec_fault * * Creates a mspec page and maps it to user space. */ -static unsigned long -mspec_nopfn(struct vm_area_struct *vma, unsigned long address) +static struct page * +mspec_fault(struct fault_data *fdata) { unsigned long paddr, maddr; unsigned long pfn; - int index; - struct vma_data *vdata = vma->vm_private_data; + int index = fdata->pgoff; + struct vma_data *vdata = fdata->vma->vm_private_data; - index = (address - vma->vm_start) >> PAGE_SHIFT; maddr = (volatile unsigned long) vdata->maddr[index]; if (maddr == 0) { maddr = uncached_alloc_page(numa_node_id()); - if (maddr == 0) - return NOPFN_OOM; + if (maddr == 0) { + fdata->type = VM_FAULT_OOM; + return NULL; + } spin_lock(>lock); if (vdata->maddr[index] == 0) { @@ -219,13 +220,21 @@ mspec_nopfn(struct vm_area_struct *vma, pfn = paddr >> PAGE_SHIFT; - return pfn; + fdata->type = VM_FAULT_MINOR; + /* +* vm_insert_pfn can fail with -EBUSY, but in that case it will +* be because another thread has installed the pte first, so it +* is no problem. +*/ + vm_insert_pfn(fdata->vma, fdata->address, pfn); + + return NULL; } static struct vm_operations_struct mspec_vm_ops = { .open = mspec_open, .close = mspec_close, - .nopfn = mspec_nopfn + .fault = mspec_fault, }; /* Index: linux-2.6/include/linux/mm.h === --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -228,7 +228,6 @@ struct vm_operations_struct { void (*close)(struct vm_area_struct * area); struct page * (*fault)(struct vm_area_struct *vma, struct fault_data * fdata); struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type); - unsigned long (*nopfn)(struct vm_area_struct * area, unsigned long address); int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock); /* notification that a previously read-only page is about to become @@ -659,12 +658,6 @@ static inline int page_mapped(struct pag #define NOPAGE_OOM ((struct page *) (-1)) /* - * Error return values for the *_nopfn functions - */ -#define NOPFN_SIGBUS ((unsigned long) -1) -#define NOPFN_OOM ((unsigned long) -2) - -/* * Different kinds of faults, as returned by handle_mm_fault(). * Used to decide whether a process gets delivered SIGBUS or * just gets major/minor fault counters bumped up. Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -1288,6 +1288,11 @@ EXPORT_SYMBOL(vm_insert_page); * * This function should only be called from a vm_ops->fault handler, and * in that case the handler should return NULL. + * + * vma cannot be a COW mapping. + * + * As this is called only for pages that do not currently exist, we + * do not need to flush old virtual caches or the TLB. */ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) { @@ -2346,54 +2351,6 @@ static int do_nonlinear_fault(struct mm_ } /* - * do_no_pfn() tries to create a new page mapping for a page without - * a struct_page backing it - * - * As this is called only for pages that do not currently exist, we - * do not need to flush old virtual caches or the TLB. - * - * We enter with non-exclusive mmap_sem (to exclude vma changes, - * but allow concurrent faults), and pte mapped but not yet locked. - * We return with mmap_sem still held, but pte unmapped and unlocked. - * - * It is expected that the ->nopfn handler always returns the same pfn - * for a given virtual mapping. - * - * Mark this `noinline' to prevent it from bloating the main pagefault code. - */ -static noinline int do_no_pfn(struct mm_struct *mm, struct vm_area_struct *vma, -unsigned long address, pte_t *page_table, pmd_t *pmd, -int write_access) -{ - spinlock_t *ptl; - pte_t entry; - unsigned long pfn; - int ret = VM_FAULT_MINOR; - - pte_unmap(page_table); - BUG_ON(!(vma->vm_flags & VM_PFNMAP)); - BUG_ON(is_cow_mapping(vma->vm_flags)); - - pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK); - if
[patch 5/7] mm: add vm_insert_pfn
Add a vm_insert_pfn helper, so that ->fault handlers can have nopfn functionality by installing their own pte and returning NULL. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/include/linux/mm.h === --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -1151,6 +1151,7 @@ unsigned long vmalloc_to_pfn(void *addr) int remap_pfn_range(struct vm_area_struct *, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t); int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); +int vm_insert_pfn(struct vm_area_struct *, unsigned long addr, unsigned long pfn); struct page *follow_page(struct vm_area_struct *, unsigned long address, unsigned int foll_flags); Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -1277,6 +1277,50 @@ int vm_insert_page(struct vm_area_struct } EXPORT_SYMBOL(vm_insert_page); +/** + * vm_insert_pfn - insert single pfn into user vma + * @vma: user vma to map to + * @addr: target user address of this page + * @pfn: source kernel pfn + * + * Similar to vm_inert_page, this allows drivers to insert individual pages + * they've allocated into a user vma. Same comments apply. + * + * This function should only be called from a vm_ops->fault handler, and + * in that case the handler should return NULL. + */ +int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) +{ + struct mm_struct *mm = vma->vm_mm; + int retval; + pte_t *pte, entry; + spinlock_t *ptl; + + BUG_ON(!(vma->vm_flags & VM_PFNMAP)); + BUG_ON(is_cow_mapping(vma->vm_flags)); + + retval = -ENOMEM; + pte = get_locked_pte(mm, addr, ); + if (!pte) + goto out; + retval = -EBUSY; + if (!pte_none(*pte)) + goto out_unlock; + + /* Ok, finally just insert the thing.. */ + entry = pfn_pte(pfn, vma->vm_page_prot); + set_pte_at(mm, addr, pte, entry); + update_mmu_cache(vma, addr, entry); + + retval = 0; +out_unlock: + pte_unmap_unlock(pte, ptl); + +out: + return retval; +} +EXPORT_SYMBOL(vm_insert_pfn); + /* * maps a range of physical memory into the requested pages. the old * mappings are removed. any references to nonexistent pages results - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 3/7] mm: fix fault vs invalidate race for linear mappings
Fix the race between invalidate_inode_pages and do_no_page. Andrea Arcangeli identified a subtle race between invalidation of pages from pagecache with userspace mappings, and do_no_page. The issue is that invalidation has to shoot down all mappings to the page, before it can be discarded from the pagecache. Between shooting down ptes to a particular page, and actually dropping the struct page from the pagecache, do_no_page from any process might fault on that page and establish a new mapping to the page just before it gets discarded from the pagecache. The most common case where such invalidation is used is in file truncation. This case was catered for by doing a sort of open-coded seqlock between the file's i_size, and its truncate_count. Truncation will decrease i_size, then increment truncate_count before unmapping userspace pages; do_no_page will read truncate_count, then find the page if it is within i_size, and then check truncate_count under the page table lock and back out and retry if it had subsequently been changed (ptl will serialise against unmapping, and ensure a potentially updated truncate_count is actually visible). Complexity and documentation issues aside, the locking protocol fails in the case where we would like to invalidate pagecache inside i_size. do_no_page can come in anytime and filemap_nopage is not aware of the invalidation in progress (as it is when it is outside i_size). The end result is that dangling (->mapping == NULL) pages that appear to be from a particular file may be mapped into userspace with nonsense data. Valid mappings to the same place will see a different page. Andrea implemented two working fixes, one using a real seqlock, another using a page->flags bit. He also proposed using the page lock in do_no_page, but that was initially considered too heavyweight. However, it is not a global or per-file lock, and the page cacheline is modified in do_no_page to increment _count and _mapcount anyway, so a further modification should not be a large performance hit. Scalability is not an issue. This patch implements this latter approach. ->nopage implementations return with the page locked if it is possible for their underlying file to be invalidated (in that case, they must set a special vm_flags bit to indicate so). do_no_page only unlocks the page after setting up the mapping completely. invalidation is excluded because it holds the page lock during invalidation of each page (and ensures that the page is not mapped while holding the lock). This also allows significant simplifications in do_no_page, because we have the page locked in the right place in the pagecache from the start. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/include/linux/mm.h === --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -168,6 +168,11 @@ extern unsigned int kobjsize(const void #define VM_NONLINEAR 0x0080 /* Is non-linear (remap_file_pages) */ #define VM_MAPPED_COPY 0x0100 /* T if mapped copy of data (nommu mmap) */ #define VM_INSERTPAGE 0x0200 /* The vma has had "vm_insert_page()" done on it */ +#define VM_CAN_INVALIDATE 0x0400 /* The mapping may be invalidated, +* eg. truncate or invalidate_inode_*. +* In this case, do_no_page must +* return with the page locked. +*/ #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */ #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1349,9 +1349,10 @@ struct page *filemap_nopage(struct vm_ar unsigned long size, pgoff; int did_readaround = 0, majmin = VM_FAULT_MINOR; + BUG_ON(!(area->vm_flags & VM_CAN_INVALIDATE)); + pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff; -retry_all: size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; if (pgoff >= size) goto outside_data_content; @@ -1373,7 +1374,7 @@ retry_all: * Do we have something in the page cache already? */ retry_find: - page = find_get_page(mapping, pgoff); + page = find_lock_page(mapping, pgoff); if (!page) { unsigned long ra_pages; @@ -1407,7 +1408,7 @@ retry_find: start = pgoff - ra_pages / 2; do_page_cache_readahead(mapping, file, start, ra_pages); } - page = find_get_page(mapping, pgoff); + page = find_lock_page(mapping, pgoff); if (!page) goto no_cached_page; } @@ -1416,13 +1417,19 @@
[patch 1/7] mm: debug check for the fault vs invalidate race
Add a bugcheck for Andrea's pagefault vs invalidate race. This is triggerable for both linear and nonlinear pages with a userspace test harness (using direct IO and truncate, respectively). Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -120,6 +120,8 @@ void __remove_from_page_cache(struct pag page->mapping = NULL; mapping->nrpages--; __dec_zone_page_state(page, NR_FILE_PAGES); + + BUG_ON(page_mapped(page)); } void remove_from_page_cache(struct page *page) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 0/7] fault vs truncate/invalidate race fix
The following set of patches fix the fault vs invalidate and fault vs truncate_range race for filemap_nopage mappings, plus those and fault vs truncate race for nonlinear mappings. Hasn't changed since I last submitted it, when it was rejected because it made one of the buffered write deadlocks easier to hit. I'll try again. Patches based on 2.6.20-rc4. Comments? Thanks, Nick -- SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2/7] mm: simplify filemap_nopage
Identical block is duplicated twice: contrary to the comment, we have been re-reading the page *twice* in filemap_nopage rather than once. If any retry logic or anything is needed, it belongs in lower levels anyway. Only retry once. Linus agrees. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1468,30 +1468,6 @@ page_not_uptodate: majmin = VM_FAULT_MAJOR; count_vm_event(PGMAJFAULT); } - lock_page(page); - - /* Did it get unhashed while we waited for it? */ - if (!page->mapping) { - unlock_page(page); - page_cache_release(page); - goto retry_all; - } - - /* Did somebody else get it up-to-date? */ - if (PageUptodate(page)) { - unlock_page(page); - goto success; - } - - error = mapping->a_ops->readpage(file, page); - if (!error) { - wait_on_page_locked(page); - if (PageUptodate(page)) - goto success; - } else if (error == AOP_TRUNCATED_PAGE) { - page_cache_release(page); - goto retry_find; - } /* * Umm, take care of errors if the page isn't up-to-date. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 4/10] mm: generic_file_buffered_write cleanup
From: Andrew Morton <[EMAIL PROTECTED]> Clean up buffered write code. Rename some variables and fix some types. Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1854,16 +1854,15 @@ generic_file_buffered_write(struct kiocb size_t count, ssize_t written) { struct file *file = iocb->ki_filp; - struct address_space * mapping = file->f_mapping; + struct address_space *mapping = file->f_mapping; const struct address_space_operations *a_ops = mapping->a_ops; struct inode*inode = mapping->host; longstatus = 0; struct page *page; struct page *cached_page = NULL; - size_t bytes; struct pagevec lru_pvec; const struct iovec *cur_iov = iov; /* current iovec */ - size_t iov_base = 0; /* offset in the current iovec */ + size_t iov_offset = 0;/* offset in the current iovec */ char __user *buf; pagevec_init(_pvec, 0); @@ -1874,31 +1873,33 @@ generic_file_buffered_write(struct kiocb if (likely(nr_segs == 1)) buf = iov->iov_base + written; else { - filemap_set_next_iovec(_iov, _base, written); - buf = cur_iov->iov_base + iov_base; + filemap_set_next_iovec(_iov, _offset, written); + buf = cur_iov->iov_base + iov_offset; } do { - unsigned long index; - unsigned long offset; - unsigned long maxlen; - size_t copied; + pgoff_t index; /* Pagecache index for current page */ + unsigned long offset; /* Offset into pagecache page */ + unsigned long maxlen; /* Bytes remaining in current iovec */ + size_t bytes; /* Bytes to write to page */ + size_t copied; /* Bytes copied from user */ - offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */ + offset = (pos & (PAGE_CACHE_SIZE - 1)); index = pos >> PAGE_CACHE_SHIFT; bytes = PAGE_CACHE_SIZE - offset; if (bytes > count) bytes = count; + maxlen = cur_iov->iov_len - iov_offset; + if (maxlen > bytes) + maxlen = bytes; + /* * Bring in the user page that we will copy from _first_. * Otherwise there's a nasty deadlock on copying from the * same page as we're writing to, without it being marked * up-to-date. */ - maxlen = cur_iov->iov_len - iov_base; - if (maxlen > bytes) - maxlen = bytes; fault_in_pages_readable(buf, maxlen); page = __grab_cache_page(mapping,index,_page,_pvec); @@ -1929,7 +1930,7 @@ generic_file_buffered_write(struct kiocb buf, bytes); else copied = filemap_copy_from_user_iovec(page, offset, - cur_iov, iov_base, bytes); + cur_iov, iov_offset, bytes); flush_dcache_page(page); status = a_ops->commit_write(file, page, offset, offset+bytes); if (status == AOP_TRUNCATED_PAGE) { @@ -1947,12 +1948,12 @@ generic_file_buffered_write(struct kiocb buf += status; if (unlikely(nr_segs > 1)) { filemap_set_next_iovec(_iov, - _base, status); + _offset, status); if (count) buf = cur_iov->iov_base + - iov_base; + iov_offset; } else { - iov_base += status; + iov_offset += status; } } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 6/10] mm: be sure to trim blocks
If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then we may have failed the write operation despite prepare_write having instantiated blocks past i_size. Fix this, and consolidate the trimming into one place. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1911,22 +1911,9 @@ generic_file_buffered_write(struct kiocb } status = a_ops->prepare_write(file, page, offset, offset+bytes); - if (unlikely(status)) { - loff_t isize = i_size_read(inode); + if (unlikely(status)) + goto fs_write_aop_error; - if (status != AOP_TRUNCATED_PAGE) - unlock_page(page); - page_cache_release(page); - if (status == AOP_TRUNCATED_PAGE) - continue; - /* -* prepare_write() may have instantiated a few blocks -* outside i_size. Trim these off again. -*/ - if (pos + bytes > isize) - vmtruncate(inode, isize); - break; - } if (likely(nr_segs == 1)) copied = filemap_copy_from_user(page, offset, buf, bytes); @@ -1935,10 +1922,9 @@ generic_file_buffered_write(struct kiocb cur_iov, iov_offset, bytes); flush_dcache_page(page); status = a_ops->commit_write(file, page, offset, offset+bytes); - if (status == AOP_TRUNCATED_PAGE) { - page_cache_release(page); - continue; - } + if (unlikely(status)) + goto fs_write_aop_error; + if (likely(copied > 0)) { if (!status) status = copied; @@ -1969,6 +1955,25 @@ generic_file_buffered_write(struct kiocb break; balance_dirty_pages_ratelimited(mapping); cond_resched(); + continue; + +fs_write_aop_error: + if (status != AOP_TRUNCATED_PAGE) + unlock_page(page); + page_cache_release(page); + + /* +* prepare_write() may have instantiated a few blocks +* outside i_size. Trim these off again. Don't need +* i_size_read because we hold i_mutex. +*/ + if (pos + bytes > inode->i_size) + vmtruncate(inode, inode->i_size); + if (status == AOP_TRUNCATED_PAGE) + continue; + else + break; + } while (count); *ppos = pos; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 10/10] mm: fix pagecache write deadlocks
Modify the core write() code so that it won't take a pagefault while holding a lock on the pagecache page. There are a number of different deadlocks possible if we try to do such a thing: 1. generic_buffered_write 2. lock_page 3.prepare_write 4. unlock_page+vmtruncate 5. copy_from_user 6. mmap_sem(r) 7. handle_mm_fault 8.lock_page (filemap_nopage) 9.commit_write 1. unlock_page b. sys_munmap / sys_mlock / others c. mmap_sem(w) d. make_pages_present e.get_user_pages f. handle_mm_fault g. lock_page (filemap_nopage) 2,8 - recursive deadlock if page is same 2,8;2,8 - ABBA deadlock is page is different 2,6;c,g - ABBA deadlock if page is same The solution is as follows: 1. If we find the destination page is uptodate, continue as normal, but use atomic usercopies which do not take pagefaults and do not zero the uncopied tail of the destination. The destination is already uptodate, so we can commit_write the full length even if there was a partial copy: it does not matter that the tail was not modified, because if it is dirtied and written back to disk it will not cause any problems (uptodate *means* that the destination page is as new or newer than the copy on disk). 1a. The above requires that fault_in_pages_readable correctly returns access information, because atomic usercopies cannot distinguish between non-present pages in a readable mapping, from lack of a readable mapping. 2. If we find the destination page is non uptodate, unlock it (this could be made slightly more optimal), then find and pin the source page with get_user_pages. Relock the destination page and continue with the copy. However, instead of a usercopy (which might take a fault), copy the data via the kernel address space. (also, rename maxlen to seglen, because it was confusing) Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1843,11 +1843,12 @@ generic_file_buffered_write(struct kiocb filemap_set_next_iovec(_iov, nr_segs, _offset, written); do { + struct page *src_page; struct page *page; pgoff_t index; /* Pagecache index for current page */ unsigned long offset; /* Offset into pagecache page */ - unsigned long maxlen; /* Bytes remaining in current iovec */ - size_t bytes; /* Bytes to write to page */ + unsigned long seglen; /* Bytes remaining in current iovec */ + unsigned long bytes;/* Bytes to write to page */ size_t copied; /* Bytes copied from user */ buf = cur_iov->iov_base + iov_offset; @@ -1857,20 +1858,30 @@ generic_file_buffered_write(struct kiocb if (bytes > count) bytes = count; - maxlen = cur_iov->iov_len - iov_offset; - if (maxlen > bytes) - maxlen = bytes; + /* +* a non-NULL src_page indicates that we're doing the +* copy via get_user_pages and kmap. +*/ + src_page = NULL; + + seglen = cur_iov->iov_len - iov_offset; + if (seglen > bytes) + seglen = bytes; -#ifndef CONFIG_DEBUG_VM /* * Bring in the user page that we will copy from _first_. * Otherwise there's a nasty deadlock on copying from the * same page as we're writing to, without it being marked * up-to-date. +* +* Not only is this an optimisation, but it is also required +* to check that the address is actually valid, when atomic +* usercopies are used, below. */ - fault_in_pages_readable(buf, maxlen); -#endif - + if (unlikely(fault_in_pages_readable(buf, seglen))) { + status = -EFAULT; + break; + } page = __grab_cache_page(mapping, index); if (!page) { @@ -1878,31 +1889,88 @@ generic_file_buffered_write(struct kiocb break; } + /* +* non-uptodate pages cannot cope with short copies, and we +* cannot take a pagefault with the destination page locked. +* So pin the source page to copy it. +*/ + if (!PageUptodate(page)) { + unlock_page(page); + + bytes = min(bytes, PAGE_CACHE_SIZE - +((unsigned long)buf & ~PAGE_CACHE_MASK)); + + /* +* Cannot
[patch 9/10] mm: generic_file_buffered_write iovec cleanup
Hide some of the open-coded nr_segs tests into the iovec helpers. This is all to simplify generic_file_buffered_write, because that gets more complex in the next patch. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/mm/filemap.h === --- linux-2.6.orig/mm/filemap.h +++ linux-2.6/mm/filemap.h @@ -22,82 +22,82 @@ __filemap_copy_from_user_iovec_inatomic( /* * Copy as much as we can into the page and return the number of bytes which - * were sucessfully copied. If a fault is encountered then clear the page - * out to (offset+bytes) and return the number of bytes which were copied. - * - * NOTE: For this to work reliably we really want copy_from_user_inatomic_nocache - * to *NOT* zero any tail of the buffer that it failed to copy. If it does, - * and if the following non-atomic copy succeeds, then there is a small window - * where the target page contains neither the data before the write, nor the - * data after the write (it contains zero). A read at this time will see - * data that is inconsistent with any ordering of the read and the write. - * (This has been detected in practice). + * were sucessfully copied. If a fault is encountered then return the number of + * bytes which were copied. */ static inline size_t -filemap_copy_from_user(struct page *page, unsigned long offset, - const char __user *buf, unsigned bytes) +filemap_copy_from_user_atomic(struct page *page, unsigned long offset, + const struct iovec *iov, unsigned long nr_segs, + size_t base, size_t bytes) { char *kaddr; - int left; + size_t copied; kaddr = kmap_atomic(page, KM_USER0); - left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes); + if (likely(nr_segs == 1)) { + int left; + char __user *buf = iov->iov_base + base; + left = __copy_from_user_inatomic_nocache(kaddr + offset, + buf, bytes); + copied = bytes - left; + } else { + copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, + iov, base, bytes); + } kunmap_atomic(kaddr, KM_USER0); - if (left != 0) { - /* Do it the slow way */ - kaddr = kmap(page); - left = __copy_from_user_nocache(kaddr + offset, buf, bytes); - kunmap(page); - } - return bytes - left; + return copied; } /* - * This has the same sideeffects and return value as filemap_copy_from_user(). - * The difference is that on a fault we need to memset the remainder of the - * page (out to offset+bytes), to emulate filemap_copy_from_user()'s - * single-segment behaviour. + * This has the same sideeffects and return value as + * filemap_copy_from_user_atomic(). + * The difference is that it attempts to resolve faults. */ static inline size_t -filemap_copy_from_user_iovec(struct page *page, unsigned long offset, - const struct iovec *iov, size_t base, size_t bytes) +filemap_copy_from_user(struct page *page, unsigned long offset, + const struct iovec *iov, unsigned long nr_segs, +size_t base, size_t bytes) { char *kaddr; size_t copied; - kaddr = kmap_atomic(page, KM_USER0); - copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov, -base, bytes); - kunmap_atomic(kaddr, KM_USER0); - if (copied != bytes) { - kaddr = kmap(page); - copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov, -base, bytes); - if (bytes - copied) - memset(kaddr + offset + copied, 0, bytes - copied); - kunmap(page); + kaddr = kmap(page); + if (likely(nr_segs == 1)) { + int left; + char __user *buf = iov->iov_base + base; + left = __copy_from_user_nocache(kaddr + offset, buf, bytes); + copied = bytes - left; + } else { + copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, + iov, base, bytes); } + kunmap(page); return copied; } static inline void -filemap_set_next_iovec(const struct iovec **iovp, size_t *basep, size_t bytes) +filemap_set_next_iovec(const struct iovec **iovp, unsigned long nr_segs, +size_t *basep, size_t bytes) { - const struct iovec *iov = *iovp; - size_t base = *basep; - - while (bytes) { - int copy = min(bytes, iov->iov_len - base); - -
[patch 8/10] mm: generic_file_buffered_write cleanup more
No need to do the confusing switch of variables from copied into status. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1898,28 +1898,22 @@ generic_file_buffered_write(struct kiocb goto fs_write_aop_error; if (likely(copied > 0)) { - if (!status) - status = copied; - - if (status >= 0) { - written += status; - count -= status; - pos += status; - buf += status; - if (unlikely(nr_segs > 1)) { - filemap_set_next_iovec(_iov, - _offset, status); - if (count) - buf = cur_iov->iov_base + - iov_offset; - } else { - iov_offset += status; - } + written += copied; + count -= copied; + pos += copied; + buf += copied; + if (unlikely(nr_segs > 1)) { + filemap_set_next_iovec(_iov, + _offset, copied); + if (count) + buf = cur_iov->iov_base + iov_offset; + } else { + iov_offset += copied; } } if (unlikely(copied != bytes)) - if (status >= 0) - status = -EFAULT; + status = -EFAULT; + unlock_page(page); mark_page_accessed(page); page_cache_release(page); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 5/10] mm: debug write deadlocks
Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the difficult race where the page may be unmapped before calling copy_from_user. Makes the race much easier to hit. This is useful for demonstration and testing purposes, but is removed in a subsequent patch. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1894,6 +1894,7 @@ generic_file_buffered_write(struct kiocb if (maxlen > bytes) maxlen = bytes; +#ifndef CONFIG_DEBUG_VM /* * Bring in the user page that we will copy from _first_. * Otherwise there's a nasty deadlock on copying from the @@ -1901,6 +1902,7 @@ generic_file_buffered_write(struct kiocb * up-to-date. */ fault_in_pages_readable(buf, maxlen); +#endif page = __grab_cache_page(mapping,index,_page,_pvec); if (!page) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 7/10] mm: cleanup pagecache insertion operations
Quite a bit of code is used in maintaining these "cached pages" that are probably pretty unlikely to get used. It would require a narrow race where the page is inserted concurrently while this process is allocating a page in order to create the spare page. Then a multi-page write into an uncached part of the file, to make use of it. Next, the buffered write path (and others) uses its own LRU pagevec when it should be just using the per-CPU LRU pagevec (which will cut down on both data and code size cacheline footprint). Also, these private LRU pagevecs are emptied after just a very short time, in contrast with the per-CPU pagevecs that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required to add the pages to pagecache for a bulk write (in 4K chunks). Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -686,26 +686,22 @@ EXPORT_SYMBOL(find_lock_page); struct page *find_or_create_page(struct address_space *mapping, unsigned long index, gfp_t gfp_mask) { - struct page *page, *cached_page = NULL; + struct page *page; int err; repeat: page = find_lock_page(mapping, index); if (!page) { - if (!cached_page) { - cached_page = alloc_page(gfp_mask); - if (!cached_page) - return NULL; - } - err = add_to_page_cache_lru(cached_page, mapping, - index, gfp_mask); - if (!err) { - page = cached_page; - cached_page = NULL; - } else if (err == -EEXIST) - goto repeat; + page = alloc_page(gfp_mask); + if (!page) + return NULL; + err = add_to_page_cache_lru(page, mapping, index, gfp_mask); + if (unlikely(err)) { + page_cache_release(page); + page = NULL; + if (err == -EEXIST) + goto repeat; + } } - if (cached_page) - page_cache_release(cached_page); return page; } EXPORT_SYMBOL(find_or_create_page); @@ -891,11 +887,9 @@ void do_generic_mapping_read(struct addr unsigned long next_index; unsigned long prev_index; loff_t isize; - struct page *cached_page; int error; struct file_ra_state ra = *_ra; - cached_page = NULL; index = *ppos >> PAGE_CACHE_SHIFT; next_index = index; prev_index = ra.prev_page; @@ -1059,23 +1053,20 @@ no_cached_page: * Ok, it wasn't cached, so we need to create a new * page.. */ - if (!cached_page) { - cached_page = page_cache_alloc_cold(mapping); - if (!cached_page) { - desc->error = -ENOMEM; - goto out; - } + page = page_cache_alloc_cold(mapping); + if (!page) { + desc->error = -ENOMEM; + goto out; } - error = add_to_page_cache_lru(cached_page, mapping, + error = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL); if (error) { + page_cache_release(page); if (error == -EEXIST) goto find_page; desc->error = error; goto out; } - page = cached_page; - cached_page = NULL; goto readpage; } @@ -1083,8 +1074,6 @@ out: *_ra = ra; *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset; - if (cached_page) - page_cache_release(cached_page); if (filp) file_accessed(filp); } @@ -1542,35 +1531,28 @@ static inline struct page *__read_cache_ int (*filler)(void *,struct page*), void *data) { - struct page *page, *cached_page = NULL; + struct page *page; int err; repeat: page = find_get_page(mapping, index); if (!page) { - if (!cached_page) { - cached_page = page_cache_alloc_cold(mapping); - if (!cached_page) - return ERR_PTR(-ENOMEM); - } - err = add_to_page_cache_lru(cached_page, mapping, - index, GFP_KERNEL); - if (err == -EEXIST) - goto
[patch 3/10] mm: revert "generic_file_buffered_write(): deadlock on vectored write"
From: Andrew Morton <[EMAIL PROTECTED]> Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83 This patch fixed the following bug: When prefaulting in the pages in generic_file_buffered_write(), we only faulted in the pages for the firts segment of the iovec. If the second of successive segment described a mmapping of the page into which we're write()ing, and that page is not up-to-date, the fault handler tries to lock the already-locked page (to bring it up to date) and deadlocks. An exploit for this bug is in writev-deadlock-demo.c, in http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz. (These demos assume blocksize < PAGE_CACHE_SIZE). The problem with this fix is that it takes the kernel back to doing a single prepare_write()/commit_write() per iovec segment. So in the worst case we'll run prepare_write+commit_write 1024 times where we previously would have run it once. The other problem with the fix is that it fix all the locking problems. And apparently this change killed NFS overwrite performance, because, I suppose, it talks to the server for each prepare_write+commit_write. So just back that patch out - we'll be fixing the deadlock by other means. Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> Nick says: also it only ever actually papered over the bug, because after faulting in the pages, they might be unmapped or reclaimed. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1881,21 +1881,14 @@ generic_file_buffered_write(struct kiocb do { unsigned long index; unsigned long offset; + unsigned long maxlen; size_t copied; offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */ index = pos >> PAGE_CACHE_SHIFT; bytes = PAGE_CACHE_SIZE - offset; - - /* Limit the size of the copy to the caller's write size */ - bytes = min(bytes, count); - - /* -* Limit the size of the copy to that of the current segment, -* because fault_in_pages_readable() doesn't know how to walk -* segments. -*/ - bytes = min(bytes, cur_iov->iov_len - iov_base); + if (bytes > count) + bytes = count; /* * Bring in the user page that we will copy from _first_. @@ -1903,7 +1896,10 @@ generic_file_buffered_write(struct kiocb * same page as we're writing to, without it being marked * up-to-date. */ - fault_in_pages_readable(buf, bytes); + maxlen = cur_iov->iov_len - iov_base; + if (maxlen > bytes) + maxlen = bytes; + fault_in_pages_readable(buf, maxlen); page = __grab_cache_page(mapping,index,_page,_pvec); if (!page) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2/10] mm: revert "generic_file_buffered_write(): handle zero length iovec segments"
From: Andrew Morton <[EMAIL PROTECTED]> Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6. This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we also revert. Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1911,12 +1911,6 @@ generic_file_buffered_write(struct kiocb break; } - if (unlikely(bytes == 0)) { - status = 0; - copied = 0; - goto zero_length_segment; - } - status = a_ops->prepare_write(file, page, offset, offset+bytes); if (unlikely(status)) { loff_t isize = i_size_read(inode); @@ -1946,8 +1940,7 @@ generic_file_buffered_write(struct kiocb page_cache_release(page); continue; } -zero_length_segment: - if (likely(copied >= 0)) { + if (likely(copied > 0)) { if (!status) status = copied; Index: linux-2.6/mm/filemap.h === --- linux-2.6.orig/mm/filemap.h +++ linux-2.6/mm/filemap.h @@ -87,7 +87,7 @@ filemap_set_next_iovec(const struct iove const struct iovec *iov = *iovp; size_t base = *basep; - do { + while (bytes) { int copy = min(bytes, iov->iov_len - base); bytes -= copy; @@ -96,7 +96,7 @@ filemap_set_next_iovec(const struct iove iov++; base = 0; } - } while (bytes); + } *iovp = iov; *basep = base; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/10] fs: libfs buffered write leak fix
simple_prepare_write and nobh_prepare_write leak uninitialised kernel data. Fix the former, make a note of the latter. Several other filesystems seem to be iffy here, too. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/fs/libfs.c === --- linux-2.6.orig/fs/libfs.c +++ linux-2.6/fs/libfs.c @@ -327,32 +327,35 @@ int simple_readpage(struct file *file, s int simple_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) { - if (!PageUptodate(page)) { - if (to - from != PAGE_CACHE_SIZE) { - void *kaddr = kmap_atomic(page, KM_USER0); - memset(kaddr, 0, from); - memset(kaddr + to, 0, PAGE_CACHE_SIZE - to); - flush_dcache_page(page); - kunmap_atomic(kaddr, KM_USER0); - } + if (PageUptodate(page)) + return 0; + + if (to - from != PAGE_CACHE_SIZE) { + clear_highpage(page); + flush_dcache_page(page); SetPageUptodate(page); } + return 0; } int simple_commit_write(struct file *file, struct page *page, - unsigned offset, unsigned to) + unsigned from, unsigned to) { - struct inode *inode = page->mapping->host; - loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; - - /* -* No need to use i_size_read() here, the i_size -* cannot change under us because we hold the i_mutex. -*/ - if (pos > inode->i_size) - i_size_write(inode, pos); - set_page_dirty(page); + if (to > from) { + struct inode *inode = page->mapping->host; + loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + + if (to - from == PAGE_CACHE_SIZE) + SetPageUptodate(page); + /* +* No need to use i_size_read() here, the i_size +* cannot change under us because we hold the i_mutex. +*/ + if (pos > inode->i_size) + i_size_write(inode, pos); + set_page_dirty(page); + } return 0; } Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -2344,6 +2344,8 @@ int nobh_prepare_write(struct page *page if (is_mapped_to_disk) SetPageMappedToDisk(page); + + /* XXX: information leak vs read(2) */ SetPageUptodate(page); /* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 0/10] buffered write deadlock fix
The following set of patches attempt to fix the buffered write locking problems (and there are a couple of peripheral patches and cleanups there too). This does pass the write deadlock tests that otherwise fail. Has survived a few hours of fsx-linux on ext2 and 3. Patches against 2.6.20-rc4. I didn't have the heart to attempt to rebase them on -mm, at least until I get some feedback ;) Thanks, Nick -- SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-rc3 regression: suspend to RAM broken on Mac mini Core Duo
On Fri, Jan 12, 2007 at 14:50:25 +, Pavel Machek wrote: > Hi! > > > > >> > It didn't. It looks like it is unusable, becuase it isn't reliable in > > > >> > 2.6.20-rc3. > > > >> > > > >> Is this issue still present in -rc4? > > > > > > > >I used 2.6.20-rc4 in single user mode, and applied 2 patches from > > > >netdev to get wake on LAN support. This way I was able to set up an > > > >automatic suspend/resume loop. It looked good, but after e.g. 20 > > > >minutes, the resume hang. So it is reproduceable with 2.6.20-rc4. > > > >Unfortunately, I can not test the same with 2.6.18, as the wake on LAN > > > >patches need 2.6.20-rc. > > > > > > Hmm, do you mean this is the first time of this kind of testing? > > > Is this issue related to LAN driver? > > > I guess you should be able to set up an automatic suspend/resume loop > > > with /proc/acpi/alarm, and test similar with 2.6.18. > > > > Thanks for the hint. I just used /proc/acpi/alarm to set up a > > suspend/resume loop and did ca. 100 cycles in a row with 2.6.18.2 in > > single user mode, without a failure. > > Can you do similar test on 2.6.20 -- w/o network driver loaded (and > generaly minimum drivers?) I think I found the problem. In 2.6.18, I had a slightly different config. With 2.6.20-rc4, I had sucessful suspend/resume cycles without the USB DVB-T box attached. I tweaked the USB options a bit and activated some options (CONFIG_USB_SUSPEND, CONFIG_USB_MULTITHREAD_PROBE, CONFIG_USB_EHCI_SPLIT_ISO, CONFIG_USB_EHCI_ROOT_HUB_TT, CONFIG_USB_EHCI_TT_NEWSCHED) and now I can suspend/resume without hangs. At least I haven't seen one until now. Thanks for you patience and regards, Tino - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux v2.6.20-rc5
Hi. On Fri, 2007-01-12 at 14:26 -0800, Andrew Morton wrote: > On Fri, 12 Jan 2007 14:27:48 -0500 (EST) > Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > > > Ok, there it is, in all its shining glory. > > > > It still doesn't run Excel. Heretic! :) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: "svc: unknown version (3)" when CONFIG_NFSD_V4=y
On Sat, Jan 13, 2007 at 06:43:07AM +1100, Neil Brown wrote: > > Ok, thanks. I must have missed something else wrong in the code.. > > Probably this 'break' in the wrong place... > > Could you try this patch instead please - or just move the 'break' to > where it should be. Now it worked :) Thanks, Wu > Signed-off-by: Neil Brown <[EMAIL PROTECTED]> > > ### Diffstat output > ./fs/nfsd/nfssvc.c |8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff .prev/fs/nfsd/nfssvc.c ./fs/nfsd/nfssvc.c > --- .prev/fs/nfsd/nfssvc.c2007-01-11 14:55:38.0 +1100 > +++ ./fs/nfsd/nfssvc.c2007-01-13 06:40:12.0 +1100 > @@ -72,7 +72,7 @@ static struct svc_program nfsd_acl_progr > .pg_prog= NFS_ACL_PROGRAM, > .pg_nvers = NFSD_ACL_NRVERS, > .pg_vers= nfsd_acl_versions, > - .pg_name= "nfsd", > + .pg_name= "nfsacl", > .pg_class = "nfsd", > .pg_stats = _acl_svcstats, > .pg_authenticate= _set_client, > @@ -118,16 +118,16 @@ int nfsd_vers(int vers, enum vers_op cha > switch(change) { > case NFSD_SET: > nfsd_versions[vers] = nfsd_version[vers]; > - break; > #if defined(CONFIG_NFSD_V2_ACL) || defined(CONFIG_NFSD_V3_ACL) > if (vers < NFSD_ACL_NRVERS) > - nfsd_acl_version[vers] = nfsd_acl_version[vers]; > + nfsd_acl_versions[vers] = nfsd_acl_version[vers]; > #endif > + break; > case NFSD_CLEAR: > nfsd_versions[vers] = NULL; > #if defined(CONFIG_NFSD_V2_ACL) || defined(CONFIG_NFSD_V3_ACL) > if (vers < NFSD_ACL_NRVERS) > - nfsd_acl_version[vers] = NULL; > + nfsd_acl_versions[vers] = NULL; > #endif > break; > case NFSD_TEST: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposed changes for libata speed handling
Alan wrote: > I'm currently hacking on the speed handling code a bit > > I'd like to do the following unless anyone has any objections > > - Remove post_set_mode and make drivers wrap the guts of the existing > set_mode() function. This allows a driver to wrap and see success/failure > while removing a callback, and also to add pre-mode code. (ie you'd do > > foo_set_mode() { > ata_default_set_mode() > my_fiddling(); > } > > - Fix the ->set_mode method FIXMEs in the current tree [DONE] > > - Add set_specific_mode, with a default behaviour that works for most > controllers. Those using a private ->set_mode might need a private > ->set_specific_mode, in some cases like it8212 simply to error the request > > - Hook set_specific_mode to the ata command parser so that instead of > erroring set_features commands we snoop them and force the mode change > desired on the controller (if valid) > > - Send the command to set the speed before setting the controller speed, > so that we send them at the right rate. > > Any comments ? Wouldn't it be better to have ->determine_xfer_mask() and ->set_specific_mode() than having two somewhat overlapping callbacks? Or is there some problem that can't be handled that way? Thanks. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-rc3-mm1 - git-block.patch causes hard lockups
On , [EMAIL PROTECTED] said: > On Thu, 04 Jan 2007 22:02:00 PST, Andrew Morton said: > > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc3/2.6.20-rc3-mm1/ Still seeing this in -rc4-mm1.. > With git-block.patch applied, my system locks up *hard* at system shutdown > time - even alt-sysrq doesn't do anything. Need to do the "power button for > 5" > stunt to get the system back. And today's chef's special is wild crow, tastefully prepared in a stir-fry with broccoli and mixed asiatic vegetables, served on a bed of steamed rice... It doesn't look quite as locked up hard when you fix the ##$*&% script that did an 'echo 0 > /proc/sys/kernel/sysrq' :) Here's the hand-copied traceback: __mutex_lock_slowpath+0x22/0xaa mutex_lock+0xe/0x10 synchronize_rcu+0x23/0xc5 blk_sync_queue+0x1d/0x5a blk_release_queue+0x19/0x65 kobject_cleanup+0x53/0x72 kobject_release+0x0/0xf kobject_release+0xd/0xf kref_put+0x5f/0x6b kobject_put+0x19/0x1b blk_put_queue+0x43/0x48 dm_put+0x11f/0x133 dev_remove+0xa3/0xb7 ctl_ioctl+0x24f/0x29f dev_remove+0x0/0xb7 file_has_perm+0xa7/0xb6 do_ioctl+0x5e/0x77 vfs_ioctl+0x252/0x26f sys_ioctl+0x5f/0x82 tracesys+0xdc/0xe1 > The system is Fedora Core 6/Rawhide, and the last command issued (from > /etc/rc6.d/S01reboot) is "/sbin/cryptsetup remove swap". It hits that, > and *wham* we're dead. Works fine if I revert git-block.patch. > > The line from /etc/crypttab for the encrypted swap: > > swap /dev/mapper/VolGroup00-swap /dev/urandom swap,cipher=aes-cbc-essiv:sha256 pgpCvj877UsH7.pgp Description: PGP signature
Re: /sys/$DEVPATH/uevent vs uevent attributes
Greg KH wrote: > On Fri, Jan 12, 2007 at 10:32:10PM +0300, Michael Tokarev wrote: >> (No patch at this time, -- just asking about an.. idea ;) > > Let's see what such a patch looks like to see if it would be workable or > not. Umm.. it's definitely workable, and even almost trivial. Just splitting kobject_uevent() routine into two parts, one to format the environment variables, and one to actually send things over netlink and executing the hotplug_helper if defined, and using the first part to format the content of `uevent' file will do the trick. I don't know how to do the last part. > And no one forces you to use udev, I have machines with a static /dev > that work just fine :) It has less and less chances to work correctly. For example, this dynamic sdX thing, when I don't know anymore which sdX is which, without some help from /dev/disk/by-XXX/. And more and more software requires udev, at least as packages by distos. For example, today I've got rid of udev on one of our servers, which has been installed (debian) due to xen-utils having Depends: udev. Even when it doesn't *really* *require* udev, -- i replaced the whole thing with a 5-line shell script. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA hotplug from the user side ?
Soeren Sonnenburg wrote: > It is true it detects a removal and newly plugged devices immediately... > However it still prints warnings and errors that it could not > synchronize SCSI cache for the disks. Then it prints regular 'rejects > I/O to dead device' warning messages and on replugging the disks puts > them to the next free sd device (e.g. sdc -> sdd). You need to stop using the devices before unplugging. If you have no pending IO to the device, there won't be 'rejects IO to dead device' messages. You can ignore the SCSI cache sync failure if the device is properly closed before being unplugged. > These messages sound eval - so now the question is should I care ? > ( On the other hand it did not crash the machine ) So, no, you don't really have to care. Just make sure the device is unmounted prior to unplugging. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Fix bttv and friends on 64bit machines with lots of memory.
Am Freitag, den 12.01.2007, 22:42 -0200 schrieb Mauro Carvalho Chehab: > Em Qui, 2007-01-11 às 00:41 +0100, hermann pitton escreveu: > > Am Mittwoch, den 10.01.2007, 09:58 +0100 schrieb Gerd Hoffmann: > > > Hi, > > > > > > We have a DMA32 zone now, lets use it to make sure the card > > > can reach the memory we have allocated for the video frame > > > buffers. > > > > > > please apply, > > > > > > Gerd > > > > Hi, > > > > did anybody already pick up, comment, review Gerd's patch ? > > > > Walks in into his own home like a stranger ... > > > > Gerd, THANKS for all you did. > > It was a incredible lot! > > Hermann, > > I just picked it today. I was out this week due to a physical damage at > the hd on my notebook, were my mailboxes are retrieved. Only today I > have it on a stable condition to return back to activities, successfully > recovering my /home on it. Mauro, Gerd, sorry to be a pain with this one, just thought it could be a missing each other. Our maintainers don't need to excuse for anything! Adrian and all, thanks for fixing the remaining bugs. Cheers, Hermann - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Linux Kernel Markers
Mathieu Desnoyers <[EMAIL PROTECTED]> wrote on 20/12/2006 23:52:16: > Hi, > > You will find, in the following posts, the latest revision of the Linux Kernel > Markers. Due to the need some tracing projects (LTTng, SystemTAP) has of this > kind of mechanism, it could be nice to consider it for mainstream inclusion. > > The following patches apply on 2.6.20-rc1-git7. > > Signed-off-by : Mathieu Desnoyers <[EMAIL PROTECTED]> Mathiue, FWIW I like this idea. A few years ago I implemented something similar, but that had no explicit clients. Consequently I made my hooks code more generalized than is needed in practice. I do remember that Karim reworked the LTT instrumentation to use hooks and it worked fine. You've got the same optimizations for x86 by modifying an instruction's immediate operand and thus avoiding a d-cache hit. The only real caveat is the need to avoid the unsynchronised cross modification erratum. Which means that all processors will need to issue a serializing operation before executing a Marker whose state is changed. How is that handled? One additional thing we did, which might be useful at some future point, was adding a /proc interface. We reflected the current instrumentation though /proc and gave the status of each hook. We even talked about being able to enable or disabled instrumentation by writing to /proc but I don't think we ever implemented this. It's high time we settled the issue of instrumentation. It gets my vote, Good luck! Richard - - Richard J Moore IBM Linux Technology Centre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ahci_softreset prevents acpi_power_off
Hello, Faik Uygur wrote: > We have a Sony PCG-6H1M laptop. It started failing to poweroff with our > switch > from 2.6.16 stable series kernels to 2.6.18 stable series. Rebooting works. > > While searching for the cause, I have found these reported bug reports in the > kernel bugzilla which may be related to this bug: > > http://bugzilla.kernel.org/show_bug.cgi?id=6982 > http://bugzilla.kernel.org/show_bug.cgi?id=7447 Seems mostly unrelated. > According to git bisect, this is the first bad commit: > > 4658f79bec0b51222e769e328c2923f39f3bda77 is first bad commit > commit 4658f79bec0b51222e769e328c2923f39f3bda77 > Author: Tejun Heo <[EMAIL PROTECTED]> > Date: Wed Mar 22 21:07:03 2006 +0900 > > [PATCH] ahci: add softreset > > Now that libata is smart enought to handle both soft and hard resets, > add softreset method. > > Signed-off-by: Tejun Heo <[EMAIL PROTECTED]> > Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]> > > :04 04 ba0a16d0ef82b6577bb61cfb18e6d9df9ee0984e > d0fc78d8f9bbe238f98ac8964562a33e64b30605 M drivers > > With v2.6.20-rc4 from git, it is still failing to poweroff. By not compiling > CONFIG_SCSI_SATA_AHCI, it successfully powers off. > > Also with CONFIG_SCSI_SATA_AHCI, reverting this patch manually by setting > softreset to NULL in ata_do_eh calls in ahci.c makes the machine poweroff. Wow, this is one of the most amazing error report. ahci softreset preventing system halt? > I have attached the dmesg output with defined ATA_DEBUG, ATA_VERBOSE_DEBUG > if it helps. Also you may find lspci output attached. > > Please let me know if anything else is needed. Does everything else work okay? Can you access devices attached to ahci? What happens when you try to shutdown? If possible, please post dmesg of shutting down. You can store it easily using netconsole (Documentation/networking/netconsole.txt). Thanks. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Fri, 12 Jan 2007 17:00:39 -0800 Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > But is > lru_lock an issue is another question. I doubt it, although there might be changes we can make in there to work around it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Fri, Jan 12, 2007 at 01:45:43PM -0800, Christoph Lameter wrote: > On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote: > > Moreover mostatomic operations are to remote memory which is also > increasing the problem by making the atomic ops take longer. Typically > mature NUMA system have implemented hardware provisions that can deal with > such high degrees of contention. If this is simply a SMP system that was > turned into a NUMA box then this is a new hardware scenario for the > engineers. This is using HT as all AMD systems do, but this is one of the 8 socket systems. I ran the same test on a 2 node Tyan AMD box, and did not notice the atrocious spin times. It would be interesting to see how a 4 socket HT box would fare. Unfortunately, I do not have access to one. If someone has access to such a box, I can provide the test case and instrumentation patches. It could very well be the hardware limitation in this case, which means, all the more reason to enable interrupts with spin locks while spinning. But is lru_lock an issue is another question. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: sata_sil24 lockups under heavy i/o
Mark Wagner wrote: > The sil24-connected sata drives are external and connected to their own > power supply. > > I've replaced the sil24-based card with a Promise SATA300 TX4 controller > card and everything seems to work now. Hmmm... sil24 fares well with four ports occupied. Weird. Care to give it another shot? Maybe pci bus contact was bad or something. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] NMI watchdog lockups caused by mwait_idle
Pallipadi, Venkatesh wrote: > Darrick, > > I tried 2.6.20-rc4 on a Dempsey system here in my lab and it worked > fine. No watchdog lockups. > Can you try idle routine with hlt instead of mwait. There is no boot > option for this in x86_64, but you can change > arch/x86_64/kernel/process.c:select_idle_routine() not to enable mwait. > With that default kernel should use hlt based idle. > > Also, worth seeing will be, what happens when nmi_watchdog=0, > nmi_watchdog=1, and nmi_watchdog=2 boot options. That should tell us > whether nmi_watchdog is raising some false alarm or the CPUs are indeed > getting locked up here.. > Locks up with hlt-based idle too. :( Here's what I get with nmi_watchdog=0: [ 206.088703] BUG: soft lockup detected on CPU#0! [ 206.093284] [ 206.093286] Call Trace: [ 206.097324][] softlockup_tick+0xd4/0xe9 [ 206.103618] [] do_flush_tlb_all+0x0/0x68 [ 206.109238] [] run_local_timers+0x13/0x15 [ 206.114949] [] update_process_times+0x4c/0x78 [ 206.121008] [] smp_local_timer_interrupt+0x34/0x51 [ 206.127498] [] smp_apic_timer_interrupt+0x49/0x60 [ 206.133901] [] apic_timer_interrupt+0x66/0x70 [ 206.139956][] __smp_call_function+0x66/0x87 [ 206.146594] [] __smp_call_function+0x62/0x87 [ 206.152564] [] do_flush_tlb_all+0x0/0x68 [ 206.158188] [] do_flush_tlb_all+0x0/0x68 [ 206.163813] [] smp_call_function+0x32/0x49 [ 206.169611] [] do_flush_tlb_all+0x0/0x68 [ 206.175236] [] on_each_cpu+0x30/0x67 [ 206.180514] [] flush_tlb_all+0x1c/0x1e [ 206.185965] [] unmap_vm_area+0x1c3/0x265 [ 206.191590] [] init_level4_pgt+0xc20/0x1000 [ 206.197474] [] remove_vm_area+0x41/0x67 [ 206.203010] [] iounmap+0x8e/0xc8 [ 206.207933] [] acpi_os_unmap_memory+0x9/0xb [ 206.213810] [] acpi_ev_system_memory_region_setup+0x52/0x105 [ 206.221174] [] acpi_ut_delete_internal_obj+0x2c4/0x3b2 [ 206.228012] [] acpi_ut_update_ref_count+0x180/0x1d2 [ 206.234587] [] acpi_ut_update_object_reference+0x160/0x207 [ 206.241770] [] acpi_ut_remove_reference+0xb5/0xd5 [ 206.248173] [] acpi_ns_detach_object+0xca/0xee [ 206.254318] [] acpi_ns_delete_namespace_by_owner+0xcf/0x154 [ 206.261597] [] acpi_ds_terminate_control_method+0xb5/0x14f [ 206.268779] [] acpi_ps_parse_aml+0x242/0x3a0 [ 206.274750] [] acpi_ps_execute_pass+0xd5/0x10b [ 206.280895] [] acpi_ps_execute_method+0x1bf/0x2cb [ 206.287298] [] acpi_ns_evaluate+0x1f8/0x315 [ 206.293180] [] acpi_evaluate_object+0x1d9/0x2fa [ 206.299411] [] kmem_cache_alloc+0xce/0xda [ 206.305125] [] :processor:acpi_processor_start+0x656/0x6fd [ 206.312307] [] kmem_cache_zalloc+0xce/0xf4 [ 206.318103] [] acpi_start_single_object+0x2a/0x54 [ 206.324509] [] acpi_bus_register_driver+0xcd/0x14c [ 206.331001] [] :processor:acpi_processor_init+0x61/0xb7 [ 206.337923] [] sys_init_module+0xac/0x16c [ 206.343630] [] system_call+0x7e/0x83 nmi_watchdog={1,2} produce the same errors. --D - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: /sys/$DEVPATH/uevent vs uevent attributes
On Fri, Jan 12, 2007 at 10:32:10PM +0300, Michael Tokarev wrote: > > (No patch at this time, -- just asking about an.. idea ;) Let's see what such a patch looks like to see if it would be workable or not. And no one forces you to use udev, I have machines with a static /dev that work just fine :) thanks, greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Early ACPI lockup (was Re: 2.6.20-rc4-mm1)
On Sat, Jan 13, 2007 at 01:08:46AM +0100, Michal Piotrowski wrote: > Jiri Slaby napisał(a): > > Frederik Deweerdt wrote: > >> On Fri, Jan 12, 2007 at 05:53:08PM -0500, Len Brown wrote: > >>> On Friday 12 January 2007 05:20, Frederik Deweerdt wrote: > On Thu, Jan 11, 2007 at 10:26:27PM -0800, Andrew Morton wrote: > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc3/2.6.20-rc4-mm1/ > > > Hi, > > The git-acpi.patch replaces earlier "if(!handler) return -EINVAL" by > "BUG_ON(!handler)". This locks my machine early at boot with a message > along the lines of (It's hand copied): > Int 6: cr2: eip: c0570e05 flags: 00010046 cs: 60 > stack: c054ffac c011db2b c04936d0 c054ff68 c054ffc0 c054fff4 c057da2c > > Reverting the change as follows, allows booting: > Any ideas to debug this further? > diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c > index db0c5f6..fba018c 100644 > --- a/drivers/acpi/tables.c > +++ b/drivers/acpi/tables.c > @@ -414,7 +414,9 @@ int __init acpi_table_parse(enum acpi_ta > unsigned int index; > unsigned int count = 0; > > -BUG_ON(!handler); > +if (!handler) > +return -EINVAL; > +/*BUG_ON(!handler);*/ > > for (i = 0; i < sdt_count; i++) { > if (sdt_entry[i].id != id) > >>> What do you see if on failure you also print out the params, like below? > > > > I get this: > > > > ACPI: RSDP (v000 GBT ) @ 0x000f6e80 > > ACPI: RSDT (v001 GBTAWRDACPI 0x42302e31 AWRD 0x01010101) @ 0x3fff3000 > > ACPI: FADT (v001 GBTAWRDACPI 0x42302e31 AWRD 0x01010101) @ 0x3fff3040 > > ACPI: MADT (v001 GBTAWRDACPI 0x42302e31 AWRD 0x01010101) @ 0x3fff7100 > > ACPI: DSDT (v001 GBTAWRDACPI 0x1000 MSFT 0x010c) @ 0x > > ACPI: PM-Timer IO Port: 0x1008 > > ACPI: Local APIC address 0xfee0 > > ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) > > Processor #0 15:2 APIC version 20 > > ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) > > Processor #1 15:2 APIC version 20 > > ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1]) > > ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1]) > > ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0]) > > IOAPIC[0]: apic_id 2, version 32, address 0xfec0, GSI 0-23 > > ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) > > ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) > > ACPI: IRQ0 used by override. > > ACPI: IRQ2 used by override. > > ACPI: IRQ9 used by override. > > Enabling APIC mode: Flat. Using 1 I/O APICs > > ACPI: acpi_table_parse(17, ) HPET NULL handler! > > Using ACPI (MADT) for SMP configuration information > > > > ACPI: acpi_table_parse(17, ) HPET NULL handler! So the BUG_ON is triggered by CONFIG_HPET_TIMER not being defined, causing acpi_parse_hpet to be NULL. Should the acpi_table_parse() called be ifdef'ed of is the previous behaviour (returning -EINVAL) just OK? Regards, Frederik - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Early ACPI lockup (was Re: 2.6.20-rc4-mm1)
Jiri Slaby wrote: >> On Fri, Jan 12, 2007 at 05:53:08PM -0500, Len Brown wrote: >>> What do you see if on failure you also print out the params, like below? [...] > ACPI: acpi_table_parse(17, ) HPET NULL handler! After re-enabling HPET, it disappeared. regards, -- http://www.fi.muni.cz/~xslaby/Jiri Slaby faculty of informatics, masaryk university, brno, cz e-mail: jirislaby gmail com, gpg pubkey fingerprint: B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Fix bttv and friends on 64bit machines with lots of memory.
Em Qui, 2007-01-11 às 00:41 +0100, hermann pitton escreveu: > Am Mittwoch, den 10.01.2007, 09:58 +0100 schrieb Gerd Hoffmann: > > Hi, > > > > We have a DMA32 zone now, lets use it to make sure the card > > can reach the memory we have allocated for the video frame > > buffers. > > > > please apply, > > > > Gerd > > Hi, > > did anybody already pick up, comment, review Gerd's patch ? > > Walks in into his own home like a stranger ... > > Gerd, THANKS for all you did. > It was a incredible lot! Hermann, I just picked it today. I was out this week due to a physical damage at the hd on my notebook, were my mailboxes are retrieved. Only today I have it on a stable condition to return back to activities, successfully recovering my /home on it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: list_del corruption with fedora 6 kernels (fc5 was ok)
On Fri, Jan 12, 2007 at 07:27:30PM -0500, Lee Revell wrote: > On Sat, 2007-01-13 at 00:34 +0100, Karl Kiniger wrote: > > how to track this down? > > Reproduce it with an untainted kernel (no nvidia or vmware modules) and > repost. How about big fat advice in every tainted oops to bugger off? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: list_del corruption with fedora 6 kernels (fc5 was ok)
On Sat, 2007-01-13 at 00:34 +0100, Karl Kiniger wrote: > how to track this down? Reproduce it with an untainted kernel (no nvidia or vmware modules) and repost. Lee - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [BUG] NMI watchdog lockups caused by mwait_idle
Darrick, I tried 2.6.20-rc4 on a Dempsey system here in my lab and it worked fine. No watchdog lockups. Can you try idle routine with hlt instead of mwait. There is no boot option for this in x86_64, but you can change arch/x86_64/kernel/process.c:select_idle_routine() not to enable mwait. With that default kernel should use hlt based idle. Also, worth seeing will be, what happens when nmi_watchdog=0, nmi_watchdog=1, and nmi_watchdog=2 boot options. That should tell us whether nmi_watchdog is raising some false alarm or the CPUs are indeed getting locked up here.. Thanks, Venki >-Original Message- >From: Darrick J. Wong [mailto:[EMAIL PROTECTED] >Sent: Friday, January 12, 2007 1:01 PM >To: Pallipadi, Venkatesh >Cc: Linux Kernel Mailing List >Subject: [BUG] NMI watchdog lockups caused by mwait_idle > >Hi Venkatesh, > >I have an IBM IntelliStation Z30 with two Dempsey CPUs. When I try to >boot 2.6.20-rc4 on it, the system prints messages about NMI watchdog >lockups. git-bisect determined that the patch "[PATCH] x86-64: Fix >interrupt race in idle callback (3rd try)" was the source of these >problems, and I can work around the problem either by passing >"idle=poll" to get avoid mwait_idle or by reverting the patch. > >Other non-Dempsey Xeon machines with mwait support do not exhibit these >symptoms. I will try to determine if this is a bug specific to Dempsey >CPUs or this particular type of machine. I suspect the latter, but I >don't know enough about monitor/mwait to pursue this much further. > >What else can I do to diagnose this? > >--D > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Early ACPI lockup (was Re: 2.6.20-rc4-mm1)
Jiri Slaby napisał(a): > Frederik Deweerdt wrote: >> On Fri, Jan 12, 2007 at 05:53:08PM -0500, Len Brown wrote: >>> On Friday 12 January 2007 05:20, Frederik Deweerdt wrote: On Thu, Jan 11, 2007 at 10:26:27PM -0800, Andrew Morton wrote: > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc3/2.6.20-rc4-mm1/ > Hi, The git-acpi.patch replaces earlier "if(!handler) return -EINVAL" by "BUG_ON(!handler)". This locks my machine early at boot with a message along the lines of (It's hand copied): Int 6: cr2: eip: c0570e05 flags: 00010046 cs: 60 stack: c054ffac c011db2b c04936d0 c054ff68 c054ffc0 c054fff4 c057da2c Reverting the change as follows, allows booting: Any ideas to debug this further? diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c index db0c5f6..fba018c 100644 --- a/drivers/acpi/tables.c +++ b/drivers/acpi/tables.c @@ -414,7 +414,9 @@ int __init acpi_table_parse(enum acpi_ta unsigned int index; unsigned int count = 0; - BUG_ON(!handler); + if (!handler) + return -EINVAL; + /*BUG_ON(!handler);*/ for (i = 0; i < sdt_count; i++) { if (sdt_entry[i].id != id) >>> What do you see if on failure you also print out the params, like below? > > I get this: > > ACPI: RSDP (v000 GBT ) @ 0x000f6e80 > ACPI: RSDT (v001 GBTAWRDACPI 0x42302e31 AWRD 0x01010101) @ 0x3fff3000 > ACPI: FADT (v001 GBTAWRDACPI 0x42302e31 AWRD 0x01010101) @ 0x3fff3040 > ACPI: MADT (v001 GBTAWRDACPI 0x42302e31 AWRD 0x01010101) @ 0x3fff7100 > ACPI: DSDT (v001 GBTAWRDACPI 0x1000 MSFT 0x010c) @ 0x > ACPI: PM-Timer IO Port: 0x1008 > ACPI: Local APIC address 0xfee0 > ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) > Processor #0 15:2 APIC version 20 > ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) > Processor #1 15:2 APIC version 20 > ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1]) > ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1]) > ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0]) > IOAPIC[0]: apic_id 2, version 32, address 0xfec0, GSI 0-23 > ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) > ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) > ACPI: IRQ0 used by override. > ACPI: IRQ2 used by override. > ACPI: IRQ9 used by override. > Enabling APIC mode: Flat. Using 1 I/O APICs > ACPI: acpi_table_parse(17, ) HPET NULL handler! > Using ACPI (MADT) for SMP configuration information > ACPI: RSDP (v000 ACPIAM) @ 0x000f9e30 ACPI: RSDT (v001 A M I OEMRSDT 0x1414 MSFT 0x0097) @ 0x7ff3 ACPI: FADT (v002 A M I OEMFACP 0x1414 MSFT 0x0097) @ 0x7ff30200 ACPI: MADT (v001 A M I OEMAPIC 0x1414 MSFT 0x0097) @ 0x7ff30390 ACPI: OEMB (v001 A M I OEMBIOS 0x1414 MSFT 0x0097) @ 0x7ff40040 ACPI: DSDT (v001 P4P81 P4P81104 0x0104 INTL 0x02002026) @ 0x ACPI: PM-Timer IO Port: 0x808 ACPI: Local APIC address 0xfee0 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) Processor #0 15:2 APIC version 20 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) Processor #1 15:2 APIC version 20 ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0]) IOAPIC[0]: apic_id 2, version 32, address 0xfec0, GSI 0-23 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) ACPI: IRQ0 used by override. ACPI: IRQ2 used by override. ACPI: IRQ9 used by override. Enabling APIC mode: Flat. Using 1 I/O APICs ACPI: acpi_table_parse(17, ) HPET NULL handler! Using ACPI (MADT) for SMP configuration information Regards, Michal -- Michal K. K. Piotrowski LTG - Linux Testers Group (http://www.stardust.webpages.pl/ltg/) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
list_del corruption with fedora 6 kernels (fc5 was ok)
Hi, these trigger about 1-2 times per week at random times. I dont see a pattern, one time it happened after plugging in the USB headphone, another time it happened while the machine was more or less idle. machine does not reboot automatically ( /proc/sys/kernel/panic is set to 20) most of the time the panic does not make it into the syslog but I have been lucky three times. how to track this down? Greetings, Karl NB: the v4l/bt848 stuff is not being used at all. /var/log/messages.3:Dec 19 11:58:38 wszip-kinigka kernel: list_del corruption. next->prev should be c6e1f2c0, but was 35b0 /var/log/messages.4:Dec 13 15:57:07 wszip-kinigka kernel: list_del corruption. next->prev should be c9a24be0, but was c1284fe0 (backtraces are essentially the same) from today: Jan 12 10:57:23 wszip-kinigka kernel: list_del corruption. prev->next should be ea24aa20, but was ea240080 Jan 12 10:57:23 wszip-kinigka kernel: [ cut here ] Jan 12 10:57:23 wszip-kinigka kernel: kernel BUG at lib/list_debug.c:65! Jan 12 10:57:23 wszip-kinigka kernel: invalid opcode: [#1] Jan 12 10:57:23 wszip-kinigka kernel: SMP Jan 12 10:57:23 wszip-kinigka kernel: last sysfs file: /class/net/lo/ifindex Jan 12 10:57:23 wszip-kinigka kernel: Modules linked in: snd_usb_audio vfat fat hfsplus nls_utf8 cifs sbp2 sg usb_storage tun snd_usb_lib autofs4 hidp rfcomm l2cap bluetooth vmnet(U) vmmon(U) sunrpc ib_iser rdma_cm ib_addr ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi scsi_transport_iscsi ipv6 reiserfs loop dm_multipath parport_pc lp parport bt878 snd_bt87x snd_cmipci tuner tvaudio snd_seq_dummy gameport snd_seq_oss snd_opl3_lib bttv video_buf ir_common snd_hwdep snd_seq_midi_event nvidia(U) snd_mpu401_uart snd_seq compat_ioctl32 snd_pcm_oss i2c_algo_bit snd_rawmidi btcx_risc snd_mixer_oss snd_seq_device snd_pcm tveeprom snd_timer videodev ide_cd ohci1394 3c59x v4l1_compat v4l2_common snd i2c_core ieee1394 snd_page_alloc cdrom floppy mii soundcore serio_raw pcspkr dm_snapshot dm_zero dm_mirror dm_mod aic7xxx scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Jan 12 10:57:23 wszip-kinigka kernel: CPU:0 Jan 12 10:57:23 wszip-kinigka kernel: EIP:0060:[]Tainted: P VLI Jan 12 10:57:23 wszip-kinigka kernel: EFLAGS: 00010096 (2.6.18-1.2869.fc6 #1) Jan 12 10:57:23 wszip-kinigka kernel: EIP is at list_del+0x23/0x6c Jan 12 10:57:23 wszip-kinigka kernel: eax: 0048 ebx: ea24aa20 ecx: c067e1d0 edx: 0092 Jan 12 10:57:23 wszip-kinigka kernel: esi: f7ffd6c0 edi: cb841000 ebp: f7fffe80 esp: f7fefef8 Jan 12 10:57:23 wszip-kinigka kernel: ds: 007b es: 007b ss: 0068 Jan 12 10:57:23 wszip-kinigka kernel: Process events/0 (pid: 5, ti=f7fef000 task=f7d80030 task.ti=f7fef000) Jan 12 10:57:23 wszip-kinigka kernel: Stack: c0641c4f ea24aa20 ea240080 ea24aa20 c046b553 f7f7a1c0 0005 0004 Jan 12 10:57:23 wszip-kinigka kernel:f7ffdef0 f7ffdee0 0005 f7ffdec0 c046b656 Jan 12 10:57:23 wszip-kinigka kernel:f7fffe80 f7ffd6e4 f7ffd6c0 f7fffe80 c18fd340 0282 c046ca7a Jan 12 10:57:23 wszip-kinigka kernel: Call Trace: Jan 12 10:57:23 wszip-kinigka kernel: [] free_block+0x63/0xdc Jan 12 10:57:23 wszip-kinigka kernel: [] drain_array+0x8a/0xb5 Jan 12 10:57:23 wszip-kinigka kernel: [] cache_reap+0x53/0x117 Jan 12 10:57:23 wszip-kinigka kernel: [] run_workqueue+0x83/0xc5 Jan 12 10:57:23 wszip-kinigka kernel: [] worker_thread+0xd9/0x10d Jan 12 10:57:23 wszip-kinigka kernel: [] kthread+0xc0/0xed Jan 12 10:57:23 wszip-kinigka kernel: [] kernel_thread_helper+0x7/0x10 Jan 12 10:57:23 wszip-kinigka kernel: DWARF2 unwinder stuck at kernel_thread_helper+0x7/0x10 Jan 12 10:57:23 wszip-kinigka kernel: Leftover inexact backtrace: Jan 12 10:57:23 wszip-kinigka kernel: === Jan 12 10:57:23 wszip-kinigka kernel: Code: 00 00 89 c3 eb e8 90 90 53 89 c3 83 ec 0c 8b 40 04 8b 00 39 d8 74 1c 89 5c 24 04 89 44 24 08 c7 04 24 4f 1c 64 c0 e8 2b be f3 ff <0f> 0b 41 00 8c 1c 64 c0 8b 03 8b 40 04 39 d8 74 1c 89 5c 24 04 Jan 12 10:57:23 wszip-kinigka kernel: EIP: [] list_del+0x23/0x6c SS:ESP 0068:f7fefef8 Jan 12 10:57:23 wszip-kinigka kernel: <3>BUG: sleeping function called from invalid context at kernel/rwsem.c:20 Jan 12 10:57:23 wszip-kinigka kernel: in_atomic():0, irqs_disabled():1 Jan 12 10:57:23 wszip-kinigka kernel: [] dump_trace+0x69/0x1af Jan 12 10:57:23 wszip-kinigka kernel: [] show_trace_log_lvl+0x18/0x2c Jan 12 10:57:23 wszip-kinigka kernel: [] show_trace+0xf/0x11 Jan 12 10:57:23 wszip-kinigka kernel: [] dump_stack+0x15/0x17 Jan 12 10:57:23 wszip-kinigka kernel: [] down_read+0x12/0x20 Jan 12 10:57:23 wszip-kinigka kernel: [] blocking_notifier_call_chain+0xe/0x29 Jan 12 10:57:23 wszip-kinigka kernel: [] do_exit+0x1b/0x776 Jan 12 10:57:23 wszip-kinigka kernel: [] die+0x29d/0x2c2 Jan 12 10:57:23 wszip-kinigka kernel: [] do_invalid_op+0xa2/0xab Jan 12 10:57:23
Re: [PATCH 2/5] fixing errors handling during pci_driver resume stage [ata]
On Tue, Jan 09, 2007 at 12:01:28PM +0300, Dmitriy Monakhov wrote: > ata pci drivers have to return correct error code during resume stage in > case of errors. ... > @@ -6246,8 +6253,10 @@ int ata_pci_device_suspend(struct pci_de > int ata_pci_device_resume(struct pci_dev *pdev) > { > struct ata_host *host = dev_get_drvdata(>dev); > + int err; > > - ata_pci_device_do_resume(pdev); > + if ((err = ata_pci_device_do_resume(pdev))) > + return err; nit: in every other case I looked at you did: err = foo() if (err) ... Can you make that consistent here too? thanks, grant - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/1] Char: mxser_new, fix sparc compile error
mxser_new, fix sparc compile error On sparc B400 is not defined. Use B200 for special baudrate, which is defined on all platforms. Signed-off-by: Jiri Slaby <[EMAIL PROTECTED]> --- commit 2826e3a35f34046890c84a77bc2784a184f9bf6a tree fcfd15b000e703d91361f2b2c3c1bafb0d18b05d parent 1ed2feac68d7b7cd50ffcd28cb0830b435e7d120 author Jiri Slaby <[EMAIL PROTECTED]> Sat, 13 Jan 2007 00:27:05 +0059 committer Jiri Slaby <[EMAIL PROTECTED]> Sat, 13 Jan 2007 00:27:05 +0059 drivers/char/mxser_new.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/char/mxser_new.c b/drivers/char/mxser_new.c index 1997390..4c80549 100644 --- a/drivers/char/mxser_new.c +++ b/drivers/char/mxser_new.c @@ -189,6 +189,8 @@ static unsigned int mxvar_baud_table1[] = { }; #define BAUD_TABLE_NO ARRAY_SIZE(mxvar_baud_table) +#define B_SPEC B200 + static int ioaddr[MXSER_BOARDS] = { 0, 0, 0, 0 }; static int ttymajor = MXSERMAJOR; static int calloutmajor = MXSERCUMAJOR; @@ -544,7 +546,7 @@ static int mxser_change_speed(struct mxser_port *info, return ret; if (mxser_set_baud_method[info->tty->index] == 0) { - if ((cflag & (CBAUD | CBAUDEX)) == B400) + if ((cflag & CBAUD) == B_SPEC) baud = info->speed; else baud = tty_get_baud_rate(info->tty); @@ -1700,7 +1702,7 @@ static int mxser_ioctl(struct tty_struct *tty, struct file *file, if (speed == mxvar_baud_table[i]) break; if (i == BAUD_TABLE_NO) { - info->tty->termios->c_cflag |= B400; + info->tty->termios->c_cflag |= B_SPEC; } else if (speed != 0) info->tty->termios->c_cflag |= mxvar_baud_table1[i]; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Early ACPI lockup (was Re: 2.6.20-rc4-mm1)
Frederik Deweerdt wrote: > On Fri, Jan 12, 2007 at 05:53:08PM -0500, Len Brown wrote: >> On Friday 12 January 2007 05:20, Frederik Deweerdt wrote: >>> On Thu, Jan 11, 2007 at 10:26:27PM -0800, Andrew Morton wrote: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc3/2.6.20-rc4-mm1/ >>> Hi, >>> >>> The git-acpi.patch replaces earlier "if(!handler) return -EINVAL" by >>> "BUG_ON(!handler)". This locks my machine early at boot with a message >>> along the lines of (It's hand copied): >>> Int 6: cr2: eip: c0570e05 flags: 00010046 cs: 60 >>> stack: c054ffac c011db2b c04936d0 c054ff68 c054ffc0 c054fff4 c057da2c >>> >>> Reverting the change as follows, allows booting: >>> Any ideas to debug this further? >> >>> diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c >>> index db0c5f6..fba018c 100644 >>> --- a/drivers/acpi/tables.c >>> +++ b/drivers/acpi/tables.c >>> @@ -414,7 +414,9 @@ int __init acpi_table_parse(enum acpi_ta >>> unsigned int index; >>> unsigned int count = 0; >>> >>> - BUG_ON(!handler); >>> + if (!handler) >>> + return -EINVAL; >>> + /*BUG_ON(!handler);*/ >>> >>> for (i = 0; i < sdt_count; i++) { >>> if (sdt_entry[i].id != id) >> What do you see if on failure you also print out the params, like below? I get this: ACPI: RSDP (v000 GBT ) @ 0x000f6e80 ACPI: RSDT (v001 GBTAWRDACPI 0x42302e31 AWRD 0x01010101) @ 0x3fff3000 ACPI: FADT (v001 GBTAWRDACPI 0x42302e31 AWRD 0x01010101) @ 0x3fff3040 ACPI: MADT (v001 GBTAWRDACPI 0x42302e31 AWRD 0x01010101) @ 0x3fff7100 ACPI: DSDT (v001 GBTAWRDACPI 0x1000 MSFT 0x010c) @ 0x ACPI: PM-Timer IO Port: 0x1008 ACPI: Local APIC address 0xfee0 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 15:2 APIC version 20 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) Processor #1 15:2 APIC version 20 ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1]) ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0]) IOAPIC[0]: apic_id 2, version 32, address 0xfec0, GSI 0-23 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) ACPI: IRQ0 used by override. ACPI: IRQ2 used by override. ACPI: IRQ9 used by override. Enabling APIC mode: Flat. Using 1 I/O APICs ACPI: acpi_table_parse(17, ) HPET NULL handler! Using ACPI (MADT) for SMP configuration information reagrds, -- http://www.fi.muni.cz/~xslaby/Jiri Slaby faculty of informatics, masaryk university, brno, cz e-mail: jirislaby gmail com, gpg pubkey fingerprint: B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] kvm & dyntick
On Fri, 2007-01-12 at 15:25 -0800, Dor Laor wrote: > This is great news for PV guests. > > Never-the-less we still need to improve our full virtualized guest > support. Full virtualized guests, which have their own dyntick support, are fine as long as we provide local apic emulation for them. If a guest does not have that, it will use the periodic mode. There is no way to circumvent this. We do not know, whether the guest relies on that periodic interrupt or not. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Choosing a HyperThreading/SMP/MultiCore kernel ?
> Trying to understand, should I set CPUSETS=y You don't need CPUSETS for this small a system. But setting it is harmless - for example at least one major commercial distribution enables CPUSETS on almost all their product, most of which is running on PC's less powerful than yours. CPUSETS provides a facility for managing the memory and processor placement of jobs running on what are typically big NUMA systems. Job X runs on CPUs 0-3 with memory on Nodes 0-1, while Job Y runs on CPUs 4-7 and Nodes 2-3. And bigger ... to hundreds and thousands of CPUs and Nodes. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] kvm & dyntick
>* Ingo Molnar <[EMAIL PROTECTED]> wrote: > >> > dyntick-enabled guest: >> > - reduce the load on the host when the guest is idling >> > (currently an idle guest consumes a few percent cpu) >> >> yeah. KVM under -rt already works with dynticks enabled on both the >> host and the guest. (but it's more optimal to use a dedicated >> hypercall to set the next guest-interrupt) > >using the dynticks code from the -rt kernel makes the overhead of an >idle guest go down by a factor of 10-15: > > PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND > 2556 mingo 15 0 598m 159m 157m R 1.5 8.0 0:26.20 qemu > >( for this to work on my system i have added a 'hyper' clocksource > hypercall API for KVM guests to use - this is needed instead of the > running-to-slowly TSC. ) > > Ingo This is great news for PV guests. Never-the-less we still need to improve our full virtualized guest support. First we need a mechanism (can we use the timeout_granularity?) to dynamically change the host timer frequency so we can support guests with 100hz that dynamically change their freq to 1000hz and back. Afterwards we'll need to compensate the lost alarm signals to the guests by using one of - hrtimers to inject the lost interrupts for specific guests. The problem this will increase the overall load. - Injecting several virtual irq to the guests one after another (using interrupt window exit). The question is how the guest will be effected from this unfair behavior. Can dyntick help HVMs? Will the answer be the same for guest-dense hosts? I understood that the main gain of dyn-tick is for idle time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Choosing a HyperThreading/SMP/MultiCore kernel ?
On 1/12/07, Lennart Sorensen <[EMAIL PROTECTED]> wrote: I would expect any distribution should work on these (as long as the kernel they use isn't too old.). Of course if it is a Mac, you need a distribution that supports their firmware (which is of course not a PC bios). As long as you can boot it, any i386 or amd64 kernel with smp enabled should use all the processors present (well amd64 on the core2duo and on the p4 if it is em64t enabled). It is not a Mac here, IBM Workstation. I can see the Processor as Pentium 4 CPU 3. GHz (family 15, model 4). How to know EM64T enabled, any command? Trying to understand, should I set CPUSETS=y and SCHED_MC=y Or ignore them. I believe the closest optimization for a Core2 is probably the Pentium M (certainly not the P4/netburst). Not entirely sure though. Yep, this ia a MacBookPro. I have decided about the distro. I did ask this doubt when I got for the custom kernel compilation from source after installation. What I have seen in KConfig is, MPENTIUM4 used for the Xeon processor too. I would try this soon on my Laptop (with SMP since it's a Core2Duo). Anyway, shall post here. -- Len Sorensen Thanks, ~Sunil - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Early ACPI lockup (was Re: 2.6.20-rc4-mm1)
On Fri, Jan 12, 2007 at 05:53:08PM -0500, Len Brown wrote: > On Friday 12 January 2007 05:20, Frederik Deweerdt wrote: > > On Thu, Jan 11, 2007 at 10:26:27PM -0800, Andrew Morton wrote: > > > > > > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc3/2.6.20-rc4-mm1/ > > > > > Hi, > > > > The git-acpi.patch replaces earlier "if(!handler) return -EINVAL" by > > "BUG_ON(!handler)". This locks my machine early at boot with a message > > along the lines of (It's hand copied): > > Int 6: cr2: eip: c0570e05 flags: 00010046 cs: 60 > > stack: c054ffac c011db2b c04936d0 c054ff68 c054ffc0 c054fff4 c057da2c > > > > Reverting the change as follows, allows booting: > > Any ideas to debug this further? > > > > diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c > > index db0c5f6..fba018c 100644 > > --- a/drivers/acpi/tables.c > > +++ b/drivers/acpi/tables.c > > @@ -414,7 +414,9 @@ int __init acpi_table_parse(enum acpi_ta > > unsigned int index; > > unsigned int count = 0; > > > > - BUG_ON(!handler); > > + if (!handler) > > + return -EINVAL; > > + /*BUG_ON(!handler);*/ > > > > for (i = 0; i < sdt_count; i++) { > > if (sdt_entry[i].id != id) > > What do you see if on failure you also print out the params, like below? > I'm sorry, I might not be able to try it until monday. Michal reported a similar problem though, adding him to CC list. Regards, Frederik > thanks, > -Len > > diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c > index 3fce3db..e2d08a5 100644 > --- a/drivers/acpi/tables.c > +++ b/drivers/acpi/tables.c > @@ -415,7 +415,12 @@ int __init acpi_table_parse(enum acpi_table_id id, > acpi_table_handler handler) > unsigned int index = 0; > unsigned int count = 0; > > - BUG_ON(!handler); > + if (!handler) { > + printk(KERN_WARNING PREFIX > + "acpi_table_parse(%d, %p) %s NULL handler!\n", > + id, handler, acpi_table_signatures[id]); > + return -EINVAL; > + } > > for (i = 0; i < sdt_count; i++) { > if (sdt_entry[i].id != id) > > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-rc3-mm1: umount reiser4 FS stuck in D state
Le 06.01.2007 19:58, Vladimir V. Saveliev a écrit : Hello On Saturday 06 January 2007 13:58, Laurent Riffard wrote: Hello, got this with 2.6.20-rc3-mm1: === SysRq : Show Blocked State freesibling task PCstack pid father child younger older umountD C013135E 6044 1168 1150 (NOTLB) de591ae4 0086 de591abc c013135e dff979c8 c012a6fe 0046 0007 dfd94ac0 128d3000 0026 dfd94bcc dff979c8 de591ae4 dffda038 0002 dff979c0 dff979bc dff979c8 de591b10 c012d600 dff979f8 Call Trace: [] synchronize_qrcu+0x70/0x8c [] __make_request+0x4c/0x29b [] generic_make_request+0x1b0/0x1de [] submit_bio+0xda/0xe2 [] write_jnodes_to_disk_extent+0x920/0x974 [reiser4] [] update_journal_footer+0x29f/0x2b7 [reiser4] [] write_tx_back+0x149/0x185 [reiser4] [] reiser4_write_logs+0xea4/0xfd2 [reiser4] [] try_commit_txnh+0x7e6/0xa4f [reiser4] [] reiser4_txn_end+0x148/0x3cf [reiser4] [] reiser4_txn_restart+0xb/0x1a [reiser4] [] reiser4_txn_restart_current+0x73/0x75 [reiser4] [] force_commit_atom+0x258/0x261 [reiser4] [] txnmgr_force_commit_all+0x406/0x697 [reiser4] [] release_format40+0x10c/0x193 [reiser4] [] reiser4_put_super+0x134/0x16a [reiser4] [] generic_shutdown_super+0x55/0xd8 [] kill_block_super+0x20/0x32 [] deactivate_super+0x3f/0x51 [] mntput_no_expire+0x42/0x5f [] path_release_on_umount+0x15/0x18 [] sys_umount+0x1a3/0x1cb [] sys_oldumount+0x19/0x1b [] sysenter_past_esp+0x5f/0x99 === Scenario: - umount a reiser4 FS (no need to write something before) Hmm, I can not reproduce this with 2.6.20-rc3-mm1. Probably I need to config the kernel more close to your system. Earlier kernels were OK. This still happens with 2.6.20-rc4-mm1... Should I open a bug report at http://bugzilla.kernel.org? -- laurent - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Early ACPI lockup (was Re: 2.6.20-rc4-mm1)
On Friday 12 January 2007 05:20, Frederik Deweerdt wrote: > On Thu, Jan 11, 2007 at 10:26:27PM -0800, Andrew Morton wrote: > > > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc3/2.6.20-rc4-mm1/ > > > Hi, > > The git-acpi.patch replaces earlier "if(!handler) return -EINVAL" by > "BUG_ON(!handler)". This locks my machine early at boot with a message > along the lines of (It's hand copied): > Int 6: cr2: eip: c0570e05 flags: 00010046 cs: 60 > stack: c054ffac c011db2b c04936d0 c054ff68 c054ffc0 c054fff4 c057da2c > > Reverting the change as follows, allows booting: > Any ideas to debug this further? > diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c > index db0c5f6..fba018c 100644 > --- a/drivers/acpi/tables.c > +++ b/drivers/acpi/tables.c > @@ -414,7 +414,9 @@ int __init acpi_table_parse(enum acpi_ta > unsigned int index; > unsigned int count = 0; > > - BUG_ON(!handler); > + if (!handler) > + return -EINVAL; > + /*BUG_ON(!handler);*/ > > for (i = 0; i < sdt_count; i++) { > if (sdt_entry[i].id != id) What do you see if on failure you also print out the params, like below? thanks, -Len diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c index 3fce3db..e2d08a5 100644 --- a/drivers/acpi/tables.c +++ b/drivers/acpi/tables.c @@ -415,7 +415,12 @@ int __init acpi_table_parse(enum acpi_table_id id, acpi_table_handler handler) unsigned int index = 0; unsigned int count = 0; - BUG_ON(!handler); + if (!handler) { + printk(KERN_WARNING PREFIX + "acpi_table_parse(%d, %p) %s NULL handler!\n", + id, handler, acpi_table_signatures[id]); + return -EINVAL; + } for (i = 0; i < sdt_count; i++) { if (sdt_entry[i].id != id) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] raw: don't allow the creation of a raw device with minor number 0
==> Regarding Re: [patch] raw: don't allow the creation of a raw device with minor number 0; Jan Engelhardt <[EMAIL PROTECTED]> adds: jengelh> On Jan 12 2007 11:32, Jeff Moyer wrote: >> Date: Fri, 12 Jan 2007 11:32:11 -0500 >> From: Jeff Moyer <[EMAIL PROTECTED]> >> To: Linux Kernel Mailing List >> Cc: Steven Fernandez <[EMAIL PROTECTED]>, Andrew Morton <[EMAIL PROTECTED]> >> Subject: [patch] raw: don't allow the creation of a raw device with minor >> number 0 >> >> Hi, >> >> Minor number 0 (under the raw major) is reserved for the rawctl device >> file, which is used to query, set, and unset raw device bindings. >> However, the ioctl interface does not protect the user from specifying >> a raw device with minor number 0: jengelh> No idea what to say about this... probably: jengelh> What: RAW driver (CONFIG_RAW_DRIVER) jengelh> When: December 2005 jengelh> Why:declared obsolete since kernel 2.6.3 jengelh> O_DIRECT can be used instead jengelh> Who:Adrian Bunk <[EMAIL PROTECTED]> It's still present, still used, and so would benefit from being fixed, in my opinion. Cheers, Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Fri, 12 Jan 2007 15:35:09 -0700 Erik Andersen <[EMAIL PROTECTED]> wrote: > On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote: > > I suspect a lot of people actually have other reasons to avoid caches. > > > > For example, the reason to do O_DIRECT may well not be that you want to > > avoid caching per se, but simply because you want to limit page cache > > activity. In which case O_DIRECT "works", but it's really the wrong thing > > to do. We could export other ways to do what people ACTUALLY want, that > > doesn't have the downsides. > > I was rather fond of the old O_STREAMING patch by Robert Love, That was an akpmpatch whcih I did for the Digeo kernel. Robert picked it up to dehackify it and get it into mainline, but we ended up deciding that posix_fadvise() was the way to go because it's standards-based. It's a bit more work in the app to use posix_fadvise() well. But the results will be better. The app should also use sync_file_range() intelligently to control its pagecache use. The problem with all of these things is that the application needs to be changed, and people often cannot do that. If we want a general way of stopping particular apps from swamping pagecache then it'd really need to be an externally-imposed thing - probably via additional accounting and a new rlimit. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] raw: don't allow the creation of a raw device with minor number 0
On Jan 12 2007 11:32, Jeff Moyer wrote: >Date: Fri, 12 Jan 2007 11:32:11 -0500 >From: Jeff Moyer <[EMAIL PROTECTED]> >To: Linux Kernel Mailing List >Cc: Steven Fernandez <[EMAIL PROTECTED]>, Andrew Morton <[EMAIL PROTECTED]> >Subject: [patch] raw: don't allow the creation of a raw device with minor >number 0 > >Hi, > >Minor number 0 (under the raw major) is reserved for the rawctl device >file, which is used to query, set, and unset raw device bindings. >However, the ioctl interface does not protect the user from specifying >a raw device with minor number 0: No idea what to say about this... probably: What: RAW driver (CONFIG_RAW_DRIVER) When: December 2005 Why:declared obsolete since kernel 2.6.3 O_DIRECT can be used instead Who:Adrian Bunk <[EMAIL PROTECTED]> -`J' -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
Eric Sandeen wrote: > Alex Tomas wrote: > >> yes, but it shouldn't allow to re-link such inode back, IMHO. >> a filesystem may start some non-revertable activity in its >> unlink method. >> >> thanks, Alex >> > > I tend to agree, chatting w/ Al I think he does too. :) I'll test > a patch that kicks out ext3_link() with -ENOENT at the top, and resubmit > that if things go well. > Well this seems to fix things up for ext3 (and ext4 by extension): --- Return -ENOENT from ext[34]_link if we've raced with unlink and i_nlink is 0. Doing otherwise has the potential to corrupt the orphan inode list, because we'd wind up with an inode with a non-zero link count on the list, and it will never get properly cleaned up. Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]> Index: linux-2.6.19/fs/ext3/namei.c === --- linux-2.6.19.orig/fs/ext3/namei.c +++ linux-2.6.19/fs/ext3/namei.c @@ -2191,6 +2191,8 @@ static int ext3_link (struct dentry * ol if (inode->i_nlink >= EXT3_LINK_MAX) return -EMLINK; + if (inode->i_nlink == 0) + return -ENOENT; retry: handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS(dir->i_sb) + Index: linux-2.6.19/fs/ext4/namei.c === --- linux-2.6.19.orig/fs/ext4/namei.c +++ linux-2.6.19/fs/ext4/namei.c @@ -2189,6 +2189,8 @@ static int ext4_link (struct dentry * ol if (inode->i_nlink >= EXT4_LINK_MAX) return -EMLINK; + if (inode->i_nlink == 0) + return -ENOENT; retry: handle = ext4_journal_start(dir, EXT4_DATA_TRANS_BLOCKS(dir->i_sb) + - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote: > I suspect a lot of people actually have other reasons to avoid caches. > > For example, the reason to do O_DIRECT may well not be that you want to > avoid caching per se, but simply because you want to limit page cache > activity. In which case O_DIRECT "works", but it's really the wrong thing > to do. We could export other ways to do what people ACTUALLY want, that > doesn't have the downsides. I was rather fond of the old O_STREAMING patch by Robert Love, which added an open() flag telling the kernel to not keep data from the current file in cache by dropping pages from the pagecache before the current index. O_STREAMING was very nice for when you know you want to read a large file sequentially without polluting the rest of the cache with GB of data that you plan on only read once and discard. It worked nicely at doing what many people want to use O_DIRECT for. Using O_STREAMING you would get normal read/write semantics since you still had the pagecache caching your data, but only the not-yet-written write-behind data and the not-yet-read read-ahead data. With the additional hint the kernel should drop free-able pages from the pagecache behind the current position, because we know we will never want them again. I thought that was a very nice way of handling things. -Erik -- Erik B. Andersen http://codepoet-consulting.com/ --This message was written using 73% post-consumer electrons-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How can I create or read/write a file in linux device driver?
On Jan 12 2007 09:27, linux-os (Dick Johnson) wrote: > >First, since file-operations require process context, and the kernel >is not a process, you need to create a kernel thread to handle your file >I/O. Not always. If you do file I/O as part of a device driver, you are fine. quad_dsp is such an example, where writing to /dev/Qdsp_* will trigger writes to /dev/dsp and /dev/adsp. >Once you set up this "internal environment," you use the appropriate >kernel function(s) such as sys_open() What against filp_open? That avoids the unnecessary getname() stuff in most syscalls. -`J' -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux v2.6.20-rc5
On Fri, 12 Jan 2007 14:27:48 -0500 (EST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > > Ok, there it is, in all its shining glory. > It still doesn't run Excel. > A lot of developers (including me) will be gone next week for > Linux.Conf.Au, me too. > so you have a week of rest and quiet to test this, and > report any problems. I have a few fixes pending: kvm-add-vm-exit-profiling-fix.patch revert-nmi_known_cpu-check-during-boot-option-parsing.patch blockdev-direct_io-fix-signedness-bug.patch submitchecklist-update.patch paravirt-mark-the-paravirt_ops-export-internal.patch kvm-make-sure-there-is-a-vcpu-context-loaded-when.patch kvm-fix-race-between-mmio-reads-and-injected-interrupts.patch kvm-x86-emulator-fix-bit-string-instructions.patch kvm-fix-asm-constraints-with-config_frame_pointer=n.patch kvm-fix-bogus-pagefault-on-writable-pages.patch rtc-sh-act-on-rtc_wkalrmenabled-when-setting-an-alarm.patch fix-blk_direct_io-bio-preparation.patch tlclk-bug-fix-misc-fixes.patch mbind-restrict-nodes-to-the-currently-allowed-cpuset.patch reiserfs-avoid-tail-packing-if-an-inode-was-ever-mmapped.patch all of which are present in http://userweb.kernel.org/~akpm/2.6.20-rc5-mm-fixes The KVM and direct-io changes are significant, so if people are testing those things, please be sure to have that patch applied. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Linus Torvalds wrote: > > On Sat, 13 Jan 2007, Michael Tokarev wrote: >>> At that point, O_DIRECT would be a way of saying "we're going to do >>> uncached accesses to this pre-allocated file". Which is a half-way >>> sensible thing to do. >> Half-way? > > I suspect a lot of people actually have other reasons to avoid caches. > > For example, the reason to do O_DIRECT may well not be that you want to > avoid caching per se, but simply because you want to limit page cache > activity. In which case O_DIRECT "works", but it's really the wrong thing > to do. We could export other ways to do what people ACTUALLY want, that > doesn't have the downsides. > > For example, the page cache is absolutely required if you want to mmap. > There's no way you can do O_DIRECT and mmap at the same time and expect > any kind of sane behaviour. It may not be what a DB wants to use, but it's > an example of where O_DIRECT really falls down. Provided when the two are about the same part of a file. If not, and if the file is "divided" on a proper boundary (sector/page/whatever-aligned), there's no issues, at least not if all the blocks of a file has been allocated (no gaps, that is). What I was referring to in my last email - and said it's a corner case - is: mmap() start of a file, say, first megabyte of it, where some index/bitmap is located, and use direct-io on the rest. So the two aren't overlap. Still problematic? >>> But what O_DIRECT does right now is _not_ really sensible, and the >>> O_DIRECT propeller-heads seem to have some problem even admitting that >>> there _is_ a problem, because they don't care. >> Well. In fact, there's NO problems to admit. >> >> Yes, yes, yes yes - when you think about it from a general point of >> view, and think how non-O_DIRECT and O_DIRECT access fits together, >> it's a complete mess, and you're 100% right it's a mess. > > You can't admit that even O_DIRECT _without_ any non-O_DIRECT actually > fails in many ways right now. > > I've already mentioned ftruncate and block allocation. You don't seem to > understand that those are ALSO a problem. I do understand this. And this is, too, solved right now in userspace. For example, when oracle allocates a file for its data, or when it extends the file, it writes something to every block of new space (using O_DIRECT while at it, but that's a different story). The thing is: while it is doing that, no process tries to do anything with that (part of a) file (not counting some external processes run by evil hackers ;) So there's still no races or fundamental brokeness *in usage*. It uses ftruncate() to create or extend a file, *and* does O_DIRECT writes to force block allocations. That's probably not right, and that alone is probably difficult to implement in kernel (I just don't know; what I know for sure is that this way is very slow on ext3). Maybe because there's no way to tell kernel something like "set the file size to this and actually *allocate* space for it" (if it doesn't write some structure to the file). What I dislike very much is - half-solutions. And current O_DIRECT indeed looks like half-a-solution, because sometimes it works, and sometimes, in *wrong* usage scenario, it doesn't, or racy, etc, and kernel *allows* such a wrong scenario. A software should either work correctly, or disallow a usage where it can't guarantee correctness. Currently, kernel allows incorrect usage, and that, plus all the ugly things in code done in attempt to fix that, suxx. But the whole thing is not (fundamentally) broken. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ext3 mounted as ext2 but journal still in effect.
Hi! > You were right, even after making the changes, it seems to be > telling lies: > > # mount > /dev/hda2 on / type ext2 (rw,usrquota) > [...] > > However, I think I am still not mounting as ext2: > > # dmesg | grep 'Kernel command' > Kernel command line: ro root=/dev/hda2 rootfstype=ext2 ... > rootfs / rootfs rw 0 0 > /dev/root / ext3 rw 0 0 > Do I need to mess with the initrd? My grub lines look like > this: Yes, probably. Pavel -- Thanks for all the (sleeping) penguins. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-rc3 regression: suspend to RAM broken on Mac mini Core Duo
Hi! > > >> > It didn't. It looks like it is unusable, becuase it isn't reliable in > > >> > 2.6.20-rc3. > > >> > > >> Is this issue still present in -rc4? > > > > > >I used 2.6.20-rc4 in single user mode, and applied 2 patches from > > >netdev to get wake on LAN support. This way I was able to set up an > > >automatic suspend/resume loop. It looked good, but after e.g. 20 > > >minutes, the resume hang. So it is reproduceable with 2.6.20-rc4. > > >Unfortunately, I can not test the same with 2.6.18, as the wake on LAN > > >patches need 2.6.20-rc. > > > > Hmm, do you mean this is the first time of this kind of testing? > > Is this issue related to LAN driver? > > I guess you should be able to set up an automatic suspend/resume loop > > with /proc/acpi/alarm, and test similar with 2.6.18. > > Thanks for the hint. I just used /proc/acpi/alarm to set up a > suspend/resume loop and did ca. 100 cycles in a row with 2.6.18.2 in > single user mode, without a failure. Can you do similar test on 2.6.20 -- w/o network driver loaded (and generaly minimum drivers?) Pavel -- Thanks for all the (sleeping) penguins. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Sat, 13 Jan 2007, Michael Tokarev wrote: > > > > At that point, O_DIRECT would be a way of saying "we're going to do > > uncached accesses to this pre-allocated file". Which is a half-way > > sensible thing to do. > > Half-way? I suspect a lot of people actually have other reasons to avoid caches. For example, the reason to do O_DIRECT may well not be that you want to avoid caching per se, but simply because you want to limit page cache activity. In which case O_DIRECT "works", but it's really the wrong thing to do. We could export other ways to do what people ACTUALLY want, that doesn't have the downsides. For example, the page cache is absolutely required if you want to mmap. There's no way you can do O_DIRECT and mmap at the same time and expect any kind of sane behaviour. It may not be what a DB wants to use, but it's an example of where O_DIRECT really falls down. > > But what O_DIRECT does right now is _not_ really sensible, and the > > O_DIRECT propeller-heads seem to have some problem even admitting that > > there _is_ a problem, because they don't care. > > Well. In fact, there's NO problems to admit. > > Yes, yes, yes yes - when you think about it from a general point of > view, and think how non-O_DIRECT and O_DIRECT access fits together, > it's a complete mess, and you're 100% right it's a mess. You can't admit that even O_DIRECT _without_ any non-O_DIRECT actually fails in many ways right now. I've already mentioned ftruncate and block allocation. You don't seem to understand that those are ALSO a problem. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 'struct task_struct' has no member named 'mems_allowed' (was: Re: 2.6.20-rc4-mm1)
On Fri, 12 Jan 2007 14:00:16 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Fri, 12 Jan 2007, Paul Jackson wrote: > > > It might look clearer to someone who is focused on that particular > > change, but it adds unnecessary noise for the other 90% of the readers > > of that code who are not concerned with cpusets at that point in time. > > This is in NUMA specific code. And they should be concerned about cpusets > since cpusets may affect the node masks they can set. If this is hidden in > a macro then it may be overlooked. bah. No ifdefs! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Disk Cache, Was: O_DIRECT question
Zan Lynx wrote: > On Sat, 2007-01-13 at 00:03 +0300, Michael Tokarev wrote: > [snip] >> And sure thing, withOUT O_DIRECT, the whole system is almost dead under this >> load - because everything is thrown away from the cache, even caches of /bin >> /usr/bin etc... ;) (For that, fadvise() seems to help a bit, but not alot). > > One thing that I've been using, and seems to work well, is a customized > version of the readahead program several distros use during boot up. [idea to lock some (commonly-used) cache pages in memory] > Something like that could keep your system responsive no matter what the > disk cache is doing otherwise. Unfortunately it's not. Sure, things like libc.so etc will be force-cached and will start fast. But not my data files and other stuff (what an unfortunate thing: memory usually is smaller in size than disks ;) I can do usual work without noticing something's working with the disks intensively, doing O_DIRECT I/O. For example, I can run large report on a database, which requires alot of disk I/O, and run a kernel compile at the same time. Sure, disk access is alot slower, but disk cache helps alot, too. My kernel compile will not be much slower than usual. But if I'll turn O_DIRECT off, the compile will take ages to finish. *And* the report running, too! Because the system tries hard to cache the WRONG pages! (yes I remember fadvise - which aren't used by the database(s) currently, and quite alot of words has been said about that, too; I also noticied it's slower as well, at least currently.) /mjt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 'struct task_struct' has no member named 'mems_allowed' (was: Re: 2.6.20-rc4-mm1)
Christoph wrote: > If this is hidden in a macro then it may be overlooked. Sooner or later, every line of code is important. Shouting any one of them in #ifdef brackets creates a noisier environment, increasing the chance of missing another. And besides ... the other umpteen cpuset hooks all use the cpuset_*() style macros (except for fs/proc/base.c, which has its own style ...). Consistency in style is important in these matters. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
> Eric Sandeen (ES) writes: ES> Al says "no" and I'm not arguing. :) ES> Apparently this may be OK with some filesystems, and Al says he doesn't ES> want to know about i_nlink in the vfs in any case. well, generic_drop_inode() uses i_nlink ... ES> But I suppose there may be other filesystems which DO care, and should ES> be checking if they're not. this is why I thought VFS could take care. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA hotplug from the user side ?
On Fri, 2007-01-12 at 12:04 -0500, Jeff Garzik wrote: > Soeren Sonnenburg wrote: > > Dear all, > > > > I'd like to try out SATA hotplugging using a SIL3114. Though I was > > harvesting the web, I could not find any useful information how this is > > done in practice. > > > > Well I realized that I can still use scsiadd to print and remove > > devices, e.g.: > > For SIL3114, you shouldn't have to run any commands at all. It should > notice when you yank the cable, or plug in a new device. It is true it detects a removal and newly plugged devices immediately... However it still prints warnings and errors that it could not synchronize SCSI cache for the disks. Then it prints regular 'rejects I/O to dead device' warning messages and on replugging the disks puts them to the next free sd device (e.g. sdc -> sdd). These messages sound eval - so now the question is should I care ? ( On the other hand it did not crash the machine ) What follows is a change between to sata drives attached to port 4/5 of the sil (ata5/ata6 here): ata6: exception Emask 0x10 SAct 0x0 SErr 0x1 action 0x2 frozen ata6: hard resetting port ata6: SATA link down (SStatus 0 SControl 310) ata6: failed to recover some devices, retrying in 5 secs ata5: exception Emask 0x10 SAct 0x0 SErr 0x1 action 0x2 frozen ata5: hard resetting port ata5: SATA link down (SStatus 0 SControl 310) ata5: failed to recover some devices, retrying in 5 secs ata6: hard resetting port ata6: SATA link down (SStatus 0 SControl 310) ata6: failed to recover some devices, retrying in 5 secs ata5: hard resetting port ata5: SATA link down (SStatus 0 SControl 310) ata5: failed to recover some devices, retrying in 5 secs ata6: hard resetting port ata6: SATA link down (SStatus 0 SControl 310) ata6.00: disabled ata6: EH complete ata6.00: detaching (SCSI 5:0:0:0) Synchronizing SCSI cache for disk sdd: FAILED status = 0, message = 00, host = 4, driver = 00 <6>ata5: hard resetting port ata5: SATA link down (SStatus 0 SControl 310) ata5.00: disabled ata5: EH complete ata5.00: detaching (SCSI 4:0:0:0) Synchronizing SCSI cache for disk sdc: FAILED status = 0, message = 00, host = 4, driver = 00 <3>ata6: exception Emask 0x10 SAct 0x0 SErr 0x5 action 0x2 frozen ata6: hard resetting port ata6: port is slow to respond, please be patient (Status 0xff) ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata6.00: ATA-7, max UDMA/133, 1465149168 sectors: LBA48 NCQ (depth 0/32) ata6.00: configured for UDMA/100 ata6: EH complete scsi 5:0:0:0: Direct-Access ATA ST3750640AS 3.AA PQ: 0 ANSI: 5 SCSI device sdf: 1465149168 512-byte hdwr sectors (750156 MB) sdf: Write Protect is off sdf: Mode Sense: 00 3a 00 00 SCSI device sdf: drive cache: write back SCSI device sdf: 1465149168 512-byte hdwr sectors (750156 MB) sdf: Write Protect is off sdf: Mode Sense: 00 3a 00 00 SCSI device sdf: drive cache: write back sdf: unknown partition table sd 5:0:0:0: Attached scsi disk sdf sd 5:0:0:0: Attached scsi generic sg2 type 0 scsi 4:0:0:0: rejecting I/O to dead device scsi 4:0:0:0: rejecting I/O to dead device scsi 5:0:0:0: rejecting I/O to dead device scsi 5:0:0:0: rejecting I/O to dead device ata5: exception Emask 0x10 SAct 0x0 SErr 0x5 action 0x2 frozen ata5: hard resetting port ata5: port is slow to respond, please be patient (Status 0xff) ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata5.00: ATA-7, max UDMA/133, 1465149168 sectors: LBA48 NCQ (depth 0/32) ata5.00: configured for UDMA/100 ata5: EH complete scsi 4:0:0:0: Direct-Access ATA ST3750640AS 3.AA PQ: 0 ANSI: 5 SCSI device sdg: 1465149168 512-byte hdwr sectors (750156 MB) sdg: Write Protect is off sdg: Mode Sense: 00 3a 00 00 SCSI device sdg: drive cache: write back SCSI device sdg: 1465149168 512-byte hdwr sectors (750156 MB) sdg: Write Protect is off sdg: Mode Sense: 00 3a 00 00 SCSI device sdg: drive cache: write back sdg: unknown partition table sd 4:0:0:0: Attached scsi disk sdg sd 4:0:0:0: Attached scsi generic sg3 type 0 Best, Soeren -- For the one fact about the future of which we can be certain is that it will be utterly fantastic. -- Arthur C. Clarke, 1962 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Cell SPU task notification
Subject: Enable SPU switch notification to detect currently active SPU tasks. From: Maynard Johnson <[EMAIL PROTECTED]> This patch adds to the capability of spu_switch_event_register so that the caller is also notified of currently active SPU tasks. It also exports spu_switch_event_register and spu_switch_event_unregister. Signed-off-by: Maynard Johnson <[EMAIL PROTECTED]> Index: linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/sched.c === --- linux-2.6.19-rc6-arnd1+patches.orig/arch/powerpc/platforms/cell/spufs/sched.c 2006-12-04 10:56:04.730698720 -0600 +++ linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/sched.c 2007-01-11 09:45:37.918333128 -0600 @@ -46,6 +46,8 @@ #define SPU_MIN_TIMESLICE (100 * HZ / 1000) +int notify_active[MAX_NUMNODES]; + #define SPU_BITMAP_SIZE (((MAX_PRIO+BITS_PER_LONG)/BITS_PER_LONG)+1) struct spu_prio_array { unsigned long bitmap[SPU_BITMAP_SIZE]; @@ -81,18 +83,45 @@ static void spu_switch_notify(struct spu *spu, struct spu_context *ctx) { blocking_notifier_call_chain(_switch_notifier, - ctx ? ctx->object_id : 0, spu); + ctx ? ctx->object_id : 0, spu); +} + +static void notify_spus_active(void) +{ + int node; + /* Wake up the active spu_contexts. When the awakened processes + * sees their notify_active flag is set, they will call + * spu_notify_already_active(). + */ + for (node = 0; node < MAX_NUMNODES; node++) { + struct spu *spu; + mutex_lock(_prio->active_mutex[node]); +list_for_each_entry(spu, _prio->active_list[node], list) { + struct spu_context *ctx = spu->ctx; + wake_up_all(>stop_wq); + notify_active[ctx->spu->number] = 1; + smp_mb(); + } +mutex_unlock(_prio->active_mutex[node]); + } + yield(); } int spu_switch_event_register(struct notifier_block * n) { - return blocking_notifier_chain_register(_switch_notifier, n); + int ret; + ret = blocking_notifier_chain_register(_switch_notifier, n); + if (!ret) + notify_spus_active(); + return ret; } +EXPORT_SYMBOL_GPL(spu_switch_event_register); int spu_switch_event_unregister(struct notifier_block * n) { return blocking_notifier_chain_unregister(_switch_notifier, n); } +EXPORT_SYMBOL_GPL(spu_switch_event_unregister); static inline void bind_context(struct spu *spu, struct spu_context *ctx) @@ -250,6 +279,14 @@ return spu_get_idle(ctx, flags); } +void spu_notify_already_active(struct spu_context *ctx) +{ + struct spu *spu = ctx->spu; + if (!spu) + return; + spu_switch_notify(spu, ctx); +} + /* The three externally callable interfaces * for the scheduler begin here. * Index: linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/spufs.h === --- linux-2.6.19-rc6-arnd1+patches.orig/arch/powerpc/platforms/cell/spufs/spufs.h 2007-01-08 18:18:40.093354608 -0600 +++ linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/spufs.h 2007-01-08 18:31:03.610345792 -0600 @@ -183,6 +183,7 @@ void spu_yield(struct spu_context *ctx); int __init spu_sched_init(void); void __exit spu_sched_exit(void); +void spu_notify_already_active(struct spu_context *ctx); extern char *isolated_loader; Index: linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/run.c === --- linux-2.6.19-rc6-arnd1+patches.orig/arch/powerpc/platforms/cell/spufs/run.c 2007-01-08 18:33:51.979311680 -0600 +++ linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/run.c 2007-01-11 10:17:20.777344984 -0600 @@ -10,6 +10,8 @@ #include "spufs.h" +extern int notify_active[MAX_NUMNODES]; + /* interrupt-level stop callback function. */ void spufs_stop_callback(struct spu *spu) { @@ -45,7 +47,9 @@ u64 pte_fault; *stat = ctx->ops->status_read(ctx); - if (ctx->state != SPU_STATE_RUNNABLE) + smp_mb(); + + if (ctx->state != SPU_STATE_RUNNABLE || notify_active[ctx->spu->number]) return 1; spu = ctx->spu; pte_fault = spu->dsisr & @@ -319,6 +323,11 @@ ret = spufs_wait(ctx->stop_wq, spu_stopped(ctx, )); if (unlikely(ret)) break; + if (unlikely(notify_active[ctx->spu->number])) { + notify_active[ctx->spu->number] = 0; + if (!(status & SPU_STATUS_STOPPED_BY_STOP)) +spu_notify_already_active(ctx); + } if ((status & SPU_STATUS_STOPPED_BY_STOP) && (status >> SPU_STOP_STATUS_SHIFT == 0x2104)) { ret = spu_process_callback(ctx);
Re: Fwd: [PATCH] Fix some ARM builds due to HID brokenness
On Fri, 12 Jan 2007 13:44:05 -0800 Andrew Morton wrote: > On Fri, 12 Jan 2007 21:00:15 + > Russell King <[EMAIL PROTECTED]> wrote: > > > Could we please have this (or a proper fix) in before 2.6.20 to resolve > > the regression please? > > > > > > ... > > > > --- a/drivers/hid/Kconfig > > +++ b/drivers/hid/Kconfig > > @@ -6,6 +6,7 @@ menu "HID Devices" > > > > config HID > > tristate "Generic HID support" > > + depends on INPUT > > default y > > ---help--- > > Say Y here if you want generic HID support to connect keyboards, > > > > This was merged a week ago.. Right, we are past that to a new patch now. --- ~Randy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
Alex Tomas wrote: >> Eric Sandeen (ES) writes: > ES> I tend to agree, chatting w/ Al I think he does too. :) I'll test > ES> a patch that kicks out ext3_link() with -ENOENT at the top, and resubmit > ES> that if things go well. > > shouldn't VFS do that? Al says "no" and I'm not arguing. :) Apparently this may be OK with some filesystems, and Al says he doesn't want to know about i_nlink in the vfs in any case. But I suppose there may be other filesystems which DO care, and should be checking if they're not. -Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/