Re: [PATCH] cpuset and sched domains: sched_load_balance flag
> > Yup - it's asking for load balancing over that set. That is why it is > > called that. There's no idea here of better or worse load balancing, > > that's an internal kernel scheduler subtlety -- it's just a request that > > load balancing be done. > > OK, if it prohibits balancing when sched_load_balance is 0, then it is > slightly more useful. It doesn't prohibit load balancing just because sched_load_balance is 0. Only if there are no overlapping cpusets still needing balancing does it prohibit balancing when 0. > Yeah, but the interface is not very nice. As an interface for hard > partitioning, it doesn't work nicely because it is hierarchical. Yeah -- cpusets are hierarchical. And some of the use cases for which cpusets are designed are hierarchical. > > > You would do this by creating partitioning cpusets which carve up the > > > root cpuset (basically -- have multiple roots). > > > > You would do this with the current, single rooted cpuset (and now > > cgroup) mechanism by having multiple immediate child cpusets of the > > root cpuset, which partition the system CPUs. There is no need to > > invent some bastardized multiple root structure. > > What do you mean by bastardized? Changing cpusets from single root to multiple roots would be bastardizing it. My proposed sched_load_balance API is already quite capable of representing what you see the need for - hard partitioning. It is also quite capable of representing some other situations, such as I've described in other replies, that you don't seem to see the need for. To repeat myself, in some cases, such as batch schedulers running in a subset of the CPUs on a large system, the code that knows some of the needs for load balancing does not have system wide control to mandate hard partitioning. The batch scheduler can state where it is depending on load balancing being present, and the system administrator can choose or not to turn off load balancing in the top cpuset, thereby granting or not control over load balancing on the CPUs controlled by the batch scheduler to the batch scheduler. Hard partitioning is not the only use case here. If you don't appreciate the other cases, then fine ... but I don't think that gives you grounds to reject a patch just because it is not precisely the ideal, narrowly focused, API for the case you do appreciate. > What's wrong with having a real > (and sane) representation of the requested hard-partitions in the system? What's wrong with it is that 1) it doesn't cover all the use cases, 2) it would require a new and different mechanism other than cpusets which are not multiple rooted, and do robustly support overlapping sets and hence are not a hard partitioning, and 3) we'd still need the cpuset based API to cover the remaining use cases. Good grief -- I must be misunderstanding you here, Nick. I can't imagine that you want to turn cpusets into a multiple rooted hard partition mechanism. If you are, then "bastardized" is the right word. > Not your proposal, just the idea to have enough information to be able > to work out a more optimal set of sched-domains automatically. I can't figure out what the sentence is saying ... sorry. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: fix sched-domains partitioning by cpusets
Ingo wrote: > i've merged your patch to my scheduler queue - see the patch below. (And > could you send me your SoB line too?) Paul, if we went with the patch > below, what else would be needed for your purposes? Nick and I already resolved that, when he first posted this patch in October of 2006. The cpu_exclusive flag doesn't work for this. Here's a copy of the key message, from Nick, near the end of that thread in which he earlier proposed this patch, also available at: http://lkml.org/lkml/2006/10/21/12 Paul Jackson wrote: > Nick wrote: > >>Or, another question, how does my patch hijack cpus_allowed? In >>what way does it change the semantics of cpus_allowed? > > > It limits load balancing for tasks in cpusets containing > a superset of that cpusets cpus. > > There are always such cpusets - the top cpuset if no other. Ah OK, and there is my misunderstanding with cpusets. From the documentation it appears as though cpu_exclusive cpusets are made in order to do the partitioning thing. If you always have other domains overlapping them (regardless that it is a parent), then what actual use does cpu_exclusive flag have? A couple messages later in that thread, Nick wrote: > But even the way cpu_exclusive semantics are defined makes it not > quite compatible with partitioning anyway, unfortunately. I agree with Nick on this conclusion, and with his other conclusion that the 'cpu_exclusive' flag is pretty near useless. Some per-cpuset flag other the 'cpu_exclusive' is required to control sched domains from cpusets. This has specific impact on one of the key users of cpusets, the various developers of batch schedulers. One by one, they have determined that the cpu_exclusive flag is incompatible with the way they set up cpusets, and have decided they should not enable that flag on any cpuset under their control. It gets in their way, and serves no useful purpose for them. However we need someway for them to specify where they need load balancing, so that on large systems, they can allow the admin to avoid the cost of load balancing over the batch schedulers entire subset of the system at once, but rather just load balance over the smaller sets where the batch scheduler has active jobs running that might depend on load balancing. Batch schedulers need to be able to specify where they need load balancing and where they don't, and they can't use the 'cpu_exclusive' flag. The defining characteristic of 'cpu_exclusive' is no overlap of CPUs with sibling cpusets. That is incompatible with their needs. Therefore, they need a different flag. I must NAQ this patch, and I'm surprised to see Nick propose it again, as I thought he had already agreed that it didn't suffice. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Man page for revised timerfd API
Davide Libenzi wrote: > On Thu, 27 Sep 2007, Michael Kerrisk wrote: > >> Davide, >> >> A further question: what is the expected behavior in the >> following scenario: >> >> 1. Create a timerfd and arm it. >> 2. Wait until M timer expirations have occurred >> 3. Modify the settings of the timer >> 4. Wait for N further timer expirations have occurred >> 5. read() from the timerfd >> >> Does the buffer returned by the read() contain the value >> N or (M+N)? In other words, should modifying the timer >> settings reset the expiration count to zero? > > Every timerfd_settime() zeroes the tick counter. So in your scenario it'll > return N. Thanks Davide. I modified the first para of the read description to make this clear: read(2) If the timer has already expired one or more times since its settings were last modified using timerfd_settime(), or since the last successful read(2), then the buffer given to read(2) returns an unsigned 8-byte integer (uint64_t) containing the number of expirations that have occurred. (In the earlier version of the page the text talked about expirations "since the timer was created".) Cheers, Michael -- Michael Kerrisk maintainer of Linux man pages Sections 2, 3, 4, 5, and 7 Want to help with man page maintenance? Grab the latest tarball at http://www.kernel.org/pub/linux/docs/manpages/ read the HOWTOHELP file and grep the source files for 'FIXME'. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] sched: fix sched-domains partitioning by cpusets
* Nick Piggin <[EMAIL PROTECTED]> wrote: > BTW. as far as the sched.c changes in your patch go, I much prefer the > partition_sched_domains API: http://lkml.org/lkml/2006/10/19/85 > > The caller should manage everything itself, rather than > partition_sched_domains doing half of the memory allocation. i've merged your patch to my scheduler queue - see the patch below. (And could you send me your SoB line too?) Paul, if we went with the patch below, what else would be needed for your purposes? Ingo -> Subject: sched: fix sched-domains partitioning by cpusets From: Nick Piggin <[EMAIL PROTECTED]> Fix sched-domains partitioning by cpusets. Walk the whole cpusets tree after something interesting changes, and recreate all partitions. Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]> --- include/linux/cpuset.h |2 include/linux/sched.h |3 - kernel/cpuset.c| 109 ++--- kernel/sched.c | 31 +++-- 4 files changed, 70 insertions(+), 75 deletions(-) Index: linux/include/linux/cpuset.h === --- linux.orig/include/linux/cpuset.h +++ linux/include/linux/cpuset.h @@ -14,6 +14,8 @@ #ifdef CONFIG_CPUSETS +extern int cpuset_hotplug_update_sched_domains(void); + extern int number_of_cpusets; /* How many cpusets are defined in system? */ extern int cpuset_init_early(void); Index: linux/include/linux/sched.h === --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -798,8 +798,7 @@ struct sched_domain { #endif }; -extern int partition_sched_domains(cpumask_t *partition1, - cpumask_t *partition2); +extern int partition_sched_domains(cpumask_t *partition); #endif /* CONFIG_SMP */ Index: linux/kernel/cpuset.c === --- linux.orig/kernel/cpuset.c +++ linux/kernel/cpuset.c @@ -752,6 +752,24 @@ static int validate_change(const struct return 0; } +static void update_cpu_domains_children(struct cpuset *par, + cpumask_t *non_partitioned) +{ + struct cpuset *c; + + list_for_each_entry(c, &par->children, sibling) { + if (cpus_empty(c->cpus_allowed)) + continue; + if (is_cpu_exclusive(c)) { + if (!partition_sched_domains(&c->cpus_allowed)) { + cpus_andnot(*non_partitioned, + *non_partitioned, c->cpus_allowed); + } + } else + update_cpu_domains_children(c, non_partitioned); + } +} + /* * For a given cpuset cur, partition the system as follows * a. All cpus in the parent cpuset's cpus_allowed that are not part of any @@ -761,53 +779,38 @@ static int validate_change(const struct * Build these two partitions by calling partition_sched_domains * * Call with manage_mutex held. May nest a call to the - * lock_cpu_hotplug()/unlock_cpu_hotplug() pair. - * Must not be called holding callback_mutex, because we must - * not call lock_cpu_hotplug() while holding callback_mutex. + * lock_cpu_hotplug()/unlock_cpu_hotplug() pair. Must not be called holding + * callback_mutex, because we must not call lock_cpu_hotplug() while holding + * callback_mutex. */ -static void update_cpu_domains(struct cpuset *cur) +static void update_cpu_domains(void) { - struct cpuset *c, *par = cur->parent; - cpumask_t pspan, cspan; - - if (par == NULL || cpus_empty(cur->cpus_allowed)) - return; + cpumask_t non_partitioned; - /* -* Get all cpus from parent's cpus_allowed not part of exclusive -* children -*/ - pspan = par->cpus_allowed; - list_for_each_entry(c, &par->children, sibling) { - if (is_cpu_exclusive(c)) - cpus_andnot(pspan, pspan, c->cpus_allowed); - } - if (!is_cpu_exclusive(cur)) { - cpus_or(pspan, pspan, cur->cpus_allowed); - if (cpus_equal(pspan, cur->cpus_allowed)) - return; - cspan = CPU_MASK_NONE; - } else { - if (cpus_empty(pspan)) - return; - cspan = cur->cpus_allowed; - /* -* Get all cpus from current cpuset's cpus_allowed not part -* of exclusive children -*/ - list_for_each_entry(c, &cur->children, sibling) { - if (is_cpu_exclusive(c)) - cpus_andnot(cspan, cspan, c->cpus_allowed); - } - } + BUG_ON(!mutex_is_locked(&manage_mutex)); lock_cpu_hotplug(); - partition_sched_domains(&
Re: [PATCH] mark read_crX() asm code as volatile
Nick Piggin wrote: This should work because the result gets used before reading again: read_cr3(a); write_cr3(a | 1); read_cr3(a); But this might be reordered so that b gets read before the write: read_cr3(a); write_cr3(a | 1); read_cr3(b); ? I don't see how, as write_cr3 clobbers memory. Because read_cr3() doesn't depend on memory, and b could be stored in a register. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix SH DMAC code to handle PVR2 cascade
On Tue, Oct 02, 2007 at 10:09:27PM +0100, Adrian McMenamin wrote: > Fix SH DMAC code to correctly handle PVR2 cascade DMA. > > This updates http://lkml.org/lkml/2007/10/2/276 > > (I decided it was better to have the true size of the transfer put in > via the API and refactor this here. And calc_xmit_shift(chan) should > return 5 but only returns 3 so I've not used it here) > It would be helpful to know why calc_xmit_shift() is broken here rather than just coding around it, as this will have implications for the other DMA channels on SH7091/SH7750. Now that you've completely bypassed the rest of the SH-DMAC ->xfer_dma() op, it's clear that the existing infrastructure needs a bit of rework for dealing with the cascaded DMACs (especially for single-address mode only, unidirectionally). It would be nice to get the mach-specific kludges for cascade out of dma-sh entirely. This can certainly be fixed for 2.6.24, though a larger overhaul is 2.6.25 material at this point. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset and sched domains: sched_load_balance flag
On Tuesday 02 October 2007 04:15, Paul Jackson wrote: > Nick wrote: > > which you could equally achieve by adding > > a second set of sched domains (and the global domains could keep > > globally balancing). > > Hmmm ... this could be the key to this discussion. > > Nick - can two sched domains overlap? And if they do, what does that > mean on any user or application behaviour. Yes, sched domains can be completely arbitrary, and of course in the current kernel, parent domains always overlap their children. A sched domain usually means that the scheduler can move tasks around among that group of CPUs, given the correct flags (but if there are no flags, then it would be a superfluous domain and should get trimmed away I think). BTW. as far as the sched.c changes in your patch go, I much prefer the partition_sched_domains API: http://lkml.org/lkml/2006/10/19/85 The caller should manage everything itself, rather than partition_sched_domains doing half of the memory allocation. > From the cpuset side - this patch handles overlap by joining the 'cpus' > into one sched domain. If two cpusets with overlapping 'cpus' are both > marked 'sched_load_balance', then this patch forms a single, combined > sched domain. > > As best as I can tell, you and I are actually in agreement in the > case that there is no overlap. If the several cpusets which have > 'sched_load_balance' enabled have mutually disjoint 'cpus' (no > overlap), then my patch forms exactly one sched domain for each such > cpuset, having the same 'cpus'. OK, I don't think your patch actually does the wrong thing technically (although admittedly your rebuild_sched_domains isn't something I really applied my poor brain to). > The issue is the overlapping cases - are overlapping sched domains > allowed, and if so, how do they affect user space? For hard partitions, you don't want them of course. And I think we should come up with a cpusets solution for that first. Afterwards, overlapping sched domains are allowed and could be used to make balancing more efficient (rather than any real affect on userspace). At the moment, the domain builder probably wouldn't cope very well, though. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: wibbling over the cpuset shed domain connnection
On Wednesday 03 October 2007 15:21, Paul Jackson wrote: > > In the meantime, that patch should be merged though, shouldn't it? > > Which patch do you refer to: > 1) the year old patch to disconnect cpusets and sched domains: > cpuset-remove-sched-domain-hooks-from-cpusets.patch > 2) my patch of a few days ago to add a 'sched_load_balance' flag: > cpuset and sched domains: sched_load_balance flag The one quoted, of course. > I can't push one without the other, because some real time folks are > depending on the sched domain hooks that (1) would remove, so need some > alternative, such as in (2). Even though (1) is rather broken, as you > note, it still provides a way that the real time folks can disable load > balancing at runtime on selected CPUs, so is essential to their work. OK. > I can't delay any more resolving this, because the cgroup (aka > container) code is tangled up with (1), and Andrew needs a clear path > to send cgroups to Linus real soon now. If code isn't ready to go, it doesn't need to rush, it can just be untangled or fixed properly etc. > In my last message to you, a couple of days ago, I asked what I thought > were a couple of key and simple questions -- can sched domains overlap, > and what does it mean for user space if they overlap? A further > question comes to mind now -- if sched domains can overlap, does this > provide some capability to user space that is important to provide? > > Could you take a minute, Nick, to consider these questions? Thanks. Yeah, it arrived after I had a 24 hour flight. I just see it now. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
Linus Torvalds wrote: > Security, on the other hand, very much does depend on the circumstances > and the wishes of the users (or policy-makers). And if we had one module > that everybody would be happy with, I'd not make it pluggable either. But > as it is, we _know_ that's not the case. > And you claim you are not a security expert :-) Crispin -- Crispin Cowan, Ph.D. http://crispincowan.com/~crispin/ Itanium. Vista. GPLv3. Complexity at work - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset and sched domains: sched_load_balance flag
On Monday 01 October 2007 13:42, Paul Jackson wrote: > Nick wrote: > > Moreover, sched_load_balance doesn't really sound like a good name > > for asking for a partition. > > Yup - it's not a good name for asking for a partition. > > That's because it isn't asking for a partition. > > It's asking for load balancing over the CPUs in the cpuset so marked. Yeah yeah OK, you turn it off in the parent cpuset of the child cpusets which you want the partitioning to occur in, and ensure there are no other overlapping cpusets with that flag turned on in order to create a hard partition. I don't think this makes the API anynicer. > > It's more like you're just asking to have better > > load balancing over that set, > > Yup - it's asking for load balancing over that set. That is why it is > called that. There's no idea here of better or worse load balancing, > that's an internal kernel scheduler subtlety -- it's just a request that > load balancing be done. OK, if it prohibits balancing when sched_load_balance is 0, then it is slightly more useful. > That is what is visible to user space: whether or not tasks get moved > from overloaded CPUs to underloaded, though still allowed, CPUs. > > This is visible to user space in two ways: > 1) as task movemement, which may or may not be what is desired, and > 2) as kernel CPU cycles spent, because load balancing costs CPU cycles > that increase more than linearly with the number of CPUs being > balanced. > > The user doesn't give a hoot what a 'sched domain' is. They care to > manage (1) whether their tasks might move under a load imbalance, and > (2) how many CPU cycles the kernel spends providing this service. Yeah, but the interface is not very nice. As an interface for hard partitioning, it doesn't work nicely because it is hierarchical. > > You would do this by creating partitioning cpusets which carve up the > > root cpuset (basically -- have multiple roots). > > You would do this with the current, single rooted cpuset (and now > cgroup) mechanism by having multiple immediate child cpusets of the > root cpuset, which partition the system CPUs. There is no need to > invent some bastardized multiple root structure. What do you mean by bastardized? What's wrong with having a real (and sane) representation of the requested hard-partitions in the system? > > You can't (easily) do this now because you have so many tasks in the > > root cpuset that it is impossible to know whether or not you > > actually want to load balance them. > > I don't know what proposal you are reacting to here. Clearly not this > patch that I have proposed, as it is trivially easy to indicate whether > you want to load balance the root cpuset - by setting or clearing the > 'sched_load_balance' flag in the root cpuset. Not your proposal, just the idea to have enough information to be able to work out a more optimal set of sched-domains automatically. Actually we can do most of it already automatically, but not hard partitioning. [snip] As I said, neither is really semantically more powerful than the other. So yeah those things are possible to do with your API, but I don't like the API. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
More ext3 panics on 2.6.22 [Fwd: ext3_ordered_writepage panic on shiny new 2.6.22]
Yes i know my kernel is tainted with vmblock and nvidia, but i'm not convinced it is related. I'm putting the panics here if anyone interested. Oct 2 20:08:39 home kernel: Assertion failure in journal_unmap_buffer() at fs/jbd/transaction.c:1886: "!buffer_jbddirty(bh)" Oct 2 20:08:39 home kernel: [ cut here ] Oct 2 20:08:39 home kernel: kernel BUG at fs/jbd/transaction.c:1886! Oct 2 20:08:39 home kernel: invalid opcode: [#1] Oct 2 20:08:39 home kernel: PREEMPT SMP Oct 2 20:08:39 home kernel: Modules linked in: iuu_phoenix cdc_acm nvidia(P) vmnet(P) parport_pc parport vmblock(P) vmmon(P) udf isofs nls_iso8859_1 nls_cp43 7 vfat fat appletalk psnap llc nfsd exportfs lockd sunrpc ftdi_sio usbserial uhci_hcd ohci_hcd i2c_nforce2 forcedeth usblp snd_hda_intel snd_seq_oss snd_seq_m idi_event snd_seq snd_seq_device snd_pcm_oss snd_pcm snd_timer snd_page_alloc snd_mixer_oss snd usb_storage it87 hwmon_vid i2c_isa i2c_dev i2c_core Oct 2 20:08:39 home kernel: CPU:0 Oct 2 20:08:39 home kernel: EIP: 0060:[journal_invalidatepage+505/1088]Tainted: P VLI Oct 2 20:08:39 home kernel: EFLAGS: 00210296 (2.6.22 #3) Oct 2 20:08:39 home kernel: EIP is at journal_invalidatepage+0x1f9/0x440 Oct 2 20:08:39 home kernel: eax: 0064 ebx: d53fc770 ecx: c03e8a5c edx: 0022 Oct 2 20:08:39 home kernel: esi: 6f72 edi: f6e14a00 ebp: d53fc770 esp: dfe63e4c Oct 2 20:08:39 home kernel: ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 Oct 2 20:08:39 home kernel: Process rm (pid: 3813, ti=dfe62000 task=c672da90 task.ti=dfe62000) Oct 2 20:08:39 home kernel: Stack: c03a44c4 c035850a c03a2ef0 075e c03a2ff3 f6e14adc f6e14a14 Oct 2 20:08:39 home kernel:c2092360 0001 0001 000a f1ac0964 c01a8b80 Oct 2 20:08:39 home kernel:6f72 000b c014e116 c2092360 c014e425 c2092360 c014e514 Oct 2 20:08:39 home kernel: Call Trace: Oct 2 20:08:39 home kernel: [ext3_invalidatepage+0/48] ext3_invalidatepage+0x0/0x30 Oct 2 20:08:39 home kernel: [do_invalidatepage+22/32] do_invalidatepage+0x16/0x20 Oct 2 20:08:39 home kernel: [truncate_complete_page+69/80] truncate_complete_page+0x45/0x50 Oct 2 20:08:39 home kernel: [truncate_inode_pages_range+228/720] truncate_inode_pages_range+0xe4/0x2d0 Oct 2 20:08:39 home kernel: [truncate_inode_pages+23/32] truncate_inode_pages+0x17/0x20 Oct 2 20:08:39 home kernel: [ext3_delete_inode+19/208] ext3_delete_inode+0x13/0xd0 Oct 2 20:08:39 home kernel: [ext3_delete_inode+0/208] ext3_delete_inode+0x0/0xd0 Oct 2 20:08:39 home kernel: [generic_delete_inode+94/208] generic_delete_inode+0x5e/0xd0 Oct 2 20:08:39 home kernel: [iput+92/112] iput+0x5c/0x70 Oct 2 20:08:39 home kernel: [do_unlinkat+239/336] do_unlinkat+0xef/0x150 Oct 2 20:08:39 home kernel: [irq_exit+91/144] irq_exit+0x5b/0x90 Oct 2 20:08:39 home kernel: [smp_apic_timer_interrupt+87/144] smp_apic_timer_interrupt+0x57/0x90 Oct 2 20:08:39 home kernel: [sysenter_past_esp+95/133] sysenter_past_esp+0x5f/0x85 Oct 2 20:08:39 home kernel: === Oct 2 20:08:39 home kernel: Code: 44 24 10 f3 2f 3a c0 c7 44 24 0c 5e 07 00 00 c7 44 24 08 f0 2e 3a c0 c7 44 24 04 0a 85 35 c0 c7 04 24 c4 44 3a c0 e8 b7 a9 f6 ff <0f> 0b eb fe 8b 74 24 1c 85 f6 0f 85 22 fe ff ff 8b 5c 24 28 85 Oct 2 20:08:39 home kernel: EIP: [journal_invalidatepage+505/1088] journal_invalidatepage+0x1f9/0x440 SS:ESP 0068:dfe63e4c And later... Oct 3 03:01:51 home kernel: BUG: unable to handle kernel NULL pointer dereference at virtual address Oct 3 03:01:51 home kernel: printing eip: Oct 3 03:01:51 home kernel: c01bbbf0 Oct 3 03:01:51 home kernel: *pdpt = 345fd001 Oct 3 03:01:51 home kernel: *pde = Oct 3 03:01:51 home kernel: Oops: 0002 [#2] Oct 3 03:01:51 home kernel: PREEMPT SMP Oct 3 03:01:51 home kernel: Modules linked in: iuu_phoenix cdc_acm nvidia(P) vmnet(P) parport_pc parport vmblock(P) vmmon(P) udf isofs nls_iso8859_1 nls_cp43 7 vfat fat appletalk psnap llc nfsd exportfs lockd sunrpc ftdi_sio usbserial uhci_hcd ohci_hcd i2c_nforce2 forcedeth usblp snd_hda_intel snd_seq_oss snd_seq_m idi_event snd_seq snd_seq_device snd_pcm_oss snd_pcm snd_timer snd_page_alloc snd_mixer_oss snd usb_storage it87 hwmon_vid i2c_isa i2c_dev i2c_core Oct 3 03:01:51 home kernel: CPU:0 Oct 3 03:01:51 home kernel: EIP: 0060:[journal_grab_journal_head+16/144]Tainted: P VLI Oct 3 03:01:51 home kernel: EFLAGS: 00010202 (2.6.22 #3) Oct 3 03:01:51 home kernel: EIP is at journal_grab_journal_head+0x10/0x90 Oct 3 03:01:51 home kernel: eax: f7c7e000 ebx: ecx: f6f6c8dc edx: c3492360 Oct 3 03:01:51 home kernel: esi: edi: ebp: d862a534 esp: f7c7fe48 Oct 3 03:01:51 home kernel: ds: 007b es: 007b fs: 00d8 gs: ss: 0068 Oct 3 03:01:51 home kernel: Process kswapd0 (pid: 202, ti=f7c7e000 task=c
Re: wibbling over the cpuset shed domain connnection
> In the meantime, that patch should be merged though, shouldn't it? Which patch do you refer to: 1) the year old patch to disconnect cpusets and sched domains: cpuset-remove-sched-domain-hooks-from-cpusets.patch 2) my patch of a few days ago to add a 'sched_load_balance' flag: cpuset and sched domains: sched_load_balance flag I can't push one without the other, because some real time folks are depending on the sched domain hooks that (1) would remove, so need some alternative, such as in (2). Even though (1) is rather broken, as you note, it still provides a way that the real time folks can disable load balancing at runtime on selected CPUs, so is essential to their work. I can't delay any more resolving this, because the cgroup (aka container) code is tangled up with (1), and Andrew needs a clear path to send cgroups to Linus real soon now. In my last message to you, a couple of days ago, I asked what I thought were a couple of key and simple questions -- can sched domains overlap, and what does it mean for user space if they overlap? A further question comes to mind now -- if sched domains can overlap, does this provide some capability to user space that is important to provide? Could you take a minute, Nick, to consider these questions? Thanks. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: File corruption when using kernels 2.6.18+
Hi Neil, On 10/3/07, Neil Romig <[EMAIL PROTECTED]> wrote: > Thanks for your help on this. I have narrowed it down to commit > "c22ce143d15eb288543fe9873e1c5ac1c01b69a1 x86: cache pollution aware > __copy_from_user_ll()". This fits with the errors I'm getting, so now I need > to find out if I can safely ignore this patch, or does it have to be modified? > This is my first Linux bug in many years of simply using it, so I'm a little > nervous! Just to make sure, if you disable CONFIG_X86_INTEL_USERCOPY, the corruption goes away? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 4 (2.6.23-rc8-mm2) Smack: Simplified Mandatory Access Control Kernel
On Tue, Oct 02, 2007 at 09:45:42PM -0700, Casey Schaufler wrote: > > From: Casey Schaufler <[EMAIL PROTECTED]> > > Smack is the Simplified Mandatory Access Control Kernel. > > Smack implements mandatory access control (MAC) using labels > attached to tasks and data containers, including files, SVIPC, > and other tasks. Smack is a kernel based scheme that requires > an absolute minimum of application support and a very small > amount of configuration data. I _really_ don't like what you are doing with these symlinks. For one thing, you have no exclusion between reading the list entries and modifying them. For another... WTF is filesystem making assumptions about the locations where the things are mounted? Hell, even if you override your tmp symlink, what happens if we want it in two chroot jails with different layouts? I really don't get it; why not simply have something like /smack/tmp.link resolve to tmp/ and have userland bind or mount whatever you bloody like on /smack/tmp? No problems with absolute paths, can be used in chroot jails with whatever layouts, ditto for namespaces, etc. and both symlink and directory get created at the same time (by one name). Hell, if you keep a reference to dentry of directory in the data associated with symlink, you can simply switch nd->dentry to that, drop the old one and grab the reference to page containing label and return it via nd_set_link(). No need to play with allocations, strcat, yadda, yadda. readlink() can stuff the ->d_name of the same dentry plus / plus label directly into user buffer; again, no allocations needed and works fine anywhere. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: wibbling over the cpuset shed domain connnection
On Tuesday 02 October 2007 07:34, Paul Jackson wrote: > In -mm merge plans for 2.6.24, Andrew wrote: > > cpuset-remove-sched-domain-hooks-from-cpusets.patch > > > > Paul continues to wibble over this. Hold, I guess. > > Oh dear ... after looking at the following to figure out what > a wibble is, I wonder which one Andrew had in mind: > > http://www.urbandictionary.com/define.php?term=wibble > > The insanity, the rubbish, being overwhelmed, ... ? > > > > If one of Nick or I can knock some sense into the others head, > then this saga should come to a close soon. In the meantime, that patch should be merged though, shouldn't it? cpusets is currently telling the scheduler to do the wrong thing WRT the user interface definition of cpusets, right? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sky2: jumbo frame regression fix
On Wed, 03 Oct 2007 03:34:34 +0200 Ian Kumlien <[EMAIL PROTECTED]> wrote: > On tis, 2007-10-02 at 18:02 -0700, Stephen Hemminger wrote: > > Remove unneeded check that caused problems with jumbo frame sizes. > > The check was recently added and is wrong. > > When using jumbo frames the sky2 driver does fragmentation, so > > rx_data_size is less than mtu. > > Confirmed working. > > Now running with 9k mtu with no errors, =) > > It also seems that the FIFO bug was the one that affected me before, > damn odd race that one. Does the workaround (forced reset work). Ian, you are the first person to report triggering it. I haven't found a way to make it happen. What combination of flow control and speeds are you using? -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sky2: jumbo frame regression fix
Stephen Hemminger wrote: On Tue, 02 Oct 2007 21:07:22 -0400 Jeff Garzik <[EMAIL PROTECTED]> wrote: Stephen Hemminger wrote: Remove unneeded check that caused problems with jumbo frame sizes. The check was recently added and is wrong. When using jumbo frames the sky2 driver does fragmentation, so rx_data_size is less than mtu. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- a/drivers/net/sky2.c2007-10-02 17:56:31.0 -0700 +++ b/drivers/net/sky2.c2007-10-02 17:58:56.0 -0700 @@ -2163,9 +2163,6 @@ static struct sk_buff *sky2_receive(stru sky2->rx_next = (sky2->rx_next + 1) % sky2->rx_pending; prefetch(sky2->rx_ring + sky2->rx_next); - if (length < ETH_ZLEN || length > sky2->rx_data_size) - goto len_error; - 2.6.23? 2.6.24? enquiring minds want to know... 2.6.23, since it is a regression You can have regressions in behavior in net-2.6.24.git, too. _Please_ be specific about where you want your patches to go. Thanks. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Question] How to represent SYSTEM_RAM in kerenel/resouce.c
On Tue, 2 Oct 2007 19:52:42 -0600 Matthew Wilcox <[EMAIL PROTECTED]> wrote: > On Wed, Oct 03, 2007 at 10:31:36AM +0900, KAMEZAWA Hiroyuki wrote: > > i386 and x86_64 registers System RAM as IORESOUCE_MEM | IORESOUCE_BUSY. > > ia64 registers System RAM as IORESOURCE_MEM. > > > > Which is better ? > > Should probably be BUSY. Non-BUSY regions can have io resources > requested underneath them, but you wouldn't want a PCI device to be > assigned an address which overlaps with physical memory. Thank you. It seems that I'll have to try modifing ia64 and memory hotplug in the next -mm. Regards, -Kame - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sky2: jumbo frame regression fix
On Tue, 02 Oct 2007 21:07:22 -0400 Jeff Garzik <[EMAIL PROTECTED]> wrote: > Stephen Hemminger wrote: > > Remove unneeded check that caused problems with jumbo frame sizes. > > The check was recently added and is wrong. > > When using jumbo frames the sky2 driver does fragmentation, so > > rx_data_size is less than mtu. > > > > Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> > > > > --- a/drivers/net/sky2.c2007-10-02 17:56:31.0 -0700 > > +++ b/drivers/net/sky2.c2007-10-02 17:58:56.0 -0700 > > @@ -2163,9 +2163,6 @@ static struct sk_buff *sky2_receive(stru > > sky2->rx_next = (sky2->rx_next + 1) % sky2->rx_pending; > > prefetch(sky2->rx_ring + sky2->rx_next); > > > > - if (length < ETH_ZLEN || length > sky2->rx_data_size) > > - goto len_error; > > - > > 2.6.23? 2.6.24? enquiring minds want to know... 2.6.23, since it is a regression -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Document x86-64 iommu kernel parameters
Jeff Garzik wrote: Randy Dunlap wrote: On Tue, 2 Oct 2007 21:34:13 -0400 Jeff Garzik wrote: Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]> --- After having to go figure out what some of these means, I figured I would save others the trouble. Some of these are "best guess" based on a quick scan of the code, so it certainly needs a sanity review before going upstream. "iommu" is listed in Documentation/x86_64/boot-options.txt along with more x86_64-specific boot options. A few other arches do something similar... Ah! Well, seeing as how we already have a provision for arch-specific options in kernel-parameters.txt, and some less-obscure arch-specific options can be found there, I think an argument can be made for my patch :) Nonethless, if the maintainer disagrees, they can drop this patch I suppose. or maybe during the x86 merge, we can merge the docs also... -- ~Randy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
On Tue, 2 Oct 2007, Bill Davidsen wrote: > > Unfortunately not so, I've been looking at schedulers since MULTICS, and > desktops since the 70s (MP/M), and networked servers since I was the ARPAnet > technical administrator at GE's Corporate R&D Center. And on desktops response > is (and should be king), while on a server, like nntp or mail, I will happily > go from 1ms to 10sec for a message to pass through the system if only I can > pass 30% more messages per hour, because in virtually all cases transit time > in that range is not an issue. Same thing for DNS, LDAP, etc, only smaller > time range. If my goal is <10ms, I will not sacrifice capacity to do it. Bill, that's a *tuning* issue, not a scheduler logic issue. You can do that today. The scheduler has always had (well, *almost* always: I think the really really original one didn't) had tuning knobs. It in no way excuses any "pluggable scheduler", because IT DOES NOT CHANGE THE PROBLEM. [ Side note: not only doesn't it change the problem, but a good scheduler tunes itself rather naturally for most things. In particular, for things that really are CPU-limited, the scheduler should be able to notice that, and will not aim for latency to the same degree. In fact, what is really important is that the scheduler notice that some programs are latency-critical AT THE SAME TIME as other programs sharing that CPU are not, which very much implies that you absolutely MUST NOT have a scheduler that done one or the other: it needs to know about *both* behaviors at the same time. IOW, it is very much *not* about multiple different "pluggable modules", because the scheduler must be able to work *across* these kinds of barriers. ] So for example, with the current scheduler, you can actually set things like scheduler latency. Exactly so you can tune things. However, I actually would argue that you generally shouldn't need to, and if you really do need to, and it's a huge deal for a real load (and not just a few percent for a benchmark), we should consider that a scheduler problem. So your "argument" is nonsense. You're arguing for something else than what you _claim_ to be arguing for. What you state that you want actually has nothing what-so-ever to do with pluggable schedulers, quite the reverse! It's also totally incorrect to state that this is somehow intrisicly a feature of a "server load". Many server loads have very real latency constraints. No, not the traditional UNIX loads of SMPT and NNTP, but in many loads the latency guarantees are a rather important part of it, and you'll have benchmarks that literally test how high the load can be until latency reaches some intolerable value - ie latency ends up being the critical part. There's also a meta-development issue here: I can state with total conviction that historically, if we had had a "server scheduler" and a "desktop scheduler", we'd have been in much worse shape than we are now. Not only are a lot of the loads the same or at least similar (and aiming for _one_ scheduler - especially one that auto-tunes itself at least to to some degree - gets you lots of good testing), but the hardware situation changes. For example, even just five years ago, there would have been people who thought that multiprocessing is a server load - and they'd have been largely right at the time. Would you have wanted a "server" (SMP, screw latency) scheduler, a "workstation" (SMP but low-latency) scheduler and a "desktop" (UP) scheduler for the different cases? Because yes, SMP does impact the scheduler a lot... The locking, the migration between CPU's, the CPU affinity.. Things that gamers five years ago would have felt was just totally screwing them over and making the scheduler slower and more complex "for no gain". See? Pluggable things are generally a *bad* thing. You should generally aim for *never* being pluggable if you can at all avoid it, because it not only fragments the developer base over totally different code bases, it generates unmaintainable decisions as the problem space evolves. To get back to security: I didn't want pluggable security because I thought that was a technically good solution. No, the reason Linux has LSM (and yes, I was the one who pushed hard for the whole thing, even if I didn't actually write any of it) was because the problem wasn't technical to begin with. It was social/political and administrative. See? Another fundamental difference between schedulers and security modules. > > I don't know who came up with it, or why people continue to feed the insane > > ideas. Why do people think that servers don't care about latency? > > Because people who run servers for a living, and have to live with limited > hardware capacity realize that latency isn't the only issue to be addressed, > and that the policy for degradation of latency vs. throughput may be very > different on one server than another or a desktop.
Re: [PATCH] mark read_crX() asm code as volatile
On Wednesday 03 October 2007 04:27, Chuck Ebbert wrote: > On 10/02/2007 11:28 AM, Arjan van de Ven wrote: > > On Tue, 02 Oct 2007 18:08:32 +0400 > > > > Kirill Korotaev <[EMAIL PROTECTED]> wrote: > >> Some gcc versions (I checked at least 4.1.1 from RHEL5 & 4.1.2 from > >> gentoo) can generate incorrect code with read_crX()/write_crX() > >> functions mix up, due to cached results of read_crX(). > > > > I'm not so sure volatile is the right answer, as compared to giving the > > asm more strict contraints > > > > asm volatile tends to mean something else than "the result has > > changed" > > It means "don't eliminate this code if it's reachable" which should be > just enough for this case. But it could still be reordered in some cases > that could break, I think. > > This should work because the result gets used before reading again: > > read_cr3(a); > write_cr3(a | 1); > read_cr3(a); > > But this might be reordered so that b gets read before the write: > > read_cr3(a); > write_cr3(a | 1); > read_cr3(b); > > ? I don't see how, as write_cr3 clobbers memory. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)
On Tuesday 02 October 2007 21:40, Peter Zijlstra wrote: > On Tue, 2007-10-02 at 13:21 +0200, Kay Sievers wrote: > > How about adding this information to the tree then, instead of > > creating a new top-level hack, just because something that you think > > you need doesn't exist. > > So you suggest adding all the various network filesystems in there > (where?), and adding the concept of a BDI, and ensuring all are properly > linked together - somehow. Feel free to do so. Would something fit better under /sys/fs/? At least filesystems are already an existing concept to userspace. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] fix bogus reporting of signals by audit
Async signals should not be reported as sent by current in audit log. As it is, we call audit_signal_info() too early in check_kill_permission(). Note that check_kill_permission() has that test already - it needs to know if it should apply current-based permission checks. So the solution is to move the call of audit_signal_info() between those. Bogosity in question is easily reproduced - add a rule watching for e.g. kill(2) from specific process (so that audit_signal_info() would not short-circuit to nothing), say load_policy, watch the bogus OBJ_PID entry in audit logs claiming that write(2) on selinuxfs file issued by load_policy(8) had somehow managed to send a signal to syslogd... Signed-off-by: Al Viro <[EMAIL PROTECTED]> Acked-by: Steve Grubb <[EMAIL PROTECTED]> Acked-by: Eric Paris <[EMAIL PROTECTED]> --- diff --git a/kernel/signal.c b/kernel/signal.c index 9fb91a3..7929523 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -531,18 +531,18 @@ static int check_kill_permission(int sig, struct siginfo *info, if (!valid_signal(sig)) return error; - error = audit_signal_info(sig, t); /* Let audit system see the signal */ - if (error) - return error; - - error = -EPERM; - if ((info == SEND_SIG_NOINFO || (!is_si_special(info) && SI_FROMUSER(info))) - && ((sig != SIGCONT) || - (process_session(current) != process_session(t))) - && (current->euid ^ t->suid) && (current->euid ^ t->uid) - && (current->uid ^ t->suid) && (current->uid ^ t->uid) - && !capable(CAP_KILL)) + if (info == SEND_SIG_NOINFO || (!is_si_special(info) && SI_FROMUSER(info))) { + error = audit_signal_info(sig, t); /* Let audit system see the signal */ + if (error) + return error; + error = -EPERM; + if (((sig != SIGCONT) || + (process_session(current) != process_session(t))) + && (current->euid ^ t->suid) && (current->euid ^ t->uid) + && (current->uid ^ t->suid) && (current->uid ^ t->uid) + && !capable(CAP_KILL)) return error; + } return security_task_kill(t, info, sig, 0); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
My kernel fails with kexec
Greetings, I have a kernel (which is not Linux kernel), and want to have it worked with kexec. That means I want to get kexec boot my kernel. Fortunately, kexec crashes when booting it. (with kexec -e command) My suspect is that my kernel is not written to "support" kexec. So could anybody tell me what is the requirement of a kernel, so it works with kexec? "kexec -e" spits out the below message in QEMU when booting my kernel. Any hint where the problem lies, and on how to debug the problem? Thanks, Jun (qemu) qemu: fatal: triple fault EAX=1500 EBX=001001f1 ECX= EDX=4081 ESI=002ae8c0 EDI=002ac8c0 EBP=e098 ESP=8ffe EIP=0d05 EFL=0002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0018 00cf9300 CS =4020 00040200 008f9f00 SS =4000 0004 9300 DS =0018 00cf9300 FS =0018 00cf9300 GS =0018 00cf9300 LDT= 8000 TR =0080 c1fd7000 2073 c10089fd GDT= 0040a938 0017 IDT= c000 CR0=0010 CR2=b7ed6200 CR3= CR4= CCS=8000 CCD=4000 CCO=SARL FCW=037f FSW= [ST=0] FTW=00 MXCSR=1f80 FPR0= FPR1= FPR2= FPR3= FPR4= FPR5= FPR6=f424 4012 FPR7= XMM00= XMM01= XMM02= XMM03= XMM04= XMM05= XMM06= XMM07= - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kdump info request
On Mon, Oct 01, 2007 at 09:31:45AM -0700, Randy Dunlap wrote: > On Mon, 1 Oct 2007 09:35:04 -0600 Mukker, Atul wrote: > > > Thanks for the information and the effort. > > > > We need to support all currently shipping products with kdump support > > available (read Red Hat and SuSE) so sooner it makes into to the > > upstream kernel the better it is. > > > > So, today there is no alternative way to find out if the driver is being > > loaded under kdump restrictive environment? > > > > Thanks > > -Atul > > I'm not the right person to answer that, but going forward, it would > be nice to have that information and it would be good to do correctly. > > I think that scanning is not actually the good/right > way to do this. It should just be a flag somewhere, and the flag should > be available in all (future) kernels (and likely easily backportable > as well, if that matters), meaning those built without kdump support. > > But it's still up to the kexec developers... > > Hi Atul/Randy, [CCing to LKML] Thinking more about it, looks like scanning command line is not a very good idea. I think we should instead look for if total available RAM in the system and then let driver make a decision about how much memory to allocate. Pavel already suggested it on LKML and I like the idea. This is more generic and can be applied to kdump, kexec based hibernation and all the future users of kexec on panic infrstrcuture which will boot a second kernel in restricted amount of RAM. I am not sure what's the best way to determine the total RAM in the system in arch independent manner. Some VM guys can tell it better. One of the ways could be to parse /proc/iomem, the way kexec-tools does. Balbir mentioned that traverse through the nodes and sum up present_pages. Any suggestions, what's the best way to determine total amount of RAM in the system in kernel? Right now this seems to be one odd case so this code can be inside module. If there are more users of it then we can probably create a flag inside kernel and export it. Thanks Vivek - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Document x86-64 iommu kernel parameters
Randy Dunlap wrote: Maybe we can/should merge the doc files along with the x86 arch merge. Well, the x86 merge is pretty much mechanical. It should be followed up with a lot of manual merging. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc7-mm1 -- powerpc rtas panic
On Wed, 2007-10-03 at 11:19 +1000, Tony Breeds wrote: > On Wed, Oct 03, 2007 at 10:30:16AM +1000, Michael Ellerman wrote: > > > I realise it'll make the patch bigger, but this doesn't seem like a > > particularly good name for the variable anymore. > > Sure, what about? Better .. but .. :D > diff --git a/arch/powerpc/platforms/pseries/rtasd.c > b/arch/powerpc/platforms/pseries/rtasd.c > index 30925d2..73401c8 100644 > --- a/arch/powerpc/platforms/pseries/rtasd.c > +++ b/arch/powerpc/platforms/pseries/rtasd.c > @@ -54,8 +54,9 @@ static unsigned int rtas_event_scan_rate; > static int full_rtas_msgs = 0; > > /* Stop logging to nvram after first fatal error */ > -static int no_more_logging; > - > +static int logging_enabled; /* Until we initialize everything, > + * make sure we don't try logging > + * anything */ Until we initialise what exactly? > static int error_log_cnt; > > /* > @@ -217,7 +218,7 @@ void pSeries_log_error(char *buf, unsigned int err_type, > int fatal) > } > > /* Write error to NVRAM */ > - if (!no_more_logging && !(err_type & ERR_FLAG_BOOT)) > + if (logging_enabled && !(err_type & ERR_FLAG_BOOT)) > nvram_write_error_log(buf, len, err_type, error_log_cnt); > > /* > @@ -229,8 +230,8 @@ void pSeries_log_error(char *buf, unsigned int err_type, > int fatal) > printk_log_rtas(buf, len); > > /* Check to see if we need to or have stopped logging */ > - if (fatal || no_more_logging) { > - no_more_logging = 1; > + if (fatal || !logging_enabled) { > + logging_enabled = 0; > spin_unlock_irqrestore(&rtasd_log_lock, s); > return; > } Hmmm, this routine has 4 separate lock-dropping exit paths .. > @@ -302,7 +303,7 @@ static ssize_t rtas_log_read(struct file * file, char > __user * buf, > > spin_lock_irqsave(&rtasd_log_lock, s); > /* if it's 0, then we know we got the last one (the one in NVRAM) */ > - if (rtas_log_size == 0 && !no_more_logging) > + if (rtas_log_size == 0 && logging_enabled) > nvram_clear_error_log(); > spin_unlock_irqrestore(&rtasd_log_lock, s); > > @@ -414,6 +415,8 @@ static int rtasd(void *unused) > memset(logdata, 0, rtas_error_log_max); > rc = nvram_read_error_log(logdata, rtas_error_log_max, > &err_type, &error_log_cnt); > + /* We can use rtas_log_buf now */ > + logging_enabled = 1; > > if (!rc) { > if (err_type != ERR_FLAG_ALREADY_LOGGED) { What exactly happens that allows us to do logging? I don't see any ordering between anything else and the setting of the flag, and AFAICT we're not inside a spinlock or anything here. cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person signature.asc Description: This is a digitally signed message part
Re: [git] CFS-devel, latest code
On Tue, Oct 02, 2007 at 09:59:04PM +0200, Dmitry Adamushko wrote: > The following patch (sched: disable sleeper_fairness on SCHED_BATCH) > seems to break GROUP_SCHED. Although, it may be > 'oops'-less due to the possibility of 'p' being always a valid > address. Thanks for catching it! Patch below looks good to me. Acked-by : Srivatsa Vaddagiri <[EMAIL PROTECTED]> > Signed-off-by: Dmitry Adamushko <[EMAIL PROTECTED]> > > --- > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c > index 8727d17..a379456 100644 > --- a/kernel/sched_fair.c > +++ b/kernel/sched_fair.c > @@ -473,9 +473,8 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity > *se, int initial) > vruntime += sched_vslice_add(cfs_rq, se); > > if (!initial) { > - struct task_struct *p = container_of(se, struct task_struct, > se); > - > - if (sched_feat(NEW_FAIR_SLEEPERS) && p->policy != SCHED_BATCH) > + if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se) && > + task_of(se)->policy != SCHED_BATCH) > vruntime -= sysctl_sched_latency; > > vruntime = max_t(s64, vruntime, se->vruntime); > > --- > -- Regards, vatsa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Document x86-64 iommu kernel parameters
On Tue, 02 Oct 2007 22:30:31 -0400 Jeff Garzik wrote: > Randy Dunlap wrote: > > On Tue, 2 Oct 2007 21:34:13 -0400 Jeff Garzik wrote: > > > >> Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]> > >> --- > >> After having to go figure out what some of these means, I figured I > >> would save others the trouble. > >> > >> Some of these are "best guess" based on a quick scan of the code, so it > >> certainly needs a sanity review before going upstream. > > > > "iommu" is listed in Documentation/x86_64/boot-options.txt > > along with more x86_64-specific boot options. > > A few other arches do something similar... > > Ah! Well, seeing as how we already have a provision for arch-specific > options in kernel-parameters.txt, and some less-obscure arch-specific > options can be found there, I think an argument can be made for my patch :) > > Nonethless, if the maintainer disagrees, they can drop this patch I suppose. [sorry if there be duplicates; I thought I sent this but can't find it anywhere] Maybe we can/should merge the doc files along with the x86 arch merge. --- ~Randy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCH] Add sysfs control to modify a user's cpu share
On Tue, Oct 02, 2007 at 06:12:39PM -0400, Eric St-Laurent wrote: > While a sysfs interface is OK and somewhat orthogonal to the interface > proposed the containers patches, I think maybe a new syscall should be > considered. We had discussed syscall vs filesystem based interface for resource management [1] and there was a heavy bias favoring filesystem based interface, based on which the container (now "cgroup") filesystem evolved. Where we already have one interface defined, I would be against adding an equivalent syscall interface. Note that this "fair-user" scheduling can in theory be accomplished using the same cgroup based interface, but requires some extra setup in userspace (either to run a daemon which moves tasks to appropriate control groups/containers upon their uid change OR to modify initrd to mount cgroup filesystem at early bootup time). I expect most distros to enable CONFIG_FAIR_CGROUP_SCHED (control group based fair group scheduler) and not CONFIG_FAIR_USER_SHCED (user id based fair group scheduler). The only reason why we are providing CONFIG_FAIR_USER_SCHED and the associated sysfs interface is to help test group scheduler w/o requiring knowledge of cgroup filesystem. Reference: 1. http://marc.info/?l=linux-kernel&m=116231242201300&w=2 -- Regards, vatsa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.6.23-rc9 and a heads-up for the 2.6.24 series..
On Tue, 2007-10-02 at 11:17 +0200, Thomas Gleixner wrote: [...] > I have uploaded an update of the arch/x86 tree based on -rc9 to > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-x86.git x86 > [...] > If there is anything we can help with the transition, please do not > hesitate to ask. > > Thanks, > > Thomas, Ingo Hi Thomas, This latest x86 branch build and boot without problem with my usual x86_64 config. If you remember our conversation one month ago, I was unable to build your tree. I've upgraded my Ubuntu distribution from 7.04 to 7.10 beta this week, maybe this fixed it. But I still had to do some manual fixes to get the packaging steps working: mkdir arch/x86_64/boot/ ln -s ../../../arch/x86/boot/bzImage arch/x86_64/boot/bzImage Best regards, - Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
Linus Torvalds wrote: On Tue, 2 Oct 2007, Bill Davidsen wrote: And yet you can make the exact same case for schedulers as security, you can quantify the behavior, but if your only choice is A it doesn't help to know that B is better. You snipped a key part of the argument. Namely: Another difference is that when it comes to schedulers, I feel like I actually can make an informed decision. Which means that I'm perfectly happy to just make that decision, and take the flak that I get for it. And I do (both decide, and get flak). That's my job. which you seem to not have read or understood (neither did apparently anybody on slashdot). Actually I had quoted that, made a reply, and decided that my reply was too close to a flame and deleted the quote and the nasty reply, because I couldn't find a nice way to say what I wanted. Oh well, I tried to keep to a higher level, but... on this topic you seem to be off on an ego trip. You are not the decider, George Bush is the decider, and the only time he's not wrong he didn't understand the question. I checked the schedule, it's not you week to be God. There are sensible people you respect on other topics, who have the opinion that there is room for behaviors other than CFS, and who have created a pluggable scheduler framework which they are trying to hand you on a platter. And you won't even consider that they might be right, because you believe there can be one scheduler which is close to optimal for all loads. You say "performance" as if it had universal meaning. Blah. Bogus and pointless argument removed. When it comes to schedulers, "performance" *is* pretty damn well-defined, and has effectively universal meaning. The arguments that "servers" have a different profile than "desktop" is pure and utter garbage, and is perpetuated by people who don't know what they are talking about. The whole notion of "server" and "desktop" scheduling being different is nothing but crap. Unfortunately not so, I've been looking at schedulers since MULTICS, and desktops since the 70s (MP/M), and networked servers since I was the ARPAnet technical administrator at GE's Corporate R&D Center. And on desktops response is (and should be king), while on a server, like nntp or mail, I will happily go from 1ms to 10sec for a message to pass through the system if only I can pass 30% more messages per hour, because in virtually all cases transit time in that range is not an issue. Same thing for DNS, LDAP, etc, only smaller time range. If my goal is <10ms, I will not sacrifice capacity to do it. I don't know who came up with it, or why people continue to feed the insane ideas. Why do people think that servers don't care about latency? Because people who run servers for a living, and have to live with limited hardware capacity realize that latency isn't the only issue to be addressed, and that the policy for degradation of latency vs. throughput may be very different on one server than another or a desktop. Why do people believe that desktop doesn't have multiple processors or through-put intensive loads? Why are people continuing this *idiotic* scheduler discussion? Because people can't get you to understand that one size doesn't fit all (and I doubt I've broken through). Really - not only is the whole "desktop scheduler" argument totally bogus to begin with (and only brought up by people who either don't know anything about it, or who just want to argue, regardless of whether the argumen is valid or not), quite frankly, when you say that it's the "same issue" as with security models, you're simply FULL OF SH*T. The real issue is that you can't imagine that people who don't share your opinion are not only wrong but don't understand the problem. You may be right, but when you say anyone who disagrees is wrong by definition, then you have lost sight of productive technical differences. When your arguments drop to personal attacks and rants it's time to look at your technical values. The issue with LSM is that security people simply cannot even agree on the model. It has nothing to do with performance. It's about management, and it's about totally different models. Have you even *looked* at the differences between AppArmor and SELinux? Did you look at SMACK? They are all done by people who are interested in security, but have totally different notions of what "security" even *IS*ALL*ABOUT. Exactly, and I'm not the only one who doubts that more than one model would be useful. I'm sorry you can't see that about CPU schedulers as well. In contrast, anybody who claims that the CPU scheduler doesn't know what it's all about is just tripping. And anybody who claims that desktop workloads are so radically different from server workloads (or that the hardware is so different) is just totally out to lunch. So next time, think five minutes before you start your argument. I don't disagr
Re: [PATCH 4/5] infiniband: add "dmabarrier" argument to ib_umem_get()
From: [EMAIL PROTECTED] Date: Tue, 2 Oct 2007 19:49:06 -0700 > > Pass a "dmabarrier" argument to ib_umem_get() and use the new > argument to control setting the DMA_BARRIER_ATTR attribute on > the memory that ib_umem_get() maps for DMA. > > Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]> Acked-by: David S. Miller <[EMAIL PROTECTED]> However I'm a little unhappy with how IA64 achieves this. The last argument for dma_map_foo() is an enum not an int, every platform other than IA64 properly defines the last argument as "enum dma_data_direction". It can take one of several distinct values, it is not a mask. This hijacking of the DMA direction argument is hokey at best, and at worst is type bypassing which is going to explode subtly for someone in the future and result in a long painful debugging session. Adding another argument could be painful to do this cleanly, but at least with inline functions and macros it could just evaluate to nothing on platforms that don't need it. Either that, or we should turn the thing into an integer "flags" across the board and audit every DMA mapping implementation so that it can handle multiple bits being set. But that's really ugly and invites mistakes as I detailed above. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] ibmebus: Move to of_device and of_platform_driver, match eHCA and eHEA drivers
Joachim Fenkes writes: > Replace struct ibmebus_dev and struct ibmebus_driver with struct of_device > and struct of_platform_driver, respectively. Match the external ibmebus > interface and drivers using it. > > Signed-off-by: Joachim Fenkes <[EMAIL PROTECTED]> > --- > drivers/infiniband/hw/ehca/ehca_classes.h |2 +- > drivers/net/ehea/ehea.h |2 +- > include/asm-powerpc/ibmebus.h | 38 +++ > arch/powerpc/kernel/ibmebus.c | 28 ++- > drivers/infiniband/hw/ehca/ehca_eq.c |6 +- > drivers/infiniband/hw/ehca/ehca_main.c| 32 ++-- > drivers/net/ehea/ehea_main.c | 72 ++-- This is somewhat difficult as this patch touches files that are the responsibility of three different maintainers. Is it possible to split the patch into three, one for each maintainer (possibly by keeping both old and new interfaces around for a little while)? If not, then you need to get an Acked-by and an agreement that this change can go via the powerpc.git tree from Roland Dreier and Jeff Garzik. Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/5] dma: document dma_flags_set_attr()
From: [EMAIL PROTECTED] Date: Tue, 2 Oct 2007 19:47:52 -0700 > > Document dma_flags_set_attr(). > > Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]> Acked-by: David S. Miller <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] dma: add dma_flags_set_attr() to dma interface
From: [EMAIL PROTECTED] Date: Tue, 2 Oct 2007 19:44:57 -0700 > > Introduce the dma_flags_set_attr() interface and give it a default > no-op implementation. > > Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]> Acked-by: David S. Miller <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/5] dma: redefine dma_flags_set_attr() for sn-ia64
From: [EMAIL PROTECTED] Date: Tue, 2 Oct 2007 19:46:41 -0700 > > define dma_flags_set_attr() for sn-ia64 - it "borrows" bits from > the direction argument (renamed "flags") to the dma_map_* routines > to pass an additional attributes. Also define routines to retrieve > the original direction and attribute from "flags". > > Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]> Acked-by: David S. Miller <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bugme-new] [Bug 8378] New: Averatec 3156X laptop doesn't reboot with kernels > 2.6.13.5 (responsible commit found)
Andrew Morton wrote (at Sat, 12 May 2007 18:02:40 -0700) : > > OK, thanks. > > So that are we doing here? We try the pre-Truxton code and if that didn't > work we try the post-Truxton code? Hard to see how that could go wrong. > > Truxton, can you please test it for us? Hi, Hiroto Shibuya wrote to tell me that he has a VIA EPIA-EK1 which suffers from the reboot problem when no keyboard is attached. My first patch works for him : http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=59f4e7d572980a521b7bdba74ab71b21f5995538 But the latest patch does not work for him : http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8b93789808756bcc1e5c90c99f1b1ef52f839a51 We found that it was necessary to also set the "disable keyboard" flag in the command byte, as the first patch was doing. The second patch tries to minimally modify the command byte, but it is not enough. Please consider this simple one-line patch to help people with low end VIA motherboards reboot when no keyboard is attached. Hiroto Shibuya has verified that this works for him (as I no longer have an afflicted machine) : This patch is against linux-2.6.23-rc9/include/asm-i386/mach-default/mach_reboot.h --- mach_reboot.h Mon Oct 1 20:24:52 2007 +++ mach_reboot.h.new Tue Oct 2 19:22:13 2007 @@ -49,7 +49,7 @@ udelay(50); kb_wait(); udelay(50); - outb(cmd | 0x04, 0x60); /* set "System flag" */ + outb(cmd | 0x14, 0x60); /* set "System flag" and "Keyboard Disabled" */ udelay(50); kb_wait(); udelay(50); Thanks, -Truxton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/5] mthca: allow setting "dmabarrier" on user-allocated memory
On Tue, Oct 02, 2007 at 07:50:07PM -0700, [EMAIL PROTECTED] wrote: > +struct mthca_reg_mr { > + __u32 mr_attrs; > +#define MTHCA_MR_DMAFLUSH 0x1/* flush in-flight DMA on a write to > + * memory region */ > + __u32 reserved; > +}; Seems like a very odd place to #define something new.. -- Heikki Orsila Barbie's law: [EMAIL PROTECTED] "Math is hard, let's go shopping!" http://www.iki.fi/shd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/5] mthca: allow setting "dmabarrier" on user-allocated memory
Allow setting a "dmabarrier" when the mthca driver registers user- allocated memory. Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]> --- mthca_provider.c |7 ++- mthca_user.h | 10 +- 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 17486a4..c818708 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1017,17 +1017,22 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, struct mthca_dev *dev = to_mdev(pd->device); struct ib_umem_chunk *chunk; struct mthca_mr *mr; + struct mthca_reg_mr ucmd; u64 *pages; int shift, n, len; int i, j, k; int err = 0; int write_mtt_size; + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) + return ERR_PTR(-EFAULT); + mr = kmalloc(sizeof *mr, GFP_KERNEL); if (!mr) return ERR_PTR(-ENOMEM); - mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0); + mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, + ucmd.mr_attrs & MTHCA_MR_DMAFLUSH); if (IS_ERR(mr->umem)) { err = PTR_ERR(mr->umem); goto err; diff --git a/drivers/infiniband/hw/mthca/mthca_user.h b/drivers/infiniband/hw/mthca/mthca_user.h index 02cc0a7..5662aea 100644 --- a/drivers/infiniband/hw/mthca/mthca_user.h +++ b/drivers/infiniband/hw/mthca/mthca_user.h @@ -41,7 +41,7 @@ * Increment this value if any changes that break userspace ABI * compatibility are made. */ -#define MTHCA_UVERBS_ABI_VERSION 1 +#define MTHCA_UVERBS_ABI_VERSION 2 /* * Make sure that all structs defined in this file remain laid out so @@ -61,6 +61,14 @@ struct mthca_alloc_pd_resp { __u32 reserved; }; +struct mthca_reg_mr { + __u32 mr_attrs; +#define MTHCA_MR_DMAFLUSH 0x1 /* flush in-flight DMA on a write to +* memory region */ + __u32 reserved; +}; + + struct mthca_create_cq { __u32 lkey; __u32 pdn; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] dma: document dma_flags_set_attr()
Document dma_flags_set_attr(). Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]> --- DMA-API.txt | 27 +++ 1 files changed, 27 insertions(+) diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index cc7a8c3..16e15c0 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -544,3 +544,30 @@ size is the size (and should be a page-sized multiple). The return value will be either a pointer to the processor virtual address of the memory, or an error (via PTR_ERR()) if any part of the region is occupied. + +int +dma_flags_set_attr(u32 attr, enum dma_data_direction dir) + +Amend dir with a platform-specific "dma attribute". + +The only attribute currently defined is DMA_BARRIER_ATTR, which causes +in-flight DMA to be flushed when the associated memory region is written +to (see example below). Setting DMA_BARRIER_ATTR provides a mechanism +to enforce ordering of DMA on platforms that permit DMA to be reordered +between device and host memory (within a NUMA interconnect). On other +platforms this is a nop. + +DMA_BARRIER_ATTR would be set when the memory region is mapped for DMA, +e.g.: + + int count; + int flags = dma_flags_set_attr(DMA_BARRIER_ATTR, DMA_BIDIRECTIONAL); + + count = dma_map_sg(dev, sglist, nents, flags); + +As an example of a situation where this would be useful, suppose that +the device does a DMA write to indicate that data is ready and +available in memory. The DMA of the "completion indication" could +race with data DMA. Using DMA_BARRIER_ATTR on the memory used for +completion indications would prevent the race. + - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/5] infiniband: add "dmabarrier" argument to ib_umem_get()
Pass a "dmabarrier" argument to ib_umem_get() and use the new argument to control setting the DMA_BARRIER_ATTR attribute on the memory that ib_umem_get() maps for DMA. Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]> --- drivers/infiniband/core/umem.c |8 ++-- drivers/infiniband/hw/amso1100/c2_provider.c |2 +- drivers/infiniband/hw/cxgb3/iwch_provider.c |2 +- drivers/infiniband/hw/ehca/ehca_mrmw.c |2 +- drivers/infiniband/hw/ipath/ipath_mr.c |2 +- drivers/infiniband/hw/mlx4/cq.c |2 +- drivers/infiniband/hw/mlx4/doorbell.c|2 +- drivers/infiniband/hw/mlx4/mr.c |3 ++- drivers/infiniband/hw/mlx4/qp.c |2 +- drivers/infiniband/hw/mlx4/srq.c |2 +- drivers/infiniband/hw/mthca/mthca_provider.c |2 +- include/rdma/ib_umem.h |4 ++-- 12 files changed, 19 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 664d2fa..093b58d 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -69,9 +69,10 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d * @addr: userspace virtual address to start at * @size: length of region to pin * @access: IB_ACCESS_xxx flags for memory being pinned + * @dmabarrier: set "dmabarrier" attribute on this memory */ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, - size_t size, int access) + size_t size, int access, int dmabarrier) { struct ib_umem *umem; struct page **page_list; @@ -83,6 +84,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, int ret; int off; int i; + int flags = dmabarrier ? dma_flags_set_attr(DMA_BARRIER_ATTR, + DMA_BIDIRECTIONAL) : +DMA_BIDIRECTIONAL; if (!can_do_mlock()) return ERR_PTR(-EPERM); @@ -160,7 +164,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, chunk->nmap = ib_dma_map_sg(context->device, &chunk->page_list[0], chunk->nents, - DMA_BIDIRECTIONAL); + flags); if (chunk->nmap <= 0) { for (i = 0; i < chunk->nents; ++i) put_page(chunk->page_list[i].page); diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c index 997cf15..17243b7 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.c +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -449,7 +449,7 @@ static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, return ERR_PTR(-ENOMEM); c2mr->pd = c2pd; - c2mr->umem = ib_umem_get(pd->uobject->context, start, length, acc); + c2mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0); if (IS_ERR(c2mr->umem)) { err = PTR_ERR(c2mr->umem); kfree(c2mr); diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index f0c7775..d0a514c 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -601,7 +601,7 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, if (!mhp) return ERR_PTR(-ENOMEM); - mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc); + mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0); if (IS_ERR(mhp->umem)) { err = PTR_ERR(mhp->umem); kfree(mhp); diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index d97eda3..c13c11c 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -329,7 +329,7 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, } e_mr->umem = ib_umem_get(pd->uobject->context, start, length, -mr_access_flags); +mr_access_flags, 0); if (IS_ERR(e_mr->umem)) { ib_mr = (void *)e_mr->umem; goto reg_user_mr_exit1; diff --git a/drivers/infiniband/hw/ipath/ipath_mr.c b/drivers/infiniband/hw/ipath/ipath_mr.c index e442470..e351222 100644 --- a/drivers/infiniband/hw/ipath/ipath_mr.c +++ b/drivers/infiniband/hw/ipath/ipath_mr.c @@ -195,7 +195,7 @@ struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
Re: 2.6.23-rc9 boot failure (megaraid?)
On Tue, 02 Oct 2007 15:38:13 -0500 James Bottomley <[EMAIL PROTECTED]> wrote: > On Tue, 2007-10-02 at 20:15 +0200, Adrian Bunk wrote: > > Cc's added, the complete bug report is at > > http://lkml.org/lkml/2007/10/2/243 > > > > On Tue, Oct 02, 2007 at 12:48:26PM -0400, Burton Windle wrote: > > > 2.6.23-rc9 fails to boot for me; 2.6.22.9 works fine. > > > > > > System is a Dell Poweredge with PERC 2/DC with RAID1 volume. > > >... > > > > Thanks for your report. > > > > Diff'ing the dmesg's shows: > > > > <-- snip --> > > > > scsi0: scanning scsi channel 4 [P0] for physical devices. > > scsi0: scanning scsi channel 5 [P1] for physical devices. > > st: Version 20070203, fixed bufsize 32768, s/g segs 256 > > -sd 0:0:0:0: [sda] 17547264 512-byte hardware sectors (8984 MB) > > +sd 0:0:0:0: [sda] Sector size 0 reported, assuming 512. > > +sd 0:0:0:0: [sda] 1 512-byte hardware sectors (0 MB) > > sd 0:0:0:0: [sda] Write Protect is off > > sd 0:0:0:0: [sda] Asking for cache data failed > > sd 0:0:0:0: [sda] Assuming drive cache: write through > > -sd 0:0:0:0: [sda] 17547264 512-byte hardware sectors (8984 MB) > > +sd 0:0:0:0: [sda] Sector size 0 reported, assuming 512. > > +sd 0:0:0:0: [sda] 1 512-byte hardware sectors (0 MB) > > sd 0:0:0:0: [sda] Write Protect is off > > sd 0:0:0:0: [sda] Asking for cache data failed > > sd 0:0:0:0: [sda] Assuming drive cache: write through > > sda: sda1 > > + sda: p1 exceeds device capacity > > > > <-- snip --> > > > > - case MEGA_BULK_DATA: > > - if (scb->cmd->use_sg == 0) > > - length = scb->cmd->request_bufflen; > > - else { > > - struct scatterlist *sgl = > > - (struct scatterlist *)scb->cmd->request_buffer; > > - length = sgl->length; > > - } > > - pci_unmap_page(adapter->dev, scb->dma_h_bulkdata, > > - length, scb->dma_direction); > > - break; > > - > > This is the problem piece I think. We've reintroduced a very old bug: > > commit 51c928c34fa7cff38df584ad01de988805877dba > Author: James Bottomley <[EMAIL PROTECTED]> > Date: Sat Oct 1 09:38:05 2005 -0500 > > [SCSI] Legacy MegaRAID: Fix READ CAPACITY > > Some Legacy megaraid cards can't actually cope with the scatter/gather > version of the READ CAPACITY command (which is what we now send them > since altering all SCSI internal I/O to go via the block layer). Fix > this (and a few other broken megaraid driver assumptions) by sending > the non-sg version of the command if the sg list only has a single > element. > > Signed-off-by: James Bottomley <[EMAIL PROTECTED]> > > So what we have to do is put back the check for use_sg == 1 and send > that as a bulk transfer command. Sorry about this. Can this fix the problem? Thanks, diff --git a/drivers/scsi/megaraid.c b/drivers/scsi/megaraid.c index 3907f67..da56163 100644 --- a/drivers/scsi/megaraid.c +++ b/drivers/scsi/megaraid.c @@ -1753,6 +1753,14 @@ mega_build_sglist(adapter_t *adapter, scb_t *scb, u32 *buf, u32 *len) *len = 0; + if (scsi_sg_count(cmd) == 1 && !adapter->has_64bit_addr) { + sg = scsi_sglist(cmd); + scb->dma_h_bulkdata = sg_dma_address(sg); + *buf = (u32)scb->dma_h_bulkdata; + *len = sg_dma_len(sg); + return 0; + } + scsi_for_each_sg(cmd, sg, sgcnt, idx) { if (adapter->has_64bit_addr) { scb->sgl64[idx].address = sg_dma_address(sg); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/5] dma: redefine dma_flags_set_attr() for sn-ia64
define dma_flags_set_attr() for sn-ia64 - it "borrows" bits from the direction argument (renamed "flags") to the dma_map_* routines to pass an additional attributes. Also define routines to retrieve the original direction and attribute from "flags". Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]> --- arch/ia64/sn/pci/pci_dma.c | 37 +++-- include/asm-ia64/sn/io.h | 23 +++ 2 files changed, 50 insertions(+), 10 deletions(-) diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c index d79ddac..84b6227 100644 --- a/arch/ia64/sn/pci/pci_dma.c +++ b/arch/ia64/sn/pci/pci_dma.c @@ -153,7 +153,7 @@ EXPORT_SYMBOL(sn_dma_free_coherent); * @dev: device to map for * @cpu_addr: kernel virtual address of the region to map * @size: size of the region - * @direction: DMA direction + * @flags: DMA direction, and arch-specific attributes * * Map the region pointed to by @cpu_addr for DMA and return the * DMA address. @@ -167,17 +167,23 @@ EXPORT_SYMBOL(sn_dma_free_coherent); * figure out how to save dmamap handle so can use two step. */ dma_addr_t sn_dma_map_single(struct device *dev, void *cpu_addr, size_t size, -int direction) +int flags) { dma_addr_t dma_addr; unsigned long phys_addr; struct pci_dev *pdev = to_pci_dev(dev); struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev); + int dmabarrier = dma_flags_get_attr(flags) & DMA_BARRIER_ATTR; BUG_ON(dev->bus != &pci_bus_type); phys_addr = __pa(cpu_addr); - dma_addr = provider->dma_map(pdev, phys_addr, size, SN_DMA_ADDR_PHYS); + if (dmabarrier) + dma_addr = provider->dma_map_consistent(pdev, phys_addr, size, + SN_DMA_ADDR_PHYS); + else + dma_addr = provider->dma_map(pdev, phys_addr, size, +SN_DMA_ADDR_PHYS); if (!dma_addr) { printk(KERN_ERR "%s: out of ATEs\n", __FUNCTION__); return 0; @@ -240,18 +246,20 @@ EXPORT_SYMBOL(sn_dma_unmap_sg); * @dev: device to map for * @sg: scatterlist to map * @nhwentries: number of entries - * @direction: direction of the DMA transaction + * @flags: direction of the DMA transaction, and arch-specific attributes * * Maps each entry of @sg for DMA. */ int sn_dma_map_sg(struct device *dev, struct scatterlist *sg, int nhwentries, - int direction) + int flags) { unsigned long phys_addr; struct scatterlist *saved_sg = sg; struct pci_dev *pdev = to_pci_dev(dev); struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev); int i; + int dmabarrier = dma_flags_get_attr(flags) & DMA_BARRIER_ATTR; + int dir = dma_flags_get_dir(flags); BUG_ON(dev->bus != &pci_bus_type); @@ -259,19 +267,28 @@ int sn_dma_map_sg(struct device *dev, struct scatterlist *sg, int nhwentries, * Setup a DMA address for each entry in the scatterlist. */ for (i = 0; i < nhwentries; i++, sg++) { + dma_addr_t dma_addr; phys_addr = SG_ENT_PHYS_ADDRESS(sg); - sg->dma_address = provider->dma_map(pdev, - phys_addr, sg->length, - SN_DMA_ADDR_PHYS); - if (!sg->dma_address) { + if (dmabarrier) { + dma_addr = provider->dma_map_consistent(pdev, + phys_addr, + sg->length, + SN_DMA_ADDR_PHYS); + } else { + dma_addr = provider->dma_map(pdev, +phys_addr, sg->length, +SN_DMA_ADDR_PHYS); + } + + if (!(sg->dma_address = dma_addr)) { printk(KERN_ERR "%s: out of ATEs\n", __FUNCTION__); /* * Free any successfully allocated entries. */ if (i > 0) - sn_dma_unmap_sg(dev, saved_sg, i, direction); + sn_dma_unmap_sg(dev, saved_sg, i, dir); return 0; } diff --git a/include/asm-ia64/sn/io.h b/include/asm-ia64/sn/io.h index 41c73a7..d2b94ce 100644 --- a/include/asm-ia64/sn/io.h +++ b/include/asm-ia64/sn/io.h @@ -271,4 +271,27 @@ sn_pci_set_vchan(struct pci_dev *pci_dev, unsigned long *addr, int vchan) return 0; } +#define ARCH_USES_DMA_ATTRS +/* we pass additional dma attributes in the
[PATCH 1/5] dma: add dma_flags_set_attr() to dma interface
Introduce the dma_flags_set_attr() interface and give it a default no-op implementation. Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]> --- dma-mapping.h |8 1 files changed, 8 insertions(+) diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 2dc21cb..4990aaf 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -99,4 +99,12 @@ static inline void dmam_release_declared_memory(struct device *dev) } #endif /* ARCH_HAS_DMA_DECLARE_COHERENT_MEMORY */ +#define DMA_BARRIER_ATTR 0x1 +#ifndef ARCH_USES_DMA_ATTRS +static inline int dma_flags_set_attr(u32 attr, enum dma_data_direction dir) +{ + return dir; +} +#endif /* ARCH_USES_DMA_ATTRS */ + #endif - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5] allow drivers to flush in-flight DMA v3
On Altix, DMA may be reordered between a device and host memory. This reordering can happen in the NUMA interconnect, and it usually results in correct operation and improved performance. In some situations it may be necessary to explicitly synchronize DMA from the device. This patchset allows a memory region to be mapped with a "dmabarrier". Writes to the memory region will cause in-flight DMA to be flushed, providing a mechanism to order DMA from a device. There are 5 patches in this patchset: [1/5] dma: add dma_flags_set_attr() to dma interface [2/5] dma: redefine dma_flags_set_attr() for sn-ia64 [3/5] dma: document dma_flags_set_attr() [4/5] infiniband: add "dmabarrier" argument to ib_umem_get() [5/5] mthca: allow setting "dmabarrier" on user-allocated memory -- Arthur - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Traffic Controller Performance in Kernel 2.4 vs 2.6
Hello This is a repost, there seems to have a misunderstanding before. I hope this is the right place to ask this. Does any know if there is a substantial difference in the performance of the traffic controller between kernel 2.4 and 2.6. We tested it using 1 iperf server and use 250 and 500 clients, altering the burst. This is the set-up: iperf client - router (w/ traffic controller) - iperf server We use the top command inside the router to check the idle time of our router to see this. The results we got from the 2.4 kernel shows around 65-70% idle time while the 2.6 shows 60-65% idle time. We tried to use MRTG and we're not getting any results either. We want to know if we could improve the bandwidth by upgrading the kernel, else we would have to get a new bandwidth manager. Have anyone performed a similar test or can suggest a better way to do this. Thanks in advance. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io
On Wed, Oct 03, 2007 at 09:34:39AM +0800, Fengguang Wu wrote: > On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote: > > On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote: > > > wbc.pages_skipped = 0; > > > @@ -560,8 +561,9 @@ static void background_writeout(unsigned > > > min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; > > > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { > > > /* Wrote less than expected */ > > > - congestion_wait(WRITE, HZ/10); > > > - if (!wbc.encountered_congestion) > > > + if (wbc.encountered_congestion || wbc.more_io) > > > + congestion_wait(WRITE, HZ/10); > > > + else > > > break; > > > } > > > > Why do you call congestion_wait() if there is more I/O to issue? If > > we have a fast filesystem, this might cause the device queues to > > fill, then drain on congestion_wait(), then fill again, etc. i.e. we > > will have trouble keeping the queues full, right? > > You mean slow writers and fast RAID? That would be exactly the case > these patches try to improve. I mean any writers and a fast block device (raid or otherwise). > This patchset makes kupdate/background writeback more responsible, > so that if (avg-write-speed < device-capabilities), the dirty data are > synced timely, and we don't have to go for balance_dirty_pages(). Sure, but I'm asking about the effect of the patches on the (avg-write-speed == device-capabilities) case. I agree that they are necessary for timely syncing of data but I'm trying to understand what effect they have on the normal write case (i.e. keeping the disk at full write throughput). > So for your question of queue depth, the answer is: the queue length > will not build up in the first place. Which queue are you talking about here? The block deivce queue? > Also the name of congestion_wait() could be misleading: > - when not congested, congestion_wait() will wakeup on write > completions; > - when congested, congestion_wait() could also wakeup on write > completions on other non-congested devices. > So congestion_wait(100ms) normally only takes 0.1-10ms. True, but if we know we are not congested and have more work to do, why sleep at all? > For the more_io case, congestion_wait() serves more like 'to take a > breath'. Tests show that the system could go mad without it. I'm interested to know what tests show that pushing more I/O when you don't have block device congestion make the system go mad (and what mad means). It sounds to me like it's hiding (yet another) bug in the writeback code.. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Document x86-64 iommu kernel parameters
Randy Dunlap wrote: On Tue, 2 Oct 2007 21:34:13 -0400 Jeff Garzik wrote: Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]> --- After having to go figure out what some of these means, I figured I would save others the trouble. Some of these are "best guess" based on a quick scan of the code, so it certainly needs a sanity review before going upstream. "iommu" is listed in Documentation/x86_64/boot-options.txt along with more x86_64-specific boot options. A few other arches do something similar... Ah! Well, seeing as how we already have a provision for arch-specific options in kernel-parameters.txt, and some less-obscure arch-specific options can be found there, I think an argument can be made for my patch :) Nonethless, if the maintainer disagrees, they can drop this patch I suppose. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Document x86-64 iommu kernel parameters
On Tue, 2 Oct 2007 21:34:13 -0400 Jeff Garzik wrote: > > Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]> > --- > After having to go figure out what some of these means, I figured I > would save others the trouble. > > Some of these are "best guess" based on a quick scan of the code, so it > certainly needs a sanity review before going upstream. "iommu" is listed in Documentation/x86_64/boot-options.txt along with more x86_64-specific boot options. A few other arches do something similar... > diff --git a/Documentation/kernel-parameters.txt > b/Documentation/kernel-parameters.txt > index 4d175c7..8afea9b 100644 > --- a/Documentation/kernel-parameters.txt > +++ b/Documentation/kernel-parameters.txt > @@ -763,6 +763,30 @@ and is between 256 and 4096 characters. It is defined in > the file > > inttest=[IA64] > > + iommu=option[,option..] [X86-64] > + off Disable IOMMU. > + force Unconditionally enable IOMMU. > + noforce Disable IOMMU and IOMMU merging, by default. > + biomergeUnconditionally enable IOMMU, IOMMU merging, > + and set BIO IOMMU vmerge boundary to 4096. > + panic Panic on IOMMU overflow. > + nopanic Do not panic on IOMMU overflow. > + merge Unconditionally enable IOMMU, IOMMU merging. > + nomerge Disable IOMMU merging. > + forcesacForce single address cycle (SAC, 32-bit). > + allowdacPermit dual address cycle (DAC, 64-bit). > + nodac Forbid dual address cycle (DAC, 64-bit). > + softEnable swiotlb. > + calgary Use Calgary IOMMU. > + > + (GART-only options follow...) > +Specify size of remapping area. > + fullflush Disable optimizing flushing strategy. > + nofullflush Enable optimizing flushing strategy. > + noagp Use entire aperture, AGP isn't using it. > + noaperture Disable aperture fixups / hole init. > + memaper= malloc an aperture of order N. > + > io7=[HW] IO7 for Marvel based alpha systems > See comment before marvel_specify_io7 in > arch/alpha/kernel/core_marvel.c. > - --- ~Randy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] writeback: remove pages_skipped accounting in __block_write_full_page()
On Wed, Oct 03, 2007 at 09:43:33AM +0800, Fengguang Wu wrote: > On Wed, Oct 03, 2007 at 07:55:18AM +1000, David Chinner wrote: > > > > > > do not quite agree with each other. The page writeback should be skipped > > > for > > > 'locked buffer', but here it is 'clean buffer'! > > > > Ok, so that means we need an equivalent fix in xfs_start_page_writeback() > > as it will skip pages with clean buffers just like this. Something like > > this (untested)? > > Sure OK - as long as it is 'no write because of clean buffer'. Yes, that's the case here. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RTC wakealarm write-only, still has 644 permissions
> > > > [EMAIL PROTECTED]:/sys/class/rtc/rtc0# cat wakealarm > > > > [EMAIL PROTECTED]:/sys/class/rtc/rtc0# echo 132719 > wakealarm > > > > At which point I'd expect > > > > # echo $? > > > > would indicate the write failed. That's a LONG time in the > > past (January 2, 1970), so that setting would be rejected. > > echo $? says 0 here :-(. I stand corrected. What it should do -- and does -- in that case involves disabling the alarm, then succeeding. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Question] How to represent SYSTEM_RAM in kerenel/resouce.c
On Wed, Oct 03, 2007 at 10:31:36AM +0900, KAMEZAWA Hiroyuki wrote: > i386 and x86_64 registers System RAM as IORESOUCE_MEM | IORESOUCE_BUSY. > ia64 registers System RAM as IORESOURCE_MEM. > > Which is better ? Should probably be BUSY. Non-BUSY regions can have io resources requested underneath them, but you wouldn't want a PCI device to be assigned an address which overlaps with physical memory. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] writeback: remove pages_skipped accounting in __block_write_full_page()
On Wed, Oct 03, 2007 at 07:55:18AM +1000, David Chinner wrote: > > > > do not quite agree with each other. The page writeback should be skipped for > > 'locked buffer', but here it is 'clean buffer'! > > Ok, so that means we need an equivalent fix in xfs_start_page_writeback() > as it will skip pages with clean buffers just like this. Something like > this (untested)? Sure OK - as long as it is 'no write because of clean buffer'. The only user of pages_skipped is obviously using that semantics. Andrew, here is the expanded patch: --- writeback: remove pages_skipped accounting in __block_write_full_page() Miklos Szeredi <[EMAIL PROTECTED]> and me identified a writeback bug: > The following strange behavior can be observed: > > 1. large file is written > 2. after 30 seconds, nr_dirty goes down by 1024 > 3. then for some time (< 30 sec) nothing happens (disk idle) > 4. then nr_dirty again goes down by 1024 > 5. repeat from 3. until whole file is written > > So basically a 4Mbyte chunk of the file is written every 30 seconds. > I'm quite sure this is not the intended behavior. It can be produced by the following test scheme: # cat bin/test-writeback.sh grep nr_dirty /proc/vmstat echo 1 > /proc/sys/fs/inode_debug dd if=/dev/zero of=/var/x bs=1K count=204800& while true; do grep nr_dirty /proc/vmstat; sleep 1; done # bin/test-writeback.sh nr_dirty 19207 nr_dirty 19207 nr_dirty 30924 204800+0 records in 204800+0 records out 209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s nr_dirty 47150 nr_dirty 47141 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47205 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47215 nr_dirty 47216 nr_dirty 47216 nr_dirty 47216 nr_dirty 47154 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47134 nr_dirty 47134 nr_dirty 47135 nr_dirty 47135 nr_dirty 47135 nr_dirty 46097 <== -1038 nr_dirty 46098 nr_dirty 46098 nr_dirty 46098 [...] nr_dirty 46091 nr_dirty 46092 nr_dirty 46092 nr_dirty 45069 <== -1023 nr_dirty 45056 nr_dirty 45056 nr_dirty 45056 [...] nr_dirty 37822 nr_dirty 36799 <== -1023 [...] nr_dirty 36781 nr_dirty 35758 <== -1023 [...] nr_dirty 34708 nr_dirty 33672 <== -1024 [...] nr_dirty 33692 nr_dirty 32669 <== -1023 % ls -li /var/x 847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x % dmesg|grep 847824 # generated by a debug printk [ 529.263184] redirtied inode 847824 line 548 [ 564.250872] redirtied inode 847824 line 548 [ 594.272797] redirtied inode 847824 line 548 [ 629.231330] redirtied inode 847824 line 548 [ 659.224674] redirtied inode 847824 line 548 [ 689.219890] redirtied inode 847824 line 548 [ 724.226655] redirtied inode 847824 line 548 [ 759.198568] redirtied inode 847824 line 548 # line 548 in fs/fs-writeback.c: 543 if (wbc->pages_skipped != pages_skipped) { 544 /* 545 * writeback is not making progress due to locked 546 * buffers. Skip this inode for now. 547 */ 548 redirty_tail(inode); 549 } More debug efforts show that __block_write_full_page() never has the chance to call submit_bh() for that big dirty file: the buffer head is *clean*. So basicly no page io is issued by __block_write_full_page(), hence pages_skipped goes up. Also the comment in generic_sync_sb_inodes(): 544 /* 545 * writeback is not making progress due to locked 546 * buffers. Skip this inode for now. 547 */ and the comment in __block_write_full_page(): 1713 /* 1714 * The page was marked dirty, but the buffers were 1715 * clean. Someone wrote them back by hand with 1716 * ll_rw_block/submit_bh. A rare case. 1717 */ do not quite agree with each other. The page writeback should be skipped for 'locked buffer', but here it is 'clean buffer'! This patch fixes this bug. Though I'm not sure why __block_write_full_page() is called only to do nothing and who actually issued the writeback for us. This is the two possible new behaviors after the patch: 1) pretty nice: wait 30s and write ALL:) 2) not so good: - during the dd: ~16M - after 30s: ~4M - after 5s: ~4M - after 5s: ~176M The next patch will fix case (2). Cc: Ken Chen <[EMAIL PROTECTED]> Cc: Andrew Morton <[EMAIL PROTECTED]> Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]> Signed-off-by: David Chinner <[EMAIL PROTECTED]> --- fs/buffer.c |1 - fs/xfs/linux-2.6/xfs_aops.c |5 ++--- 2 files changed, 2 insertions(+), 4 deletions(-) --- linux-2.6.23-rc8-mm2.orig/fs/buffer.c +++ linux-2.6.23-rc8-mm2/fs/buffer.c @@ -1737,7 +1737,6 @@ done:
Re: linux cache routines for Write-back cache policy on MIPS24KE
hi Geert, here i mean 'flush' is 'write-back only' Regards, Veerasena. --- Geert Uytterhoeven <[EMAIL PROTECTED]> wrote: > On Mon, 1 Oct 2007, veerasena reddy wrote: > > In linux-2.6.18 (for MIPS24KE processor): > > suppose if i want to do flush only then which API > i > > should use? > > `flush' is fuzzy terminology: some people mean > invalidate, others mean > write back, others mean both. > > > Similarly, if i want to do invalidation only which > API > > i should use? > > dma_cache_inv(). > > > --- Geert Uytterhoeven <[EMAIL PROTECTED]> > wrote: > > > > > On Mon, 1 Oct 2007, veerasena reddy wrote: > > > > I have ported Linux-2.6.18 kernel on MIPS24KE > > > > processor. I am using write back cache policy. > > > > > > > > Could you please guide me under what cases the > > > below > > > > cache API's are being used: > > > > - dma_cache_wback_inv() : Could you explain > what > > > > exactly this function does > > > > > > It does both write back and invalidate. > > > > > > > - dma_cache_wback() : This function write back > the > > > > cache data to memory > > > > - dma_cache_inv : This function invalidate > the > > > cache > > > > tags. so subsequent access will fetch from > memory. > > > > > > Note that 2.6.18 is old. The above functions are > > > intended to be removed. > > Gr{oetje,eeting}s, > > Geert > > -- > Geert Uytterhoeven -- There's lots of Linux beyond > ia32 -- [EMAIL PROTECTED] > > In personal conversations with technical people, I > call myself a hacker. But > when I'm talking to journalists I just say > "programmer" or something like that. > -- Linus Torvalds > > Forgot the famous last words? Access your message archive online at http://in.messenger.yahoo.com/webmessengerpromo.php - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd min order, slub max order [was Re: -mm merge plans for 2.6.24]
On Wednesday 03 October 2007 02:06, Hugh Dickins wrote: > On Mon, 1 Oct 2007, Andrew Morton wrote: > > # > > # slub && antifrag > > # > > have-kswapd-keep-a-minimum-order-free-other-than-order-0.patch > > only-check-absolute-watermarks-for-alloc_high-and-alloc_harder-allocation > >s.patch slub-exploit-page-mobility-to-increase-allocation-order.patch > > slub-reduce-antifrag-max-order.patch > > > > I think this stuff is in the "mm stuff we don't want to merge" > > category. If so, I really should have dropped it ages ago. > > I agree. I spent a while last week bisecting down to see why my heavily > swapping loads take 30%-60% longer with -mm than mainline, and it was > here that they went bad. Trying to keep higher orders free is costly. Yeah, no there's no way we'd merge that. > On the other hand, hasn't SLUB efficiency been built on the expectation > that higher orders can be used? And it would be a twisted shame for > high performance to be held back by some idiot's swapping load. IMO it's a bad idea to create all these dependencies like this. If SLUB can get _more_ performance out of using higher order allocations, then fine. If it is starting off at a disadvantage at the same order, then it that should be fixed first, right? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] libata: fix for sata_mv >64KB DMA segments
Fix bug in sata_mv for cases where the IOMMU layer has merged SG entries to larger than 64KB. They need to be split up before being sent to the driver. Just for simplicity's sake, split up at 64K boundary instead of 64K size, since that's what the common code does anyway. Signed-off-by: Olof Johansson <[EMAIL PROTECTED]> --- On Tue, Oct 02, 2007 at 07:23:10PM -0400, Jeff Garzik wrote: > Olof Johansson wrote: >> On Mon, Oct 01, 2007 at 06:37:44PM -0400, Jeff Garzik wrote: >> Looks like it's caused by enabling vmerge (which tends to be on for the >> common PPC defconfigs). If I disable it, things look OK. >> Perhaps the Marvell controller doesn't like requests larger than 64K, >> or wrapping some boundary. Do you have access to erratas/docs? >> I have verified it on a powermac now as well (had a quick scare that it >> might have been some problem with the PA Semi IOMMU, but no). > > FWIW: I just tried the 6042 on my AMD64 platform with iommu=force, and was > unable to reproduce any trouble. > > You could try changing MV_DMA_BOUNDARY to 0xU and see what happens. As per discussion on IRC, it was really caused by mv_fill_sg() not handing SG entries larger than 64K properly. Below patch fixes it to behave like ata_fill_sg() instead. Works OK here after some light testing (restoring my corrupted root filesystem and running a few fscks on it, among other things). Thanks, Olof diff --git a/drivers/ata/sata_mv.c b/drivers/ata/sata_mv.c index 11bf6c7..1a82e22 100644 --- a/drivers/ata/sata_mv.c +++ b/drivers/ata/sata_mv.c @@ -1139,15 +1139,27 @@ static unsigned int mv_fill_sg(struct ata_queued_cmd *qc) dma_addr_t addr = sg_dma_address(sg); u32 sg_len = sg_dma_len(sg); - mv_sg->addr = cpu_to_le32(addr & 0x); - mv_sg->addr_hi = cpu_to_le32((addr >> 16) >> 16); - mv_sg->flags_size = cpu_to_le32(sg_len & 0x); + while (sg_len) { + u32 offset = addr & 0x; + u32 len = sg_len; - if (ata_sg_is_last(sg, qc)) - mv_sg->flags_size |= cpu_to_le32(EPRD_FLAG_END_OF_TBL); + if ((offset + sg_len > 0x1)) + len = 0x1 - offset; + + mv_sg->addr = cpu_to_le32(addr & 0x); + mv_sg->addr_hi = cpu_to_le32((addr >> 16) >> 16); + mv_sg->flags_size = cpu_to_le32(len); + + sg_len -= len; + addr += len; + + if (!sg_len && ata_sg_is_last(sg, qc)) + mv_sg->flags_size |= cpu_to_le32(EPRD_FLAG_END_OF_TBL); + + mv_sg++; + n_sg++; + } - mv_sg++; - n_sg++; } return n_sg; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io
On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote: > On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote: > > wbc.pages_skipped = 0; > > @@ -560,8 +561,9 @@ static void background_writeout(unsigned > > min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; > > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { > > /* Wrote less than expected */ > > - congestion_wait(WRITE, HZ/10); > > - if (!wbc.encountered_congestion) > > + if (wbc.encountered_congestion || wbc.more_io) > > + congestion_wait(WRITE, HZ/10); > > + else > > break; > > } > > Why do you call congestion_wait() if there is more I/O to issue? If > we have a fast filesystem, this might cause the device queues to > fill, then drain on congestion_wait(), then fill again, etc. i.e. we > will have trouble keeping the queues full, right? You mean slow writers and fast RAID? That would be exactly the case these patches try to improve. The old writeback behaviors are sluggish when there is - single big dirty file; - single congested device the queues may well build up slowly, hit background_limit, and continue to build up, until hit dirty_limit. That means: - kupdate writeback could leave behind many expired dirty data - background writeback used to return prematurely - eventually it relies on balance_dirty_pages() to do the job, which means - writers get throttled unnecessarily - dirty_limit pages are pinned unnecessarily This patchset makes kupdate/background writeback more responsible, so that if (avg-write-speed < device-capabilities), the dirty data are synced timely, and we don't have to go for balance_dirty_pages(). So for your question of queue depth, the answer is: the queue length will not build up in the first place. Also the name of congestion_wait() could be misleading: - when not congested, congestion_wait() will wakeup on write completions; - when congested, congestion_wait() could also wakeup on write completions on other non-congested devices. So congestion_wait(100ms) normally only takes 0.1-10ms. For the more_io case, congestion_wait() serves more like 'to take a breath'. Tests show that the system could go mad without it. Regards, Fengguang - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sky2: jumbo frame regression fix
On tis, 2007-10-02 at 18:02 -0700, Stephen Hemminger wrote: > Remove unneeded check that caused problems with jumbo frame sizes. > The check was recently added and is wrong. > When using jumbo frames the sky2 driver does fragmentation, so > rx_data_size is less than mtu. Confirmed working. Now running with 9k mtu with no errors, =) It also seems that the FIFO bug was the one that affected me before, damn odd race that one. > Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> Tested-by: Ian Kumlien <[EMAIL PROTECTED]> (if that tag exists now) Btw, Sorry but all mail directly to you will be blocked. I have yet to fix the relaying properly with isp:s blocking port 25 etc so for some of you this mail will only show up on the ML. > --- a/drivers/net/sky2.c 2007-10-02 17:56:31.0 -0700 > +++ b/drivers/net/sky2.c 2007-10-02 17:58:56.0 -0700 > @@ -2163,9 +2163,6 @@ static struct sk_buff *sky2_receive(stru > sky2->rx_next = (sky2->rx_next + 1) % sky2->rx_pending; > prefetch(sky2->rx_ring + sky2->rx_next); > > - if (length < ETH_ZLEN || length > sky2->rx_data_size) > - goto len_error; > - > /* This chip has hardware problems that generates bogus status. >* So do only marginal checking and expect higher level protocols >* to handle crap frames. -- Ian Kumlien -- http://pomac.netswarm.net signature.asc Description: This is a digitally signed message part
[PATCH] Document x86-64 iommu kernel parameters
Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]> --- After having to go figure out what some of these means, I figured I would save others the trouble. Some of these are "best guess" based on a quick scan of the code, so it certainly needs a sanity review before going upstream. diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 4d175c7..8afea9b 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -763,6 +763,30 @@ and is between 256 and 4096 characters. It is defined in the file inttest=[IA64] + iommu=option[,option..] [X86-64] + off Disable IOMMU. + force Unconditionally enable IOMMU. + noforce Disable IOMMU and IOMMU merging, by default. + biomergeUnconditionally enable IOMMU, IOMMU merging, + and set BIO IOMMU vmerge boundary to 4096. + panic Panic on IOMMU overflow. + nopanic Do not panic on IOMMU overflow. + merge Unconditionally enable IOMMU, IOMMU merging. + nomerge Disable IOMMU merging. + forcesacForce single address cycle (SAC, 32-bit). + allowdacPermit dual address cycle (DAC, 64-bit). + nodac Forbid dual address cycle (DAC, 64-bit). + softEnable swiotlb. + calgary Use Calgary IOMMU. + + (GART-only options follow...) + Specify size of remapping area. + fullflush Disable optimizing flushing strategy. + nofullflush Enable optimizing flushing strategy. + noagp Use entire aperture, AGP isn't using it. + noaperture Disable aperture fixups / hole init. + memaper= malloc an aperture of order N. + io7=[HW] IO7 for Marvel based alpha systems See comment before marvel_specify_io7 in arch/alpha/kernel/core_marvel.c. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Question] How to represent SYSTEM_RAM in kerenel/resouce.c
Hi, Now, SYSTEM_RAM is registerd to resouce list and a user can see memory map from /proc/iomem, like following. == [EMAIL PROTECTED] linux-2.6.23-rc8-mm2]$ grep RAM /proc/iomem -0009 : System RAM 0010-03ff : System RAM 0400-04f1bfff : System RAM 04f1c000-6b4b9fff : System RAM 6b4ba000-6b797fff : System RAM 6b798000-6b799fff : System RAM 6b79a000-6b79dfff : System RAM 6b79e000-6b79efff : System RAM 6b79f000-6b7fbfff : System RAM 6b7fc000-6c629fff : System RAM 6c62a000-6c800fff : System RAM 6c801000-6c843fff : System RAM 6c844000-6c847fff : System RAM 6c848000-6c849fff : System RAM 6c84a000-6c85dfff : System RAM 6c85e000-6c85efff : System RAM 6c85f000-6cbfbfff : System RAM 6cbfc000-6d349fff : System RAM 6d34a000-6d3fbfff : System RAM 6d3fc000-6d455fff : System RAM 6d4fc000-6d773fff : System RAM 1-7 : System RAM 408000-40 : System RAM 1400400-147 : System RAM == But there is a confusion. i386 and x86_64 registers System RAM as IORESOUCE_MEM | IORESOUCE_BUSY. ia64 registers System RAM as IORESOURCE_MEM. Which is better ? I ask this because current memory hotplug treat memory as IORESOUCE_MEM. When memory hot-add occurs on x86_64, new memory is added as IORESOUCE_MEM. memory hot-remove (now) can remove only IORESOUCE_MEM. If ia64 should treat System RAM as IORESOUCE_MEM | IORESOUCE_BUSY, I'll write a fix. Thanks, -Kame - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc7-mm1 -- powerpc rtas panic
On Wed, Oct 03, 2007 at 10:30:16AM +1000, Michael Ellerman wrote: > I realise it'll make the patch bigger, but this doesn't seem like a > particularly good name for the variable anymore. Sure, what about? Clarify when RTAS logging is enabled. Signed-off-by: Tony Breeds <[EMAIL PROTECTED]> --- arch/powerpc/platforms/pseries/rtasd.c | 15 +-- 1 files changed, 9 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/pseries/rtasd.c b/arch/powerpc/platforms/pseries/rtasd.c index 30925d2..73401c8 100644 --- a/arch/powerpc/platforms/pseries/rtasd.c +++ b/arch/powerpc/platforms/pseries/rtasd.c @@ -54,8 +54,9 @@ static unsigned int rtas_event_scan_rate; static int full_rtas_msgs = 0; /* Stop logging to nvram after first fatal error */ -static int no_more_logging; - +static int logging_enabled; /* Until we initialize everything, + * make sure we don't try logging + * anything */ static int error_log_cnt; /* @@ -217,7 +218,7 @@ void pSeries_log_error(char *buf, unsigned int err_type, int fatal) } /* Write error to NVRAM */ - if (!no_more_logging && !(err_type & ERR_FLAG_BOOT)) + if (logging_enabled && !(err_type & ERR_FLAG_BOOT)) nvram_write_error_log(buf, len, err_type, error_log_cnt); /* @@ -229,8 +230,8 @@ void pSeries_log_error(char *buf, unsigned int err_type, int fatal) printk_log_rtas(buf, len); /* Check to see if we need to or have stopped logging */ - if (fatal || no_more_logging) { - no_more_logging = 1; + if (fatal || !logging_enabled) { + logging_enabled = 0; spin_unlock_irqrestore(&rtasd_log_lock, s); return; } @@ -302,7 +303,7 @@ static ssize_t rtas_log_read(struct file * file, char __user * buf, spin_lock_irqsave(&rtasd_log_lock, s); /* if it's 0, then we know we got the last one (the one in NVRAM) */ - if (rtas_log_size == 0 && !no_more_logging) + if (rtas_log_size == 0 && logging_enabled) nvram_clear_error_log(); spin_unlock_irqrestore(&rtasd_log_lock, s); @@ -414,6 +415,8 @@ static int rtasd(void *unused) memset(logdata, 0, rtas_error_log_max); rc = nvram_read_error_log(logdata, rtas_error_log_max, &err_type, &error_log_cnt); + /* We can use rtas_log_buf now */ + logging_enabled = 1; if (!rc) { if (err_type != ERR_FLAG_ALREADY_LOGGED) { Yours Tony linux.conf.auhttp://linux.conf.au/ || http://lca2008.linux.org.au/ Jan 28 - Feb 02 2008 The Australian Linux Technical Conference! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 05/12] mm: trylock_page
On Sunday 30 September 2007 01:01, Peter Zijlstra wrote: > On Fri, 2007-09-28 at 13:11 +1000, Nick Piggin wrote: > > On Friday 28 September 2007 17:42, Peter Zijlstra wrote: > > > Replace raw TestSetPageLocked() usage with trylock_page() > > > > I have such a thing queued too, for the lock bitops patches for when > > 2.6.24 opens, Andrew promises me :). > > > > I guess they should be identical, except I don't like doing trylock_page > > in place of SetPageLocked, for memory ordering performance and aesthetic > > reasons... I've got an init_page_locked (or set_page_locked... I can't > > remember, the patch is at home). > > Sure, that might work, or we could just make it so that add_to_*_cache > is never passed an unlocked page. But sure... It does kind of make sense to have it there (because you want the page to be locked iff it gets inserted into pagecache). And wherever you lock the page, we'll still want an init_page_locked to be able to lock it while we are the only owner of it, for the same performance reason. > > Fine idea to lockdep the page lock, anyway. Does it show up any of the > > buffered write deadlock possibilities? :) > > Not yet, it might just be that the concessions done to annotate this > type of lock were too severe. > > What I basically did was treat all the page locks as a single recursive > lock. Hmm, OK. There are a couple of page lock deadlocks there that wouldn't be picked up, but the page lock vs mmap_sem one probably should be. > > buffer lock is another notable bit-mutex that might be converted (I have > > the patch to do the similar nice !tas->trylock conversion for that too). > > I think it is used widely enough by tricky code that it would be useful > > to annotate as well. > > Not at all familiar with that lock, but yeah, we could have a look at > doing that too. Should be pretty well identical to the page lock. I'll cc you on the patch to do this equivalent API conversion, if you'd like. > > Unfortunately we can't convert bit_spinlock.h easily, I guess? > > Yeah, the space constraints make that rather hard. Each of these locks > needs some form of external meta-data. Yeah. > For the page lock I used one lock instance per file system type. Seems OK. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add ability to dump mapped pages with /proc/sys/vm/drop_caches
On Monday 01 October 2007 04:03, Soeren Sandmann wrote: > This patch adds the ability to drop mapped pages with > /proc/sys/vm/drop_caches. This is useful to get repeatable > measurements of startup time for applications. > > Without it, pages that are mapped in already-running applications will > not get dropped, so the time measured will not be a true cold-cache > time. invalidate_inode_pages2_range is a nasty function... only to be used if you really know you need it (and even in that case, it's probably wrong!). I have on my todo list a spring clean of mm/truncate.c to attempt to figure out how to fix this thing properly but until then, it's a bad idea to put in this kind of function. Please just unmap_mapping_range before the existing invalidate call. Also, I don't know if there is any use in making an extra mode for this -- presumably if someone wants to drop all pagecache, they really want to drop all pagecache. Aside from that, I think the idea is fine and indeed should make this more useful (I never realised it didn't throw out mapped pages, which we really should do to have reliable testing). So, thanks. > Rik pointed out that "be_atomic" is a bit pointless since there is a > race on SMP anyway where pages can be added. However, it is there in > the existing code, so I added it for the new code as well. Does anyone > know why it's there? > > > Soren > > Signed-off-by: Soren Sandmann <[EMAIL PROTECTED]> > > --- > fs/drop_caches.c | 13 - > include/linux/fs.h |3 +++ > include/linux/mm.h |2 +- > mm/truncate.c | 36 ++-- > 4 files changed, 34 insertions(+), 20 deletions(-) > > diff --git a/fs/drop_caches.c b/fs/drop_caches.c > index 59375ef..9f3741d 100644 > --- a/fs/drop_caches.c > +++ b/fs/drop_caches.c > @@ -12,7 +12,7 @@ > /* A global variable is a bit ugly, but it keeps the code simple */ > int sysctl_drop_caches; > > -static void drop_pagecache_sb(struct super_block *sb) > +static void drop_pagecache_sb(struct super_block *sb, int unmap) > { > struct inode *inode; > > @@ -20,12 +20,15 @@ static void drop_pagecache_sb(struct super_block *sb) > list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { > if (inode->i_state & (I_FREEING|I_WILL_FREE)) > continue; > - __invalidate_mapping_pages(inode->i_mapping, 0, -1, true); > + if (unmap) > + __invalidate_inode_pages2_range(inode->i_mapping, 0, > -1, true); > + else > + __invalidate_mapping_pages(inode->i_mapping, 0, -1, > true); > } > spin_unlock(&inode_lock); > } > > -void drop_pagecache(void) > +void drop_pagecache(int unmap) > { > struct super_block *sb; > > @@ -36,7 +39,7 @@ restart: > spin_unlock(&sb_lock); > down_read(&sb->s_umount); > if (sb->s_root) > - drop_pagecache_sb(sb); > + drop_pagecache_sb(sb, unmap); > up_read(&sb->s_umount); > spin_lock(&sb_lock); > if (__put_super_and_need_restart(sb)) > @@ -60,7 +63,7 @@ int drop_caches_sysctl_handler(ctl_table *table, int > write, proc_dointvec_minmax(table, write, file, buffer, length, ppos); > if (write) { > if (sysctl_drop_caches & 1) > - drop_pagecache(); > + drop_pagecache(sysctl_drop_caches & 4); > if (sysctl_drop_caches & 2) > drop_slab(); > } > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 16421f6..c112aa3 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1536,6 +1536,9 @@ static inline void invalidate_remote_inode(struct > inode *inode) invalidate_mapping_pages(inode->i_mapping, 0, -1); > } > extern int invalidate_inode_pages2(struct address_space *mapping); > +extern int __invalidate_inode_pages2_range(struct address_space *mapping, > +pgoff_t start, pgoff_t end, > +int be_atomic); > extern int invalidate_inode_pages2_range(struct address_space *mapping, >pgoff_t start, pgoff_t end); > extern int write_inode_now(struct inode *, int); > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 1692dd6..6719c41 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1207,7 +1207,7 @@ int drop_caches_sysctl_handler(struct ctl_table *, > int, struct file *, void __user *, size_t *, loff_t *); > unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, > unsigned long lru_pages); > -void drop_pagecache(void); > +void drop_pagecache(int unmap); > void drop_slab(void); > > #ifndef CONFIG_MMU > diff --git a/mm/truncate.c b/mm/truncate.c > index 5cdfbc1..568ac77 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -371,19 +371,9 @@ static int do_laund
Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
On Tuesday 02 October 2007 06:50, Christoph Lameter wrote: > On Fri, 28 Sep 2007, Nick Piggin wrote: > > I thought it was slower. Have you fixed the performance regression? > > (OK, I read further down that you are still working on it but not > > confirmed yet...) > > The problem is with the weird way of Intel testing and communication. > Every 3-6 month or so they will tell you the system is X% up or down on > arch Y (and they wont give you details because its somehow secret). And > then there are conflicting statements by the two or so performance test > departments. One of them repeatedly assured me that they do not see any > regressions. Just so long as there aren't known regressions that would require higher order allocations to fix them. > > OK, so long as it isn't going to depend on using higher order pages, > > that's fine. (if they help even further as an optional thing, that's fine > > too. You can turn them on your huge systems and not even bother about > > adding this vmap fallback -- you won't have me to nag you about these > > purely theoretical issues). > > Well the vmap fallback is generally useful AFAICT. Higher order > allocations are common on some of our platforms. Order 1 failures even > affect essential things like stacks that have nothing to do with SLUB and > the LBS patchset. I don't know if it is worth the trouble, though. The best thing to do is to ensure that contiguous memory is not wasted on frivolous things... a few order-1 or 2 allocations aren't too much of a problem. The only high order allocation failure I've seen from fragmentation for a long time IIRC are the order-3 failures coming from e1000. And obviously they cannot use vmap. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
On Tuesday 02 October 2007 07:01, Christoph Lameter wrote: > On Sat, 29 Sep 2007, Peter Zijlstra wrote: > > On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote: > > > Really? That means we can no longer even allocate stacks for forking. > > > > I think I'm running with 4k stacks... > > 4k stacks will never fly on an SGI x86_64 NUMA configuration given the > additional data that may be kept on the stack. We are currently > considering to go from 8k to 16k (or even 32k) to make things work. So > having the ability to put the stacks in vmalloc space may be something to > look at. i386 and x86-64 already used 8K stacks for years and they have never really been much problem before. They only started failing when contiguous memory is getting used up by other things, _even with_ those anti-frag patches in there. Bottom line is that you do not use higher order allocations when you do not need them. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sky2: jumbo frame regression fix
Stephen Hemminger wrote: Remove unneeded check that caused problems with jumbo frame sizes. The check was recently added and is wrong. When using jumbo frames the sky2 driver does fragmentation, so rx_data_size is less than mtu. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- a/drivers/net/sky2.c2007-10-02 17:56:31.0 -0700 +++ b/drivers/net/sky2.c2007-10-02 17:58:56.0 -0700 @@ -2163,9 +2163,6 @@ static struct sk_buff *sky2_receive(stru sky2->rx_next = (sky2->rx_next + 1) % sky2->rx_pending; prefetch(sky2->rx_ring + sky2->rx_next); - if (length < ETH_ZLEN || length > sky2->rx_data_size) - goto len_error; - 2.6.23? 2.6.24? enquiring minds want to know... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] allow drivers to flush in-flight DMA v2
On Fri, Sep 28, 2007 at 03:43:39PM -0700, David Miller wrote: > > My only beef with this patch set is that it seems > a bit much to create a totally new function name every > time we want to set some kind of new attribute on some > DMA object. Why not add a "dma_set_flags()" or similar > that can be used later on to set other kinds of aspects > we'd like to change? > > You can make the arguments "u32 flags" and "int dir". > Actually you should probably use the dma direction > enumaration instead of 'int'. OK, this will be in the next version along with the coding style changes you mentioned. -- Arthur - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4/4] mthca: allow setting "dmabarrier" on user-allocated memory
On Fri, Sep 28, 2007 at 12:50:00PM -0700, Roland Dreier wrote: > Sorry for not mentioning this earlier, but this patch should really be > two (or more) patches: one to add dmabarrier support to the core user > memory stuff in drivers/infiniband, and a second one to add support to > mthca (and more patches to add support to mlx4, cxgb3, etc, etc). Makes sense. > > > + * @dmabarrier: set "dmabarrier" attribute on this memory, if necessary > > Nit: just delete the "if necessary" since I don't think it makes > things clearer (and actually doesn't make much sense in this context) > OK. > Other than that this look fine to me, and I'm ready to merge it once > the necessary core DMA stuff is settled. > Great. A new version of the patchset is on the way. -- Arthur - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Boot protocol changes
Jeremy Fitzhardinge wrote: H. Peter Anvin wrote: I'm proposing that the existing bzImage format be retained, but that the payload of the decompressor (already a gzip file) simply be vmlinux.gz -- i.e. a gzip compressed ELF file, notes and all. A pointer in the header will point to the offset of the payload (this is new, obviously.) The decompression stub is adjusted to expect an ELF image, instead of a raw binary. It could, or just treat it as a raw binary at 1M+offset to skip the headers. It would be cleaner to actually parse the ELF; it's only a handful of lines of code (we don't have to support arbitrary placement of sections, obviously, which makes the problem simpler.) Existing bootloaders (16- or 32-bit) simply load the bzImage the way they do now; new bootloaders have the option of accessing the vmlinux.gz directly if they either want to load it themselves or want to examine the notes. OK, but that has the same problem as making the payload an ELF file: 32-bit bootloaders which simply jump to 1M will be jumping into data rather than code - and I got the impression from taking to Eric at KS that there are such bootloaders. Uhm, no it doesn't. Those bootloaders jump to the decompressor, not to the payload. The decompressor interface hasn't changed. If that's not an issue, then I still think the payload should be a plain ELF file (possibly self-decompressing, or just a plain uncompressed vmlinux, if that's what's desired). Still think making a protected-mode bootloader do the decompression is the wrong way to go about this; ELF is enough. It doesn't have to if it doesn't want to; it only needs to do so if it wants to access the kernel as an ELF. Again, it has the advantage that the ELF is the real vmlinux, no funnies. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sky2: jumbo frame regression fix
Remove unneeded check that caused problems with jumbo frame sizes. The check was recently added and is wrong. When using jumbo frames the sky2 driver does fragmentation, so rx_data_size is less than mtu. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- a/drivers/net/sky2.c2007-10-02 17:56:31.0 -0700 +++ b/drivers/net/sky2.c2007-10-02 17:58:56.0 -0700 @@ -2163,9 +2163,6 @@ static struct sk_buff *sky2_receive(stru sky2->rx_next = (sky2->rx_next + 1) % sky2->rx_pending; prefetch(sky2->rx_ring + sky2->rx_next); - if (length < ETH_ZLEN || length > sky2->rx_data_size) - goto len_error; - /* This chip has hardware problems that generates bogus status. * So do only marginal checking and expect higher level protocols * to handle crap frames. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Boot protocol changes
H. Peter Anvin wrote: > I'm proposing that the existing bzImage format be retained, but that > the payload of the decompressor (already a gzip file) simply be > vmlinux.gz -- i.e. a gzip compressed ELF file, notes and all. A > pointer in the header will point to the offset of the payload (this is > new, obviously.) > > The decompression stub is adjusted to expect an ELF image, instead of > a raw binary. It could, or just treat it as a raw binary at 1M+offset to skip the headers. > Existing bootloaders (16- or 32-bit) simply load the bzImage the way > they do now; new bootloaders have the option of accessing the > vmlinux.gz directly if they either want to load it themselves or want > to examine the notes. OK, but that has the same problem as making the payload an ELF file: 32-bit bootloaders which simply jump to 1M will be jumping into data rather than code - and I got the impression from taking to Eric at KS that there are such bootloaders. If that's not an issue, then I still think the payload should be a plain ELF file (possibly self-decompressing, or just a plain uncompressed vmlinux, if that's what's desired). Still think making a protected-mode bootloader do the decompression is the wrong way to go about this; ELF is enough. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Boot protocol changes
H. Peter Anvin wrote: No, not at all. I'm proposing that the existing bzImage format be retained, but that the payload of the decompressor (already a gzip file) simply be vmlinux.gz -- i.e. a gzip compressed ELF file, notes and all. A pointer in the header will point to the offset of the payload (this is new, obviously.) The decompression stub is adjusted to expect an ELF image, instead of a raw binary. Existing bootloaders (16- or 32-bit) simply load the bzImage the way they do now; new bootloaders have the option of accessing the vmlinux.gz directly if they either want to load it themselves or want to examine the notes. Slight correction: it does, of course, break loaders which root through the bzImage for a gzip header and decode that themselves and place in memory. These loaders are pretty broken, though; they can't deal with the fact that the physical address of the kernel is configurable, for one thing. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[BUG] sky2 errors in 2.6.23-rc9-git1
Hi, Sorry about this but the latest sky2 seems damned odd. I have been running with jumbo frames at home for quite some time but with this kernel that doesn't work, i instead get loads of: sky2 eth0: rx length error: status 0x5e60500 length 1510 sky2 eth0: rx length error: status 0x5e60500 length 1510 sky2 eth0: rx length error: status 0x5ea0500 length 1514 sky2 eth0: rx length error: status 0x5ea0500 length 1514 Where length can be just about anything from 800 -> MTU That is not enough though, i also, for some reason, got several hangs: sky2 eth0: hung mac 0:68 fifo 143 (133:76) sky2 eth0: receiver hang detected sky2 eth0: disabling interface sky2 eth0: enabling interface sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control rx sky2 eth0: hung mac 0:125 fifo 195 (93:88) sky2 eth0: receiver hang detected sky2 eth0: disabling interface sky2 eth0: enabling interface sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control rx sky2 eth0: hung mac 0:124 fifo 98 (10:108) sky2 eth0: receiver hang detected sky2 eth0: disabling interface sky2 eth0: enabling interface sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control rx sky2 eth0: hung mac 0:41 fifo 30 (187:17) sky2 eth0: receiver hang detected sky2 eth0: disabling interface sky2 eth0: enabling interface sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control rx ... All during about 2 minutes. Could this be related to [sky2: sky2 FE+ receive status workaround]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blobdiff;f=drivers/net/sky2.c;h=a3de0b6127ebb537b87a1849e207909fcc333ee4;hp=0792031a5cf959a1543f32f4e0f2ab4ccb7b0ec2;hb=3b12e0141f7a97c3b84731b5f935ed738bb6f960;hpb=ff0ce6845bc18292e80ea40d11c3d3a539a3fc5e The chips being used are: sky2 :02:00.0: v1.18 addr 0xdbffc000 irq 17 Yukon-EC (0xb6) rev 2 sky2 :02:00.0: v1.18 addr 0xfddfc000 irq 17 Yukon-EC (0xb6) rev 1 The receiver hang only happes on the REV 2 chip, which also reports: sky2 :02:00.0: No interrupt generated using MSI, switching to INTx mode. Ifconfig reports: REV 2 chip: RX packets:30492 errors:0 dropped:646 overruns:0 frame:646 TX packets:29229 errors:0 dropped:0 overruns:0 carrier:0 REV 1 chip: RX packets:19795 errors:0 dropped:131 overruns:0 frame:131 TX packets:18588 errors:0 dropped:0 overruns:0 carrier:0 Let me know when jumbo frames work again, just mail me patches =) (to tired to look in to it closer atm) -- Ian Kumlien -- http://pomac.netswarm.net signature.asc Description: This is a digitally signed message part
Re: [PATCH 0/5] Boot protocol changes
Jeremy Fitzhardinge wrote: H. Peter Anvin wrote: This series looks like a good start for Xen, but we still need to work out where to stash the metadata which normally lives in ELF notes. Using ELF is convenient for Xen because it lets a large chunk of domain builder code be reused; on the other hand, loading a plain bzImage is pretty simple, so maybe it isn't such a big deal. HPA, Eric: if we don't go the "embed ELF" path, where's a good backwards-compatible place to stash the note data? If we do go with "embed ELF", how should we go about doing it? Arrange to put the ELF headers before the 1M mark? This sounds like another good reason to do the ELF image as the postcompression image. The interface to the embedded compression routine is then unchanged, and we get the "full vmlinux" with any notes that belongs there. I'll try to get an implementation of that done -- it really shouldn't be very hard. Please explain what you're proposing again, because my memory of your plan from last time wouldn't help in this case. Are you proposing that the bzImage contains compressed data that its expecting the bootloader to decompress? Won't that completely break backwards compatibility? If we don't care about backwards compatibility with old bootloaders, then it doesn't matter what we do one way or the other. No, not at all. I'm proposing that the existing bzImage format be retained, but that the payload of the decompressor (already a gzip file) simply be vmlinux.gz -- i.e. a gzip compressed ELF file, notes and all. A pointer in the header will point to the offset of the payload (this is new, obviously.) The decompression stub is adjusted to expect an ELF image, instead of a raw binary. Existing bootloaders (16- or 32-bit) simply load the bzImage the way they do now; new bootloaders have the option of accessing the vmlinux.gz directly if they either want to load it themselves or want to examine the notes. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: build error
On Tue, 2 Oct 2007 17:28:27 -0700 Randy Dunlap <[EMAIL PROTECTED]> wrote: > On Tue, 2 Oct 2007 17:19:42 -0700 Stephen Hemminger wrote: > > > On Tue, 2 Oct 2007 22:12:13 +0200 > > [EMAIL PROTECTED] wrote: > > > > > [please CC: me, my subscribe mail was greylisted] > > > > > > Morning! > > > > > > My make run for 2.6.23-rc9 ends like this: > > > > > > GEN .version > > > CHK include/linux/compile.h > > > UPD include/linux/compile.h > > > CC init/version.o > > > LD init/built-in.o > > > LD .tmp_vmlinux1 > > > kernel/built-in.o: In function `getnstimeofday': > > > (.text+0x1e141): undefined reference to `__udivdi3' > > > kernel/built-in.o: In function `do_gettimeofday': > > > (.text+0x1e263): undefined reference to `__udivdi3' > > > kernel/built-in.o: In function `timekeeping_resume': > > > timekeeping.c:(.text+0x1e427): undefined reference to `__udivdi3' > > > kernel/built-in.o: In function `update_wall_time': > > > (.text+0x1e829): undefined reference to `__udivdi3' > > > kernel/built-in.o: In function `update_wall_time': > > > (.text+0x1ec4c): undefined reference to `__udivdi3' > > > make: *** [.tmp_vmlinux1] Error 1 > > > > > > .config attached. > > > > > > I have already read the diff from -rc8 and found nothing that helped me. > > > > > > Any ideas? Further questions? > > > > > > Wilfried > > > > > > > What Gcc version? and config/architecture? > > .config file was attached. It says X86_32. > > I can't reproduce it on 2 different systems & toolchains. There were earlier reports of gcc 4.3 bogus optimization: http://lkml.org/lkml/2007/5/18/355 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd min order, slub max order [was Re: -mm merge plans for 2.6.24]
On Tue, 2 Oct 2007, Christoph Lameter wrote: > The maximum order of allocation used by SLUB may have to depend on the > number of page structs in the system since small systems (128M was the > case that Peter found) can easier get into trouble. SLAB has similar > measures to avoid order 1 allocations for small systems below 32M. A patch like this? This is based on the number of page structs on the system. Maybe it needs to be based on the number of MAX_ORDER blocks for antifrag? SLUB: Determine slub_max_order depending on the number of pages available Determine the maximum order to be used for slabs and the mininum desired number of objects in a slab from the amount of pages that a system has available (like SLAB does for the order 1/0 distinction). For systems with less than 128M only use order 0 allocations (SLAB does that for <32M only). The order 0 config is useful for small systems to minimize the memory used. Memory easily fragments since we have less than 32k pages to play with. Order 0 insures that higher order allocations are minimized (Larger orders must still be used for objects that do not fit into order 0 pages). Then step up to order 1 for systems < 256000 pages (1G) Order 2 limit to systems < 100 page structs (4G) Order 3 for systems larger than that. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- mm/slub.c | 49 + 1 file changed, 25 insertions(+), 24 deletions(-) Index: linux-2.6/mm/slub.c === --- linux-2.6.orig/mm/slub.c2007-10-02 09:26:16.0 -0700 +++ linux-2.6/mm/slub.c 2007-10-02 16:40:22.0 -0700 @@ -153,25 +153,6 @@ static inline void ClearSlabDebug(struct /* Enable to test recovery from slab corruption on boot */ #undef SLUB_RESILIENCY_TEST -#if PAGE_SHIFT <= 12 - -/* - * Small page size. Make sure that we do not fragment memory - */ -#define DEFAULT_MAX_ORDER 1 -#define DEFAULT_MIN_OBJECTS 4 - -#else - -/* - * Large page machines are customarily able to handle larger - * page orders. - */ -#define DEFAULT_MAX_ORDER 2 -#define DEFAULT_MIN_OBJECTS 8 - -#endif - /* * Mininum number of partial slabs. These will be left on the partial * lists even if they are empty. kmem_cache_shrink may reclaim them. @@ -1718,8 +1699,9 @@ static struct page *get_object_page(cons * take the list_lock. */ static int slub_min_order; -static int slub_max_order = DEFAULT_MAX_ORDER; -static int slub_min_objects = DEFAULT_MIN_OBJECTS; +static int slub_max_order; +static int slub_min_objects = 4; +static int manual; /* * Merge control. If this is set then no merging of slab caches will occur. @@ -2237,7 +2219,7 @@ static struct kmem_cache *kmalloc_caches static int __init setup_slub_min_order(char *str) { get_option (&str, &slub_min_order); - + manual = 1; return 1; } @@ -2246,7 +2228,7 @@ __setup("slub_min_order=", setup_slub_mi static int __init setup_slub_max_order(char *str) { get_option (&str, &slub_max_order); - + manual = 1; return 1; } @@ -2255,7 +2237,7 @@ __setup("slub_max_order=", setup_slub_ma static int __init setup_slub_min_objects(char *str) { get_option (&str, &slub_min_objects); - + manual = 1; return 1; } @@ -2566,6 +2548,16 @@ int kmem_cache_shrink(struct kmem_cache } EXPORT_SYMBOL(kmem_cache_shrink); +/* + * Table to autotune the maximum slab order based on the number of pages + * that the system has available. + */ +static unsigned long __initdata phys_pages_for_order[PAGE_ALLOC_COSTLY_ORDER] = { + 32768, /* >128M if using 4K pages, >512M (16k), >2G (64k) */ + 256000, /* >1G if using 4k pages, >4G (16k), >16G (64k) */ + 100 /* >4G if using 4k pages, >16G (16k), >64G (64k) */ +}; + / * Basic setup of slabs ***/ @@ -2575,6 +2567,15 @@ void __init kmem_cache_init(void) int i; int caches = 0; + if (!manual) { + /* No manual parameters. Autotune for system */ + for (i = 0; i < PAGE_ALLOC_COSTLY_ORDER; i++) + if (num_physpages > phys_pages_for_order[i]) { + slub_max_order++; + slub_min_objects <<= 1; + } + } + #ifdef CONFIG_NUMA /* * Must first have the slab cache available for the allocations of the - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc7-mm1 -- powerpc rtas panic
On Wed, 2007-10-03 at 10:26 +1000, Tony Breeds wrote: > On Tue, Oct 02, 2007 at 06:28:19PM -0500, Linas Vepstas wrote: > > On Mon, Sep 24, 2007 at 01:35:31PM +0100, Andy Whitcroft wrote: > > > Seeing the following from an older power LPAR, pretty sure we had > > > this in the previous -mm also: > > > > I haven't forgetten about this ... and am looking at it now. > > Seems that whenever I go to reserve the machine pSeries-102, > > someone else is using it :-) > > This panic is caused by "[POWERPC] pseries: Fix jumbled no_logging flag." > (79c0108d1b9db4864ab77b2a95dfa04f2dcf264c), in the powerpc/for-2.6.24 > branch. It looks to me that we have logging enabled too early now. > > I think the following is a reasonable fix? > > --- > Explicitly enable RTAS error logging, when it should be ready. > > > Signed-off-by: Tony Breeds <[EMAIL PROTECTED]> > > --- > > arch/powerpc/platforms/pseries/rtasd.c |7 ++- > 1 files changed, 6 insertions(+), 1 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/rtasd.c > b/arch/powerpc/platforms/pseries/rtasd.c > index 30925d2..0df5d0d 100644 > --- a/arch/powerpc/platforms/pseries/rtasd.c > +++ b/arch/powerpc/platforms/pseries/rtasd.c > @@ -54,7 +54,10 @@ static unsigned int rtas_event_scan_rate; > static int full_rtas_msgs = 0; > > /* Stop logging to nvram after first fatal error */ > -static int no_more_logging; > +static int no_more_logging = 1; /* Until we initialize everything, > + * make sure we don't try logging > + * anything */ > + I realise it'll make the patch bigger, but this doesn't seem like a particularly good name for the variable anymore. cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person signature.asc Description: This is a digitally signed message part
Re: build error
On Tue, 2 Oct 2007 17:19:42 -0700 Stephen Hemminger wrote: > On Tue, 2 Oct 2007 22:12:13 +0200 > [EMAIL PROTECTED] wrote: > > > [please CC: me, my subscribe mail was greylisted] > > > > Morning! > > > > My make run for 2.6.23-rc9 ends like this: > > > > GEN .version > > CHK include/linux/compile.h > > UPD include/linux/compile.h > > CC init/version.o > > LD init/built-in.o > > LD .tmp_vmlinux1 > > kernel/built-in.o: In function `getnstimeofday': > > (.text+0x1e141): undefined reference to `__udivdi3' > > kernel/built-in.o: In function `do_gettimeofday': > > (.text+0x1e263): undefined reference to `__udivdi3' > > kernel/built-in.o: In function `timekeeping_resume': > > timekeeping.c:(.text+0x1e427): undefined reference to `__udivdi3' > > kernel/built-in.o: In function `update_wall_time': > > (.text+0x1e829): undefined reference to `__udivdi3' > > kernel/built-in.o: In function `update_wall_time': > > (.text+0x1ec4c): undefined reference to `__udivdi3' > > make: *** [.tmp_vmlinux1] Error 1 > > > > .config attached. > > > > I have already read the diff from -rc8 and found nothing that helped me. > > > > Any ideas? Further questions? > > > > Wilfried > > > > What Gcc version? and config/architecture? .config file was attached. It says X86_32. I can't reproduce it on 2 different systems & toolchains. --- ~Randy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc7-mm1 -- powerpc rtas panic
On Tue, Oct 02, 2007 at 06:28:19PM -0500, Linas Vepstas wrote: > On Mon, Sep 24, 2007 at 01:35:31PM +0100, Andy Whitcroft wrote: > > Seeing the following from an older power LPAR, pretty sure we had > > this in the previous -mm also: > > I haven't forgetten about this ... and am looking at it now. > Seems that whenever I go to reserve the machine pSeries-102, > someone else is using it :-) This panic is caused by "[POWERPC] pseries: Fix jumbled no_logging flag." (79c0108d1b9db4864ab77b2a95dfa04f2dcf264c), in the powerpc/for-2.6.24 branch. It looks to me that we have logging enabled too early now. I think the following is a reasonable fix? --- Explicitly enable RTAS error logging, when it should be ready. Signed-off-by: Tony Breeds <[EMAIL PROTECTED]> --- arch/powerpc/platforms/pseries/rtasd.c |7 ++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/platforms/pseries/rtasd.c b/arch/powerpc/platforms/pseries/rtasd.c index 30925d2..0df5d0d 100644 --- a/arch/powerpc/platforms/pseries/rtasd.c +++ b/arch/powerpc/platforms/pseries/rtasd.c @@ -54,7 +54,10 @@ static unsigned int rtas_event_scan_rate; static int full_rtas_msgs = 0; /* Stop logging to nvram after first fatal error */ -static int no_more_logging; +static int no_more_logging = 1; /* Until we initialize everything, + * make sure we don't try logging + * anything */ + static int error_log_cnt; @@ -414,6 +417,8 @@ static int rtasd(void *unused) memset(logdata, 0, rtas_error_log_max); rc = nvram_read_error_log(logdata, rtas_error_log_max, &err_type, &error_log_cnt); + /* We can use rtas_log_buf now */ + no_more_logging = 0; if (!rc) { if (err_type != ERR_FLAG_ALREADY_LOGGED) { Yours Tony linux.conf.auhttp://linux.conf.au/ || http://lca2008.linux.org.au/ Jan 28 - Feb 02 2008 The Australian Linux Technical Conference! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: build error
On Tue, 2 Oct 2007 22:12:13 +0200 [EMAIL PROTECTED] wrote: > [please CC: me, my subscribe mail was greylisted] > > Morning! > > My make run for 2.6.23-rc9 ends like this: > > GEN .version > CHK include/linux/compile.h > UPD include/linux/compile.h > CC init/version.o > LD init/built-in.o > LD .tmp_vmlinux1 > kernel/built-in.o: In function `getnstimeofday': > (.text+0x1e141): undefined reference to `__udivdi3' > kernel/built-in.o: In function `do_gettimeofday': > (.text+0x1e263): undefined reference to `__udivdi3' > kernel/built-in.o: In function `timekeeping_resume': > timekeeping.c:(.text+0x1e427): undefined reference to `__udivdi3' > kernel/built-in.o: In function `update_wall_time': > (.text+0x1e829): undefined reference to `__udivdi3' > kernel/built-in.o: In function `update_wall_time': > (.text+0x1ec4c): undefined reference to `__udivdi3' > make: *** [.tmp_vmlinux1] Error 1 > > .config attached. > > I have already read the diff from -rc8 and found nothing that helped me. > > Any ideas? Further questions? > > Wilfried > What Gcc version? and config/architecture? -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
On Wed, 3 Oct 2007, Alan Cox wrote: > > Smack seems a perfectly good simple LSM module, its clean, its based upon > credible security models and sound theory (unlike AppArmor). The problem with SELinux isn't the theory. It's the practice. IOW, it's too hard to use. Apparently Ubuntu is giving up on it too, for that reason. And what some people seem to have trouble admitting is that theory counts for nothing, if the practice isn't there. So quite frankly, the SELinux people would look at whole lot smarter if they didn't blather on about "theory". Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel 2.4 vs 2.6 Traffic Controller performance
On Wed, 3 Oct 2007 08:05:30 +0800 Sonny <[EMAIL PROTECTED]> wrote: > Hello > I hope this is the right place to ask this.Does any know if there is a > substantial difference in the performance of the traffic controller > between kernel 2.4 and 2.6. We tested it using 1 iperf server and use > 250 and 500 clients, altering the burst. We use the top command to > check the idle time of our router to see this. The results we got from > the 2.4 kernel shows around 65-70% idle time while the 2.6 shows > 60-65% idle time. We tried to use MRTG and we're not getting any > results either. We want to know if we could improve the bandwidth by > upgrading the kernel, else we would have to get a new bandwidth > manager. Could anyone have the similar test regarding this. Thanks in > advance. Some related thoughts: 1. Make sure you have the iperf yield fix in place. Otherwise iperf eats cpu. 2. Proper mailing lists are: [EMAIL PROTECTED] and [EMAIL PROTECTED] 3. The latest versions of 2.6 use different clock measurement that should be better than older 2.4 (where there are three choices). The new clock is finer resolution (at slightly higher overhead), which should make accuracy higher but might increase cpu usage. -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-fbdev-devel] [PATCH 0/6] Patch series to add of_platform binding to xilinxfb
On Mon, 2007-10-01 at 09:57 -0600, Grant Likely wrote: > (resend due to mailer issues. Apologies to anyone receiving this twice) > > This patch series reworks the Xilinx framebuffer driver and then adds > an of_platform bus binding. The of_platform bus binding is needed to use > the driver in arch/powerpc platforms. > > Antonino, > > Assuming there are no major issues, I'd like to get this patch series > queued up for inclusion in 2.6.24. Okay. Tony - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
> situations. For example, I find SELinux to be so irrelevant to my usage > that I don't use it at all. I just don't have any other users on my > machine That you know about... The value of SELinux (or indeed any system compartmentalising access and limiting damage) comes into play when you get breakage - eg via a web browser exploit. Yes SELinux is much more relevant to servers, and really comes into its own when its used to write custom rulesets and enforce corporate policy ("No you can't run that screensaver that arrived by email"). Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc9 boot failure (megaraid?)
On Tue, 02 Oct 2007 15:38:13 -0500 James Bottomley <[EMAIL PROTECTED]> wrote: > On Tue, 2007-10-02 at 20:15 +0200, Adrian Bunk wrote: > > Cc's added, the complete bug report is at > > http://lkml.org/lkml/2007/10/2/243 > > > > On Tue, Oct 02, 2007 at 12:48:26PM -0400, Burton Windle wrote: > > > 2.6.23-rc9 fails to boot for me; 2.6.22.9 works fine. > > > > > > System is a Dell Poweredge with PERC 2/DC with RAID1 volume. > > >... > > > > Thanks for your report. > > > > Diff'ing the dmesg's shows: > > > > <-- snip --> > > > > scsi0: scanning scsi channel 4 [P0] for physical devices. > > scsi0: scanning scsi channel 5 [P1] for physical devices. > > st: Version 20070203, fixed bufsize 32768, s/g segs 256 > > -sd 0:0:0:0: [sda] 17547264 512-byte hardware sectors (8984 MB) > > +sd 0:0:0:0: [sda] Sector size 0 reported, assuming 512. > > +sd 0:0:0:0: [sda] 1 512-byte hardware sectors (0 MB) > > sd 0:0:0:0: [sda] Write Protect is off > > sd 0:0:0:0: [sda] Asking for cache data failed > > sd 0:0:0:0: [sda] Assuming drive cache: write through > > -sd 0:0:0:0: [sda] 17547264 512-byte hardware sectors (8984 MB) > > +sd 0:0:0:0: [sda] Sector size 0 reported, assuming 512. > > +sd 0:0:0:0: [sda] 1 512-byte hardware sectors (0 MB) > > sd 0:0:0:0: [sda] Write Protect is off > > sd 0:0:0:0: [sda] Asking for cache data failed > > sd 0:0:0:0: [sda] Assuming drive cache: write through > > sda: sda1 > > + sda: p1 exceeds device capacity > > > > <-- snip --> > > > > - case MEGA_BULK_DATA: > > - if (scb->cmd->use_sg == 0) > > - length = scb->cmd->request_bufflen; > > - else { > > - struct scatterlist *sgl = > > - (struct scatterlist *)scb->cmd->request_buffer; > > - length = sgl->length; > > - } > > - pci_unmap_page(adapter->dev, scb->dma_h_bulkdata, > > - length, scb->dma_direction); > > - break; > > - > > This is the problem piece I think. We've reintroduced a very old bug: > > commit 51c928c34fa7cff38df584ad01de988805877dba > Author: James Bottomley <[EMAIL PROTECTED]> > Date: Sat Oct 1 09:38:05 2005 -0500 > > [SCSI] Legacy MegaRAID: Fix READ CAPACITY > > Some Legacy megaraid cards can't actually cope with the scatter/gather > version of the READ CAPACITY command (which is what we now send them > since altering all SCSI internal I/O to go via the block layer). Fix > this (and a few other broken megaraid driver assumptions) by sending > the non-sg version of the command if the sg list only has a single > element. > > Signed-off-by: James Bottomley <[EMAIL PROTECTED]> > > So what we have to do is put back the check for use_sg == 1 and send > that as a bulk transfer command. Sorry again. Needs to check sg count before dma mapping. diff --git a/drivers/scsi/megaraid.c b/drivers/scsi/megaraid.c index 3907f67..ae0b220 100644 --- a/drivers/scsi/megaraid.c +++ b/drivers/scsi/megaraid.c @@ -1737,9 +1737,12 @@ mega_build_sglist(adapter_t *adapter, scb_t *scb, u32 *buf, u32 *len) Scsi_Cmnd *cmd; int sgcnt; int idx; + int bulkdata; cmd = scb->cmd; + bulkdata = (scsi_sg_count(cmd) == 1) ? 1 : 0; + /* * Copy Scatter-Gather list info into controller structure. * @@ -1753,6 +1756,14 @@ mega_build_sglist(adapter_t *adapter, scb_t *scb, u32 *buf, u32 *len) *len = 0; + if (bulkdata && !adapter->has_64bit_addr) { + sg = scsi_sglist(cmd); + scb->dma_h_bulkdata = sg_dma_address(sg); + *buf = (u32)scb->dma_h_bulkdata; + *len = sg_dma_len(sg); + return 0; + } + scsi_for_each_sg(cmd, sg, sgcnt, idx) { if (adapter->has_64bit_addr) { scb->sgl64[idx].address = sg_dma_address(sg); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
On Tue, 02 Oct 2007 17:02:13 -0400 Bill Davidsen <[EMAIL PROTECTED]> wrote: > Linus Torvalds wrote: > > > > On Mon, 1 Oct 2007, Stephen Smalley wrote: > >> You argued against pluggable schedulers, right? Why is security > >> different? > > > > Schedulers can be objectively tested. There's this thing called > > "performance", that can generally be quantified on a load basis. > > > > Yes, you can have crazy ideas in both schedulers and security. Yes, you > > can simplify both for a particular load. Yes, you can make mistakes in > > both. But the *discussion* on security seems to never get down to real > > numbers. > > > And yet you can make the exact same case for schedulers as security, you > can quantify the behavior, but if your only choice is A it doesn't help > to know that B is better. To be fair the discussion on security does get down to real set theory but at that point most people's eyes (mine included) glaze over somewhat. You can reasonably quantify the behaviour and correctness of a security model based upon mathematical principles - if anything its *easier* that schedulers which are so much based on "feeling right". Smack seems a perfectly good simple LSM module, its clean, its based upon credible security models and sound theory (unlike AppArmor). I don't see why it shouldn't go in. Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Kernel 2.4 vs 2.6 Traffic Controller performance
Hello I hope this is the right place to ask this.Does any know if there is a substantial difference in the performance of the traffic controller between kernel 2.4 and 2.6. We tested it using 1 iperf server and use 250 and 500 clients, altering the burst. We use the top command to check the idle time of our router to see this. The results we got from the 2.4 kernel shows around 65-70% idle time while the 2.6 shows 60-65% idle time. We tried to use MRTG and we're not getting any results either. We want to know if we could improve the bandwidth by upgrading the kernel, else we would have to get a new bandwidth manager. Could anyone have the similar test regarding this. Thanks in advance. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Boot protocol changes
H. Peter Anvin wrote: >> This series looks like a good start for Xen, but we still need to work >> out where to stash the metadata which normally lives in ELF notes. >> Using ELF is convenient for Xen because it lets a large chunk of domain >> builder code be reused; on the other hand, loading a plain bzImage is >> pretty simple, so maybe it isn't such a big deal. >> >> HPA, Eric: if we don't go the "embed ELF" path, where's a good >> backwards-compatible place to stash the note data? If we do go with >> "embed ELF", how should we go about doing it? Arrange to put the ELF >> headers before the 1M mark? >> > > This sounds like another good reason to do the ELF image as the > postcompression image. The interface to the embedded compression > routine is then unchanged, and we get the "full vmlinux" with any > notes that belongs there. > > I'll try to get an implementation of that done -- it really shouldn't > be very hard. Please explain what you're proposing again, because my memory of your plan from last time wouldn't help in this case. Are you proposing that the bzImage contains compressed data that its expecting the bootloader to decompress? Won't that completely break backwards compatibility? If we don't care about backwards compatibility with old bootloaders, then it doesn't matter what we do one way or the other. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Boot protocol changes
Jeremy Fitzhardinge wrote: Rusty Russell wrote: Hi all, Jeremy had some boot changes for bzImages, but buried in there was an update to the boot protocol to support Xen and lguest (and kvm-lite). I've copied those fairly simple patches, and if HPA is happy I'd like to push them for 2.6.24 (after correcting for the Great Arch Merge of course). Ah, good. I was thinking about reviving this work. The main problem is that sticking an ELF header at the 1 meg mark (the address of the bzImage "payload") breaks 32-bit bootloaders which think they can just jump to 32-bit code there. I started a conversation with Eric at KS about it, but we didn't reach any conclusions. This series looks like a good start for Xen, but we still need to work out where to stash the metadata which normally lives in ELF notes. Using ELF is convenient for Xen because it lets a large chunk of domain builder code be reused; on the other hand, loading a plain bzImage is pretty simple, so maybe it isn't such a big deal. HPA, Eric: if we don't go the "embed ELF" path, where's a good backwards-compatible place to stash the note data? If we do go with "embed ELF", how should we go about doing it? Arrange to put the ELF headers before the 1M mark? This sounds like another good reason to do the ELF image as the postcompression image. The interface to the embedded compression routine is then unchanged, and we get the "full vmlinux" with any notes that belongs there. I'll try to get an implementation of that done -- it really shouldn't be very hard. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Boot protocol changes
Rusty Russell wrote: > Hi all, > > Jeremy had some boot changes for bzImages, but buried in there was an > update to the boot protocol to support Xen and lguest (and kvm-lite). > I've copied those fairly simple patches, and if HPA is happy I'd like to > push them for 2.6.24 (after correcting for the Great Arch Merge of > course). Ah, good. I was thinking about reviving this work. The main problem is that sticking an ELF header at the 1 meg mark (the address of the bzImage "payload") breaks 32-bit bootloaders which think they can just jump to 32-bit code there. I started a conversation with Eric at KS about it, but we didn't reach any conclusions. This series looks like a good start for Xen, but we still need to work out where to stash the metadata which normally lives in ELF notes. Using ELF is convenient for Xen because it lets a large chunk of domain builder code be reused; on the other hand, loading a plain bzImage is pretty simple, so maybe it isn't such a big deal. HPA, Eric: if we don't go the "embed ELF" path, where's a good backwards-compatible place to stash the note data? If we do go with "embed ELF", how should we go about doing it? Arrange to put the ELF headers before the 1M mark? J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Boot protocol changes
Rusty Russell wrote: Hi all, Jeremy had some boot changes for bzImages, but buried in there was an update to the boot protocol to support Xen and lguest (and kvm-lite). I've copied those fairly simple patches, and if HPA is happy I'd like to push them for 2.6.24 (after correcting for the Great Arch Merge of course). Acked-by: H. Peter Anvin <[EMAIL PROTECTED]> -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/5] Revert lguest magic and use hook in head.S
Version 2.07 of the boot protocol uses 0x23C for the hardware_subarch field, that for lguest is "1". This allows us to use the standard boot entry point rather than the "GenuineLguest" string hack. This entry point also clears the BSS and copies the boot parameters and commandline for us, saving more code. Signed-off-by: Rusty Russell <[EMAIL PROTECTED]> --- Documentation/lguest/lguest.c | 31 --- arch/i386/kernel/head.S |8 drivers/lguest/lguest_asm.S |9 +++-- 3 files changed, 15 insertions(+), 33 deletions(-) diff -r 2fdc577cfe5c Documentation/lguest/lguest.c --- a/Documentation/lguest/lguest.c Tue Oct 02 22:21:05 2007 +1000 +++ b/Documentation/lguest/lguest.c Tue Oct 02 23:00:09 2007 +1000 @@ -251,23 +251,6 @@ static void *get_pages(unsigned int num) return addr; } -/* To find out where to start we look for the magic Guest string, which marks - * the code we see in lguest_asm.S. This is a hack which we are currently - * plotting to replace with the normal Linux entry point. */ -static unsigned long entry_point(const void *start, const void *end) -{ - const void *p; - - /* The scan gives us the physical starting address. We boot with -* pagetables set up with virtual and physical the same, so that's -* OK. */ - for (p = start; p < end; p++) - if (memcmp(p, "GenuineLguest", strlen("GenuineLguest")) == 0) - return to_guest_phys(p + strlen("GenuineLguest")); - - errx(1, "Is this image a genuine lguest?"); -} - /* This routine is used to load the kernel or initrd. It tries mmap, but if * that fails (Plan 9's kernel file isn't nicely aligned on page boundaries), * it falls back to reading the memory in. */ @@ -303,7 +286,6 @@ static void map_at(int fd, void *addr, u * We return the starting address. */ static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr) { - void *start = (void *)-1, *end = NULL; Elf32_Phdr phdr[ehdr->e_phnum]; unsigned int i; @@ -335,19 +317,13 @@ static unsigned long map_elf(int elf_fd, verbose("Section %i: size %i addr %p\n", i, phdr[i].p_memsz, (void *)phdr[i].p_paddr); - /* We track the first and last address we mapped, so we can -* tell entry_point() where to scan. */ - if (from_guest_phys(phdr[i].p_paddr) < start) - start = from_guest_phys(phdr[i].p_paddr); - if (from_guest_phys(phdr[i].p_paddr) + phdr[i].p_filesz > end) - end=from_guest_phys(phdr[i].p_paddr)+phdr[i].p_filesz; - /* We map this section of the file at its physical address. */ map_at(elf_fd, from_guest_phys(phdr[i].p_paddr), phdr[i].p_offset, phdr[i].p_filesz); } - return entry_point(start, end); + /* The entry point is given in the ELF header. */ + return ehdr->e_entry; } /*L:160 Unfortunately the entire ELF image isn't compressed: the segments @@ -374,7 +350,8 @@ static unsigned long unpack_bzimage(int verbose("Unpacked size %i addr %p\n", len, img); - return entry_point(img, img + len); + /* The entry point for a bzImage is always the first byte */ + return (unsigned long)img; } /*L:150 A bzImage, unlike an ELF file, is not meant to be loaded. You're @@ -1684,8 +1661,15 @@ int main(int argc, char *argv[]) *(u32 *)(boot + 0x228) = 4096; concat(boot + 4096, argv+optind+2); - /* The guest type value of "1" tells the Guest it's under lguest. */ - *(int *)(boot + 0x23c) = 1; + /* Boot protocol version: 2.07 supports the fields for lguest. */ + *(u16 *)(boot + 0x206) = 0x207; + + /* The hardware_subarch value of "1" tells the Guest it's an lguest. */ + *(u32 *)(boot + 0x23c) = 1; + + /* Set bit 6 of the loadflags (aka. KEEP_SEGMENTS) so the entry path +* does not ttry to reload segment registers. */ + *(u8 *)(boot + 0x211) |= (1 << 6); /* We tell the kernel to initialize the Guest: this returns the open * /dev/lguest file descriptor. */ diff -r 2fdc577cfe5c arch/i386/lguest/boot.c --- a/arch/i386/lguest/boot.c Tue Oct 02 22:21:05 2007 +1000 +++ b/arch/i386/lguest/boot.c Tue Oct 02 23:27:22 2007 +1000 @@ -938,18 +938,8 @@ static unsigned lguest_patch(u8 type, u1 /*G:030 Once we get to lguest_init(), we know we're a Guest. The paravirt_ops * structure in the kernel provides a single point for (almost) every routine * we have to override to avoid privileged instructions. */ -__init void lguest_init(void *boot) -{ - /* Copy boot parameters first: the Launcher put the physical location -* in %esi, and head.S converted that to a virtual address and handed -* it to us. We use "__memcpy" because "memcpy" sometimes tries to do -* trick
[PATCH 5/5] lguest: loading bzImage directly
Now arch/i386/boot/compressed/head.S understands the hardware_platform field, we can directly execute bzImages. No more horrific unpacking code. Signed-off-by: Rusty Russell <[EMAIL PROTECTED]> --- Documentation/lguest/lguest.c| 97 -- arch/i386/boot/compressed/head.S |6 ++ drivers/lguest/lguest.c |5 + 3 files changed, 42 insertions(+), 66 deletions(-) diff -r b0480fd71a72 Documentation/lguest/lguest.c --- a/Documentation/lguest/lguest.c Tue Oct 02 22:28:13 2007 +1000 +++ b/Documentation/lguest/lguest.c Tue Oct 02 22:52:07 2007 +1000 @@ -326,74 +326,39 @@ static unsigned long map_elf(int elf_fd, return ehdr->e_entry; } -/*L:160 Unfortunately the entire ELF image isn't compressed: the segments - * which need loading are extracted and compressed raw. This denies us the - * information we need to make a fully-general loader. */ -static unsigned long unpack_bzimage(int fd) -{ - gzFile f; - int ret, len = 0; - /* A bzImage always gets loaded at physical address 1M. This is -* actually configurable as CONFIG_PHYSICAL_START, but as the comment -* there says, "Don't change this unless you know what you are doing". -* Indeed. */ - void *img = from_guest_phys(0x10); - - /* gzdopen takes our file descriptor (carefully placed at the start of -* the GZIP header we found) and returns a gzFile. */ - f = gzdopen(fd, "rb"); - /* We read it into memory in 64k chunks until we hit the end. */ - while ((ret = gzread(f, img + len, 65536)) > 0) - len += ret; - if (ret < 0) - err(1, "reading image from bzImage"); - - verbose("Unpacked size %i addr %p\n", len, img); - - /* The entry point for a bzImage is always the first byte */ - return (unsigned long)img; -} - /*L:150 A bzImage, unlike an ELF file, is not meant to be loaded. You're - * supposed to jump into it and it will unpack itself. We can't do that - * because the Guest can't run the unpacking code, and adding features to - * lguest kills puppies, so we don't want to. - * - * The bzImage is formed by putting the decompressing code in front of the - * compressed kernel code. So we can simple scan through it looking for the - * first "gzip" header, and start decompressing from there. */ + * supposed to jump into it and it will unpack itself. We used to have to + * perform some hairy magic because the unpacking code scared me. + * + * Fortunately, Jeremy Fitzhardinge convinced me it wasn't that hard and wrote + * a small patch to jump over the tricky bits in the guest, so now we just read + * the funky header so we know where in the file to load, and away we go! */ static unsigned long load_bzimage(int fd) { - unsigned char c; - int state = 0; - - /* GZIP header is 0x1F 0x8B ... . */ - while (read(fd, &c, 1) == 1) { - switch (state) { - case 0: - if (c == 0x1F) - state++; - break; - case 1: - if (c == 0x8B) - state++; - else - state = 0; - break; - case 2 ... 8: - state++; - break; - case 9: - /* Seek back to the start of the gzip header. */ - lseek(fd, -10, SEEK_CUR); - /* One final check: "compressed under UNIX". */ - if (c != 0x03) - state = -1; - else - return unpack_bzimage(fd); - } - } - errx(1, "Could not find kernel in bzImage"); + u8 hdr[1024]; + int r; + /* Modern bzImages get loaded at 1M. */ + void *p = from_guest_phys(0x10); + + /* Go back to the start of the file and read the header. It should be +* a Linux boot header (see Documentation/i386/boot.txt) */ + lseek(fd, 0, SEEK_SET); + read(fd, hdr, sizeof(hdr)); + + /* At offset 0x202, we expect the magic "HdrS" */ + if (memcmp(hdr + 0x202, "HdrS", 4) != 0) + errx(1, "This doesn't look like a bzImage to me"); + + /* The byte at 0x1F1 tells us how many extra sectors of +* header: skip over them all. */ + lseek(fd, (unsigned long)(hdr[0x1F1]+1) * 512, SEEK_SET); + + /* Now read everything into memory. in nice big chunks. */ + while ((r = read(fd, p, 65536)) > 0) + p += r; + + /* Finally, 0x214 tells us where to start the kernel. */ + return *(unsigned long *)&hdr[0x214]; } /*L:140 Loading the kernel is easy when it's a "vmlinux", but most kernels - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in th
Re: PROBLEM: high load average when idle
Arjan van de Ven wrote: On Tue, 02 Oct 2007 18:46:18 -0400 On a related note, {set/get}itimer() currently are buggy (since 2.6.11 or so), also due to this round_jiffies() thing I believe. I very much believe that it is totally unrelated... most of all since round_jiffies() wasn't in the kernel then an also isn't used anywhere near these timers. Ah, yes, you're correct. The itimer routines do their *own* rounding. -ml - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.6.23-rc9 and a heads-up for the 2.6.24 series..
On Tue, 2007-10-02 at 18:09 -0400, Bill Davidsen wrote: > Mel Gorman wrote: > > On (02/10/07 14:15), Ingo Molnar didst pronounce: > >> * Mel Gorman <[EMAIL PROTECTED]> wrote: > >> > >>> Dirt. Booting with "profile=sleep,2" is broken in 2.6.23-rc9 and > >>> 2.6.23-rc8 but working in 2.6.22. I was checking it out as part of a > >>> discussion in another thread and noticed it broken in -mm as well > >>> (2.6.23-rc8-mm2). Bisect is in progress but suggestions as to the > >>> prime candidates are welcome or preferably, pointing out that I'm an > >>> idiot because I missed twiddling some config change. > >> Mel, does the patch below fix this bug for you? (Note: you will need to > >> enable CONFIG_SCHEDSTATS=y too.) > >> > > > > Nice one Ingo - got it first try. The problem commit was > > dd41f596cda0d7d6e4a8b139ffdfabcefdd46528 and it's clear that the code > > removed > > in this commit is put back by this latest patch. When applied, > > profile=sleep > > works as long as CONFIG_SCHEDSTAT is set. > > > And if it isn't set? I can easily see building a new kernel with stats > off and forgetting to change the boot options. > If CONFIG_SCHEDSTAT is off and profile=sleep is set, you see with Ingo's patch and readprofile; 0 *unknown* 0 total 0. That is a tad confusing hence my follow-up patch which would say "/proc/profile" doesn't exist when readprofile is used and the warning in dmesg. -- Mel Gorman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git patches] net driver updates
From: Jeff Garzik <[EMAIL PROTECTED]> Date: Tue, 2 Oct 2007 13:41:50 -0400 > Please pull from the 'upstream' branch of > master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git upstream Pulled and pushed back out to net-2.6.24, thanks Jeff! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/5] add WEAK() for creating weak asm labels
Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]> Signed-off-by: Rusty Russell <[EMAIL PROTECTED]> --- include/linux/linkage.h |6 ++ 1 file changed, 6 insertions(+) === --- a/include/linux/linkage.h +++ b/include/linux/linkage.h @@ -34,6 +34,12 @@ name: #endif +#ifndef WEAK +#define WEAK(name)\ + .weak name;\ + name: +#endif + #define KPROBE_ENTRY(name) \ .pushsection .kprobes.text, "ax"; \ ENTRY(name) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] i386: paravirt boot sequence
This patch uses the updated boot protocol to do paravirtualized boot. If the boot version is >= 2.07, then it will do two things: 1. Check the bootparams loadflags to see if we should reload the segment registers and clear interrupts. This is appropriate for normal native boot and some paravirtualized environments, but inapproprate for others. 2. Check the hardware architecture, and dispatch to the appropriate kernel entrypoint. If the bootloader doesn't set this, then we simply do the normal boot sequence. Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]> Signed-off-by: Rusty Russell <[EMAIL PROTECTED]> Cc: "Eric W. Biederman" <[EMAIL PROTECTED]> Cc: H. Peter Anvin <[EMAIL PROTECTED]> Cc: Vivek Goyal <[EMAIL PROTECTED]> Cc: James Bottomley <[EMAIL PROTECTED]> --- arch/i386/boot/compressed/head.S | 14 +-- arch/i386/boot/compressed/misc.c |4 +++ arch/i386/boot/header.S |7 - arch/i386/kernel/head.S | 47 ++ 4 files changed, 65 insertions(+), 7 deletions(-) diff -r 5d471e4c931d arch/i386/boot/compressed/head.S --- a/arch/i386/boot/compressed/head.S Tue Oct 02 22:13:34 2007 +1000 +++ b/arch/i386/boot/compressed/head.S Tue Oct 02 22:20:25 2007 +1000 @@ -27,19 +27,30 @@ #include #include #include +#include .section ".text.head","ax",@progbits .globl startup_32 startup_32: - cld - cli + /* check to see if KEEP_SEGMENTS flag is meaningful */ + cmpw $0x207, BP_version(%esi) + jb 1f + + /* test KEEP_SEGMENTS flag to see if the bootloader is asking +* us to not reload segments */ + testb $(1<<6), BP_loadflags(%esi) + jnz 2f + +1: cli movl $(__BOOT_DS),%eax movl %eax,%ds movl %eax,%es movl %eax,%fs movl %eax,%gs movl %eax,%ss + +2: cld /* Calculate the delta between where we were compiled to run * at and where we were actually loaded at. This can only be done diff -r 5d471e4c931d arch/i386/boot/compressed/misc.c --- a/arch/i386/boot/compressed/misc.c Tue Oct 02 22:13:34 2007 +1000 +++ b/arch/i386/boot/compressed/misc.c Tue Oct 02 22:13:34 2007 +1000 @@ -246,6 +246,9 @@ static void putstr(const char *s) { int x,y,pos; char c; + + if (RM_SCREEN_INFO.orig_video_mode == 0 && lines == 0 && cols == 0) + return; x = RM_SCREEN_INFO.orig_x; y = RM_SCREEN_INFO.orig_y; diff -r 5d471e4c931d arch/i386/boot/header.S --- a/arch/i386/boot/header.S Tue Oct 02 22:13:34 2007 +1000 +++ b/arch/i386/boot/header.S Tue Oct 02 22:13:34 2007 +1000 @@ -119,7 +119,7 @@ 1: # Part 2 of the header, from the old setup.S .ascii "HdrS" # header signature - .word 0x0206 # header version number (>= 0x0105) + .word 0x0207 # header version number (>= 0x0105) # or else old loadlin-1.5 will fail) .globl realmode_swtch realmode_swtch:.word 0, 0# default_switch, SETUPSEG @@ -214,6 +214,11 @@ cmdline_size: .long COMMAND_LINE_SIZ #added with boot protocol #version 2.06 +hardware_subarch: .long 0 # subarchitecture, added with 2.07 + # default to 0 for normal x86 PC + +hardware_subarch_data: .quad 0 + # End of setup header # .section ".inittext", "ax" diff -r 5d471e4c931d arch/i386/kernel/head.S --- a/arch/i386/kernel/head.S Tue Oct 02 22:13:34 2007 +1000 +++ b/arch/i386/kernel/head.S Tue Oct 02 22:21:01 2007 +1000 @@ -70,22 +70,30 @@ INIT_MAP_BEYOND_END = BOOTBITMAP_SIZE + */ .section .text.head,"ax",@progbits ENTRY(startup_32) + /* check to see if KEEP_SEGMENTS flag is meaningful */ + cmpw $0x207, BP_version(%esi) + jb 1f + + /* test KEEP_SEGMENTS flag to see if the bootloader is asking + us to not reload segments */ + testb $(1<<6), BP_loadflags(%esi) + jnz 2f /* * Set segments to known values. */ - cld - lgdt boot_gdt_descr - __PAGE_OFFSET +1: lgdt boot_gdt_descr - __PAGE_OFFSET movl $(__BOOT_DS),%eax movl %eax,%ds movl %eax,%es movl %eax,%fs movl %eax,%gs +2: /* * Clear BSS first so that there are no surprises... - * No need to cld as DF is already clear from cld above... - */ + */ + cld xorl %eax,%eax movl $__bss_start - __PAGE_OFFSET,%edi movl $__bss_stop - __PAGE_OFFSET,%ecx @@ -119,6 +127,35 @@ 2: movsl 1: +#ifdef CONFIG_PARAVIRT + cmpw $0x207, (boot_params + BP_version - __PAGE_OFFSET) + jb default_entry + + /* Paravirt-compatible boot parameters. Look to see what ar