Re: CONFIG_IRQBALANCE for 64-bit x86 ?
On Wednesday 21 November 2007 06:07, Arjan van de Ven wrote: > On Wed, 21 Nov 2007 02:43:46 +1100 > > Of course it is, if you want to effectively use your resources. > > Imagine if the task balancer only polled once every 10s. > > but unlike the task balancer, moving an irq is really expensive. > (at least for networking and a few other similar systems) > ANd no it's not just the cache bouncing, it's the entire reassembly of > multiple packets etc etc that gets really messy. Actually a blanket statement like that is just wrong. Moving a network interrupt yes is probably quite expensive, but it is about the worst case one to move. What's more, moving tasks between NUMA nodes could easily be many orders of magnitude worse than the transient slowdown of moving irqs. Furthermore, what you say doesn't really seem to be an argument for doing it in userspace or an argument against moving IRQs. It actually shows that there are complex, hardware and kernel implementation dependent issues, all of which suggest it is better to be in kernel. > > Some constants > > that make assumptions about the machine it is running on and may or > > may not agree with what the task scheduler is trying to do. > > > > Some > > classification stuff which makes guesses about how a particular bit of > > you misunderstood this; the classification stuff is there to spread > different irqs of similar class (say networking) over multiple > cores/packages. Doing this is a system resource balancing proposition > not just a cpu time one. > > You may think this spreading based on classification is a mistake, but > it's based on the following observation: No I'm not misunderstanding or think it is a mistake. But it is something which the kernel and the devices themselves should have better knowledge of. You have a process which is reading off disk and sending to a network interface? You may well want to put the process and the disk interrupt and the network interrupt all on the same CPU. [snip] > We used to rebalance this frequently in the 2.4-early kernels based on > a patch from Ingo. Turned out to be a really really bad idea; > performance really tanked. To reiterate, I do not think that IRQs should be moved more frequently. I think the kernel is in the position to know far better than userspace about irq balancing. > > hardware or device driver wants to be balanced. Hacks to poll > > hotplugging and topology changes. > > "hacks" as in "rescan".. so falls under the topology code and would > indeed be changed to hook into hotplug inside the kernel; just > different complexity. ie. simpler. All the topology stuff would be far simpler. > > I'm still convinced. Who isn't? > > I know you can do SOME sort of balancing in the kernel. But please > describe the algorithm you would use; I started out with the same > thought but when it got down to the algorithm to me at least it became > clear "we really don't want this complexity in kernel mode". I'd rather not to this far into handwaving. I'm not saying that I know exactly how it should work right now. I'm questioning the established viewpoint that irq balancing belongs in userspace. For that matter, I guess from the results you get, it's not terribly bad to do in userspace or anything. But I think it can be done in kernel. Policy... I think that's a misused argument. The "policy" of any kernel code I write is to utilise the hardware as efficiently as possible within restrictions (eg. fairness, permissions). Setting those restrictions is the realm of userspace, otherwise IMO it is fine to go in kernel. Using the same argument, task balancing and even scheduling is policy, so is page reclaim, page writeback, filesystem block allocation, etc. Now many of those things can be directed or restricted somehow from userspace, and in-kernel irq balancing would be no different. > > > not needed. (also because on single socket machines, the > > > irqbalancer basically has a one-shot task because there balancing > > > is effectively a static setup) > > > > I don't think that's a good argument for not having it in kernel. > > if you don't care about kernel unpagable memory footprint, fine. > Others do. It would be a couple of K, right? I mean it would be probably less than half the code of irqbalance because of the parsing and topology stuff. Also, I don't think the one-shot behaviour on single socket machines is good policy at all, and it can't capture dynamic behaviour at all. > > > I listed a few; > > > 1) it's policy > > > > I don't think that's such a constructive point. Task balancing is > > policy in exactly the same way. > > not really; CFS has shown that the only real policy in task > balancing is the fairness part, Ahh, hate to get off topic, but let's not perpetuate this myth. It wasn't Con, or CFS, or anything that showed fairness is some great new idea. Actually I was arguing for fairness first, against both Con and Ingo, way back when the old scheduler was having so much
Re: High priority tasks break SMP balancer?
On Tue, Nov 20, 2007 at 10:47:52PM +0100, Dmitry Adamushko wrote: > btw., what's your system? If I recall right, SD_BALANCE_NEWIDLE is on > by default for all configs, except for NUMA nodes. It's a dual AMD64 Opteron. So, I recompiled my 2.6.23.1 kernel without NUMA support, and with your patch for scheduling domain flags in /proc. It looks like with NUMA disabled, my test case no longer shows the CPU imbalance problem. Cool. With NUMA disabled (and my test running smoothly), the flags show that SD_BALANCE_NEWIDLE is set: [EMAIL PROTECTED]:~# cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags 55 Next I turned it off: [EMAIL PROTECTED]:~# echo 53 > /proc/sys/kernel/sched_domain/cpu0/domain0/flags [EMAIL PROTECTED]:~# echo 53 > /proc/sys/kernel/sched_domain/cpu1/domain0/flags Oddly enough, I still don't observe the CPU imbalance problem. Now I reboot into a kernel which has NUMA re-enabled but which is otherwise identical. I verify that now I can reproduce the CPU imbalance again. [EMAIL PROTECTED]:~# cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags 1101 Now I set cpu[10]/domain0/flags to 1099, and the imbalance immediately disappears. I can reliably cause the imbalance again by setting it back to 1101, and remove the imbalance by setting them to 1099. Do these results make sense? I'm not sure I understand how SD_BALANCE_NEWIDLE could be the whole story, since my /proc/schedstat graphs do show that we continuously try to balance on idle, but we can't successfully do so because the idle CPU has a much higher load than the non-idle CPU. I don't understand how the problem I'm seeing could be related to the time at which we run the balancer, rather than being related to the load average calculation. Assuming the CPU imbalance I'm seeing is actually related to SD_BALANCE_NEWIDLE being unset, I have a couple of questions: - Is this intended/expected behaviour for a machine without NEWIDLE set? I'm not familiar with the rationale for disabling this flag on NUMA systems. - Is there a good way to detect, without any kernel debug flags set, whether the current machine has any scheduling domains that are missing the SD_BALANCE_NEWIDLE bit? I'm looking for a good way to work around the problem I'm seeing with VMware's code. Right now the best I can do is disable all thread priority elevation when running on an SMP machine with Linux 2.6.20 or later. Thank you again for all your help. --Micah - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Posix file capabilities in 2.6.24rc2; now 2.6.24-rc3
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Serge E. Hallyn wrote: > The problem is that when you run a setuid binary, its pP and pE are > fully raised. The following patch fixes it for me. Chris, does it fix > your problem? Andrew, am I again confusing myself and doing something > unsafe? I think this is yet another example of the fragile mess that is UID emulation with capabilities. Your patch is an example of privilege escalation - luser can kill a more-capable process. In the kill CONT case we reached the opposite conclusion to this one. As was the case then, I didn't disagree then :*). If it meets folk's expectations, then this is probably a good patch... > --- a/security/commoncap.c > +++ b/security/commoncap.c > @@ -543,6 +543,9 @@ int cap_task_kill(struct task_struct *p, struct siginfo > *info, > if (capable(CAP_KILL)) > return 0; > > + if (p->euid==0 && p->uid==current->uid) > + return 0; > + Its late and I'm obviously tired, but is there any reason not to simply use: if (p->uid == current->uid) return 0; Whatever the case, could you put the new code closer to the sig == SIGCONT test? The capability tests are at the end of cap_task_kill() and this new check breaks that pattern. Cheers Andrew -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.6 (GNU/Linux) iD8DBQFHRTLlQheEq9QabfIRAt/hAKCJgj2kbuyAWI486LOwwDLdkbcpoQCfQdrQ J+bcvi+9pGTodFn42PsHJHA= =cXaG -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
In /proc/cpuinfo, all processor items show "0"
Build kernel 2.6.24-rc3, cat /proc/cpuinfo, all processor items show "0": processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 6 model name :Genuine Intel(R) CPU 3.40GHz stepping: 8 cpu MHz : 3391.555 cache size : 16384 KB physical id : 0 siblings: 4 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 6 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx est tm2 cid cx16 xtpr lahf_lm bogomips: 7465.18 clflush size: 64 cache_alignment : 128 address sizes : 40 bits physical, 48 bits virtual power management: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 6 model name :Genuine Intel(R) CPU 3.40GHz stepping: 8 cpu MHz : 3391.555 cache size : 16384 KB physical id : 2 siblings: 4 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 6 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx est tm2 cid cx16 xtpr lahf_lm bogomips: 6782.81 clflush size: 64 cache_alignment : 128 address sizes : 40 bits physical, 48 bits virtual power management: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 6 model name :Genuine Intel(R) CPU 3.40GHz stepping: 8 cpu MHz : 3391.555 cache size : 16384 KB physical id : 3 siblings: 4 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 6 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx est tm2 cid cx16 xtpr lahf_lm bogomips: 6782.80 clflush size: 64 cache_alignment : 128 address sizes : 40 bits physical, 48 bits virtual power management: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 6 model name :Genuine Intel(R) CPU 3.40GHz stepping: 8 cpu MHz : 3391.555 cache size : 16384 KB physical id : 1 siblings: 4 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 6 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx est tm2 cid cx16 xtpr lahf_lm bogomips: 6782.79 clflush size: 64 cache_alignment : 128 address sizes : 40 bits physical, 48 bits virtual power management: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 6 model name :Genuine Intel(R) CPU 3.40GHz stepping: 8 cpu MHz : 3391.555 cache size : 16384 KB physical id : 2 siblings: 4 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 6 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx est tm2 cid cx16 xtpr lahf_lm bogomips: 6782.80 clflush size: 64 cache_alignment : 128 address sizes : 40 bits physical, 48 bits virtual power management: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 6 model name :Genuine Intel(R) CPU 3.40GHz stepping: 8 cpu MHz : 3391.555 cache size : 16384 KB physical id : 3 siblings: 4 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 6 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx est tm2 cid cx16 xtpr lahf_lm bogomips: 6782.80 clflush size: 64 cache_alignment : 128 address sizes : 40 bits physical, 48 bits virtual power management: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 6 model name :Genuine Intel(R) CPU 3.40GHz stepping: 8 cpu MHz : 3391.555 cache size : 16384 KB physical id : 0
Re: [kvm-devel] [PATCH 3/3] virtio PCI device
Zachary Amsden wrote: On Wed, 2007-11-21 at 09:13 +0200, Avi Kivity wrote: Where the device is implemented is an implementation detail that should be hidden from the guest, isn't that one of the strengths of virtualization? Two examples: a file-based block device implemented in qemu gives you fancy file formats with encryption and compression, while the same device implemented in the kernel gives you a low-overhead path directly to a zillion-disk SAN volume. Or a user-level network device capable of running with the slirp stack and no permissions vs. the kernel device running copyless most of the time and using a dma engine for the rest but requiring you to be good friends with the admin. The user should expect zero reconfigurations moving a VM from one model to the other. I think that is pretty insightful, and indeed, is probably the only reason we would ever consider using a virtio based driver. But is this really a virtualization problem, and is virtio the right place to solve it? Doesn't I/O hotplug with multipathing or NIC teaming provide the same infrastructure in a way that is useful in more than just a virtualization context? With the aid of a dictionary I was able to understand about half the words in the last sentence. Moving from device to device using hotplug+multipath is complex to configure, available on only some guests, uses rarely-exercised paths in the guest OS, and only works for a few types of devices (network and block). Having host independence in the device means you can change the device implementation for, say, a display driver (consider, for example, a vmgl+virtio driver, which can be implemented in userspace or tunneled via virtio-over-tcp to some remote display without going through userspace, without the guest knowing about it). -- error compiling committee.c: too many arguments to function - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9]: Reduce Log I/O latency
On Thu, Nov 22, 2007 at 02:41:06PM +1100, David Chinner wrote: > On Wed, Nov 21, 2007 at 08:57:27PM -0600, Matt Mackall wrote: > > On Thu, Nov 22, 2007 at 12:12:14PM +1100, David Chinner wrote: > > > In all the cases that I know of where ppl are using what could > > > be considered real-time I/O (e.g. media environments where they > > > do real-time ingest and playout from the same filesystem) the > > > real-time ingest processes create the files and do pre-allocation > > > before doing their I/O. This I/O can get held up behind another > > > process that is not real time that has issued log I/O. > > > > > > Given there is no I/O priority inheritence and having log I/O stall > > > will stall the entire filesystem, we cannot allow log I/O to > > > stall in real-time environments. Hence it must have the highest > > > possible priority to prevent this. > > > > I've seen PVRs that would be upset by this. They put media on one > > filesystem and database/apps/swap/etc. on another, but have everything > > on a single spindle. Stalling a media filesystem read for a write > > anywhere else = fail. > > Sounds like the PVR is badly designed to me. If a write can cause a > read to miss a playback deadline, then you haven't built enough > buffering into your playback application. Normally it's not a problem. But your proposed change can push a working system into a non-working system by making non-critical I/O on an unrelated filesystem have higher priority than the thing that -actually has real-time constraints-. In other words, I/O priority is per-spindle and not per-filesystem and thus this change has consequences that leak outside the filesystem in question. That's bad. I'd further add that the kernel internals probably shouldn't wander into RT priority levels unless it's actually doing priority inheritance, otherwise it's quite likely to upset the careful considerations of the RT system designer's priority schemes. For instance, a log-heavy but otherwise non-RT load with this patch could possibly completely starve direct I/O to another partition even though it's marked RT, thus livelocking the system. To the general PVR problem: they typically want to work with a minimum of buffering to maximize responsiveness to user commands (fast forward, jump 30 seconds, play in reverse). Now consider that you're recording and playing back multiple HD streams on low-margin set-top hardware and you'll see that making this work -at all- means lots of I/O tuning. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Laptop keyboard unusable when ACPI is active
Hi all, some upates about this issue (see also bug 9147 http://bugzilla.kernel.org/show_bug.cgi?id=9147 ). The 'ac', 'battery' and 'thermal' modules (compiled as stand-alone) do cause the bug; it suffices that one of them (or any set of them) is loaded to trigger the bug either immediately or after some time. If none of them is loaded into memory, the bug does not happen. Also, the 'battery' module does not generate system messages although the problem is equally verified. The 'thermal' module instead, when loaded with 'modprobe thermal', causes the enter key pressed to execute the command to be indefinitively repeated into any terminal. This is currently a perfectly reproducible testcase for bug 9147. The bug has been confirmed by at least another user (with different hardware configuration); please reply for either bug addressing or confirmation. The current known best workaround to this bug is to compile all the above mentioned ACPI modules as stand-alone and to not (auto)load them (loosing their vital functionalities, since we are talking about laptops here, see http://gentoo-wiki.com/HARDWARE_Maxdata_Pro_7000_DX for an example of affected hardware). It is also important to note that this bug always comes with bug 8740 http://bugzilla.kernel.org/show_bug.cgi?id=8740 (also confirmed and also an ACPI issue). Best regards, -- Daniele C. [EMAIL PROTECTED] ha scritto: I am posting this message just to say that this bug is being addressed on the bug tracker: http://bugzilla.kernel.org/show_bug.cgi?id=9147 Regards, -- Daniele C. [EMAIL PROTECTED] ha scritto: [EMAIL PROTECTED] ha scritto: Kernel: 2.6.22-r5 Kernel option: i8042.nomux=1 I am now using kernel 2.6.22-r8 (Gentoo) and the following kernel options: i8042.nomux=1 acpi=off I have tried kernel 2.6.23-rc9 but the problem is still there. The problem which still remains, and I can't fix or work it around, is witnessed by the below dmesg lines: - atkbd.c: Unknown key released (translated set 2, code 0xe0 on isa0060/serio0). atkbd.c: Use 'setkeycodes e060 ' to make it known. atkbd.c: Unknown key released (translated set 2, code 0xe0 on isa0060/serio0). atkbd.c: Use 'setkeycodes e060 ' to make it known. atkbd.c: Unknown key released (translated set 2, code 0xe0 on isa0060/serio0). atkbd.c: Use 'setkeycodes e060 ' to make it known. - The release event for some keys is never caught, so all sorts of troubles happen if for example I use the Del key and it stucks, or if I use the Ctrl key and it never gets released...pushing again the stuck key brings back the key in the proper status. With acpi=off the problem is totally worked around. Can somebody please give me some clues about this issue, and possible solutions? I have been searching the web for a couple of weeks and seems like it is a common trouble of notebook users, but nobody has yet published a solution. I am trying to find a path myself in this issue - which dates back to at least 2005 and has never been resolved. I would now try some other kernel parameter in order to preserve ACPI functionality and possibly prevent ACPI from messing up the keyboard IRQs. Can somebody please give me istructions regarding the correct tests (regarding kernel parameters and/or anything else) to perform in order to better isolate the issue? Related Gentoo bug tracker item: http://bugs.gentoo.org/show_bug.cgi?id=194781 Other messages about the same kernel bug (many more can be found googling around, and no solution yet): https://lists.linux-foundation.org/pipermail/bugme-new/2005-January/011736.html http://dev.laptop.org/ticket/2401 Regards, -- Daniele C. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sata_nv: fix ADMA ATAPI issues with memory over 4GB (v2)
Hello, Robert. Robert Hancock wrote: > This fixes some problems with ATAPI devices on nForce4 controllers in ADMA > mode on systems with memory located above 4GB. We need to delay setting the > 64-bit DMA mask until the PRD table and padding buffer are allocated so that > they don't get allocated above 4GB and break legacy mode (which is needed for > ATAPI devices). Also, explicitly set a 32-bit DMA mask before allocating the > legacy buffers since setting the DMA mask affects both ports and we need to > ensure the second port's buffers are allocated properly (fixes a problem > with the previous version of this patch). > > Signed-off-by: Robert Hancock <[EMAIL PROTECTED]> > > + /* Ensure DMA mask is set to 32-bit before allocating legacy PRD and > +pad buffers */ > + pci_set_dma_mask(pdev, DMA_BIT_MASK(32)); > + pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(32)); [--snip--] > + pci_set_dma_mask(pdev, DMA_BIT_MASK(64)); > + pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64)); I'm probably being paranoid here but please add error checks. Just checking return value and returning error suffices. Thanks. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem with ufs nextstep in 2.6.18 (debian)
On Tue, Nov 20, 2007 at 12:29:03PM -0800, Dave Bailey wrote: > This problem has been around since kernel 2.6.16, and I see it in > 2.6.23.1-10.fc7. It occurs in the ufs_check_page function of ufs/dir.c > at the Espan test, which seems unnecessary for NextStep/OpenStep > files systems. The following patch preserves the test for other file > systems and makes the mount useful for NextStep/OpenStep: > (against the 2.6.23.1-10.fc7 source tree) > > [EMAIL PROTECTED] diff dir.c dir.c.orig > 108,110d107 > < unsigned mnext = UFS_SB(sb)->s_mount_opt & > < (UFS_MOUNT_UFSTYPE_NEXTSTEP || UFS_MOUNT_UFSTYPE_NEXTSTEP_CD || > < UFS_MOUNT_UFSTYPE_OPENSTEP); > 131c128 > < if ((mnext == 0) & (((offs + rec_len - 1) ^ offs) & > ~chunk_mask)) > --- > > if (((offs + rec_len - 1) ^ offs) & ~chunk_mask) This fixes only symptom, not illness. This check represent what code think about filesystem layout. On what actually kind of UFS system did you test this patch? When I sometime ago fixed similar issue for openstep ufs, actully this was darwin's ufs which has the same layout, I just set s_dirblksize to right value, may be for UFS_MOUNT_UFSTYPE_NEXTSTEP, UFS_MOUNT_UFSTYPE_NEXTSTEP_CD you need do the same, see TODO items in fs/ufs/super.c. -- /Evgeniy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 49/59] fs/ufs: Add missing "space"
On Mon, Nov 19, 2007 at 05:53:36PM -0800, Joe Perches wrote: > > Signed-off-by: Joe Perches <[EMAIL PROTECTED]> > --- > fs/ufs/dir.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c > index 30f8c2b..d19dfe8 100644 > --- a/fs/ufs/dir.c > +++ b/fs/ufs/dir.c > @@ -180,7 +180,7 @@ bad_entry: > Eend: > p = (struct ufs_dir_entry *)(kaddr + offs); > ufs_error (sb, "ext2_check_page", If you touch this code, it will be good, if you also replace "ext2_check_page" with something like __FUNCTION__. > -"entry in directory #%lu spans the page boundary" > +"entry in directory #%lu spans the page boundary " > "offset=%lu", > dir->i_ino, (page->index< fail: > -- > 1.5.3.5.652.gf192c -- /Evgeniy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9]: Reduce Log I/O latency
On Thu, 2007-11-22 at 12:12 +1100, David Chinner wrote: > In all the cases that I know of where ppl are using what could > be considered real-time I/O (e.g. media environments where they > do real-time ingest and playout from the same filesystem) the > real-time ingest processes create the files and do pre-allocation > before doing their I/O. This I/O can get held up behind another > process that is not real time that has issued log I/O. > > Given there is no I/O priority inheritence and having log I/O stall > will stall the entire filesystem, we cannot allow log I/O to > stall in real-time environments. Hence it must have the highest > possible priority to prevent this. FWIW from a "real time" database POV this seems to make sense to me... in fact, we probably rely on filesystem metadata way too much (historically it's just "worked" although we do seem to get issues on ext3). I have a (casually stupid) simulation program... although I've observed little to no problems on all my XFS tests using it. -- Stewart Smith, Senior Software Engineer (MySQL Cluster) MySQL AB, www.mysql.com Office: +14082136540 Ext: 6616 VoIP: [EMAIL PROTECTED] Mobile: +61 4 3 8844 332 signature.asc Description: This is a digitally signed message part
Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.
On Thursday 22 November 2007 13:43:06 Andi Kleen wrote: > There seems to be rough consensus that the kernel currently has too many > exported symbols. A lot of these exports are generally usable utility > functions or important driver interfaces; but another large part are > functions intended by only one or two very specific modules for a very > specific purpose. Hi Andi, This is an interesting idea, thanks for the code! My only question is whether we can get most of this benefit by dropping the indirection of namespaces and have something like "EXPORT_SYMBOL_TO(sym, modname)"? It doesn't work so well for exporting to a group of modules, but that seems a reasonable line to draw anyway. Cheers, Rusty. PS. Probably better to use the standard warnx and errx in modpost, too. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
freeze vs freezer
It seems that a process blocked in a write to an xfs filesystem due to xfs_freeze cannot be frozen by the freezer. I see this if I suspend my laptop while doing something xfs-filesystem intensive, like a kernel build. My suspend scripts freeze the XFS filesystem (as Dave said I should), which presumably blocks some writer, and then the freezer times out and fails to complete. Here's part of the process dump the freezer does when it times out: cc1 D 0 18138 18137 dd5f1e24 00200082 0002 ecdeeb00 ecdeec64 c200f280 0001 009c09a0 dd5f1e0c dd5f1e0c 000f dd5f1e74 c7beb480 dd5f1e88 dd5f1ea8 c0228d97 e8889540 dd5f1e38 c015b75d dd5f1e44 Call Trace: [] xfs_write+0xf4/0x6d9 [] xfs_file_aio_write+0x53/0x5b [] do_sync_write+0xae/0xec [] vfs_write+0xa4/0x120 [] sys_write+0x3b/0x60 [] sysenter_past_esp+0x6b/0xa1 === I haven't looked at how to fix this yet. I only just worked out why I was getting suspend failures. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.
On Thu, Nov 22, 2007 at 03:43:06AM +0100, Andi Kleen wrote: > There seems to be rough consensus that the kernel currently has too many > exported symbols. A lot of these exports are generally usable utility > functions or important driver interfaces; but another large part are > functions > intended by only one or two very specific modules for a very specific > purpose. > One example is the TCP code. It has most of its internals exported, but > only for use by tcp_ipv6.c (and now a few more by the TCP/IP congestion > modules) > But it doesn't make sense to include these exported for a specific module > functions into a broader "kernel interface". External modules assume > they can use these functions, but they were never intended for that. > > This patch allows to export symbols only for specific modules by > introducing symbol name spaces. A module name space has a white > list of modules that are allowed to import symbols for it; all others > can't use the symbols. I really like this patchset. Definitely a step in the right direction imo. Looks like some nits there that checkpatch will probably pick up on, but otherwise, looks very straightforward too. Kudos. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/1] mm: add dirty_highmem option
On Thu, Nov 15, 2007 at 01:14:32PM -0800, Linus Torvalds wrote: > Examples of non-broken solutions: > (a) always use lowmem sizes (what we do now) > (b) always use total mem sizes (sane but potentially dangerous: but the > VM pressure should work! It has serious bounce-buffer issues, though, > which is why I think it's crazy even if it's otherwise consistent) > > Btw, I actually suspect that while (a) is what we do now, for the specific > case that Bron has, we could have a /proc/sys/vm option to just enable > (b). So we don't have to have just one consistent model, we can allow odd > users (and Bron sounds like one - sorry Bron ;) to just force other, odd, > but consistent models. A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of approximately 2Gb size which contains a hash format that is written "randomly" by the dbclean process. On 2.6.16 this process took a few minutes. With lowmem only accounting of dirty ratios, this takes about 12 hours of 100% disk IO, all random writes. This patch includes some code cleanup from Linus and a toggle in /proc/sys/vm/dirty_highmem which can be set to 1 to add the highmem back to the total available memory count. Signed-off-by: Bron Gondwana <[EMAIL PROTECTED]> Index: linux-2.6.23.8-reiserfix-fai-vmdirty/mm/page-writeback.c === --- linux-2.6.23.8-reiserfix-fai-vmdirty.orig/mm/page-writeback.c 2007-11-22 01:48:20.0 + +++ linux-2.6.23.8-reiserfix-fai-vmdirty/mm/page-writeback.c2007-11-22 02:42:04.0 + @@ -70,6 +70,12 @@ static inline long sync_writeback_pages( int dirty_background_ratio = 5; /* + * free highmem will not be subtracted from the total free memory + * for calculating free ratios if vm_dirty_highmem is true + */ +int vm_dirty_highmem; + +/* * The generator of dirty data starts writeback at this percentage */ int vm_dirty_ratio = 10; @@ -153,7 +159,8 @@ static unsigned long determine_dirtyable x = global_page_state(NR_FREE_PAGES) + global_page_state(NR_INACTIVE) + global_page_state(NR_ACTIVE); - x -= highmem_dirtyable_memory(x); + if (!vm_dirty_highmem) + x -= highmem_dirtyable_memory(x); return x + 1; /* Ensure that we never return 0 */ } @@ -163,20 +170,12 @@ get_dirty_limits(long *pbackground, long { int background_ratio; /* Percentages */ int dirty_ratio; - int unmapped_ratio; long background; long dirty; unsigned long available_memory = determine_dirtyable_memory(); struct task_struct *tsk; - unmapped_ratio = 100 - ((global_page_state(NR_FILE_MAPPED) + - global_page_state(NR_ANON_PAGES)) * 100) / - available_memory; - dirty_ratio = vm_dirty_ratio; - if (dirty_ratio > unmapped_ratio / 2) - dirty_ratio = unmapped_ratio / 2; - if (dirty_ratio < 5) dirty_ratio = 5; Index: linux-2.6.23.8-reiserfix-fai-vmdirty/include/linux/writeback.h === --- linux-2.6.23.8-reiserfix-fai-vmdirty.orig/include/linux/writeback.h 2007-10-09 20:31:38.0 + +++ linux-2.6.23.8-reiserfix-fai-vmdirty/include/linux/writeback.h 2007-11-22 01:48:21.0 + @@ -92,6 +92,7 @@ void throttle_vm_writeout(gfp_t gfp_mask /* These are exported to sysctl. */ extern int dirty_background_ratio; +extern int vm_dirty_highmem; extern int vm_dirty_ratio; extern int dirty_writeback_interval; extern int dirty_expire_interval; Index: linux-2.6.23.8-reiserfix-fai-vmdirty/kernel/sysctl.c === --- linux-2.6.23.8-reiserfix-fai-vmdirty.orig/kernel/sysctl.c 2007-10-09 20:31:38.0 + +++ linux-2.6.23.8-reiserfix-fai-vmdirty/kernel/sysctl.c2007-11-22 01:48:21.0 + @@ -776,6 +776,7 @@ static ctl_table kern_table[] = { /* Constants for minimum and maximum testing in vm_table. We use these as one-element integer vectors. */ static int zero; +static int one = 1; static int two = 2; static int one_hundred = 100; @@ -1066,6 +1067,19 @@ static ctl_table vm_table[] = { .extra1 = , }, #endif +#ifdef CONFIG_HIGHMEM + { + .ctl_name = CTL_UNNUMBERED, + .procname = "dirty_highmem", + .data = _dirty_highmem, + .maxlen = sizeof(vm_dirty_highmem), + .mode = 0644, + .proc_handler = _dointvec_minmax, + .strategy = _intvec, + .extra1 = , + .extra2 = , + }, +#endif /* * NOTE: do not add new entries to this table unless you have read * Documentation/sysctl/ctl_unnumbered.txt
Re: [PATCH 2/9]: Reduce Log I/O latency
On Wed, Nov 21, 2007 at 08:57:27PM -0600, Matt Mackall wrote: > On Thu, Nov 22, 2007 at 12:12:14PM +1100, David Chinner wrote: > > In all the cases that I know of where ppl are using what could > > be considered real-time I/O (e.g. media environments where they > > do real-time ingest and playout from the same filesystem) the > > real-time ingest processes create the files and do pre-allocation > > before doing their I/O. This I/O can get held up behind another > > process that is not real time that has issued log I/O. > > > > Given there is no I/O priority inheritence and having log I/O stall > > will stall the entire filesystem, we cannot allow log I/O to > > stall in real-time environments. Hence it must have the highest > > possible priority to prevent this. > > I've seen PVRs that would be upset by this. They put media on one > filesystem and database/apps/swap/etc. on another, but have everything > on a single spindle. Stalling a media filesystem read for a write > anywhere else = fail. Sounds like the PVR is badly designed to me. If a write can cause a read to miss a playback deadline, then you haven't built enough buffering into your playback application. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Linux Kernel Markers - Support Multiple Probes
RCU style multiple probes support for the Linux Kernel Markers. Common case (one probe) is still fast and does not require dynamic allocation or a supplementary pointer dereference on the fast path. - Move preempt disable from the marker site to the callback. Since we now have an internal callback, move the preempt disable/enable to the callback instead of the marker site. Since the callback change is done asynchronously (passing from a handler that supports arguments to a handler that does not setup the arguments is no arguments are passed), we can safely update it even if it is outside the preempt disable section. - Move probe arm to probe connection. Now, a connected probe is automatically armed. Remove MARK_MAX_FORMAT_LEN, unused. This patch modifies the Linux Kernel Markers API : it removes the probe "arm/disarm" and changes the probe function prototype : it now expects a va_list * instead of a "...". It applies on top of 2.6.24-rc3-git1. Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> CC: Christoph Hellwig <[EMAIL PROTECTED]> CC: Andrew Morton <[EMAIL PROTECTED]> CC: Mike Mason <[EMAIL PROTECTED]> CC: Dipankar Sarma <[EMAIL PROTECTED]> --- include/linux/marker.h | 59 ++- include/linux/module.h |2 kernel/marker.c | 671 +--- kernel/module.c |7 samples/markers/probe-example.c | 25 - 5 files changed, 548 insertions(+), 216 deletions(-) Index: linux-2.6-lttng/include/linux/marker.h === --- linux-2.6-lttng.orig/include/linux/marker.h 2007-11-21 19:01:02.0 -0500 +++ linux-2.6-lttng/include/linux/marker.h 2007-11-21 19:17:30.0 -0500 @@ -19,16 +19,23 @@ struct marker; /** * marker_probe_func - Type of a marker probe function - * @mdata: pointer of type struct marker - * @private_data: caller site private data + * @probe_private: probe private data + * @call_private: call site private data * @fmt: format string - * @...: variable argument list + * @args: variable argument list pointer. Use a pointer to overcome C's + *inability to pass this around as a pointer in a portable manner in + *the callee otherwise. * * Type of marker probe functions. They receive the mdata and need to parse the * format string to recover the variable argument list. */ -typedef void marker_probe_func(const struct marker *mdata, - void *private_data, const char *fmt, ...); +typedef void marker_probe_func(void *probe_private, void *call_private, + const char *fmt, va_list *args); + +struct marker_probe_closure { + marker_probe_func *func;/* Callback */ + void *probe_private;/* Private probe data */ +}; struct marker { const char *name; /* Marker name */ @@ -36,8 +43,11 @@ struct marker { * variable argument list. */ char state; /* Marker state. */ - marker_probe_func *call;/* Probe handler function pointer */ - void *private; /* Private probe data */ + char ptype; /* probe type : 0 : single, 1 : multi */ + void (*call)(const struct marker *mdata,/* Probe wrapper */ + void *call_private, const char *fmt, ...); + struct marker_probe_closure single; + struct marker_probe_closure *multi; } __attribute__((aligned(8))); #ifdef CONFIG_MARKERS @@ -49,7 +59,7 @@ struct marker { * not add unwanted padding between the beginning of the section and the * structure. Force alignment to the same alignment as the section start. */ -#define __trace_mark(name, call_data, format, args...) \ +#define __trace_mark(name, call_private, format, args...) \ do {\ static const char __mstrtab_name_##name[] \ __attribute__((section("__markers_strings"))) \ @@ -60,24 +70,23 @@ struct marker { static struct marker __mark_##name \ __attribute__((section("__markers"), aligned(8))) = \ { __mstrtab_name_##name, __mstrtab_format_##name, \ - 0, __mark_empty_function, NULL }; \ + 0, 0, marker_probe_cb, \ + { __mark_empty_function, NULL}, NULL }; \ __mark_check_format(format, ## args); \ if (unlikely(__mark_##name.state)) {\ - preempt_disable(); \ (*__mark_##name.call) \ - (&__mark_##name, call_data, \ + (&__mark_##name,
Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.
> I like this concept in general; I have one minor comment; right now > your namespace argument is like > > EXPORT_SYMBOL_NS(foo, some_symbol); > > from a language-like pov I kinda wonder if it's nicer to do > > EXPORT_SYMBOL_NS("foo", some_symbol); > > because foo isn't something in C scope, but more a string-like > identifier... That wouldn't work for MODULE_ALLOW() because it appends the namespace to other identifiers. I don't know of a way in the C processor to get back from a string to a ## concatenable identifier. For EXPORT_SYMBOL_NS it would be in theory possible, but making it asymmetric to MODULE_ALLOW would be ugly imho. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC][try 2] IA64 signal : remove redundant code in setup_sigcontext()
On Thu, Nov 22, 2007 at 11:15:55AM +0800, Shi Weihua wrote: > This patch removes some redundant code in the function setup_sigcontext(). > > The registers ar.ccv,b7,r14,ar.csd,ar.ssd,r2-r3 and r16-r31 are not restored > in restore_sigcontext() when (flags & IA64_SC_FLAG_IN_SYSCALL) is true. > So we don't need to zero those variables in setup_sigcontext(). Erm, couldn't those registers contain information the process shouldn't see? -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC][try 2] IA64 signal : remove redundant code in setup_sigcontext()
This patch removes some redundant code in the function setup_sigcontext(). The registers ar.ccv,b7,r14,ar.csd,ar.ssd,r2-r3 and r16-r31 are not restored in restore_sigcontext() when (flags & IA64_SC_FLAG_IN_SYSCALL) is true. So we don't need to zero those variables in setup_sigcontext(). Signed-off-by: Shi Weihua <[EMAIL PROTECTED]> --- diff -urp linux-2.6.24-rc3-git1.orig/arch/ia64/kernel/signal.c linux-2.6.24-rc3-git1/arch/ia64/kernel/signal.c --- linux-2.6.24-rc3-git1.orig/arch/ia64/kernel/signal.c2007-11-17 13:16:36.0 +0800 +++ linux-2.6.24-rc3-git1/arch/ia64/kernel/signal.c 2007-11-22 11:02:27.0 +0800 @@ -280,15 +280,7 @@ setup_sigcontext (struct sigcontext __us err |= __copy_to_user(>sc_gr[15], >pt.r15, 8); /* r15 */ err |= __put_user(scr->pt.cr_iip + ia64_psr(>pt)->ri, >sc_ip); - if (flags & IA64_SC_FLAG_IN_SYSCALL) { - /* Clear scratch registers if the signal interrupted a system call. */ - err |= __put_user(0, >sc_ar_ccv); /* ar.ccv */ - err |= __put_user(0, >sc_br[7]); /* b7 */ - err |= __put_user(0, >sc_gr[14]); /* r14 */ - err |= __clear_user(>sc_ar25, 2*8); /* ar.csd & ar.ssd */ - err |= __clear_user(>sc_gr[2], 2*8); /* r2-r3 */ - err |= __clear_user(>sc_gr[16], 16*8); /* r16-r31 */ - } else { + if (!(flags & IA64_SC_FLAG_IN_SYSCALL)) { /* Copy scratch regs to sigcontext if the signal didn't interrupt a syscall. */ err |= __put_user(scr->pt.ar_ccv, >sc_ar_ccv); /* ar.ccv */ err |= __put_user(scr->pt.b7, >sc_br[7]); /* b7 */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1 (sync is slow ?)
On Wed, 21 Nov 2007 00:49:09 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote: > On Wed, 21 Nov 2007 17:42:15 +0900 KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> > wrote: > > > Hi, Andrew > > > > I got following result in 'sync' command. > > It was too slow. (memory controller config is off ;) > > I attaches my .config. > > == > > [2.6.24-rc3-mm1] > > [EMAIL PROTECTED] ~]$ dd if=/dev/zero of=./tmpfile bs=4096 count=10 > > 10+0 records in > > 10+0 records out > > 40960 bytes (410 MB) copied, 1.46706 seconds, 279 MB/s > > [EMAIL PROTECTED] ~]$ time sync > > > > real3m6.440s > > user0m0.000s > > sys 0m0.133s > Well I wonder how we did that. > > It seems OK here from a quick test (i386, ext3-on-IDE). > > Maybe device driver/block breakage? > I confirmed This slowdown is caused by git-scsi-misc.patch. I'm sorry that I can't chase more and will be offline in this weekend. This is scsi_mod information in /proc/modules = scsi_mod 409416 8 mptctl,sg,lpfc,scsi_transport_fc,mptspi,mptscsih,scsi_transport_spi,sd_mod, Live 0xa00202818000 = What information should I provide more ? Thanks, -Kame - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.
On Thu, 22 Nov 2007 03:43:06 +0100 (CET) Andi Kleen <[EMAIL PROTECTED]> wrote: > > There seems to be rough consensus that the kernel currently has too > many exported symbols. A lot of these exports are generally usable > utility functions or important driver interfaces; but another large > part are functions intended by only one or two very specific modules > for a very specific purpose. One example is the TCP code. It has most > of its internals exported, but only for use by tcp_ipv6.c (and now a > few more by the TCP/IP congestion modules) But it doesn't make sense > to include these exported for a specific module functions into a > broader "kernel interface". External modules assume they can use > these functions, but they were never intended for that. > > This patch allows to export symbols only for specific modules by > introducing symbol name spaces. A module name space has a white > list of modules that are allowed to import symbols for it; all others > can't use the symbols. > > It adds two new macros: > > MODULE_NAMESPACE_ALLOW(namespace, module); > > Allow module to import symbols from namespace. module is the module > name without .ko as displayed by lsmod. Must be in the same module > as the export (and be duplicated if there are multiple modules > exporting symbols to a namespace). Multiple allows for the same name > space are allowed. > > EXPORT_SYMBOL_NS(namespace, symbol); > Hi, I like this concept in general; I have one minor comment; right now your namespace argument is like EXPORT_SYMBOL_NS(foo, some_symbol); from a language-like pov I kinda wonder if it's nicer to do EXPORT_SYMBOL_NS("foo", some_symbol); because foo isn't something in C scope, but more a string-like identifier... -- If you want to reach me at my work email, use [EMAIL PROTECTED] For development, discussion and tips for power savings, visit http://www.lesswatts.org - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9]: Reduce Log I/O latency
On Thu, Nov 22, 2007 at 12:12:14PM +1100, David Chinner wrote: > On Thu, Nov 22, 2007 at 01:49:25AM +0100, Andi Kleen wrote: > > David Chinner <[EMAIL PROTECTED]> writes: > > > > > To ensure that log I/O is issued as the highest priority I/O, set > > > the I/O priority of the log I/O to the highest possible. This will > > > ensure that log I/O is not held up behind bulk data or other > > > metadata I/O as delaying log I/O can pause the entire transaction > > > subsystem. Introduce a new buffer flag to allow us to tag the log > > > buffers so we can discrimiate when issuing the I/O. > > > > Won't that possible disturb other RT priority users that do not need > > log IO (e.g. working on preallocated files)? Seems a little > > dangerous. > > In all the cases that I know of where ppl are using what could > be considered real-time I/O (e.g. media environments where they > do real-time ingest and playout from the same filesystem) the > real-time ingest processes create the files and do pre-allocation > before doing their I/O. This I/O can get held up behind another > process that is not real time that has issued log I/O. > > Given there is no I/O priority inheritence and having log I/O stall > will stall the entire filesystem, we cannot allow log I/O to > stall in real-time environments. Hence it must have the highest > possible priority to prevent this. I've seen PVRs that would be upset by this. They put media on one filesystem and database/apps/swap/etc. on another, but have everything on a single spindle. Stalling a media filesystem read for a write anywhere else = fail. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above
Simon Holm Thøgersen wrote: ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen: There is a backport of the CFS scheduler to 2.6.21, see http://lkml.org/lkml/2007/11/19/127 Hi, Simon: I will try that after the thanksgiving holiday to find out whether the odd behavior will show up using 2.6.21 with back ported CFS. Kernel 2.6.21 Number of Threads 2 4 6 8 SpinLock (Time micro second) 10.561810.5853810.5915 10.643 (Overhead) 0.073 0.05746 0.102805 0.154563 Barrier (Time micro second)11.020410 11.678125 11.9889 12.38002 (Overhead)0.531660 1.1502 1.500112 1.891617 Each thread is bound to a particular core using pthread_setaffinity_np. Kernel 2.6.23.8 Number of Threads 2 4 6 8 SpinLock (Time micro second) 14.849915 17.117603 14.4496 10.5990 (Overhead)4.345417 6.6172073.949435 0.110985 Barrier (Time micro second)19.462255 20.285117 16.19395 12.37662 (Overhead)8.957755 9.7847225.699590 1.869518 Simon Holm Thøgersen I just ran a simple test to prove that the problem may be related to load balance of the scheduler. I first started 6 processes using "taskset -c 2 donothing&; taskset -c 3 donothing&; ..., taskset -c 7 donothing". These 6 processes will run on core 2 to 7. Then I started my test program using two threads bound to core 0 and 1. Here is the result: Two threads on Kernel 2.6.23.8: SpinLock (Time micro second) 10.558255 (Overhead) 0.068965 Barrier (Time micro second) 10.865520 (Overhead) 0.376230 Similarly, I started 4 donothing processes on core 4, 5, 6 and 7, and ran the test program. I have the following result: Four threads on Kernel 2.6.23.8: SpinLock (Time micro second) 10.579413 (Overhead) 0.090023 Barrier (Time micro second) 11.363193 (Overhead) 0.873803 Finally, here is the result for 6 threads with two donothing processes running on core 6 and 7: Six threads on Kernel 2.6.23.8: SpinLock (Time micro second) 10.590030 (Overhead) 0.100940 Barrier (Time micro second) 11.977548 (Overhead) 1.488458 Now the above results are very much similar to the results obtained for the kernel 2.6.21. I hope this helps you guys in some ways. Thank you. -- # # Jie Chen # Scientific Computing Group # Thomas Jefferson National Accelerator Facility # Newport News, VA 23606 # # [EMAIL PROTECTED] # (757)269-5046 (office) # (757)269-6248 (fax) # - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Where is the interrupt going?
On Wed, 21 Nov 2007 17:08:30 -0800 Al Niessner <[EMAIL PROTECTED]> wrote: > > Lastly, I would be happy to give out the entire module to anyone who > requests it, but it is about 550 lines so I did not want to attach it > to this already long post. > can you send it to me, or even better, post it somewhere online ? I have something I'd like to check to see if you do it correct but I can't without the code... -- If you want to reach me at my work email, use [EMAIL PROTECTED] For development, discussion and tips for power savings, visit http://www.lesswatts.org - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCH] SO_NO_CHECK for IPv6
In article <[EMAIL PROTECTED]> (at Thu, 22 Nov 2007 10:34:03 +0800), Herbert Xu <[EMAIL PROTECTED]> says: > On Wed, Nov 21, 2007 at 07:17:40PM -0500, Jeff Garzik wrote: > > > > For those interested, I am dealing with a UDP app that already does very > > strong checksumming and encryption, so additional software checksumming > > at the lower layers is quite simply a waste of CPU cycles. Hardware > > checksumming is fine, as long as its "free." > > No matter how strong your underlying checksumming is it's not > going to protect the IPv6 header is it :) In that sense, we should use AH. --yoshfuji - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] alpha: kill deprecated virt_to_bus
On Wed, 21 Nov 2007 12:26:55 +0100 Jens Axboe <[EMAIL PROTECTED]> wrote: > On Tue, Nov 20 2007, FUJITA Tomonori wrote: > > pci-noop.c doesn't use DMA mappings. > > you should send that one to the alpha maintainers, it needs to go in > that way. Yeah, I'll do. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 3/4] swiotlb: respect the segment boundary limits
This patch makes swiotlb not allocate a memory area spanning LLD's segment boundary. is_span_boundary() judges whether a memory area spans LLD's segment boundary. If map_single finds such a area, map_single tries to find the next available memory area. Signed-off-by: FUJITA Tomonori <[EMAIL PROTECTED]> --- lib/swiotlb.c | 41 +++-- 1 files changed, 35 insertions(+), 6 deletions(-) diff --git a/lib/swiotlb.c b/lib/swiotlb.c index 1a8050a..4bb5a11 100644 --- a/lib/swiotlb.c +++ b/lib/swiotlb.c @@ -282,6 +282,15 @@ address_needs_mapping(struct device *hwdev, dma_addr_t addr) return (addr & ~mask) != 0; } +static inline unsigned int is_span_boundary(unsigned int index, + unsigned int nslots, + unsigned long offset_slots, + unsigned long max_slots) +{ + unsigned long offset = (offset_slots + index) & (max_slots - 1); + return offset + nslots > max_slots; +} + /* * Allocates bounce buffer and returns its kernel virtual address. */ @@ -292,6 +301,16 @@ map_single(struct device *hwdev, char *buffer, size_t size, int dir) char *dma_addr; unsigned int nslots, stride, index, wrap; int i; + unsigned long start_dma_addr; + unsigned long mask; + unsigned long offset_slots; + unsigned long max_slots; + + mask = dma_get_seg_boundary(hwdev); + start_dma_addr = virt_to_bus(io_tlb_start) & mask; + + offset_slots = ALIGN(start_dma_addr, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT; + max_slots = ALIGN(mask + 1, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT; /* * For mappings greater than a page, we limit the stride (and @@ -311,10 +330,17 @@ map_single(struct device *hwdev, char *buffer, size_t size, int dir) */ spin_lock_irqsave(_tlb_lock, flags); { - wrap = index = ALIGN(io_tlb_index, stride); - + index = ALIGN(io_tlb_index, stride); if (index >= io_tlb_nslabs) - wrap = index = 0; + index = 0; + + while (is_span_boundary(index, nslots, offset_slots, + max_slots)) { + index += stride; + if (index >= io_tlb_nslabs) + index = 0; + } + wrap = index; do { /* @@ -341,9 +367,12 @@ map_single(struct device *hwdev, char *buffer, size_t size, int dir) goto found; } - index += stride; - if (index >= io_tlb_nslabs) - index = 0; + do { + index += stride; + if (index >= io_tlb_nslabs) + index = 0; + } while (is_span_boundary(index, nslots, offset_slots, + max_slots)); } while (index != wrap); spin_unlock_irqrestore(_tlb_lock, flags); -- 1.5.3.4 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 0/4] fix iommu segment boundary problems
This is the latter half of my iommu work to make the IOMMUs respect LLDs restrictions. IOMMUs allocate memory areas without considering a low level driver's segment boundary limits. So we have some workarounds: splitting sg segments again in LLDs; reserving all I/O space spanning 4GB boundary in IOMMUs (with assumption that all the LLDs have 4GB boundary restrictions). The goal is killing all the workarounds. This patchset adds new accessors for segment_boundary_mask in device_dma_parameters structure in the same way as the first half of my work did for max_segment_size. Currently, I fixed only swiotlb. Next, I'll generalize swiotlb's free area management and convert all the IOMMUs to use it. Or I'll generalize a free area management to use bitmap that most of the IOMMUs use and convert them to use it. This is against 2.6.24-rc3-mm1. The first half of my iommu work is: http://thread.gmane.org/gmane.linux.scsi/35602 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 1/4] add accessors for segment_boundary_mask in device_dma_parameters
This adds new accessors for segment_boundary_mask in device_dma_parameters structure in the same way I did for max_segment_size. So we can easily change where to place struct device_dma_parameters in the future. dma_get_segment boundary returns 0x if dma_parms in struct device isn't set up properly. 0x is the default value used in the block layer and the scsi mid layer. Signed-off-by: FUJITA Tomonori <[EMAIL PROTECTED]> --- include/linux/dma-mapping.h | 15 +++ 1 files changed, 15 insertions(+), 0 deletions(-) diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 71972ca..7d157ed 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -75,6 +75,21 @@ static inline unsigned int dma_set_max_seg_size(struct device *dev, return -EIO; } +static inline unsigned long dma_get_seg_boundary(struct device *dev) +{ + return dev->dma_parms ? + dev->dma_parms->segment_boundary_mask : 0x; +} + +static inline int dma_set_seg_boundary(struct device *dev, unsigned long mask) +{ + if (dev->dma_parms) { + dev->dma_parms->segment_boundary_mask = mask; + return 0; + } else + return -EIO; +} + /* flags for the coherent memory api */ #defineDMA_MEMORY_MAP 0x01 #define DMA_MEMORY_IO 0x02 -- 1.5.3.4 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 4/4] call dma_set_seg_boundary in __scsi_alloc_queue
This is a one-line patch to add the following to __scsi_alloc_queue(): dma_set_seg_boundary(dev, shost->dma_boundary); This is the simplest approach but the result looks odd, __scsi_alloc_queue() does: blk_queue_segment_boundary(q, shost->dma_boundary); dma_set_seg_boundary(dev, shost->dma_boundary); blk_queue_max_segment_size(q, dma_get_max_seg_size(dev)); I think that it would be better to set up segment boundary in the same way as we did for the maximum segment size. That is, removing shost->dma_boundary and LLDs call pci_set_dma_seg_boundary (or its friends). Then __scsi_alloc_queue() can set up both limits in the same way: blk_queue_segment_boundary(q, dma_get_seg_boundary(dev)); blk_queue_max_segment_size(q, dma_get_max_seg_size(dev)); killing dma_boundary in scsi_host_template needs a large patch for libata (dma_boundary is used by only libata and sym53c8xx). I'll send a patch to do that if it is acceptable. James and Jeff? Signed-off-by: FUJITA Tomonori <[EMAIL PROTECTED]> --- drivers/scsi/scsi_lib.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 733176d..2a15a3b 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1767,6 +1767,7 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost, blk_queue_max_sectors(q, shost->max_sectors); blk_queue_bounce_limit(q, scsi_calculate_bounce_limit(shost)); blk_queue_segment_boundary(q, shost->dma_boundary); + dma_set_seg_boundary(dev, shost->dma_boundary); blk_queue_max_segment_size(q, dma_get_max_seg_size(dev)); -- 1.5.3.4 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 2/4] PCI: add dma segment boundary support
This adds PCI's accessor for segment_boundary_mask in device_dma_parameters. The default segment_boundary is set to 0x, same to the block layer's default value (and the scsi mid layer uses the same value). Signed-off-by: FUJITA Tomonori <[EMAIL PROTECTED]> --- drivers/pci/pci.c |8 drivers/pci/probe.c |1 + include/linux/pci.h |2 ++ 3 files changed, 11 insertions(+), 0 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index de623cf..3b7e0e0 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1435,6 +1435,14 @@ int pci_set_dma_max_seg_size(struct pci_dev *dev, unsigned int size) EXPORT_SYMBOL(pci_set_dma_max_seg_size); #endif +#ifndef HAVE_ARCH_PCI_SET_DMA_SEGMENT_BOUNDARY +int pci_set_dma_seg_boundary(struct pci_dev *dev, unsigned long mask) +{ + return dma_set_seg_boundary(>dev, mask); +} +EXPORT_SYMBOL(pci_set_dma_seg_boundary); +#endif + /** * pcix_get_max_mmrbc - get PCI-X maximum designed memory read byte count * @dev: PCI device to query diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index aa343e1..2e8b539 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -987,6 +987,7 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus) dev->dev.coherent_dma_mask = 0xull; pci_set_dma_max_seg_size(dev, 65536); + pci_set_dma_seg_boundary(dev, 0x); /* Fix up broken headers */ pci_fixup_device(pci_fixup_header, dev); diff --git a/include/linux/pci.h b/include/linux/pci.h index d56d0b6..a05a843 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -567,6 +567,7 @@ void pci_msi_off(struct pci_dev *dev); int pci_set_dma_mask(struct pci_dev *dev, u64 mask); int pci_set_consistent_dma_mask(struct pci_dev *dev, u64 mask); int pci_set_dma_max_seg_size(struct pci_dev *dev, unsigned int size); +int pci_set_dma_seg_boundary(struct pci_dev *dev, unsigned long mask); int pcix_get_max_mmrbc(struct pci_dev *dev); int pcix_get_mmrbc(struct pci_dev *dev); int pcix_set_mmrbc(struct pci_dev *dev, int mmrbc); @@ -753,6 +754,7 @@ static inline int pci_enable_device(struct pci_dev *dev) { return -EIO; } static inline void pci_disable_device(struct pci_dev *dev) { } static inline int pci_set_dma_mask(struct pci_dev *dev, u64 mask) { return -EIO; } static inline int pci_set_dma_max_seg_size(struct pci_dev *dev, unsigned int size) { return -EIO; } +static inline int pci_set_dma_seg_boundary(struct pci_dev *dev, unsigned long mask) { return -EIO; } static inline int pci_assign_resource(struct pci_dev *dev, int i) { return -EBUSY;} static inline int __pci_register_driver(struct pci_driver *drv, struct module *owner) { return 0;} static inline int pci_register_driver(struct pci_driver *drv) { return 0;} -- 1.5.3.4 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC] [7/9] Convert TCP exports into namespaces
I defined two namespaces: tcp for TCP internals which are only used by tcp_ipv6.ko And tcpcong for exports used by the TCP congestion modules No need to export any TCP internals to anybody else. So express this in a namespace. I admit I'm not 100% sure tcpcong makes sense -- there might be a legitimate need to have external out of tree congestion modules. They seem nearly like drivers, but only nearly. If that was deemed the case it would be possible to remove tcpcong again to allow everybody to access this. This implicitely turns all exports into GPL only, but that won't matter because all modules allowed to import TCP functions are GPLed. --- net/ipv4/tcp.c | 71 +++ net/ipv4/tcp_cong.c | 14 - net/ipv4/tcp_input.c | 12 +++ net/ipv4/tcp_ipv4.c | 38 - net/ipv4/tcp_minisocks.c | 12 +++ net/ipv4/tcp_output.c| 12 +++ net/ipv4/tcp_timer.c |2 - 7 files changed, 87 insertions(+), 74 deletions(-) Index: linux/net/ipv4/tcp.c === --- linux.orig/net/ipv4/tcp.c +++ linux/net/ipv4/tcp.c @@ -275,21 +275,21 @@ DEFINE_SNMP_STAT(struct tcp_mib, tcp_sta atomic_t tcp_orphan_count = ATOMIC_INIT(0); -EXPORT_SYMBOL_GPL(tcp_orphan_count); +EXPORT_SYMBOL_NS(tcp, tcp_orphan_count); int sysctl_tcp_mem[3] __read_mostly; int sysctl_tcp_wmem[3] __read_mostly; int sysctl_tcp_rmem[3] __read_mostly; -EXPORT_SYMBOL(sysctl_tcp_mem); -EXPORT_SYMBOL(sysctl_tcp_rmem); -EXPORT_SYMBOL(sysctl_tcp_wmem); +EXPORT_SYMBOL_NS(tcp, sysctl_tcp_mem); +EXPORT_SYMBOL_NS(tcp, sysctl_tcp_rmem); +EXPORT_SYMBOL_NS(tcp, sysctl_tcp_wmem); atomic_t tcp_memory_allocated; /* Current allocated memory. */ atomic_t tcp_sockets_allocated;/* Current number of TCP sockets. */ -EXPORT_SYMBOL(tcp_memory_allocated); -EXPORT_SYMBOL(tcp_sockets_allocated); +EXPORT_SYMBOL_NS(tcp, tcp_memory_allocated); +EXPORT_SYMBOL_NS(tcp, tcp_sockets_allocated); /* * Pressure flag: try to collapse. @@ -299,7 +299,7 @@ EXPORT_SYMBOL(tcp_sockets_allocated); */ int tcp_memory_pressure __read_mostly; -EXPORT_SYMBOL(tcp_memory_pressure); +EXPORT_SYMBOL_NS(tcp, tcp_memory_pressure); void tcp_enter_memory_pressure(void) { @@ -309,7 +309,7 @@ void tcp_enter_memory_pressure(void) } } -EXPORT_SYMBOL(tcp_enter_memory_pressure); +EXPORT_SYMBOL_NS(tcp, tcp_enter_memory_pressure); /* * Wait for a TCP event. @@ -1995,7 +1995,7 @@ int compat_tcp_setsockopt(struct sock *s return do_tcp_setsockopt(sk, level, optname, optval, optlen); } -EXPORT_SYMBOL(compat_tcp_setsockopt); +EXPORT_SYMBOL_NS(tcp, compat_tcp_setsockopt); #endif /* Return information about state of tcp endpoint in API format. */ @@ -2061,7 +2061,7 @@ void tcp_get_info(struct sock *sk, struc info->tcpi_total_retrans = tp->total_retrans; } -EXPORT_SYMBOL_GPL(tcp_get_info); +EXPORT_SYMBOL_NS(tcp, tcp_get_info); static int do_tcp_getsockopt(struct sock *sk, int level, int optname, char __user *optval, int __user *optlen) @@ -2174,7 +2174,7 @@ int compat_tcp_getsockopt(struct sock *s return do_tcp_getsockopt(sk, level, optname, optval, optlen); } -EXPORT_SYMBOL(compat_tcp_getsockopt); +EXPORT_SYMBOL_NS(tcp, compat_tcp_getsockopt); #endif struct sk_buff *tcp_tso_segment(struct sk_buff *skb, int features) @@ -2262,7 +2262,7 @@ struct sk_buff *tcp_tso_segment(struct s out: return segs; } -EXPORT_SYMBOL(tcp_tso_segment); +EXPORT_SYMBOL_NS(tcp, tcp_tso_segment); #ifdef CONFIG_TCP_MD5SIG static unsigned long tcp_md5sig_users; @@ -2298,7 +2298,7 @@ void tcp_free_md5sig_pool(void) __tcp_free_md5sig_pool(pool); } -EXPORT_SYMBOL(tcp_free_md5sig_pool); +EXPORT_SYMBOL_NS(tcp, tcp_free_md5sig_pool); static struct tcp_md5sig_pool **__tcp_alloc_md5sig_pool(void) { @@ -2371,7 +2371,7 @@ retry: return pool; } -EXPORT_SYMBOL(tcp_alloc_md5sig_pool); +EXPORT_SYMBOL_NS(tcp, tcp_alloc_md5sig_pool); struct tcp_md5sig_pool *__tcp_get_md5sig_pool(int cpu) { @@ -2384,14 +2384,14 @@ struct tcp_md5sig_pool *__tcp_get_md5sig return (p ? *per_cpu_ptr(p, cpu) : NULL); } -EXPORT_SYMBOL(__tcp_get_md5sig_pool); +EXPORT_SYMBOL_NS(tcp, __tcp_get_md5sig_pool); void __tcp_put_md5sig_pool(void) { tcp_free_md5sig_pool(); } -EXPORT_SYMBOL(__tcp_put_md5sig_pool); +EXPORT_SYMBOL_NS(tcp, __tcp_put_md5sig_pool); #endif void tcp_done(struct sock *sk) @@ -2409,7 +2409,7 @@ void tcp_done(struct sock *sk) else inet_csk_destroy_sock(sk); } -EXPORT_SYMBOL_GPL(tcp_done); +EXPORT_SYMBOL_NS(tcp, tcp_done); extern void __skb_cb_too_small_for_tcp(int, int); extern struct tcp_congestion_ops tcp_reno; @@ -2524,15 +2524,28 @@ void __init tcp_init(void) tcp_register_congestion_control(_reno); } -EXPORT_SYMBOL(tcp_close); -EXPORT_SYMBOL(tcp_disconnect);
[PATCH RFC] [9/9] Add a inet namespace
Shared by IP, IPv6, DCCP, UDPLITE, SCTP. The symbols used by tunnel modules weren't put into any name space because there are quite a lot of them. --- net/core/fib_rules.c|9 -- net/ipv4/af_inet.c | 52 net/ipv4/arp.c |1 net/ipv4/icmp.c | 10 +++ net/ipv4/inet_connection_sock.c | 40 +++--- net/ipv4/inet_diag.c|4 +-- net/ipv4/inet_hashtables.c |8 +++--- net/ipv4/inet_timewait_sock.c | 12 - net/ipv4/ip_input.c |2 - net/ipv4/ip_output.c|7 +++-- net/ipv4/ip_sockglue.c | 10 +++ 11 files changed, 86 insertions(+), 69 deletions(-) Index: linux/net/ipv4/af_inet.c === --- linux.orig/net/ipv4/af_inet.c +++ linux/net/ipv4/af_inet.c @@ -218,7 +218,7 @@ out: } u32 inet_ehash_secret __read_mostly; -EXPORT_SYMBOL(inet_ehash_secret); +EXPORT_SYMBOL_NS(inet, inet_ehash_secret); /* * inet_ehash_secret must be set exactly once @@ -235,7 +235,7 @@ void build_ehash_secret(void) inet_ehash_secret = rnd; spin_unlock_bh(_lock); } -EXPORT_SYMBOL(build_ehash_secret); +EXPORT_SYMBOL_NS(inet, build_ehash_secret); /* * Create an inet socket. @@ -1127,7 +1127,7 @@ int inet_sk_rebuild_header(struct sock * return err; } -EXPORT_SYMBOL(inet_sk_rebuild_header); +EXPORT_SYMBOL_NS(inet,inet_sk_rebuild_header); static int inet_gso_send_check(struct sk_buff *skb) { @@ -1235,6 +1235,8 @@ unsigned long snmp_fold_field(void *mib[ } return res; } +/* AK: Not in inet namespace because they're a generic facility. Probably + should be in another file though. */ EXPORT_SYMBOL_GPL(snmp_fold_field); int snmp_mib_init(void *ptr[2], size_t mibsize, size_t mibalign) @@ -1499,20 +1501,30 @@ static int __init ipv4_proc_init(void) MODULE_ALIAS_NETPROTO(PF_INET); -EXPORT_SYMBOL(inet_accept); -EXPORT_SYMBOL(inet_bind); -EXPORT_SYMBOL(inet_dgram_connect); -EXPORT_SYMBOL(inet_dgram_ops); -EXPORT_SYMBOL(inet_getname); -EXPORT_SYMBOL(inet_ioctl); -EXPORT_SYMBOL(inet_listen); -EXPORT_SYMBOL(inet_register_protosw); -EXPORT_SYMBOL(inet_release); -EXPORT_SYMBOL(inet_sendmsg); -EXPORT_SYMBOL(inet_shutdown); -EXPORT_SYMBOL(inet_sock_destruct); -EXPORT_SYMBOL(inet_stream_connect); -EXPORT_SYMBOL(inet_stream_ops); -EXPORT_SYMBOL(inet_unregister_protosw); -EXPORT_SYMBOL(net_statistics); -EXPORT_SYMBOL(sysctl_ip_nonlocal_bind); +MODULE_NAMESPACE_ALLOW(inet, ipv6); +MODULE_NAMESPACE_ALLOW(inet, udplite); +MODULE_NAMESPACE_ALLOW(inet, dccp_ipv6); +MODULE_NAMESPACE_ALLOW(inet, dccp_ipv4); +MODULE_NAMESPACE_ALLOW(inet, dccp); +MODULE_NAMESPACE_ALLOW(inet, sctp); + +/* RED-PEN: would be better to fix wanrouter */ +MODULE_NAMESPACE_ALLOW(inet, wanrouter); + +EXPORT_SYMBOL_NS(inet,inet_accept); +EXPORT_SYMBOL_NS(inet,inet_bind); +EXPORT_SYMBOL_NS(inet,inet_dgram_connect); +EXPORT_SYMBOL_NS(inet,inet_dgram_ops); +EXPORT_SYMBOL_NS(inet,inet_getname); +EXPORT_SYMBOL_NS(inet,inet_ioctl); +EXPORT_SYMBOL_NS(inet,inet_listen); +EXPORT_SYMBOL_NS(inet,inet_register_protosw); +EXPORT_SYMBOL_NS(inet,inet_release); +EXPORT_SYMBOL_NS(inet,inet_sendmsg); +EXPORT_SYMBOL_NS(inet,inet_shutdown); +EXPORT_SYMBOL_NS(inet,inet_sock_destruct); +EXPORT_SYMBOL_NS(inet,inet_stream_connect); +EXPORT_SYMBOL_NS(inet,inet_stream_ops); +EXPORT_SYMBOL_NS(inet,inet_unregister_protosw); +EXPORT_SYMBOL_NS(inet,net_statistics); +EXPORT_SYMBOL_NS(inet,sysctl_ip_nonlocal_bind); Index: linux/net/ipv4/arp.c === --- linux.orig/net/ipv4/arp.c +++ linux/net/ipv4/arp.c @@ -1406,6 +1406,7 @@ static int __init arp_proc_init(void) #endif /* CONFIG_PROC_FS */ +/* No namespace because those are used by various drivers */ EXPORT_SYMBOL(arp_broken_ops); EXPORT_SYMBOL(arp_find); EXPORT_SYMBOL(arp_create); Index: linux/net/ipv4/icmp.c === --- linux.orig/net/ipv4/icmp.c +++ linux/net/ipv4/icmp.c @@ -1101,7 +1101,7 @@ void __init icmp_init(struct net_proto_f } } -EXPORT_SYMBOL(icmp_err_convert); -EXPORT_SYMBOL(icmp_send); -EXPORT_SYMBOL(icmp_statistics); -EXPORT_SYMBOL(xrlim_allow); +EXPORT_SYMBOL_NS(inet, icmp_err_convert); +EXPORT_SYMBOL_NS(inet, icmp_send); +EXPORT_SYMBOL_NS(inet, icmp_statistics); +EXPORT_SYMBOL_NS(inet, xrlim_allow); Index: linux/net/ipv4/inet_connection_sock.c === --- linux.orig/net/ipv4/inet_connection_sock.c +++ linux/net/ipv4/inet_connection_sock.c @@ -26,7 +26,7 @@ #ifdef INET_CSK_DEBUG const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n"; -EXPORT_SYMBOL(inet_csk_timer_bug_msg); +EXPORT_SYMBOL_NS(inet, inet_csk_timer_bug_msg); #endif /* @@ -73,7 +73,7 @@ int
[PATCH RFC] [8/9] Put UDP exports into a namespace
The UDP exports are only used by UDPv6 and UDP lite. They are internal functions not supposed to be used by anybody else. So turn them into a name space that only allows those. --- net/ipv4/udp.c | 27 +++ net/ipv4/udplite.c |6 +++--- 2 files changed, 18 insertions(+), 15 deletions(-) Index: linux/net/ipv4/udp.c === --- linux.orig/net/ipv4/udp.c +++ linux/net/ipv4/udp.c @@ -105,6 +105,9 @@ #include #include "udp_impl.h" +MODULE_NAMESPACE_ALLOW(udp, udplite); +MODULE_NAMESPACE_ALLOW(udp, ipv6); + /* * Snmp MIB for the UDP layer */ @@ -1641,18 +1644,18 @@ void udp4_proc_exit(void) } #endif /* CONFIG_PROC_FS */ -EXPORT_SYMBOL(udp_disconnect); -EXPORT_SYMBOL(udp_hash); -EXPORT_SYMBOL(udp_hash_lock); -EXPORT_SYMBOL(udp_ioctl); -EXPORT_SYMBOL(udp_get_port); -EXPORT_SYMBOL(udp_prot); -EXPORT_SYMBOL(udp_sendmsg); -EXPORT_SYMBOL(udp_lib_getsockopt); -EXPORT_SYMBOL(udp_lib_setsockopt); -EXPORT_SYMBOL(udp_poll); +EXPORT_SYMBOL_NS(udp, udp_disconnect); +EXPORT_SYMBOL_NS(udp, udp_hash); +EXPORT_SYMBOL_NS(udp, udp_hash_lock); +EXPORT_SYMBOL_NS(udp, udp_ioctl); +EXPORT_SYMBOL_NS(udp, udp_get_port); +EXPORT_SYMBOL_NS(udp, udp_prot); +EXPORT_SYMBOL_NS(udp, udp_sendmsg); +EXPORT_SYMBOL_NS(udp, udp_lib_getsockopt); +EXPORT_SYMBOL_NS(udp, udp_lib_setsockopt); +EXPORT_SYMBOL_NS(udp, udp_poll); #ifdef CONFIG_PROC_FS -EXPORT_SYMBOL(udp_proc_register); -EXPORT_SYMBOL(udp_proc_unregister); +EXPORT_SYMBOL_NS(udp, udp_proc_register); +EXPORT_SYMBOL_NS(udp, udp_proc_unregister); #endif Index: linux/net/ipv4/udplite.c === --- linux.orig/net/ipv4/udplite.c +++ linux/net/ipv4/udplite.c @@ -113,6 +113,6 @@ out_register_err: printk(KERN_CRIT "%s: Cannot add UDP-Lite protocol.\n", __FUNCTION__); } -EXPORT_SYMBOL(udplite_hash); -EXPORT_SYMBOL(udplite_prot); -EXPORT_SYMBOL(udplite_get_port); +EXPORT_SYMBOL_NS(udp, udplite_hash); +EXPORT_SYMBOL_NS(udp, udplite_prot); +EXPORT_SYMBOL_NS(udp, udplite_get_port); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC] [5/9] modpost: Fix a buffer overflow in modpost
When passing an file name > 1k the stack could be overflowed. Not really a security issue, but still better plugged. --- scripts/mod/modpost.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux/scripts/mod/modpost.c === --- linux.orig/scripts/mod/modpost.c +++ linux/scripts/mod/modpost.c @@ -1656,7 +1656,6 @@ int main(int argc, char **argv) { struct module *mod; struct buffer buf = { }; - char fname[SZ]; char *kernel_read = NULL, *module_read = NULL; char *dump_write = NULL; int opt; @@ -1709,6 +1708,8 @@ int main(int argc, char **argv) err = 0; for (mod = modules; mod; mod = mod->next) { + char fname[strlen(mod->name) + 10]; + if (mod->skip) continue; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC] [6/9] Implement namespace checking in modpost
This checks the namespaces at build time in modpost --- scripts/mod/modpost.c | 344 ++ 1 file changed, 317 insertions(+), 27 deletions(-) Index: linux/scripts/mod/modpost.c === --- linux.orig/scripts/mod/modpost.c +++ linux/scripts/mod/modpost.c @@ -1,8 +1,9 @@ -/* Postprocess module symbol versions +/* Postprocess module symbol versions and do various other module checks. * * Copyright 2003 Kai Germaschewski * Copyright 2002-2004 Rusty Russell, IBM Corporation * Copyright 2006 Sam Ravnborg + * Copyright 2007 Andi Kleen, SUSE Labs (changes licensed GPLv2 only) * Based in part on module-init-tools/depmod.c,file2alias * * This software may be used and distributed according to the terms @@ -12,9 +13,13 @@ */ #include +#include #include "modpost.h" #include "../../include/linux/license.h" +#define NS_SEPARATOR '.' +#define NS_SEPARATOR_STRING "." + /* Are we using CONFIG_MODVERSIONS? */ int modversions = 0; /* Warn about undefined symbols? (do so if we have vmlinux) */ @@ -27,6 +32,9 @@ static int external_module = 0; static int vmlinux_section_warnings = 1; /* Only warn about unresolved symbols */ static int warn_unresolved = 0; +/* Fixing those would cause too many ifdefs -- off by default. */ +static int warn_missing_modules = 0; + /* How a symbol is exported */ enum export { export_plain, export_unused, export_gpl, @@ -105,19 +113,43 @@ static struct module *find_module(char * return mod; } -static struct module *new_module(char *modname) +static const char *basename(const char *s) +{ + char *p = strrchr(s, '/'); + if (p) + return p + 1; + return s; +} + +static struct module *find_module_base(char *modname) { struct module *mod; - char *p, *s; - mod = NOFAIL(malloc(sizeof(*mod))); - memset(mod, 0, sizeof(*mod)); - p = NOFAIL(strdup(modname)); + for (mod = modules; mod; mod = mod->next) { + if (strcmp(basename(mod->name), modname) == 0) + break; + } + return mod; +} +static void strip_o(char *p) +{ + char *s; /* strip trailing .o */ if ((s = strrchr(p, '.')) != NULL) if (strcmp(s, ".o") == 0) *s = '\0'; +} + +static struct module *new_module(char *modname) +{ + struct module *mod; + char *p; + + mod = NOFAIL(malloc(sizeof(*mod))); + memset(mod, 0, sizeof(*mod)); + p = NOFAIL(strdup(modname)); + strip_o(p); /* add to list */ mod->name = p; @@ -132,10 +164,12 @@ static struct module *new_module(char *m * struct symbol is also used for lists of unresolved symbols */ #define SYMBOL_HASH_SIZE 1024 +#define NSALLOW_HASH_SIZE 64 struct symbol { struct symbol *next; struct module *module; + const char *namespace; unsigned int crc; int crc_valid; unsigned int weak:1; @@ -147,10 +181,19 @@ struct symbol { char name[0]; }; +struct nsallow { + struct nsallow *next; + struct module *mod; + struct module *orig; + int ref; + char name[0]; +}; + static struct symbol *symbolhash[SYMBOL_HASH_SIZE]; +static struct nsallow *nsallowhash[NSALLOW_HASH_SIZE]; /* This is based on the hash agorithm from gdbm, via tdb */ -static inline unsigned int tdb_hash(const char *name) +static unsigned int tdb_hash(const char *name) { unsigned value; /* Used to compute the hash value. */ unsigned i; /* Used to cycle through random values. */ @@ -192,21 +235,67 @@ static struct symbol *new_symbol(const c return new; } -static struct symbol *find_symbol(const char *name) +static struct symbol *find_symbol(const char *name, const char *ns) { - struct symbol *s; + struct symbol *s, *match; /* For our purposes, .foo matches foo. PPC64 needs this. */ if (name[0] == '.') name++; + match = NULL; for (s = symbolhash[tdb_hash(name) % SYMBOL_HASH_SIZE]; s; s=s->next) { + if (strcmp(s->name, name) == 0) { + match = s; + if (ns && s->namespace && strcmp(s->namespace, ns)) + continue; + return s; + } + } + return ns ? NULL : match; +} + +static struct nsallow *find_nsallow(const char *name, struct module *mod) +{ + struct nsallow *s; + + for (s = nsallowhash[tdb_hash(name)%NSALLOW_HASH_SIZE]; s; s=s->next) { + if (strcmp(s->name, name) == 0 && s->mod == mod) + return s; + } + return NULL; +} + +static struct nsallow *find_nsallow_name(const char *name) +{ + struct nsallow *s; + + for (s =
[PATCH RFC] [4/9] modpost: Fix format string warnings
Fix wrong format strings in modpost exposed by the previous patch. Including one missing argument -- some random data was printed instead. --- scripts/mod/modpost.c |7 --- 1 file changed, 4 insertions(+), 3 deletions(-) Index: linux/scripts/mod/modpost.c === --- linux.orig/scripts/mod/modpost.c +++ linux/scripts/mod/modpost.c @@ -388,7 +388,7 @@ static int parse_elf(struct elf_info *in /* Check if file offset is correct */ if (hdr->e_shoff > info->size) { - fatal("section header offset=%u in file '%s' is bigger then filesize=%lu\n", hdr->e_shoff, filename, info->size); + fatal("section header offset=%lu in file '%s' is bigger then filesize=%lu\n", (unsigned long)hdr->e_shoff, filename, info->size); return 0; } @@ -409,7 +409,7 @@ static int parse_elf(struct elf_info *in const char *secname; if (sechdrs[i].sh_offset > info->size) { - fatal("%s is truncated. sechdrs[i].sh_offset=%u > sizeof(*hrd)=%ul\n", filename, (unsigned int)sechdrs[i].sh_offset, sizeof(*hdr)); + fatal("%s is truncated. sechdrs[i].sh_offset=%lu > sizeof(*hrd)=%lu\n", filename, (unsigned long)sechdrs[i].sh_offset, sizeof(*hdr)); return 0; } secname = secstrings + sechdrs[i].sh_name; @@ -907,7 +907,8 @@ static void warn_sec_mismatch(const char "before '%s' (at offset -0x%llx)\n", modname, fromsec, (unsigned long long)r.r_offset, secname, refsymname, -elf->strtab + after->st_name); +elf->strtab + after->st_name, +(unsigned long long)r.r_offset); } else { warn("%s(%s+0x%llx): Section mismatch: reference to %s:%s\n", modname, fromsec, (unsigned long long)r.r_offset, - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC] [3/9] modpost: Declare the modpost error functions as printf like
This way gcc can warn for wrong format strings --- scripts/mod/modpost.c |8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) Index: linux/scripts/mod/modpost.c === --- linux.orig/scripts/mod/modpost.c +++ linux/scripts/mod/modpost.c @@ -33,7 +33,9 @@ enum export { export_unused_gpl, export_gpl_future, export_unknown }; -void fatal(const char *fmt, ...) +#define PRINTF __attribute__ ((format (printf, 1, 2))) + +PRINTF void fatal(const char *fmt, ...) { va_list arglist; @@ -46,7 +48,7 @@ void fatal(const char *fmt, ...) exit(1); } -void warn(const char *fmt, ...) +PRINTF void warn(const char *fmt, ...) { va_list arglist; @@ -57,7 +59,7 @@ void warn(const char *fmt, ...) va_end(arglist); } -void merror(const char *fmt, ...) +PRINTF void merror(const char *fmt, ...) { va_list arglist; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC] [2/9] Fix duplicate symbol check to also check future gpl and unused symbols
This seems to have been forgotten earlier. Right now it was possible for a normal symbol to override a future gpl symbol and similar. I restructured the code a bit to avoid too much duplicated code. --- kernel/module.c | 45 - 1 file changed, 24 insertions(+), 21 deletions(-) Index: linux/kernel/module.c === --- linux.orig/kernel/module.c +++ linux/kernel/module.c @@ -1430,33 +1430,36 @@ EXPORT_SYMBOL_GPL(do_symbol_get); * Ensure that an exported symbol [global namespace] does not already exist * in the kernel or in some other module's exported symbol table. */ -static int verify_export_symbols(struct module *mod) + +static int check_duplicate(const struct kernel_symbol *syms, int num, struct module *owner) { - const char *name = NULL; - unsigned long i, ret = 0; - struct module *owner; + int i; const unsigned long *crc; - for (i = 0; i < mod->num_syms; i++) - if (find_symbol(mod->syms[i].name, , , 1, mod)) { - name = mod->syms[i].name; - ret = -ENOEXEC; - goto dup; - } - - for (i = 0; i < mod->num_gpl_syms; i++) - if (find_symbol(mod->gpl_syms[i].name, , , 1, mod)) { - name = mod->gpl_syms[i].name; - ret = -ENOEXEC; - goto dup; + for (i = 0; i < num; i++) + if (find_symbol(syms[i].name, , , 1, owner)) { + printk(KERN_ERR "%s: exports duplicate symbol %s (owned by %s)\n", + owner->name, syms[i].name, module_name(owner)); + return -ENOEXEC; } + return 0; +} -dup: +static int verify_export_symbols(struct module *mod) +{ + int ret = check_duplicate(mod->syms, mod->num_syms, mod); if (ret) - printk(KERN_ERR "%s: exports duplicate symbol %s (owned by %s)\n", - mod->name, name, module_name(owner)); - - return ret; + return ret; + ret = check_duplicate(mod->gpl_syms, mod->num_gpl_syms, mod); + if (ret) + return ret; + ret = check_duplicate(mod->unused_syms, mod->num_unused_syms, mod); + if (ret) + return ret; + ret = check_duplicate(mod->unused_gpl_syms, mod->num_unused_gpl_syms, mod); + if (ret) + return ret; + return check_duplicate(mod->gpl_future_syms, mod->num_gpl_future_syms, mod); } /* Change all symbols so that sh_value encodes the pointer directly. */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC] [1/9] Core module symbol namespaces code and intro.
There seems to be rough consensus that the kernel currently has too many exported symbols. A lot of these exports are generally usable utility functions or important driver interfaces; but another large part are functions intended by only one or two very specific modules for a very specific purpose. One example is the TCP code. It has most of its internals exported, but only for use by tcp_ipv6.c (and now a few more by the TCP/IP congestion modules) But it doesn't make sense to include these exported for a specific module functions into a broader "kernel interface". External modules assume they can use these functions, but they were never intended for that. This patch allows to export symbols only for specific modules by introducing symbol name spaces. A module name space has a white list of modules that are allowed to import symbols for it; all others can't use the symbols. It adds two new macros: MODULE_NAMESPACE_ALLOW(namespace, module); Allow module to import symbols from namespace. module is the module name without .ko as displayed by lsmod. Must be in the same module as the export (and be duplicated if there are multiple modules exporting symbols to a namespace). Multiple allows for the same name space are allowed. EXPORT_SYMBOL_NS(namespace, symbol); Export symbol into namespace. Only modules allowed for the namespace will be able to use them. EXPORT_SYMBOL_NS implies GPL only because it is only for "internal" interfaces. The name spaces only work for module loading. I didn't find a nice way to make them work inside the main kernel binary. This means the name space is not enforced for modules that are built in. The biggest amount of work is of course still open: to go over all the existing exports and figure for which ones it makes sense to define a namespace. I did it for TCP and UDP so far, but the kernel right now has nearly 10k exports (with some dups) that would need to be checked and turned into name spaces. I would expect any symbol that is only used by one or two other modules is a strong candidate for a namespace; in some cases even more with modules that are tightly coupled. I am optimistic that in the end we will have a much more manageable kernel interface. Caveats: Exports need one long word more memory. I had to add some alignment magic to the existing EXPORT_SYMBOLs to get the sections right. Tested on i386/x86-64, but I hope it also still works on architectures with stricter alignment requirements like ARM. Any testers for that? --- arch/arm/kernel/armksyms.c|2 include/asm-generic/vmlinux.lds.h |7 + include/linux/module.h| 71 +++ kernel/module.c | 137 +++--- 4 files changed, 177 insertions(+), 40 deletions(-) Index: linux/include/linux/module.h === --- linux.orig/include/linux/module.h +++ linux/include/linux/module.h @@ -34,6 +34,7 @@ struct kernel_symbol { unsigned long value; const char *name; + const char *namespace; }; struct modversion_info @@ -167,49 +168,80 @@ struct notifier_block; #ifdef CONFIG_MODULES /* Get/put a kernel symbol (calls must be symmetric) */ -void *__symbol_get(const char *symbol); -void *__symbol_get_gpl(const char *symbol); +extern void *do_symbol_get(const char *symbol, struct module *caller); +#define __symbol_get(sym) do_symbol_get(sym, THIS_MODULE) #define symbol_get(x) ((typeof())(__symbol_get(MODULE_SYMBOL_PREFIX #x))) +struct module_ns { + char *name; + char *allowed; +}; + +#define NS_SEPARATOR "." + +/* + * Allow module MODULE to reference namespace NS. + * MODULE is just the base module name with suffix or path. + * This must be declared in the module (or main kernel) as where the + * symbols are defined. When multiple modules export symbols from + * a single namespace all modules need to contain a full set + * of MODULE_NAMESPACE_ALLOWs. + */ +#define MODULE_NAMESPACE_ALLOW(ns, module) \ + static const struct module_ns __knamespace_##module##_##_##ns \ + asm("__knamespace_" #module NS_SEPARATOR #ns) \ + __attribute_used__ \ + __attribute__((section("__knamespace"), unused))\ + = { #ns, #module } + #ifndef __GENKSYMS__ #ifdef CONFIG_MODVERSIONS /* Mark the CRC weak since genksyms apparently decides not to * generate a checksums for some symbols */ -#define __CRC_SYMBOL(sym, sec) \ +#define __CRC_SYMBOL(sym, sec, post, post2)\ extern void *__crc_##sym __attribute__((weak)); \ - static const unsigned long __kcrctab_##sym \ + static const unsigned long __kcrctab_##sym##post\ + asm("__kcrctab_" #sym post2)\ __attribute_used__ \
[PATCH] Allow changing O_SYNC with fcntl().
Is there a reason why this isn't allowed now? --- fs/fcntl.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/fcntl.c b/fs/fcntl.c index 8685263..fc0c92e 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -203,7 +203,7 @@ asmlinkage long sys_dup(unsigned int fildes) return ret; } -#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | O_NOATIME) +#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | O_NOATIME | O_SYNC) static int setfl(int fd, struct file * filp, unsigned long arg) { signature.asc Description: This is a digitally signed message part
Re: [RFC/PATCH] SO_NO_CHECK for IPv6
On Wed, Nov 21, 2007 at 07:17:40PM -0500, Jeff Garzik wrote: > > For those interested, I am dealing with a UDP app that already does very > strong checksumming and encryption, so additional software checksumming > at the lower layers is quite simply a waste of CPU cycles. Hardware > checksumming is fine, as long as its "free." No matter how strong your underlying checksumming is it's not going to protect the IPv6 header is it :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above
ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen: > Eric Dumazet wrote: > > Jie Chen a écrit : > >> Hi, there: > >> > >> We have a simple pthread program that measures the synchronization > >> overheads for various synchronization mechanisms such as spin locks, > >> barriers (the barrier is implemented using queue-based barrier > >> algorithm) and so on. We have dual quad-core AMD opterons (barcelona) > >> clusters running 2.6.23.8 kernel at this moment using Fedora Core 7 > >> distribution. Before we moved to this kernel, we had kernel 2.6.21. > >> These two kernels are configured identical and compiled with the same > >> gcc 4.1.2 compiler. Under the old kernel, we observed that the > >> performance of these overheads increases as the number of threads > >> increases from 2 to 8. The following are the values of total time and > >> overhead for all threads acquiring a pthread spin lock and all threads > >> executing a barrier synchronization call. > > > > Could you post the source of your test program ? > > > > > Hi, Eric: > > Thank you for the quick response. You can get the source code containing > the test code from ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz . This is a > data parallel threading package for physics calculation. The test code > is pthread_sync in the src directory once you unpack the gz file. To > configure and build this package is very simple: configure and make. The > test program is built by make check. The number of threads is > controlled by QMT_NUM_THREADS. The package is using pthread spin lock, > but the barrier is implemented using a queue-based barrier algorithm > proposed by J. B. Carter of University of Utah (2005). > > > > > > > spinlock are ... spining and should not call linux scheduler, so I have > > no idea why a kernel change could modify your results. > > > > Also I suspect you'll have better results with Fedora Core 8 (since > > glibc was updated to use private futexes in v 2.7), at least for the > > barrier ops. > > > > > > I am not sure what the biggest change between kernel 2.6.21 and 2.6.22 > (23) is? Is the scheduler the biggest change between these versions? Can > the scheduler of kernel somehow effect the performance? I know the > scheduler is trying to do load balance and so on. Can the scheduler move > threads to different cores according to the load balance algorithm even > though the threads are bound to cores using pthread_setaffinity_np call > when the number of threads is fewer than the number of cores? I am > thinking about this because the performance of our test code is roughly > the same for both kernels when the number of threads equals to the > number of cores. > There is a backport of the CFS scheduler to 2.6.21, see http://lkml.org/lkml/2007/11/19/127 > >> > >> Kernel 2.6.21 > >> Number of Threads 2 4 6 8 > >> SpinLock (Time micro second) 10.561810.5853810.5915 10.643 > >> (Overhead) 0.073 0.05746 0.102805 0.154563 > >> Barrier (Time micro second)11.020410 11.678125 11.9889 12.38002 > >> (Overhead)0.531660 1.1502 1.500112 1.891617 > >> > >> Each thread is bound to a particular core using pthread_setaffinity_np. > >> > >> Kernel 2.6.23.8 > >> Number of Threads 2 4 6 8 > >> SpinLock (Time micro second) 14.849915 17.117603 14.4496 10.5990 > >> (Overhead)4.345417 6.6172073.949435 0.110985 > >> Barrier (Time micro second)19.462255 20.285117 16.19395 12.37662 > >> (Overhead)8.957755 9.7847225.699590 1.869518 > >> > >> It is clearly that the synchronization overhead increases as the > >> number of threads increases in the kernel 2.6.21. But the > >> synchronization overhead actually decreases as the number of threads > >> increases in the kernel 2.6.23.8 (We observed the same behavior on > >> kernel 2.6.22 as well). This certainly is not a correct behavior. The > >> kernels are configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, > >> CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel > >> configuration file is in the attachment of this e-mail. > >> > >> From what we have read, there was a new scheduler (CFS) appeared from > >> 2.6.22. We are not sure whether the above behavior is caused by the > >> new scheduler. > >> > >> Finally, our machine cpu information is listed in the following: > >> > >> processor : 0 > >> vendor_id : AuthenticAMD > >> cpu family : 16 > >> model : 2 > >> model name : Quad-Core AMD Opteron(tm) Processor 2347 > >> stepping: 10 > >> cpu MHz : 1909.801 > >> cache size : 512 KB > >> physical id : 0 > >> siblings: 4 > >> core id : 0 > >> cpu cores : 4 > >> fpu : yes > >> fpu_exception : yes > >> cpuid level : 5 > >> wp
Re: Where is the interrupt going?
On Wed, Nov 21, 2007 at 05:08:30PM -0800, Al Niessner wrote: > On with the detailed technical information. I developed a kernel module > for an PCI card back in 2.4, moved it to 2.6.3, then 2.6.11 or so and > now I am trying to move it to 2.6.22. When I began the to move to > 2.6.22, I changed all of the deprecated calls for finding the card on > the PCI bus, modified the interrupt handler prototype, and changed my > readvv/writev to aio_read/aio_write following > http://lwn.net/Articles/202449/. So initialization looks like this: > Hi Al, >From the sounds of it, you might have an interrupt routing problem. Can you describe the machine you have this plugged into? Possibly attaching a copy of "dmesg" and "/proc/interrupts"? Feel free to attach the driver source to your email if the size is reasonable (which it sounds like it is.) As a "big hammer" in case it is an APIC problem, please try booting the kernel with the "noapic" parameter. cheers, Kyle - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Where is the interrupt going?
On 22/11/2007, Al Niessner <[EMAIL PROTECTED]> wrote: > > Quickly stated, I have a piece of hardware on the PCI bus that is > generating an interrupt (can watch it with a scope) but my handler is > not being called (no printk in /var/log/messages). So, where has the > interrupt gone? > Just to rule out the trivial causes. Could it be that you've simply not configured your system to log messages at the loglevel that your printk() is using? -- Jesper Juhl <[EMAIL PROTECTED]> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mmap dirty limits on 32 bit kernels (Was: [BUG] New Kernel Bugs)
On Thu, Nov 22, 2007 at 10:51:15AM +1100, Bron Gondwana wrote: > On Thu, Nov 15, 2007 at 08:32:22AM -0800, Linus Torvalds wrote: > > If this patch makes a difference, please holler. I think it's the correct > > thing to do, but I'm not going to actually commit it without somebody > > saying that it makes a difference (and preferably Peter Zijlstra and > > Andrew acking it too). > > mmap: mmap call failed: errno: 12 errmsg: Cannot allocate memory > > Yep, that's "fixed" the problem alright! No way this puppy is > dirtying 2Gb of memory any more. > > http://linux.brong.fastmail.fm/2007-11-22/bmtest.pl Alternatively perhaps I'm just a moron who used a config file with: CONFIG_PAGE_OFFSET=0x8000 set to build the new kernel (I hadn't committed it because it turned out not to solve the issue it was there for). That would explain a few things. [EMAIL PROTECTED] perl]$ free total used free sharedbuffers cached Mem: 415062022722841878336 0 112122066536 -/+ buffers/cache: 1945363956084 Swap: 2096472 02096472 That's more the usage I would expect to see. Now for the downside. It works again, but it still runs slow. Seems to hit (and this is totally unscientific, I'm just watching the numbers scroll by) at about 12 writes rather than 7 writes, but that's still not fitting the while file dirty. I notice that PF_LESS_THROTTLE gets set by nfsd to get an extra 25% bonus free space allocated. Potentially dcc could use similar tricks to claim extra space if that knob is available up in userspace. I'm happy to patch dcc as well if I have to, I'm already backporting it, so adding another little quilt directory and applying it is pretty trivial (must try guilt/stgit one of these days) Bron. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Where is the interrupt going?
On Thu, Nov 22, 2007 at 01:56:25AM +, Alan Cox wrote: > > status = request_irq (apcsi[i].board_irq, > > apc8620_handler, > > IRQF_DISABLED, > > You set IRQF_DISABLED > > Do you then enable the interrupt anywhere later on ? > IRQF_DISABLED just means that the handler is atomic wrt other local interrupts. Shouldn't be the cause of this. cheers, Kyle - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Where is the interrupt going?
> status = request_irq (apcsi[i].board_irq, > apc8620_handler, > IRQF_DISABLED, You set IRQF_DISABLED Do you then enable the interrupt anywhere later on ? Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above
Eric Dumazet wrote: Jie Chen a écrit : Hi, there: We have a simple pthread program that measures the synchronization overheads for various synchronization mechanisms such as spin locks, barriers (the barrier is implemented using queue-based barrier algorithm) and so on. We have dual quad-core AMD opterons (barcelona) clusters running 2.6.23.8 kernel at this moment using Fedora Core 7 distribution. Before we moved to this kernel, we had kernel 2.6.21. These two kernels are configured identical and compiled with the same gcc 4.1.2 compiler. Under the old kernel, we observed that the performance of these overheads increases as the number of threads increases from 2 to 8. The following are the values of total time and overhead for all threads acquiring a pthread spin lock and all threads executing a barrier synchronization call. Could you post the source of your test program ? Hi, Eric: Thank you for the quick response. You can get the source code containing the test code from ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz . This is a data parallel threading package for physics calculation. The test code is pthread_sync in the src directory once you unpack the gz file. To configure and build this package is very simple: configure and make. The test program is built by make check. The number of threads is controlled by QMT_NUM_THREADS. The package is using pthread spin lock, but the barrier is implemented using a queue-based barrier algorithm proposed by J. B. Carter of University of Utah (2005). spinlock are ... spining and should not call linux scheduler, so I have no idea why a kernel change could modify your results. Also I suspect you'll have better results with Fedora Core 8 (since glibc was updated to use private futexes in v 2.7), at least for the barrier ops. I am not sure what the biggest change between kernel 2.6.21 and 2.6.22 (23) is? Is the scheduler the biggest change between these versions? Can the scheduler of kernel somehow effect the performance? I know the scheduler is trying to do load balance and so on. Can the scheduler move threads to different cores according to the load balance algorithm even though the threads are bound to cores using pthread_setaffinity_np call when the number of threads is fewer than the number of cores? I am thinking about this because the performance of our test code is roughly the same for both kernels when the number of threads equals to the number of cores. Kernel 2.6.21 Number of Threads 2 4 6 8 SpinLock (Time micro second) 10.561810.5853810.5915 10.643 (Overhead) 0.073 0.05746 0.102805 0.154563 Barrier (Time micro second)11.020410 11.678125 11.9889 12.38002 (Overhead)0.531660 1.1502 1.500112 1.891617 Each thread is bound to a particular core using pthread_setaffinity_np. Kernel 2.6.23.8 Number of Threads 2 4 6 8 SpinLock (Time micro second) 14.849915 17.117603 14.4496 10.5990 (Overhead)4.345417 6.6172073.949435 0.110985 Barrier (Time micro second)19.462255 20.285117 16.19395 12.37662 (Overhead)8.957755 9.7847225.699590 1.869518 It is clearly that the synchronization overhead increases as the number of threads increases in the kernel 2.6.21. But the synchronization overhead actually decreases as the number of threads increases in the kernel 2.6.23.8 (We observed the same behavior on kernel 2.6.22 as well). This certainly is not a correct behavior. The kernels are configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel configuration file is in the attachment of this e-mail. From what we have read, there was a new scheduler (CFS) appeared from 2.6.22. We are not sure whether the above behavior is caused by the new scheduler. Finally, our machine cpu information is listed in the following: processor : 0 vendor_id : AuthenticAMD cpu family : 16 model : 2 model name : Quad-Core AMD Opteron(tm) Processor 2347 stepping: 10 cpu MHz : 1909.801 cache size : 512 KB physical id : 0 siblings: 4 core id : 0 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw bogomips: 3822.95 TLB size: 1024 4K pages clflush size: 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate In addition, we
Where is the interrupt going?
Quickly stated, I have a piece of hardware on the PCI bus that is generating an interrupt (can watch it with a scope) but my handler is not being called (no printk in /var/log/messages). So, where has the interrupt gone? Obligatory information: 1) I have done the google search and mailing list search finding lots of ancillary information but not what I needed. 2) Yes, it is my fault, but I need some help from people more directly involved in the kernel than myself to point out what I am doing wrong. 3) Thanks for any and all help in advance. On with the detailed technical information. I developed a kernel module for an PCI card back in 2.4, moved it to 2.6.3, then 2.6.11 or so and now I am trying to move it to 2.6.22. When I began the to move to 2.6.22, I changed all of the deprecated calls for finding the card on the PCI bus, modified the interrupt handler prototype, and changed my readvv/writev to aio_read/aio_write following http://lwn.net/Articles/202449/. So initialization looks like this: p8620 = pci_get_device (APC8620_VENDOR_ID, APC8620_DEVICE_ID, p8620); <... fail if p8620 is 0 ...> apcsi[i].ret_val = register_chrdev (MAJOR_NUM, DEVICE_NAME, _ops); <... fail if ret_val < 0 ...> apcsi[i].board_irq = p8620->irq; status = request_irq (apcsi[i].board_irq, apc8620_handler, IRQF_DISABLED, DEVICE_NAME, (void*)[i]); <... fail if status != 0 ...> I do check all of the return values to verify the call happened successfully. There are some memory mapping calls that I have left out since they are working while the interrupt is not. Things seem to work for the most part because I can read/write data through a memory map and verify the IndustryPack modules on the carrier through their header. The memory map is still working sufficiently well that I can program up one of the IndustryPack modules to generate an interrupt every 2 seconds or so. Prior to my changes for 2.6.22 this worked quite well. Since it is the interrupt portion of this game that is giving me grief, lets stick with just that. apc8620_handler is: static irqreturn_t apc8620_handler (int irq, void *did) { printk (KERN_NOTICE "apc8620: did (0x%lx)\n", (unsigned long)did); <... other irrelevant steps ...> return IRQ_HANDLED; } I would then expect that every two seconds or so I would see a message from apc8620_handler pop up. Instead I see nothing. Poking around I see that the kernel module is loaded and attached to my devices and set for IRQ 10: lsmod: -> acromag8620 4207556 0 cat /proc/devices -> 46 apc8620 cat /proc/interrupts -> 10: 0 IO-APIC-edge apc8620 With /proc/interrupts, LOC keeps growing at a rate faster than what my hardware is generating and I have no idea what LOC means, but ERR and MIS (I take it to mean error and missed respectively) are both 0 and remain 0 indefinitely. In /var/log/messages, I do not see any missing interrupt messages or any other report indicating that there is some trouble. Assuming no one sees the error I am making right off the bat and would like me to probe the interrupt system a little bit more, please give me a suggestion as to where to poke. There is lots of code there and I would prefer to have guided poke over a random one. Anyway, I read through linux/interrupts.h looking for some bit, flag, or call that I have omitted but found nothing. I understand why the interrupt handlers have changed, but the changes made should not be causing this problem. Again, any and all help in finding my lost interrupt is much appreciated. Lastly, I would be happy to give out the entire module to anyone who requests it, but it is about 550 lines so I did not want to attach it to this already long post. -- Al Niessner 818.354.0859 All opinions stated above are mine and do not necessarily reflect those of JPL or NASA. | dS | >= 0 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9]: Reduce Log I/O latency
On Thu, Nov 22, 2007 at 01:49:25AM +0100, Andi Kleen wrote: > David Chinner <[EMAIL PROTECTED]> writes: > > > To ensure that log I/O is issued as the highest priority I/O, set > > the I/O priority of the log I/O to the highest possible. This will > > ensure that log I/O is not held up behind bulk data or other > > metadata I/O as delaying log I/O can pause the entire transaction > > subsystem. Introduce a new buffer flag to allow us to tag the log > > buffers so we can discrimiate when issuing the I/O. > > Won't that possible disturb other RT priority users that do not need > log IO (e.g. working on preallocated files)? Seems a little > dangerous. In all the cases that I know of where ppl are using what could be considered real-time I/O (e.g. media environments where they do real-time ingest and playout from the same filesystem) the real-time ingest processes create the files and do pre-allocation before doing their I/O. This I/O can get held up behind another process that is not real time that has issued log I/O. Given there is no I/O priority inheritence and having log I/O stall will stall the entire filesystem, we cannot allow log I/O to stall in real-time environments. Hence it must have the highest possible priority to prevent this. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 0/4 Support for Toshiba TMIO multifunction devices
Hi Ian, Personally I'm very appreciate your patches, they'll will help submitting HP iPaqs SOCs/MFDs, you know... ;-) Thus, much thanks in advance. Few comments... On Wed, Nov 21, 2007 at 03:54:15AM +, ian wrote: > On Wed, 2007-11-21 at 10:23 +0800, eric miao wrote: > > Roughly went through the patch, looks good, here comes the remind, though > > :-) > > > > 1. is it possible to use some name other than "soc_core", maybe > > "tmio_core" so that other multifunction chips sharing a core base > > will live easier. > > It's (soc-core) not tmio MFD specific - its already used by other MFD > chips (although obviously not ones in mainline (yet!) > > it might be better named 'mfd-core' though, as thats its intended use... > > > 2. those C++ style comments "//" are not so pleasant... > > Should I clean them up and resubmit? I'd resubmit cleaned up version. I think four or even more resubmissions is inevitable for such patch-set (new general code + a lot of drivers). About patches their self... I think soc_add_devices could be split into two small functions, thus you'll get rid of high indentation level + code will be more reader friendly. Ideally, checkpatch.pl should be happy. If it will, then there will be less nitpicks somebody can pull. ;-) Here it is: - - - - ~/linux-2.6$ scripts/checkpatch.pl ~/0001-Reuseable-SOC-core-code-suitable-for-multifunction-c.patch WARNING: line over 80 characters #57: FILE: drivers/mfd/soc-core.c:37: +#define SIGNED_SHIFT(val, shift) ((shift) >= 0 ? ((val) << (shift)) : ((val) >> -(shift))) WARNING: line over 80 characters #60: FILE: drivers/mfd/soc-core.c:40: + struct soc_device_data *soc, int nr_devs, WARNING: line over 80 characters #84: FILE: drivers/mfd/soc-core.c:64: + res = kzalloc(blk->num_resources * sizeof (struct resource), GFP_KERNEL); WARNING: no space between function name and open parenthesis '(' #84: FILE: drivers/mfd/soc-core.c:64: + res = kzalloc(blk->num_resources * sizeof (struct resource), GFP_KERNEL); ERROR: do not use C99 // comments #89: FILE: drivers/mfd/soc-core.c:69: + res[r].name = blk->res[r].name; // Fixme - should copy WARNING: braces {} are not necessary for single statement blocks #93: FILE: drivers/mfd/soc-core.c:73: + if (blk->res[r].flags & IORESOURCE_MEM) { + base = mem->start; + } else if ((blk->res[r].flags & IORESOURCE_IRQ) && WARNING: braces {} are not necessary for single statement blocks #95: FILE: drivers/mfd/soc-core.c:75: + } else if ((blk->res[r].flags & IORESOURCE_IRQ) && + (blk->res[r].flags & IORESOURCE_IRQ_SOC_SUBDEVICE)) { + base = irq_base; + } WARNING: line over 80 characters #96: FILE: drivers/mfd/soc-core.c:76: + (blk->res[r].flags & IORESOURCE_IRQ_SOC_SUBDEVICE)) { WARNING: line over 80 characters #103: FILE: drivers/mfd/soc-core.c:83: + res[r].start = base + SIGNED_SHIFT(blk->res[r].start, relative_addr_shift); WARNING: line over 80 characters #104: FILE: drivers/mfd/soc-core.c:84: + res[r].end = base + SIGNED_SHIFT(blk->res[r].end, relative_addr_shift); ERROR: Missing Signed-off-by: line(s) total: 2 errors, 9 warnings, 145 lines checked Your patch has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. - - - - There is false positive though: if (...) { single_stmt; } else { one; two; } ^^^ is perfectly OK and preferred, IIRC. checkpatch isn't ideal, but it's mostly good. > More to the point, who should I be submitting them to? the files under > arm/ are obviously for RMK to peruse, but I couldnt find an entry for > drivers/mfd in MAINTAINERS... Well, don't know about drivers/mfd/*. Probably there simply isn't any [official] maintainer, thus lkml is the right place. There is one not so obvious thing though: you should not submit patches with To/Cc'ing lkml (open list) and linux-arm-kernel (subscribers-only). Russell King will probably point to linux-arm-kernel etiquette article (http://www.arm.linux.org.uk/mailinglists/etiquette.php "Cross-posting between linux-arm* lists and other lists.") So, either place linux-arm-kernel into Bcc:, or duplicate stuff for lkml and linux-arm-kernel separately, thus they'll not see each others' To/Cc. Looking forward to your patches! -- Anton Vorontsov email: [EMAIL PROTECTED] backup email: [EMAIL PROTECTED] irc://irc.freenode.net/bd2 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [UPDATED PATCH] Support for Toshiba TMIO multifunction devices
On Thu, Nov 22, 2007 at 12:34:09AM +, ian wrote: > +void mfd_free_devices(struct platform_device *devices, int nr_devs) > +{ > + struct platform_device *dev = devices; > + int i; > + > + for (i = 0; i < nr_devs; i++) { > + struct resource *res = dev->resource; > + platform_device_unregister(dev++); > + kfree(res); > + } > + kfree(devices); > +} > +EXPORT_SYMBOL_GPL(mfd_free_devices); Unfortunately, this is broken as designed (in fact this whole file is.) I'm not sure why people just don't get it. sysfs. devices. device tree. It has object lifetime rules. You can _not_ go around unregistering things and then immediately freeing them - something else might _still_ be using stuff even after the call to unregister returns. It's a potential OOPS just waiting to happen. That's why we have a proper management API for platform devices. Please use it, I didn't add the code for just for fun. See platform_device_alloc() + platform_device_add_resources() + platform_device_add_data() + platform_device_add() to create, and platform_device_unregister() to destroy. (Not looked at the rest because you really really need to get this right first.) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [stable] null pointer dereference during restart autofs (was: Linux 2.6.22.12)
On Wed, Nov 21, 2007 at 04:45:10AM +0100, Tomasz K?oczko wrote: > > BUG: unable to handle kernel NULL pointer dereference at virtual address > 0014 Did this happen with older versions of 2.6.22.y? Have you asked the autofs people about this? thanks, greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9]: Reduce Log I/O latency
David Chinner <[EMAIL PROTECTED]> writes: > To ensure that log I/O is issued as the highest priority I/O, set > the I/O priority of the log I/O to the highest possible. This will > ensure that log I/O is not held up behind bulk data or other > metadata I/O as delaying log I/O can pause the entire transaction > subsystem. Introduce a new buffer flag to allow us to tag the log > buffers so we can discrimiate when issuing the I/O. Won't that possible disturb other RT priority users that do not need log IO (e.g. working on preallocated files)? Seems a little dangerous. I suspect you want a "higher than bulk but lower than RT" priority for this really unless there is any block RT priority task waiting for log IO (but keeping track of the later might be tricky) -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] sata_nv: don't use legacy DMA in ADMA mode
Robert Hancock wrote: > Tejun Heo wrote: >> Tejun Heo wrote: >>> If so, can you please add that switching into register mode is okay as >>> long as there's no other ADMA commands in flight and add >>> WARN_ON((qc->flags & ATA_QCFLAG_RESULT_TF) && link->sactive)? >> >> More accurately, link->sactive test can be substituted with >> (ap->qc_allocated & ~(1 << qc->tag)). > > Unfortunately we only get the ata_port and ata_taskfile in the tf_read > callback, so I'm not sure if we can do the equivalent of the qc->flags & > ATA_QCFLAG_RESULT_TF test (i.e. distinguishing between the > error-handling case where we care if we abort outstanding commands and > the normal case with a RESULT_TF command where we do).. You can test it in ->qc_issue(), no? -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 9/9] Clean up open coded inode dirty checks
Use xfs_inode_clean() in more places. Signed-off-by: Dave Chinner <[EMAIL PROTECTED]> --- fs/xfs/xfs_inode.c | 27 +-- fs/xfs/xfs_inode_item.h |8 fs/xfs/xfs_vnodeops.c |4 +--- 3 files changed, 14 insertions(+), 25 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c === --- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c 2007-11-22 10:33:57.728849000 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c2007-11-22 10:33:59.692597965 +1100 @@ -2158,13 +2158,6 @@ xfs_iunlink_remove( return 0; } -STATIC_INLINE int xfs_inode_clean(xfs_inode_t *ip) -{ - return (((ip->i_itemp == NULL) || - !(ip->i_itemp->ili_format.ilf_fields & XFS_ILOG_ALL)) && - (ip->i_update_core == 0)); -} - /* lookup all the inodes in the cluster */ STATIC int xfs_icluster_lookup( @@ -3067,7 +3060,6 @@ xfs_iflush_cluster( int ilist_size; xfs_inode_t **ilist; xfs_inode_t *iq; - xfs_inode_log_item_t*iip; int nr_found; int clcount = 0; int bufwasdelwri; @@ -3094,13 +3086,8 @@ xfs_iflush_cluster( * is a candidate for flushing. These checks will be repeated * later after the appropriate locks are acquired. */ - iip = iq->i_itemp; - if ((iq->i_update_core == 0) && - ((iip == NULL) || -!(iip->ili_format.ilf_fields & XFS_ILOG_ALL)) && - xfs_ipincount(iq) == 0) { + if (xfs_inode_clean(iq) && xfs_ipincount(iq) == 0) continue; - } /* * Try to get locks. If any are unavailable or it is pinned, @@ -3123,10 +3110,8 @@ xfs_iflush_cluster( * arriving here means that this inode can be flushed. First * re-check that it's dirty before flushing. */ - iip = iq->i_itemp; - if ((iq->i_update_core != 0) || ((iip != NULL) && -(iip->ili_format.ilf_fields & XFS_ILOG_ALL))) { - int error; + if (!xfs_inode_clean(iq)) { + int error; error = xfs_iflush_int(iq, bp); if (error) { xfs_iunlock(iq, XFS_ILOCK_SHARED); @@ -3230,8 +3215,7 @@ xfs_iflush( * If the inode isn't dirty, then just release the inode * flush lock and do nothing. */ - if ((ip->i_update_core == 0) && - ((iip == NULL) || !(iip->ili_format.ilf_fields & XFS_ILOG_ALL))) { + if (xfs_inode_clean(ip)) { ASSERT((iip != NULL) ? !(iip->ili_item.li_flags & XFS_LI_IN_AIL) : 1); xfs_ifunlock(ip); @@ -3398,8 +3382,7 @@ xfs_iflush_int( * If the inode isn't dirty, then just release the inode * flush lock and do nothing. */ - if ((ip->i_update_core == 0) && - ((iip == NULL) || !(iip->ili_format.ilf_fields & XFS_ILOG_ALL))) { + if (xfs_inode_clean(ip)) { xfs_ifunlock(ip); return 0; } Index: 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c === --- 2.6.x-xfs-new.orig/fs/xfs/xfs_vnodeops.c2007-11-22 10:33:57.732848488 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c 2007-11-22 10:33:59.696597454 +1100 @@ -3532,7 +3532,6 @@ xfs_inode_flush( int flags) { xfs_mount_t *mp = ip->i_mount; - xfs_inode_log_item_t *iip = ip->i_itemp; int error = 0; if (XFS_FORCED_SHUTDOWN(mp)) @@ -3542,8 +3541,7 @@ xfs_inode_flush( * Bypass inodes which have already been cleaned by * the inode flush clustering code inside xfs_iflush */ - if ((ip->i_update_core == 0) && - ((iip == NULL) || !(iip->ili_format.ilf_fields & XFS_ILOG_ALL))) + if (xfs_inode_clean(ip)) return 0; /* Index: 2.6.x-xfs-new/fs/xfs/xfs_inode_item.h === --- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode_item.h 2007-11-22 10:25:23.286572511 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_inode_item.h 2007-11-22 10:33:59.696597454 +1100 @@ -168,6 +168,14 @@ static inline int xfs_ilog_fext(int w) return (w == XFS_DATA_FORK ? XFS_ILOG_DEXT : XFS_ILOG_AEXT); } +STATIC_INLINE int xfs_inode_clean(xfs_inode_t *ip) +{ + return (((ip->i_itemp == NULL) || + !(ip->i_itemp->ili_format.ilf_fields & XFS_ILOG_ALL)) && + (ip->i_update_core == 0)); +} + + #ifdef __KERNEL__ extern void xfs_inode_item_init(struct xfs_inode *, struct xfs_mount *); - To
[PATCH 8/9] Convert inode cache locking to RCU
Use RCU locking on the inode radix trees To make use of the efficient radix tree gang lookups for inode cluster operations we had to increase the time we hold the radix tree read lock for. This will affect performance somewhat. Given that all the lookups are done on a radix tree and we already have mechanisms to determine if an inode is valid or not during lookup, we can pretty easily move this across to lockless lookups using RCU. The wrinkle is that the current read lock is used to synchronise inode reclaim and lookup. Luckily, we have the inode flags lock which is used in the same places as we need for this synchronisation and hence the code can be easily changed to use this lock for reclaim/lookup synchronisation. Also, we can avoid growing the xfs_inode structure to place the rcuhead structure for the rcu_call() on inode destruction by reusing the reclaim list listhead structure. We can safely do this because the inode has been removed from the reclaim list before the reclaim code calls xfs_idestroy(). This is effectively the same trick as used in the dentry cache to avoid growing the dentry structure. Signed-off-by: Dave Chinner <[EMAIL PROTECTED]> --- fs/xfs/xfs_ag.h |2 fs/xfs/xfs_iget.c | 107 ++ fs/xfs/xfs_inode.c| 47 - fs/xfs/xfs_inode.h| 14 +- fs/xfs/xfs_mount.c|2 fs/xfs/xfs_vnodeops.c |8 --- 6 files changed, 108 insertions(+), 72 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c === --- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c2007-11-22 10:33:53.993326524 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-22 10:33:57.724849511 +1100 @@ -40,6 +40,37 @@ #include "xfs_utils.h" /* + * Attempt to move the inode out of the IRECLAIMABLE state. + * Must be called under rcu_read_lock() and with the ip->i_flags_lock + * held for synchronisation with xfs_ireclaim_finish(). + */ +STATIC int +xfs_iget_reclaim_check( + xfs_inode_t *ip, + int flags) +{ + /* +* If IRECLAIM is set this inode is on its way out of the system, we +* need to pause and try again. +*/ + if (__xfs_iflags_test(ip, XFS_IRECLAIM)) + return EAGAIN; + ASSERT(__xfs_iflags_test(ip, XFS_IRECLAIMABLE)); + + /* +* If lookup is racing with unlink, then we should return an error +* immediately so we don't remove it from the reclaim list and +* potentially leak the inode. +*/ + + if ((ip->i_d.di_mode == 0) && !(flags & XFS_IGET_CREATE)) + return ENOENT; + + __xfs_iflags_clear(ip, XFS_IRECLAIMABLE); + return 0; +} + +/* * Look up an inode by number in the given file system. * The inode is looked up in the cache held in each AG. * If the inode is found in the cache, attach it to the provided @@ -94,7 +125,7 @@ xfs_iget_core( agino = XFS_INO_TO_AGINO(mp, ino); again: - read_lock(>pag_ici_lock); + rcu_read_lock(); ip = radix_tree_lookup(>pag_ici_root, agino); if (ip != NULL) { @@ -103,52 +134,44 @@ again: * we need to pause and try again. */ if (xfs_iflags_test(ip, XFS_INEW)) { - read_unlock(>pag_ici_lock); + rcu_read_unlock(); delay(1); XFS_STATS_INC(xs_ig_frecycle); goto again; } + /* +* Determine if the inode is queued for reclaim or being +* reclaimed. This is trickier now we are under RCU locking. +* +* Basically, xfs_ireclaim_finish() uses the i_flags_lock to +* atomically move the inode out of the IRECLAIMABLE state and +* inode the IRECLAIM state, so we have to use the same lock to +* do an equivalent set of tests and move the inode out of the +* IRECLAIMABLE state. +*/ old_inode = ip->i_vnode; if (old_inode == NULL) { - /* -* If IRECLAIM is set this inode is -* on its way out of the system, -* we need to pause and try again. -*/ - if (xfs_iflags_test(ip, XFS_IRECLAIM)) { - read_unlock(>pag_ici_lock); - delay(1); + spin_lock(>i_flags_lock); + error = xfs_iget_reclaim_check(ip, flags); + spin_unlock(>i_flags_lock); + rcu_read_unlock(); + if (error) { XFS_STATS_INC(xs_ig_frecycle); - - goto again; -
[PATCH 7/9] Use radix_tree_gang_lookup_range for cluster lookups
Use radix_tree_gang_lookup_range() for inode cluster lookups Now that we have an efficent lookup method for the radix tree, convert cluster lookups to use it. Factor out the common lookup, add some debug checking to it and call it where needed. For sanity, we need to hold the radix tree lock in read mode across the entire set of locking operations done to ensure we can operate on the inodes. This does increase the length of time we hold the lock in read mode, but we'll correct that with another patch. Signed-off-by: Dave Chinner <[EMAIL PROTECTED]> --- fs/xfs/xfs_inode.c | 83 +++-- 1 file changed, 56 insertions(+), 27 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c === --- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c 2007-11-22 10:33:53.993326524 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c2007-11-22 10:33:55.877085717 +1100 @@ -2165,6 +2165,37 @@ STATIC_INLINE int xfs_inode_clean(xfs_in (ip->i_update_core == 0)); } +/* lookup all the inodes in the cluster */ +STATIC int +xfs_icluster_lookup( + xfs_mount_t *mp, + xfs_perag_t *pag, + xfs_ino_t ino, + xfs_inode_t **ilist, + int clsize) +{ + unsigned long first_index, last_index, mask; + int nr_found; + + mask = ~(clsize - 1); + first_index = XFS_INO_TO_AGINO(mp, ino) & mask; + last_index = (XFS_INO_TO_AGINO(mp, ino + clsize) & mask) - 1; + nr_found = radix_tree_gang_lookup_range(>pag_ici_root, + (void**)ilist, first_index, last_index, + clsize); + ASSERT(nr_found <= clsize); +#ifdef DEBUG +{ int i; + xfs_inode_t *iq; + for (i = 0; i < nr_found; i++) { + iq = ilist[i]; + ASSERT((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index); + } +} +#endif + return nr_found; +} + STATIC void xfs_ifree_cluster( xfs_inode_t *free_ip, @@ -2178,7 +2209,7 @@ xfs_ifree_cluster( int i, j, found, pre_flushed; xfs_daddr_t blkno; xfs_buf_t *bp; - xfs_inode_t *ip, **ip_found; + xfs_inode_t *ip, **ip_found, **ilist; xfs_inode_log_item_t*iip; xfs_log_item_t *lip; xfs_perag_t *pag = xfs_get_perag(mp, inum); @@ -2195,8 +2226,10 @@ xfs_ifree_cluster( } ip_found = kmem_alloc(ninodes * sizeof(xfs_inode_t *), KM_NOFS); + ilist = kmem_alloc(ninodes * sizeof(xfs_inode_t *), KM_NOFS); for (j = 0; j < nbufs; j++, inum += ninodes) { + int nr_found; blkno = XFS_AGB_TO_DADDR(mp, XFS_INO_TO_AGNO(mp, inum), XFS_INO_TO_AGBNO(mp, inum)); @@ -2213,24 +2246,23 @@ xfs_ifree_cluster( * and fail, we need some other form of interlock * here. */ + read_lock(>pag_ici_lock); + nr_found = xfs_icluster_lookup(mp, pag, inum, ilist, ninodes); + if (nr_found == 0) { + read_unlock(>pag_ici_lock); + continue; + } found = 0; - for (i = 0; i < ninodes; i++) { - read_lock(>pag_ici_lock); - ip = radix_tree_lookup(>pag_ici_root, - XFS_INO_TO_AGINO(mp, (inum + i))); + for (i = 0; i < nr_found; i++) { + ip = ilist[i]; /* Inode not in memory or we found it already, * nothing to do */ - if (!ip || xfs_iflags_test(ip, XFS_ISTALE)) { - read_unlock(>pag_ici_lock); + if (!ip || xfs_iflags_test(ip, XFS_ISTALE)) continue; - } - - if (xfs_inode_clean(ip)) { - read_unlock(>pag_ici_lock); + if (xfs_inode_clean(ip)) continue; - } /* If we can get the locks then add it to the * list, otherwise by the time we get the bp lock @@ -2251,7 +2283,6 @@ xfs_ifree_cluster( ip_found[found++] = ip; } } - read_unlock(>pag_ici_lock); continue; } @@ -2269,8 +2300,8 @@ xfs_ifree_cluster( xfs_iunlock(ip, XFS_ILOCK_EXCL); }
[PATCH 6/9] Remove xfs_icluster
Remove the xfs_icluster structure and replace with a radix tree lookup. We don't need to keep a list of inodes in each cluster around anymore as we can look them up quickly when we need to. The only time we need to do this now is during inode writeback. Factor the inode cluster writeback code out of xfs_iflush and convert it to use radix_tree_gang_lookup() instead of walking a list of inodes built when we first read in the inodes. This remove 3 pointers from each xfs_inode structure and the xfs_icluster structure per inode cluster. Hence we reduce the cache footprint of the xfs_inodes by between 5-10% depending on cluster sparseness. To be truly efficient we need a radix_tree_gang_lookup_range() call to stop searching once we are past the end of the cluster instead of trying to find a full cluster's worth of inodes. Before (ia64): $ cat /sys/slab/xfs_inode/object_size 536 After: $ cat /sys/slab/xfs_inode/object_size 512 Signed-off-by: Dave Chinner <[EMAIL PROTECTED]> --- fs/xfs/linux-2.6/xfs_ksyms.c |1 fs/xfs/xfs_iget.c| 49 --- fs/xfs/xfs_inode.c | 266 --- fs/xfs/xfs_inode.h | 16 -- fs/xfs/xfs_vfsops.c |5 fs/xfs/xfsidbg.c |4 6 files changed, 154 insertions(+), 187 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c === --- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c2007-11-22 10:25:24.178458638 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-22 10:33:53.993326524 +1100 @@ -78,7 +78,6 @@ xfs_iget_core( xfs_inode_t *ip; xfs_inode_t *iq; int error; - xfs_icluster_t *icl, *new_icl = NULL; unsigned long first_index, mask; xfs_perag_t *pag; xfs_agino_t agino; @@ -229,11 +228,9 @@ finish_inode: } /* -* This is a bit messy - we preallocate everything we _might_ -* need before we pick up the ici lock. That way we don't have to -* juggle locks and go all the way back to the start. +* Preload the radix tree so we can insert safely under the +* write spinlock. */ - new_icl = kmem_zone_alloc(xfs_icluster_zone, KM_SLEEP); if (radix_tree_preload(GFP_KERNEL)) { delay(1); goto again; @@ -241,17 +238,6 @@ finish_inode: mask = ~(((XFS_INODE_CLUSTER_SIZE(mp) >> mp->m_sb.sb_inodelog)) - 1); first_index = agino & mask; write_lock(>pag_ici_lock); - - /* -* Find the cluster if it exists -*/ - icl = NULL; - if (radix_tree_gang_lookup(>pag_ici_root, (void**), - first_index, 1)) { - if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index) - icl = iq->i_cluster; - } - /* * insert the new inode */ @@ -266,30 +252,13 @@ finish_inode: } /* -* These values _must_ be set before releasing ihlock! +* These values _must_ be set before releasing the radix tree lock! */ ip->i_udquot = ip->i_gdquot = NULL; xfs_iflags_set(ip, XFS_INEW); - ASSERT(ip->i_cluster == NULL); - - if (!icl) { - spin_lock_init(_icl->icl_lock); - INIT_HLIST_HEAD(_icl->icl_inodes); - icl = new_icl; - new_icl = NULL; - } else { - ASSERT(!hlist_empty(>icl_inodes)); - } - spin_lock(>icl_lock); - hlist_add_head(>i_cnode, >icl_inodes); - ip->i_cluster = icl; - spin_unlock(>icl_lock); - write_unlock(>pag_ici_lock); radix_tree_preload_end(); - if (new_icl) - kmem_zone_free(xfs_icluster_zone, new_icl); /* * Link ip to its mount and thread it on the mount's inode list. @@ -528,18 +497,6 @@ xfs_iextract( xfs_put_perag(mp, pag); /* -* Remove from cluster list -*/ - mp = ip->i_mount; - spin_lock(>i_cluster->icl_lock); - hlist_del(>i_cnode); - spin_unlock(>i_cluster->icl_lock); - - /* was last inode in cluster? */ - if (hlist_empty(>i_cluster->icl_inodes)) - kmem_zone_free(xfs_icluster_zone, ip->i_cluster); - - /* * Remove from mount's inode list. */ XFS_MOUNT_ILOCK(mp); Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c === --- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c 2007-11-22 10:33:51.037704348 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c2007-11-22 10:33:53.993326524 +1100 @@ -53,7 +53,6 @@ kmem_zone_t *xfs_ifork_zone; kmem_zone_t *xfs_inode_zone; -kmem_zone_t *xfs_icluster_zone; /* * Used in xfs_itruncate(). This is the maximum number of extents @@ -3014,6 +3013,151 @@ xfs_iflush_fork(
[PATCH 5/9] Don't block pdflush when flushing inodes
When pdflush is writing back inodes, it can get stuck on inode cluster buffers that are currently under I/O. This occurs when we write data to multiple inodes in the same inode cluster at the same time. Effectively, delayed allocation marks the inode dirty during the data writeback. Hence if the inode cluster was flushed during the writeback of the first inode, the writeback of the second inode will block waiting for the inode cluster write to complete before writing it again for the newly dirtied inode. Basically, we want to avoid this from happening so we don't block pdflush and slow down all of writeback. Hence we introduce a non-blocking async inode flush flag that pdflush uses. If this flag is set, we use non-blocking operations (e.g. try locks) where-ever we can to avoid blocking or extra I/O being issued. Signed-off-by: Dave Chinner <[EMAIL PROTECTED]> --- fs/xfs/linux-2.6/xfs_super.c |3 + fs/xfs/linux-2.6/xfs_vnode.h |5 -- fs/xfs/xfs_inode.c | 82 +-- fs/xfs/xfs_inode.h |8 +++- fs/xfs/xfs_trans_buf.c |3 + fs/xfs/xfs_vnodeops.c| 55 ++-- 6 files changed, 79 insertions(+), 77 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c === --- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c 2007-11-22 10:33:43.014729931 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c2007-11-22 10:33:51.037704348 +1100 @@ -183,12 +183,20 @@ xfs_imap_to_bp( int ni; xfs_buf_t *bp; + if (buf_flags == 0) + buf_flags = XFS_BUF_LOCK; + error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, imap->im_blkno, - (int)imap->im_len, XFS_BUF_LOCK, ); + (int)imap->im_len, buf_flags, ); if (error) { - cmn_err(CE_WARN, "xfs_imap_to_bp: xfs_trans_read_buf()returned " + if (error != EAGAIN) { + cmn_err(CE_WARN, + "xfs_imap_to_bp: xfs_trans_read_buf()returned " "an error %d on %s. Returning error.", error, mp->m_fsname); + } else { + ASSERT(buf_flags & XFS_BUF_TRYLOCK); + } return error; } @@ -306,14 +314,15 @@ xfs_inotobp( * 0 for the disk block address. */ int -xfs_itobp( +xfs_itobp_flags( xfs_mount_t *mp, xfs_trans_t *tp, xfs_inode_t *ip, xfs_dinode_t**dipp, xfs_buf_t **bpp, xfs_daddr_t bno, - uintimap_flags) + uintimap_flags, + uintbuf_flags) { xfs_imap_t imap; xfs_buf_t *bp; @@ -344,10 +353,17 @@ xfs_itobp( } ASSERT(bno == 0 || bno == imap.im_blkno); - error = xfs_imap_to_bp(mp, tp, , , XFS_BUF_LOCK, imap_flags); + error = xfs_imap_to_bp(mp, tp, , , buf_flags, imap_flags); if (error) return error; + if (!bp) { + ASSERT(buf_flags & XFS_BUF_TRYLOCK); + ASSERT(tp == NULL); + *bpp = NULL; + return EAGAIN; + } + *dipp = (xfs_dinode_t *)xfs_buf_offset(bp, imap.im_boffset); *bpp = bp; return 0; @@ -3023,6 +3039,7 @@ xfs_iflush( int bufwasdelwri; struct hlist_node *entry; enum { INT_DELWRI = (1 << 0), INT_ASYNC = (1 << 1) }; + int noblock = (flags == XFS_IFLUSH_ASYNC_NOBLOCK); XFS_STATS_INC(xs_iflush_count); @@ -3047,11 +3064,22 @@ xfs_iflush( } /* -* We can't flush the inode until it is unpinned, so -* wait for it. We know noone new can pin it, because -* we are holding the inode lock shared and you need -* to hold it exclusively to pin the inode. +* We can't flush the inode until it is unpinned, so wait for it if we +* are allowed to block. We know noone new can pin it, because we are +* holding the inode lock shared and you need to hold it exclusively to +* pin the inode. +* +* If we are not allowed to block, force the log out asynchronously so +* that when we come back the inode will be unpinned. If other inodes +* in the same cluster are dirty, they will probably write the inode +* out for us if they occur after the log force completes. */ + + if (noblock && xfs_ipincount(ip)) { + xfs_log_force(mp, (xfs_lsn_t)0, XFS_LOG_FORCE); + xfs_ifunlock(ip); + return EAGAIN; + } xfs_iunpin_wait(ip); /* @@ -3068,15 +3096,6 @@ xfs_iflush( } /* -* Get the buffer containing the on-disk inode. -*/ -
[UPDATED PATCH] Support for Toshiba TMIO multifunction devices
On Wed, 2007-11-21 at 12:05 +0800, eric miao wrote: > On Nov 21, 2007 11:54 AM, ian <[EMAIL PROTECTED]> wrote: > > On Wed, 2007-11-21 at 10:23 +0800, eric miao wrote: > > > Roughly went through the patch, looks good, here comes the remind, though > > > :-) > > > > > > 1. is it possible to use some name other than "soc_core", maybe > > > "tmio_core" so that other multifunction chips sharing a core base > > > will live easier. > > > > It's (soc-core) not tmio MFD specific - its already used by other MFD > > chips (although obviously not ones in mainline (yet!) I've renamed soc-core to mfd-core in the patches attached to this message. > > > 2. those C++ style comments "//" are not so pleasant... > > > > Should I clean them up and resubmit? > > Will be nice then, anyway, could you inline them so others can comment? All done. > Well, I briefly went through the git history, looks like Russell is the proper > one you could sent them to (probably not) :-) I've added RMK to the CC. I've ommitted the platform support for e-series - I'll send that to RMK once this is merged. Patches follow: >From 9c4ffb764ae2366368a0038a6fbdd9a19ce430c4 Mon Sep 17 00:00:00 2001 From: Ian Molton <[EMAIL PROTECTED]> Date: Wed, 21 Nov 2007 23:32:37 + Subject: [PATCH] Reuseable MFD core code suitable for multifunction chips with built in IRQ multiplexing and local RAM. --- drivers/mfd/Kconfig| 25 drivers/mfd/Makefile |3 + drivers/mfd/mfd-core.c | 102 drivers/mfd/mfd-core.h | 26 include/linux/ioport.h |3 + 5 files changed, 159 insertions(+), 0 deletions(-) create mode 100644 drivers/mfd/mfd-core.c create mode 100644 drivers/mfd/mfd-core.h diff --git a/drivers/mfd/Kconfig b/drivers/mfd/Kconfig index 2571619..38edfdc 100644 --- a/drivers/mfd/Kconfig +++ b/drivers/mfd/Kconfig @@ -15,6 +15,31 @@ config MFD_SM501 interface. The device may be connected by PCI or local bus with varying functions enabled. +config MFD_T7L66XB + bool "Toshiba T7L66XB SoC support" + ---help--- + This driver supports the T7L66XB, which incorporates SD/MMC, and + USB host functionality. associated subdevices are: + tmio_mmc + tmio_ohci + +config MFD_TC6387XB + bool "Toshiba TC6387XB SoC support" + ---help--- + This driver supports the TC6393XB, which incorporates SD/MMC, NAND, + Video, and USB host functionality. associated subdevices are: + tmio_mmc + +config MFD_TC6393XB + bool "Toshiba TC6393XB SoC support" + ---help--- + This driver supports the TC6393XB, which incorporates SD/MMC, NAND, + Video, and USB host functionality. associated subdevices are: + tmio_mmc + tmio_nand + tmio_fb + tmio_ohci + endmenu menu "Multimedia Capabilities Port drivers" diff --git a/drivers/mfd/Makefile b/drivers/mfd/Makefile index 5143209..5ae3877 100644 --- a/drivers/mfd/Makefile +++ b/drivers/mfd/Makefile @@ -3,6 +3,9 @@ # obj-$(CONFIG_MFD_SM501)+= sm501.o +obj-$(CONFIG_MFD_T7L66XB) += t7l66xb.o mfd-core.o +obj-$(CONFIG_MFD_TC6387XB) += tc6387xb.o mfd-core.o +obj-$(CONFIG_MFD_TC6393XB) += tc6393xb.o mfd-core.o obj-$(CONFIG_MCP) += mcp-core.o obj-$(CONFIG_MCP_SA11X0) += mcp-sa11x0.o diff --git a/drivers/mfd/mfd-core.c b/drivers/mfd/mfd-core.c new file mode 100644 index 000..e668c92 --- /dev/null +++ b/drivers/mfd/mfd-core.c @@ -0,0 +1,102 @@ +/* + * drivers/mfd/mfd-core.c + * + * core MFD support + * Copyright (c) 2006 Ian Molton + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + */ + +#include +#include +#include +#include +#include "mfd-core.h" + +void mfd_free_devices(struct platform_device *devices, int nr_devs) +{ + struct platform_device *dev = devices; + int i; + + for (i = 0; i < nr_devs; i++) { + struct resource *res = dev->resource; + platform_device_unregister(dev++); + kfree(res); + } + kfree(devices); +} +EXPORT_SYMBOL_GPL(mfd_free_devices); + +#define SIGNED_SHIFT(val, shift) ((shift) >= 0 ? ((val) << (shift)) : ((val) >> -(shift))) + +struct platform_device *mfd_add_devices(struct platform_device *dev, + struct mfd_device_data *mfd, int nr_devs, + struct resource *mem, + int relative_addr_shift, int irq_base) +{ + struct platform_device *devices; + int i, r, base; + + devices = kzalloc(nr_devs * sizeof(struct platform_device), GFP_KERNEL); + if (!devices) + return NULL; + + for (i = 0; i < nr_devs; i++) { +
[PATCH 3/9] Use _META bio I/O types for metadata I/O
Improve metadata I/O merging in the elevator Change all async metadata buffers to use [READ|WRITE]_META I/O types so that the I/O doesn't get issued immediately. This allows merging of adjacent metadata requests but still prioritises them over bulk data. This shows a 10-15% improvement in sequential create speed of small files. Don't include the log buffers in this classification - leave them as sync types so they are issued immediately. Signed-off-by: Dave Chinner <[EMAIL PROTECTED]> --- fs/xfs/linux-2.6/xfs_buf.c |6 +- include/linux/fs.h |1 + 2 files changed, 6 insertions(+), 1 deletion(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c 2007-11-22 10:53:11.556186722 +1100 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c2007-11-22 10:53:43.748024392 +1100 @@ -1175,10 +1175,14 @@ _xfs_buf_ioapply( if (bp->b_flags & XBF_ORDERED) { ASSERT(!(bp->b_flags & XBF_READ)); rw = WRITE_BARRIER; - } else if (bp->b_flags & _XBF_RUN_QUEUES) { + } else if (bp->b_flags & XBF_LOG_BUFFER) { ASSERT(!(bp->b_flags & XBF_READ_AHEAD)); bp->b_flags &= ~_XBF_RUN_QUEUES; rw = (bp->b_flags & XBF_WRITE) ? WRITE_SYNC : READ_SYNC; + } else if (bp->b_flags & _XBF_RUN_QUEUES) { + ASSERT(!(bp->b_flags & XBF_READ_AHEAD)); + bp->b_flags &= ~_XBF_RUN_QUEUES; + rw = (bp->b_flags & XBF_WRITE) ? WRITE_META : READ_META; } else { rw = (bp->b_flags & XBF_WRITE) ? WRITE : (bp->b_flags & XBF_READ_AHEAD) ? READA : READ; Index: 2.6.x-xfs-new/include/linux/fs.h === --- 2.6.x-xfs-new.orig/include/linux/fs.h 2007-11-22 10:47:21.965392742 +1100 +++ 2.6.x-xfs-new/include/linux/fs.h2007-11-22 10:53:43.748024392 +1100 @@ -83,6 +83,7 @@ extern int dir_notify_enable; #define READ_SYNC (READ | (1 << BIO_RW_SYNC)) #define READ_META (READ | (1 << BIO_RW_META)) #define WRITE_SYNC (WRITE | (1 << BIO_RW_SYNC)) +#define WRITE_META (WRITE | (1 << BIO_RW_META)) #define WRITE_BARRIER ((1 << BIO_RW) | (1 << BIO_RW_BARRIER)) #define SEL_IN 1 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/9] Factor common inode cluster buffer lookup code
Factor xfs_itobp() and xfs_inotobp(). The only difference between the functions is one passes an inode for the lookup, the other passes an inode number. However, they don't do the same validity checking or set all the same state on the buffer that is returned yet they should. Factor the functions into a common implementation. Signed-off-by: Dave Chinner <[EMAIL PROTECTED]> --- fs/xfs/xfs_inode.c | 283 - 1 file changed, 129 insertions(+), 154 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c === --- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c 2007-11-22 10:31:44.0 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c2007-11-22 10:33:43.014729931 +1100 @@ -124,6 +124,126 @@ xfs_inobp_check( #endif /* + * Simple wrapper for calling xfs_imap() that includes error + * and bounds checking + */ +STATIC int +xfs_ino_to_imap( + xfs_mount_t *mp, + xfs_trans_t *tp, + xfs_ino_t ino, + xfs_imap_t *imap, + uintimap_flags) +{ + int error; + + error = xfs_imap(mp, tp, ino, imap, imap_flags); + if (error) { + cmn_err(CE_WARN, "xfs_ino_to_imap: xfs_imap() returned an " + "error %d on %s. Returning error.", + error, mp->m_fsname); + return error; + } + + /* +* If the inode number maps to a block outside the bounds +* of the file system then return NULL rather than calling +* read_buf and panicing when we get an error from the +* driver. +*/ + if ((imap->im_blkno + imap->im_len) > + XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)) { + xfs_fs_cmn_err(CE_ALERT, mp, "xfs_ino_to_imap: " + "(imap->im_blkno (0x%llx) + imap->im_len (0x%llx)) > " + " XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) (0x%llx)", + (unsigned long long) imap->im_blkno, + (unsigned long long) imap->im_len, + XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)); + return XFS_ERROR(EINVAL); + } + return 0; +} + +/* + * Find the buffer associated with the given inode map + * We do basic validation checks on the buffer once it has been + * retrieved from disk. + */ +STATIC int +xfs_imap_to_bp( + xfs_mount_t *mp, + xfs_trans_t *tp, + xfs_imap_t *imap, + xfs_buf_t **bpp, + uintbuf_flags, + uintimap_flags) +{ + int error; + int i; + int ni; + xfs_buf_t *bp; + + error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, imap->im_blkno, + (int)imap->im_len, XFS_BUF_LOCK, ); + if (error) { + cmn_err(CE_WARN, "xfs_imap_to_bp: xfs_trans_read_buf()returned " + "an error %d on %s. Returning error.", + error, mp->m_fsname); + return error; + } + + /* +* Validate the magic number and version of every inode in the buffer +* (if DEBUG kernel) or the first inode in the buffer, otherwise. +*/ +#ifdef DEBUG + ni = BBTOB(imap->im_len) >> mp->m_sb.sb_inodelog; +#else /* usual case */ + ni = 1; +#endif + + for (i = 0; i < ni; i++) { + int di_ok; + xfs_dinode_t*dip; + + dip = (xfs_dinode_t *)xfs_buf_offset(bp, + (i << mp->m_sb.sb_inodelog)); + di_ok = be16_to_cpu(dip->di_core.di_magic) == XFS_DINODE_MAGIC && + XFS_DINODE_GOOD_VERSION(dip->di_core.di_version); + if (unlikely(XFS_TEST_ERROR(!di_ok, mp, + XFS_ERRTAG_ITOBP_INOTOBP, + XFS_RANDOM_ITOBP_INOTOBP))) { + if (imap_flags & XFS_IMAP_BULKSTAT) { + xfs_trans_brelse(tp, bp); + return XFS_ERROR(EINVAL); + } + XFS_CORRUPTION_ERROR("xfs_imap_to_bp", + XFS_ERRLEVEL_HIGH, mp, dip); +#ifdef DEBUG + cmn_err(CE_PANIC, + "Device %s - bad inode magic/vsn " + "daddr %lld #%d (magic=%x)", + XFS_BUFTARG_NAME(mp->m_ddev_targp), + (unsigned long long)imap->im_blkno, i, + be16_to_cpu(dip->di_core.di_magic)); +#endif + xfs_trans_brelse(tp, bp); + return XFS_ERROR(EFSCORRUPTED); + } +
tun device supplementary group ownership
Hi, It seems to me that supplementary groups should be taken into account when checking for permissions on a tun device. Can someone comment on my patch below; is it a reasonable approach? If so, I'd like to submit it for inclusion in the kernel under the GPL. Please forward any responses to me directly in addition to the lkml. Mike --- tun.c 2007-11-16 10:14:27.0 -0800 +++ tun.c.new 2007-11-21 16:12:15.0 -0800 @@ -471,7 +471,8 @@ if (((tun->owner != -1 && current->euid != tun->owner) || (tun->group != -1 && - current->egid != tun->group)) && + (current->egid != tun->group && + !groups_search(current->group_info, tun->group && !capable(CAP_NET_ADMIN)) return -EPERM; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/9]: Reduce Log I/O latency
Reduce log I/O latency To ensure that log I/O is issued as the highest priority I/O, set the I/O priority of the log I/O to the highest possible. This will ensure that log I/O is not held up behind bulk data or other metadata I/O as delaying log I/O can pause the entire transaction subsystem. Introduce a new buffer flag to allow us to tag the log buffers so we can discrimiate when issuing the I/O. Signed-off-by: Dave Chinner <[EMAIL PROTECTED]> --- fs/xfs/linux-2.6/xfs_buf.c |3 +++ fs/xfs/linux-2.6/xfs_buf.h |5 - fs/xfs/xfs_log.c |2 ++ 3 files changed, 9 insertions(+), 1 deletion(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c 2007-11-22 10:47:21.937396362 +1100 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c2007-11-22 10:53:11.556186722 +1100 @@ -1255,6 +1255,9 @@ next_chunk: submit_io: if (likely(bio->bi_size)) { + /* log I/O should not be delayed by anything. */ + if (bp->b_flags & XBF_LOG_BUFFER) + bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 0)); submit_bio(rw, bio); if (size) goto next_chunk; Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.h === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.h 2007-11-22 10:47:21.945395328 +1100 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.h2007-11-22 10:53:11.556186722 +1100 @@ -53,7 +53,8 @@ typedef enum { XBF_DELWRI = (1 << 6), /* buffer has dirty pages */ XBF_STALE = (1 << 7), /* buffer has been staled, do not find it */ XBF_FS_MANAGED = (1 << 8), /* filesystem controls freeing memory */ - XBF_ORDERED = (1 << 11),/* use ordered writes */ + XBF_LOG_BUFFER = (1 << 9), /* Buffer issued by the log*/ + XBF_ORDERED = (1 << 11),/* use ordered writes */ XBF_READ_AHEAD = (1 << 12), /* asynchronous read-ahead */ /* flags used only as arguments to access routines */ @@ -340,6 +341,8 @@ extern void xfs_buf_trace(xfs_buf_t *, c #define XFS_BUF_TARGET(bp) ((bp)->b_target) #define XFS_BUFTARG_NAME(target) xfs_buf_target_name(target) +#define XFS_BUF_SET_LOGBUF(bp) ((bp)->b_flags |= XBF_LOG_BUFFER) + static inline int xfs_bawrite(void *mp, xfs_buf_t *bp) { bp->b_fspriv3 = mp; Index: 2.6.x-xfs-new/fs/xfs/xfs_log.c === --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log.c 2007-11-22 10:47:21.945395328 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_log.c 2007-11-22 10:53:11.556186722 +1100 @@ -1443,6 +1443,8 @@ xlog_sync(xlog_t *log, XFS_BUF_ZEROFLAGS(bp); XFS_BUF_BUSY(bp); XFS_BUF_ASYNC(bp); + XFS_BUF_SET_LOGBUF(bp); + /* * Do an ordered write for the log block. * Its unnecessary to flush the first split block in the log wrap case. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/9]: introduce radix_tree_gang_lookup_range
Introduce radix_tree_gang_lookup_range() The inode clustering in XFS requires a gang lookup on the radix tree to find all the inodes in the cluster. The gang lookup has to set the maximum items to that of a fully populated cluster so we get all the inodes in the cluster, but we only populate the radix tree sparsely (on demand). As a result, the gang lookup can search way, way past the index of end of the cluster because it is looking for a fixed number of entries to return. We know we want to terminate the search at either a specific index or a maximum number of items, so we need to add a "last_index" parameter to the lookup. Furthermore, the existing radix_tree_gang_lookup() can use this same function if we define a RADIX_TREE_MAX_INDEX value so the search is not limited by the last_index. Signed-off-by: Dave Chinner <[EMAIL PROTECTED]> --- include/linux/radix-tree.h |7 - lib/radix-tree.c | 55 - 2 files changed, 51 insertions(+), 11 deletions(-) Index: 2.6.x-xfs-new/include/linux/radix-tree.h === --- 2.6.x-xfs-new.orig/include/linux/radix-tree.h 2007-11-22 10:25:23.834502553 +1100 +++ 2.6.x-xfs-new/include/linux/radix-tree.h2007-11-22 10:31:46.689597763 +1100 @@ -98,10 +98,11 @@ do { \ * radix_tree_lookup * radix_tree_tag_get * radix_tree_gang_lookup + * radix_tree_gang_lookup_range * radix_tree_gang_lookup_tag * radix_tree_tagged * - * The first 4 functions are able to be called locklessly, using RCU. The + * The first 5 functions are able to be called locklessly, using RCU. The * caller must ensure calls to these functions are made within rcu_read_lock() * regions. Other readers (lock-free or otherwise) and modifications may be * running concurrently. @@ -155,6 +156,10 @@ void *radix_tree_delete(struct radix_tre unsigned int radix_tree_gang_lookup(struct radix_tree_root *root, void **results, unsigned long first_index, unsigned int max_items); +unsigned int +radix_tree_gang_lookup_range(struct radix_tree_root *root, void **results, + unsigned long first_index, unsigned long last_index, + unsigned int max_items); int radix_tree_preload(gfp_t gfp_mask); void radix_tree_init(void); void *radix_tree_tag_set(struct radix_tree_root *root, Index: 2.6.x-xfs-new/lib/radix-tree.c === --- 2.6.x-xfs-new.orig/lib/radix-tree.c 2007-11-22 10:31:24.564425190 +1100 +++ 2.6.x-xfs-new/lib/radix-tree.c 2007-11-22 10:31:46.693597252 +1100 @@ -62,6 +62,8 @@ struct radix_tree_path { #define RADIX_TREE_INDEX_BITS (8 /* CHAR_BIT */ * sizeof(unsigned long)) #define RADIX_TREE_MAX_PATH (RADIX_TREE_INDEX_BITS/RADIX_TREE_MAP_SHIFT + 2) +#define RADIX_TREE_MAX_KEY ~0UL + static unsigned long height_to_maxindex[RADIX_TREE_MAX_PATH] __read_mostly; /* @@ -599,7 +601,8 @@ EXPORT_SYMBOL(radix_tree_tag_get); static unsigned int __lookup(struct radix_tree_node *slot, void **results, unsigned long index, - unsigned int max_items, unsigned long *next_index) + unsigned long last_index, unsigned int max_items, + unsigned long *next_index) { unsigned int nr_found = 0; unsigned int shift, height; @@ -640,6 +643,8 @@ __lookup(struct radix_tree_node *slot, v if (nr_found == max_items) goto out; } + if (index > last_index) + goto out; } out: *next_index = index; @@ -647,27 +652,29 @@ out: } /** - * radix_tree_gang_lookup - perform multiple lookup on a radix tree + * radix_tree_gang_lookup_range - perform multiple lookup on a radix tree * @root: radix tree root * @results: where the results of the lookup are placed * @first_index: start the lookup from this key + * @last_index:end the lookup at this key * @max_items: place up to this many items at *results * - * Performs an index-ascending scan of the tree for present items. Places - * them at [EMAIL PROTECTED] and returns the number of items which were placed at - * [EMAIL PROTECTED] + * Performs an index-ascending scan of the tree for present items up to + * @last_index in the tree. Places them at [EMAIL PROTECTED] and returns the + * number of items which were placed at [EMAIL PROTECTED] * * The implementation is naive. * - * Like radix_tree_lookup, radix_tree_gang_lookup may be called under + * Like radix_tree_lookup, radix_tree_gang_lookup_range may be called under * rcu_read_lock. In this case, rather than the returned results being * an atomic snapshot of the tree at a single point in time, the semantics * of an
[PATCH 0/9]: Various XFS inode clustering improvements
Normally I wouldn't bother cc'ing lkml on XFS changes, however a couple of these patches touch generic code. The changes to generic code are introducing a WRITE_META bio type and radix_tree_gang_lookup_range() and hence the wider ditribution. This patch set is against the current xfs-dev tree so bits of it may not apply to current mainline. Overall, the patch set is focussed on improving the XFS inode cache and clustering code. It reduces memory usage of the cache by 5-10% and improves performance on some workloads by 10-15%. Comments welcome. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCH] SO_NO_CHECK for IPv6
YOSHIFUJI Hideaki / 吉藤英明 wrote: In article <[EMAIL PROTECTED]> (at Wed, 21 Nov 2007 07:45:32 -0500), Jeff Garzik <[EMAIL PROTECTED]> says: SO_NO_CHECK support for IPv6 appeared to be missing. This is presented, based on a reading of net/ipv4/udp.c. Disagree. UDP checksum is mandatory in IPv6. Ah, you mean that I need to turn off UDP checksum on receive end as well in IPv6... true. For those interested, I am dealing with a UDP app that already does very strong checksumming and encryption, so additional software checksumming at the lower layers is quite simply a waste of CPU cycles. Hardware checksumming is fine, as long as its "free." Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: radeonfb i2c regression post-2.6.18.
On Wed, 2007-11-21 at 23:56 +, Roger Leigh wrote: > Fantastic, thanks! I've copied this to Debian bugs 433236 and 426124 > which were about this problem. > > BTW, the framebuffer penguin logo looked a little wierd (low number of > colours, odd colours), though on my powerpc it has always looked odd > (wrong colours). Could there be some endianness bug in the fblogo > code? I'll check it with other video options when I next have a few > minutes. Strange, it's always been working fine for me. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: radeonfb i2c regression post-2.6.18.
Benjamin Herrenschmidt <[EMAIL PROTECTED]> writes: >> > Can you try the patch from Jean that I pasted below and let us know if >> > it helps ? It looks like the releasing of the i2c lines may have been >> > done backward. >> >> This patch fixes the problem. The monitor stays powered on during the >> switch to the framebuffer. > > Excellent ! That saves me having to test myself :-) > > As far as I'm concerned, that's an Ack for the patch. Fantastic, thanks! I've copied this to Debian bugs 433236 and 426124 which were about this problem. BTW, the framebuffer penguin logo looked a little wierd (low number of colours, odd colours), though on my powerpc it has always looked odd (wrong colours). Could there be some endianness bug in the fblogo code? I'll check it with other video options when I next have a few minutes. Thanks again, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `-GPG Public Key: 0x25BFB848 Please GPG sign your mail. pgpbGl5qp3z8C.pgp Description: PGP signature
Re: network driver usage count
Wagner Ferenc <[EMAIL PROTECTED]> : [...] > So why can I remove a driver serving live network traffic? Why not ? It is quite common to remove physically a network/storage device. -- Ueimor - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mmap dirty limits on 32 bit kernels (Was: [BUG] New Kernel Bugs)
On Thu, Nov 15, 2007 at 08:32:22AM -0800, Linus Torvalds wrote: > On Thu, 15 Nov 2007, Bron Gondwana wrote: > > > > I guess we'll be doing the one-liner kernel mod and testing > > that then. > > The thing to look at is "get_dirty_limits()" in mm/page-writeback.c, and > in this particular case it's the > > unsigned long available_memory = determine_dirtyable_memory(); > > that's going to bite you. In particular, note the > > x -= highmem_dirtyable_memory(x); > > that we do in determine_dirtyable_memory(). > > So in this case, if you basically remove that line, it will allow all of > memory to be dirtied (including highmem), and then the background_ratio > will work on the whole 6GB. > > HOWEVER! It's worth noting that we also have some other old legacy cruft > there that may interfere with your code. In particular, if you look at the > top of "get_dirty_limits()", it *also* does a > > unmapped_ratio = 100 - ((global_page_state(NR_FILE_MAPPED) + > global_page_state(NR_ANON_PAGES)) * 100) / > available_memory; > > dirty_ratio = vm_dirty_ratio; > if (dirty_ratio > unmapped_ratio / 2) > dirty_ratio = unmapped_ratio / 2; > > and that whole "unmapped_ratio" comparison is probably bogus these days, > since we now take the mapped dirty pages into account. That code harks > back to the days before we did that, and dirty ratios only affected > non-mapped pages. > > And in particular, now that I look at it, I wonder if it can even go > negative (because "available_memory" may be *smaller* than the > NR_FILE_MAPPED|ANON_PAGES sum!). > > We'll fix up a negative value anyway (because of the clamping of > dirty_ratio to no less than 5), but the point is that the whole > "unmapped_ratio" thing probably doesn't make sense any more, and may well > make the dirty_ratio not work for you, because you may have a very small > unmapped_ratio that effectively makes all dirty limits always clamp to a > very small value. > > So regardless, I think you may want to try the appended patch *first*. > > If this patch makes a difference, please holler. I think it's the correct > thing to do, but I'm not going to actually commit it without somebody > saying that it makes a difference (and preferably Peter Zijlstra and > Andrew acking it too). mmap: mmap call failed: errno: 12 errmsg: Cannot allocate memory Yep, that's "fixed" the problem alright! No way this puppy is dirtying 2Gb of memory any more. http://linux.brong.fastmail.fm/2007-11-22/bmtest.pl That said, pushing the size down to 1700 rather than 2000 in that file makes it run, and the behaviour matches the 2000 Mb case on 2.6.16.55 rather than 2.6.20.20 or 2.6.23.1 (my other test case kernels that happened to be pre-built on that machine) [EMAIL PROTECTED] ~]$ free total used free sharedbuffers cached Mem: 414983620730562076780 0 220361846096 -/+ buffers/cache: 2049243944912 Swap: 2096472 02096472 That's after running the 1700Mb version. You can see this machine is our one remaining 4Gb machine (it's not running any production services unlike the 6Gb machine, so it's better for testing) Anyway - looks like this may be a "good enough" solution for out1 if it can manage an ~2Gb file with 6Gb of memory available. I'll test that later today - but I should drag myself into the office now... Bron. (patch left attached below for reference) > Only *after* testing this change is it probably a good idea to test the > real hack of then removing the highmem_dirtyable_memory() thing. > > Peter? Andrew? > > Linus > > --- > mm/page-writeback.c |8 > 1 files changed, 0 insertions(+), 8 deletions(-) > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 81a91e6..d55cfca 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -297,20 +297,12 @@ get_dirty_limits(long *pbackground, long *pdirty, long > *pbdi_dirty, > { > int background_ratio; /* Percentages */ > int dirty_ratio; > - int unmapped_ratio; > long background; > long dirty; > unsigned long available_memory = determine_dirtyable_memory(); > struct task_struct *tsk; > > - unmapped_ratio = 100 - ((global_page_state(NR_FILE_MAPPED) + > - global_page_state(NR_ANON_PAGES)) * 100) / > - available_memory; > - > dirty_ratio = vm_dirty_ratio; > - if (dirty_ratio > unmapped_ratio / 2) > - dirty_ratio = unmapped_ratio / 2; > - > if (dirty_ratio < 5) > dirty_ratio = 5; > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ
Re: [PATCH] [2.6.24-rc3-mm1] loop cleanup in fs/namespace.c - repost
Andrew Morton пишет: > On Thu, 22 Nov 2007 01:49:19 +0300 > Dmitri Vorobiev <[EMAIL PROTECTED]> wrote: > >> Zach Brown пишет: > This doesn't look fine. Did you test this? Oops, my fault. Of course, I tested the patch, but kernel modules are disabled in my test setup, so I missed the error. >>> :) >>> Enclosed to this message is a new patch, which replaces the goto-loop by the while-based one, but leaves the EXPORT_SYMBOL macro intact. >>> It certainly looks OK to me now, for whatever that's worth. >> Zach, thank you for the code review and suggestions. >> >>> You probably want to wait 'till the next merge window to get it in, >>> though. It's just a cleanup and so shouldn't go in this late in the -rc >>> line. >>> >>> Maybe Andrew will be willing to queue it until that time in -mm. >> I am enclosing the patch against current -mm tree and adding Andrew to the >> Cc: list. >> >> Thanks, >> >> Dmitri >> >>> - z >>> >> > [loop-cleanup-fs-namespace-mm.diff text/x-patch (742B)] > Signed-off-by: Dmitri Vorobiev <[EMAIL PROTECTED]> > --- > diff --git a/fs/namespace.c b/fs/namespace.c > index 79883fe..b098b63 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -606,19 +606,17 @@ static inline void __mntput(struct vfsmo > > void mntput_no_expire(struct vfsmount *mnt) > { > -repeat: > - if (atomic_dec_and_lock(>mnt_count, _lock)) { > + while (atomic_dec_and_lock(>mnt_count, _lock)) { > if (likely(!mnt->mnt_pinned)) { > spin_unlock(_lock); > __mntput(mnt); > - return; > + break; > } > atomic_add(mnt->mnt_pinned + 1, >mnt_count); > mnt->mnt_pinned = 0; > spin_unlock(_lock); > acct_auto_close_mnt(mnt); > security_sb_umount_close(mnt); > - goto repeat; > } > } > > This patch has no changelog which I can use. > > Andrew, thanks for the quick reply. I believe that a couple of sentences is enough for the changelog entry, so here it goes... From: Dmitri Vorobiev <[EMAIL PROTECTED]> The mntput_no_expire() routine implements a simple loop using the goto-based construct. Replace this with an equivalent while-based loop, which looks much cleaner in C code. Signed-off-by: Dmitri Vorobiev <[EMAIL PROTECTED]> --- diff --git a/fs/namespace.c b/fs/namespace.c index 79883fe..b098b63 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -606,19 +606,17 @@ static inline void __mntput(struct vfsmo void mntput_no_expire(struct vfsmount *mnt) { -repeat: - if (atomic_dec_and_lock(>mnt_count, _lock)) { + while (atomic_dec_and_lock(>mnt_count, _lock)) { if (likely(!mnt->mnt_pinned)) { spin_unlock(_lock); __mntput(mnt); - return; + break; } atomic_add(mnt->mnt_pinned + 1, >mnt_count); mnt->mnt_pinned = 0; spin_unlock(_lock); acct_auto_close_mnt(mnt); security_sb_umount_close(mnt); - goto repeat; } }
Re: [PATCH] [2.6.24-rc3-mm1] loop cleanup in fs/namespace.c - repost
On Thu, 22 Nov 2007 01:49:19 +0300 Dmitri Vorobiev <[EMAIL PROTECTED]> wrote: > Zach Brown пишет: > >>> This doesn't look fine. Did you test this? > >> Oops, my fault. Of course, I tested the patch, but kernel modules are > >> disabled in my test setup, so I missed the error. > > > > :) > > > >> Enclosed to this message is a new patch, which replaces the goto-loop by > >> the while-based one, but leaves the EXPORT_SYMBOL macro intact. > > > > It certainly looks OK to me now, for whatever that's worth. > > Zach, thank you for the code review and suggestions. > > > > > You probably want to wait 'till the next merge window to get it in, > > though. It's just a cleanup and so shouldn't go in this late in the -rc > > line. > > > > Maybe Andrew will be willing to queue it until that time in -mm. > > I am enclosing the patch against current -mm tree and adding Andrew to the > Cc: list. > > Thanks, > > Dmitri > > > > > - z > > > > [loop-cleanup-fs-namespace-mm.diff text/x-patch (742B)] Signed-off-by: Dmitri Vorobiev <[EMAIL PROTECTED]> --- diff --git a/fs/namespace.c b/fs/namespace.c index 79883fe..b098b63 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -606,19 +606,17 @@ static inline void __mntput(struct vfsmo void mntput_no_expire(struct vfsmount *mnt) { -repeat: - if (atomic_dec_and_lock(>mnt_count, _lock)) { + while (atomic_dec_and_lock(>mnt_count, _lock)) { if (likely(!mnt->mnt_pinned)) { spin_unlock(_lock); __mntput(mnt); - return; + break; } atomic_add(mnt->mnt_pinned + 1, >mnt_count); mnt->mnt_pinned = 0; spin_unlock(_lock); acct_auto_close_mnt(mnt); security_sb_umount_close(mnt); - goto repeat; } } This patch has no changelog which I can use. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] PNP cleanups - Version 2 - Pass struct pnp_dev to pnp_clean_resource_table for cleanup reasons
On Wed, 2007-11-21 at 18:13 +, Alan Cox wrote: > > > in the pnp_dev. That is, the resources are tied to the device, with > > > struct > > > pnp_resource_table being no more than a handy container to group them > > > under > > > a single name. > > Putting the count into struct resource does not make sense. > > Can you explain that claim ? The additional variable would only make sense for the pnp layer, or only for the pnp resource table in the pnp layer, but struct resource is used at much more places... It is meant for System Memory and IO port resources in general, why waste bytes and an additional name at all places it is used, just for the pnp resource table? > > The idea is to not rely on the exact pnp resource table structure and > > abstracting this to macros. If krealloc approach works, > > dev->res.port_resource[i].start would even still work, if not, it's > > easier to alter the pnp resource table and the macros internally. > > Externally in drivers yes. Internally in code no - it makes the code > harder to work with. > > > > Yes, I dont know how he intends to deal with this (nor, in fact, just how > > > dynamic things are supposed to end up to begin with) so over to Thomas. > > Krealloc should only get used at early pnp init time, when the BIOS > > structures are parsed. The devices shouldn't be active then... > > A bit of a problem, as said, could be the sysfs interfaces, there it > > must be insured krealloc is not used anymore. > > I don't think its that simple but that can be dealt with one the changes > are in place if the objects are sensibly laid out. I hope it is, stay tuned there will come something soon... If it's not that easy, another structure would be needed and every dev->res.port_resource[i].start and friends need to be touched (I don't see how this could still be resolved in a simple array then...). Thomas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1- powerpc link failure
On Wed, 21 Nov 2007 13:36:30 +0530 Kamalesh Babulal <[EMAIL PROTECTED]> wrote: > > The kernel build fails on powerpc while linking, Only for allyesconfig (or maybe some other config that builds a lot of stuff in. > AS .tmp_kallsyms3.o > LD vmlinux.o > ld: TOC section size exceeds 64k > make: *** [vmlinux.o] Error 1 > > The patch posted at http://lkml.org/lkml/2007/11/13/414, solves this > failure. However, that patch needs more testing especially to figure out what performance effects it has. i.e. not for merging, yet. -- Cheers, Stephen Rothwell[EMAIL PROTECTED] http://www.canb.auug.org.au/~sfr/ pgpQkAgSmSIms.pgp Description: PGP signature
Re: [PATCH] [2.6.24-rc3-mm1] loop cleanup in fs/namespace.c - repost
Zach Brown пишет: >>> This doesn't look fine. Did you test this? >> Oops, my fault. Of course, I tested the patch, but kernel modules are >> disabled in my test setup, so I missed the error. > > :) > >> Enclosed to this message is a new patch, which replaces the goto-loop by >> the while-based one, but leaves the EXPORT_SYMBOL macro intact. > > It certainly looks OK to me now, for whatever that's worth. Zach, thank you for the code review and suggestions. > > You probably want to wait 'till the next merge window to get it in, > though. It's just a cleanup and so shouldn't go in this late in the -rc > line. > > Maybe Andrew will be willing to queue it until that time in -mm. I am enclosing the patch against current -mm tree and adding Andrew to the Cc: list. Thanks, Dmitri > > - z > Signed-off-by: Dmitri Vorobiev <[EMAIL PROTECTED]> --- diff --git a/fs/namespace.c b/fs/namespace.c index 79883fe..b098b63 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -606,19 +606,17 @@ static inline void __mntput(struct vfsmo void mntput_no_expire(struct vfsmount *mnt) { -repeat: - if (atomic_dec_and_lock(>mnt_count, _lock)) { + while (atomic_dec_and_lock(>mnt_count, _lock)) { if (likely(!mnt->mnt_pinned)) { spin_unlock(_lock); __mntput(mnt); - return; + break; } atomic_add(mnt->mnt_pinned + 1, >mnt_count); mnt->mnt_pinned = 0; spin_unlock(_lock); acct_auto_close_mnt(mnt); security_sb_umount_close(mnt); - goto repeat; } }
Error returns not handled correctly by sysfs.c:subsys_attr_store()
The buf in fs/sysfs.c:subsys_attr_store() does not seem to be updated correctly when returning a negative value (indicating that an error condition has occurred) is returned. If a negative value is returned, the next subsequent call to subsys_attr_store will have the contents of buf appended to the previous call. Example: I have modified the sd.c:sd_store_allow_restart to always print out the contents of the buf and return an error using the following patch: --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -183,6 +183,9 @@ static ssize_t sd_store_allow_restart(struct class_device *c struct scsi_disk *sdkp = to_scsi_disk(cdev); struct scsi_device *sdp = sdkp->device; + printk(KERN_ERR "buf_ptr = 0x%p, buf = %s, count = %u\n", buf, buf, coun + return -EINVAL; + if (!capable(CAP_SYS_ADMIN)) return -EACCES; I get the following output when writing invalid values to the allow_restart sysfs file: # echo x > /sys/class/scsi_disk/4:0:0:0/allow_restart bash: echo: write error: Invalid argument # echo y > /sys/class/scsi_disk/4:0:0:0/allow_restart bash: echo: write error: Invalid argument # echo z > /sys/class/scsi_disk/4:0:0:0/allow_restart bash: echo: write error: Invalid argument And the console output shows: buf_ptr = 0xe1004bdb, buf = x , count = 2 buf_ptr = 0xe1004bdb, buf = x , count = 2 buf_ptr = 0xe1004bdb, buf = x y , count = 4 buf_ptr = 0xe1004bdb, buf = x y , count = 4 buf_ptr = 0xe1004caf, buf = x y z , count = 6 buf_ptr = 0xe1004caf, buf = x y z , count = 6 and the same append problem occurs when using another sysfs file: # echo xyzzy > /sys/class/scsi_disk/4:0:1:0/allow_restart bash: echo: write error: Invalid argument buf_ptr = 0xe1004caf, buf = x y z xyzzy , count = 12 I found this problem in 2.6.24-rc3 and and an earlier version of 2.6.23. This seems to work correctly on 2.6.18 (at least with the RHEL5 kernel I did some testing with), i.e. the contents of buf from the previous failed called are thrown away/overwritten. I looked through sysfs.c to see if I could find anything obvious but could not see anything. Perhaps this is handled at a higher level. -- Andrew Patterson Hewlett-Packard - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
On Wed, 21 Nov 2007 22:45:22 +0100 Laurent Riffard <[EMAIL PROTECTED]> wrote: > Le 21.11.2007 05:45, Andrew Morton a écrit : > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ > > Hello, > > My system hangs shortly after I logged in Gnome desktop. SysRq-W shows > that a bunch of task are blocked in "D" state, they seem to wait for > some I/O completion. I can try to hand-copy some data if requested. > > I found these messages in dmesg: > > ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 > EXT3-fs: mounted filesystem with ordered data mode. > sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT > driverbyte=DRIVER_OK,SUGGEST_OK > end_request: I/O error, dev sda, sector 16460 > ReiserFS: sda7: found reiserfs format "3.6" with standard journal > ReiserFS: sda7: using ordered data mode > -- > ReiserFS: sda7: Using r5 hash to sort names > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT > driverbyte=DRIVER_OK,SUGGEST_OK > end_request: I/O error, dev sdb, sector 19632 > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT > driverbyte=DRIVER_OK,SUGGEST_OK > end_request: I/O error, dev sdb, sector 40037363 > Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 > across:1048568k > lp0: using parport0 (interrupt-driven). > > These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. > 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. > > Maybe something is broken in pata_via driver ? > Could be - libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch touch pata_via.c. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[replacement PATCH 5/6] x86: tls32 moved
This renames arch/x86/ia32/tls32.c to arch/x86/kernel/tls.c, which does nothing now but paves the way to consolidate this code for 32-bit too. Signed-off-by: Roland McGrath <[EMAIL PROTECTED]> --- arch/x86/ia32/Makefile |2 +- arch/x86/ia32/tls32.c | 158 --- arch/x86/kernel/Makefile_64 |2 + arch/x86/kernel/tls.c | 158 +++ 4 files changed, 161 insertions(+), 159 deletions(-) diff --git a/arch/x86/ia32/Makefile b/arch/x86/ia32/Makefile index e2edda2..3c8b746 100644 --- a/arch/x86/ia32/Makefile +++ b/arch/x86/ia32/Makefile @@ -2,7 +2,7 @@ # Makefile for the ia32 kernel emulation subsystem. # -obj-$(CONFIG_IA32_EMULATION) := ia32entry.o sys_ia32.o ia32_signal.o tls32.o \ +obj-$(CONFIG_IA32_EMULATION) := ia32entry.o sys_ia32.o ia32_signal.o \ ia32_binfmt.o fpu32.o ptrace32.o syscall32.o syscall32_syscall.o \ mmap32.o diff --git a/arch/x86/ia32/tls32.c b/arch/x86/ia32/tls32.c deleted file mode 100644 index 5291596..000 --- a/arch/x86/ia32/tls32.c +++ /dev/null @@ -1,158 +0,0 @@ -#include -#include -#include -#include - -#include -#include -#include -#include -#include -#include - -/* - * sys_alloc_thread_area: get a yet unused TLS descriptor index. - */ -static int get_free_idx(void) -{ - struct thread_struct *t = >thread; - int idx; - - for (idx = 0; idx < GDT_ENTRY_TLS_ENTRIES; idx++) - if (desc_empty((struct n_desc_struct *)(t->tls_array) + idx)) - return idx + GDT_ENTRY_TLS_MIN; - return -ESRCH; -} - -/* - * Set a given TLS descriptor: - * When you want addresses > 32bit use arch_prctl() - */ -int do_set_thread_area(struct thread_struct *t, struct user_desc __user *u_info) -{ - struct user_desc info; - struct n_desc_struct *desc; - int cpu, idx; - - if (copy_from_user(, u_info, sizeof(info))) - return -EFAULT; - - idx = info.entry_number; - - /* -* index -1 means the kernel should try to find and -* allocate an empty descriptor: -*/ - if (idx == -1) { - idx = get_free_idx(); - if (idx < 0) - return idx; - if (put_user(idx, _info->entry_number)) - return -EFAULT; - } - - if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX) - return -EINVAL; - - desc = ((struct n_desc_struct *)t->tls_array) + idx - GDT_ENTRY_TLS_MIN; - - /* -* We must not get preempted while modifying the TLS. -*/ - cpu = get_cpu(); - - if (LDT_empty()) { - desc->a = 0; - desc->b = 0; - } else { - desc->a = LDT_entry_a(); - desc->b = LDT_entry_b(); - } - if (t == >thread) - load_TLS(t, cpu); - - put_cpu(); - return 0; -} - -asmlinkage long sys32_set_thread_area(struct user_desc __user *u_info) -{ - return do_set_thread_area(>thread, u_info); -} - - -/* - * Get the current Thread-Local Storage area: - */ - -#define GET_LIMIT(desc) ( \ - ((desc)->a & 0x0) | \ -((desc)->b & 0xf) ) - -#define GET_32BIT(desc)(((desc)->b >> 22) & 1) -#define GET_CONTENTS(desc) (((desc)->b >> 10) & 3) -#define GET_WRITABLE(desc) (((desc)->b >> 9) & 1) -#define GET_LIMIT_PAGES(desc) (((desc)->b >> 23) & 1) -#define GET_PRESENT(desc) (((desc)->b >> 15) & 1) -#define GET_USEABLE(desc) (((desc)->b >> 20) & 1) -#define GET_LONGMODE(desc) (((desc)->b >> 21) & 1) - -int do_get_thread_area(struct thread_struct *t, struct user_desc __user *u_info) -{ - struct user_desc info; - struct n_desc_struct *desc; - int idx; - - if (get_user(idx, _info->entry_number)) - return -EFAULT; - if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX) - return -EINVAL; - - desc = ((struct n_desc_struct *)t->tls_array) + idx - GDT_ENTRY_TLS_MIN; - - memset(, 0, sizeof(struct user_desc)); - info.entry_number = idx; - info.base_addr = get_desc_base(desc); - info.limit = GET_LIMIT(desc); - info.seg_32bit = GET_32BIT(desc); - info.contents = GET_CONTENTS(desc); - info.read_exec_only = !GET_WRITABLE(desc); - info.limit_in_pages = GET_LIMIT_PAGES(desc); - info.seg_not_present = !GET_PRESENT(desc); - info.useable = GET_USEABLE(desc); - info.lm = GET_LONGMODE(desc); - - if (copy_to_user(u_info, , sizeof(info))) - return -EFAULT; - return 0; -} - -asmlinkage long sys32_get_thread_area(struct user_desc __user *u_info) -{ - return do_get_thread_area(>thread, u_info); -} - - -int ia32_child_tls(struct task_struct *p, struct pt_regs *childregs) -{ - struct n_desc_struct *desc; - struct user_desc info; - struct user_desc __user
[replacement PATCH 6/6] x86: TLS cleanup
This consolidates the four different places that implemented the same encoding magic for the GDT-slot 32-bit TLS support. The old tls32.c was renamed and is now only slightly modified to be the shared implementation. Signed-off-by: Roland McGrath <[EMAIL PROTECTED]> --- arch/x86/ia32/ia32entry.S|4 +- arch/x86/kernel/Makefile_32 |1 + arch/x86/kernel/process_32.c | 143 ++ arch/x86/kernel/process_64.c |3 +- arch/x86/kernel/ptrace_32.c | 91 +++ arch/x86/kernel/ptrace_64.c | 26 +++- arch/x86/kernel/tls.c| 97 +++- include/asm-x86/ia32.h |6 -- include/asm-x86/ptrace.h | 11 +++ 9 files changed, 80 insertions(+), 302 deletions(-) diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index df588f0..0a71fa9 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -644,8 +644,8 @@ ia32_sys_call_table: .quad compat_sys_futex /* 240 */ .quad compat_sys_sched_setaffinity .quad compat_sys_sched_getaffinity - .quad sys32_set_thread_area - .quad sys32_get_thread_area + .quad sys_set_thread_area + .quad sys_get_thread_area .quad compat_sys_io_setup /* 245 */ .quad sys_io_destroy .quad compat_sys_io_getevents diff --git a/arch/x86/kernel/Makefile_32 b/arch/x86/kernel/Makefile_32 index a7bc93c..e660584 100644 --- a/arch/x86/kernel/Makefile_32 +++ b/arch/x86/kernel/Makefile_32 @@ -10,6 +10,7 @@ obj-y := process_32.o signal_32.o entry_32.o traps_32.o irq_32.o \ pci-dma_32.o i386_ksyms_32.o i387_32.o bootflag.o e820_32.o\ quirks.o i8237.o topology.o alternative.o i8253.o tsc_32.o +obj-y += tls.o obj-$(CONFIG_STACKTRACE) += stacktrace.o obj-y += cpu/ obj-y += acpi/ diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index 7b89958..ebbbfc5 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -480,33 +480,16 @@ int copy_thread(int nr, unsigned long clone_flags, unsigned long esp, set_tsk_thread_flag(p, TIF_IO_BITMAP); } + err = 0; + /* * Set a new TLS for the child thread? */ - if (clone_flags & CLONE_SETTLS) { - struct desc_struct *desc; - struct user_desc info; - int idx; - - err = -EFAULT; - if (copy_from_user(, (void __user *)childregs->esi, sizeof(info))) - goto out; - err = -EINVAL; - if (LDT_empty()) - goto out; - - idx = info.entry_number; - if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX) - goto out; - - desc = p->thread.tls_array + idx - GDT_ENTRY_TLS_MIN; - desc->a = LDT_entry_a(); - desc->b = LDT_entry_b(); - } + if (clone_flags & CLONE_SETTLS) + err = do_set_thread_area(p, -1, + (struct user_desc __user *)childregs->esi, 0); - err = 0; - out: - if (err && p->thread.io_bitmap_ptr) { + if (err && p->thread.io_bitmap_ptr) { kfree(p->thread.io_bitmap_ptr); p->thread.io_bitmap_max = 0; } @@ -851,120 +834,6 @@ unsigned long get_wchan(struct task_struct *p) return 0; } -/* - * sys_alloc_thread_area: get a yet unused TLS descriptor index. - */ -static int get_free_idx(void) -{ - struct thread_struct *t = >thread; - int idx; - - for (idx = 0; idx < GDT_ENTRY_TLS_ENTRIES; idx++) - if (desc_empty(t->tls_array + idx)) - return idx + GDT_ENTRY_TLS_MIN; - return -ESRCH; -} - -/* - * Set a given TLS descriptor: - */ -asmlinkage int sys_set_thread_area(struct user_desc __user *u_info) -{ - struct thread_struct *t = >thread; - struct user_desc info; - struct desc_struct *desc; - int cpu, idx; - - if (copy_from_user(, u_info, sizeof(info))) - return -EFAULT; - idx = info.entry_number; - - /* -* index -1 means the kernel should try to find and -* allocate an empty descriptor: -*/ - if (idx == -1) { - idx = get_free_idx(); - if (idx < 0) - return idx; - if (put_user(idx, _info->entry_number)) - return -EFAULT; - } - - if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX) - return -EINVAL; - - desc = t->tls_array + idx - GDT_ENTRY_TLS_MIN; - - /* -* We must not get preempted while modifying the TLS. -*/ - cpu = get_cpu(); - - if (LDT_empty()) { - desc->a = 0; - desc->b
Re: [PATCH 5/5] x86: TLS cleanup
> I had a bit of trouble verifying correctness here because of much > brownian motion. Any possibility of a pure movement / fixup separation > to make it easier on the eyes? Yeah, sorry about that. It was late and the whole TLS thing was a sudden afterthought while I was in the middle of doing something else, so I didn't feel like slicing up the patch any more. And in my tree, GIT didn't even do so great a job with noticing this rename. I'll send a replacement. Thanks, Roland - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1
On Wed, 21 Nov 2007 20:35:13 +0200 "Kirill A. Shutemov" <[EMAIL PROTECTED]> wrote: > Symbol init_level4_pgt is needed by nvidia module. Is it really need to > unexport it? It's our clever way of reducing the tester base so we don't get so many bug reports. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: usb mouse doesn't work
On Wed, 21 Nov 2007 20:23:46 +0200 "Kirill A. Shutemov" <[EMAIL PROTECTED]> wrote: > USB mouse(Logitech M-BT58) doesn't work. TouchPad works. > dmesg after rmmod usbcore && modprobe uhci_hcd: > > usbcore: registered new interface driver usbfs > usbcore: registered new interface driver hub > usbcore: registered new device driver usb > USB Universal Host Controller Interface driver v3.0 > ACPI: PCI Interrupt :00:1d.0[A] -> Link [LNKE] -> GSI 10 (level, low) > -> IRQ 10 > PCI: Setting latency timer of device :00:1d.0 to 64 > uhci_hcd :00:1d.0: UHCI Host Controller > uhci_hcd :00:1d.0: new USB bus registered, assigned bus number 1 > uhci_hcd :00:1d.0: irq 10, io base 0xbf80 > usb usb1: configuration #1 chosen from 1 choice > hub 1-0:1.0: USB hub found > hub 1-0:1.0: 2 ports detected > usb usb1: new device found, idVendor=, idProduct= > usb usb1: new device strings: Mfr=3, Product=2, SerialNumber=1 > usb usb1: Product: UHCI Host Controller > usb usb1: Manufacturer: Linux 2.6.24-kas-alt1 uhci_hcd > usb usb1: SerialNumber: :00:1d.0 > ACPI: PCI Interrupt :00:1d.1[B] -> Link [LNKF] -> GSI 11 (level, low) > -> IRQ 11 > PCI: Setting latency timer of device :00:1d.1 to 64 > uhci_hcd :00:1d.1: UHCI Host Controller > uhci_hcd :00:1d.1: new USB bus registered, assigned bus number 2 > uhci_hcd :00:1d.1: irq 11, io base 0xbf60 > usb usb2: configuration #1 chosen from 1 choice > hub 2-0:1.0: USB hub found > hub 2-0:1.0: 2 ports detected > usb usb2: new device found, idVendor=, idProduct= > usb usb2: new device strings: Mfr=3, Product=2, SerialNumber=1 > usb usb2: Product: UHCI Host Controller > usb usb2: Manufacturer: Linux 2.6.24-kas-alt1 uhci_hcd > usb usb2: SerialNumber: :00:1d.1 > ACPI: PCI Interrupt :00:1d.2[C] -> Link [LNKG] -> GSI 9 (level, low) > -> IRQ 9 > PCI: Setting latency timer of device :00:1d.2 to 64 > uhci_hcd :00:1d.2: UHCI Host Controller > uhci_hcd :00:1d.2: new USB bus registered, assigned bus number 3 > uhci_hcd :00:1d.2: irq 9, io base 0xbf40 > usb usb3: configuration #1 chosen from 1 choice > hub 3-0:1.0: USB hub found > hub 3-0:1.0: 2 ports detected > usb usb3: new device found, idVendor=, idProduct= > usb usb3: new device strings: Mfr=3, Product=2, SerialNumber=1 > usb usb3: Product: UHCI Host Controller > usb usb3: Manufacturer: Linux 2.6.24-kas-alt1 uhci_hcd > usb usb3: SerialNumber: :00:1d.2 > ACPI: PCI Interrupt :00:1d.3[D] -> Link [LNKH] -> GSI 7 (level, low) > -> IRQ 7 > PCI: Setting latency timer of device :00:1d.3 to 64 > uhci_hcd :00:1d.3: UHCI Host Controller > uhci_hcd :00:1d.3: new USB bus registered, assigned bus number 4 > uhci_hcd :00:1d.3: irq 7, io base 0xbf20 > usb usb4: configuration #1 chosen from 1 choice > hub 4-0:1.0: USB hub found > hub 4-0:1.0: 2 ports detected > usb usb4: new device found, idVendor=, idProduct= > usb usb4: new device strings: Mfr=3, Product=2, SerialNumber=1 > usb usb4: Product: UHCI Host Controller > usb usb4: Manufacturer: Linux 2.6.24-kas-alt1 uhci_hcd > usb usb4: SerialNumber: :00:1d.3 > uhci_hcd :00:1d.3: FGR not stopped yet! > I've had some strangenesses with USB lately. Sometimes running `lsusb' makes the USB system notice a newly attached device. Is that "FGR not stopped yet!" messgae new behaviour? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-usb-devel] USB deadlock after resume
On 11/21/07, Laurent Pinchart <[EMAIL PROTECTED]> wrote: > On Wednesday 21 November 2007, Markus Rechberger wrote: > > On 11/21/07, Alan Stern <[EMAIL PROTECTED]> wrote: > > > On Wed, 21 Nov 2007, Markus Rechberger wrote: > > > > > > it's not just usb_set_interface that hangs actually. > > > > > > It seems to hang at > > > > > > > > > > > > wait_event(usb_kill_urb_queue, atomic_read(>use_count) == 0); > > > > > > > > > > > > in drivers/usb/core/urb.c after resuming. I disabled access to the > > > > > > usb subsystem in the uvc driver, although connecting any other usb > > > > > > storage fails too, just at the same point. > > > > > > > > > > Which URB is usb_kill_urb() called for? > > > > > > > > it's the usb_control_message which calls usb_kill_urb if I haven't got > > > > it wrong. (if you're looking for some other information please let me > > > > know) > > > > Although, I got a bit further with it. The error seems to happen > > > > earlier already. > > > > If I load the driver, and do not access the device after suspending > > > > all usb_control commands fail with -71 eproto. > > > > > > That's very strange. Getting -71 errors is understandable; it > > > indicates that the device can't handle being suspended. But the > > > wait_event() line still shouldn't hang. If it does, it indicates that > > > there's something wrong with the USB host controller, not just the > > > device. > > > > > > Can you try testing this on a different sort of computer? > > > > Not really, suspending doesn't work at all on my other notebook it > > just freezes.. > > I'm basically trying to get that driver work on my eee PC [1], it's > > cheap and tiny so I don't expect anything special in there.. > > The system is preloaded with Xandros (it's debian etch with a few > > custom applications) and linux 2.6.21.4. > > If I'm not mistaken, the EeePC ACPI bios plays tricks with the USB ports > during suspend/resume. You should really test suspend/resume with the same > camera chipset on a proper computer. If the camera still crashes, we have a > buggy chipset which needs a reset quirk. If it doesn't, the EeePC ACPI bios > is probably at fault. Adding quirks and hacks to the Linux kernel (either in > the USB stack or the uvcvideo driver) is pretty pointless if the bios tries > to make the system crash. The ACPI code should be fixed in that case. > With ACPI it seems to be possible to disconnect the uvc device. I tested the suspend/resume functions by adding a proc interface to it, and it worked properly. Although the eee PC also suspends the underlying bus where the usb controller is connected to (which is PCI or PCIe) > > The system still locks up, although only if I leave the video > > application running during suspending. I don't have to reload the > > driver anymore after resuming if the video node doesn't get accessed > > (I'm looking for races in the uvc driver at the moment). > The current state I revealed is that after suspend if the video node isn't used it's not necessary to reconnect the device nor to reload the driver again if that reset is implemented. That eee PC comes with 2.6.21.3 which has no such reset quirk feature in the usbcore (that's what I initially meant actually). If a videoapplication accesses the nodes during suspend the notebook won't come back again. I also think it's faulty hardware in that case but I'm moreover looking for a solution for it. My other intel notebook doesn't even awake from suspend to ram, and for some reason suspend to disk just didn't work as expected either (Acer Travelmate 660). thanks for the feedback, Markus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: network driver usage count
Stephen Hemminger <[EMAIL PROTECTED]> writes: > On Wed, 21 Nov 2007 20:45:11 +0100 > Ferenc Wagner <[EMAIL PROTECTED]> wrote: > >> Under 2.6.23.1, my lsmod output shows this: >> >> $ lsmod | grep tg3 >> tg3 100580 0 >> >> The usage count is zero, even though it drives my two physical >> interfaces: >> >> $ ls -l /sys/class/net/eth-gb?/device/driver >> lrwxrwxrwx 1 root root 0 2007-11-21 19:58 >> /sys/class/net/eth-gb1/device/driver -> ../../../bus/pci/drivers/tg3 >> lrwxrwxrwx 1 root root 0 2007-11-21 19:58 >> /sys/class/net/eth-gb2/device/driver -> ../../../bus/pci/drivers/tg3 >> >> These interfaces are up and bonded together, but that doesn't seem to >> matter at all. I also checked other machines, the network driver >> (tg3, e1000) usage counts are always zero under various recent 2.6 >> kernels, but nonzero under 2.4.21 for example. >> >> And really, the module could be removed, cutting my ssh session. :) >> >> Was this made possible intentionally? If yes, why? > > Yes, so devices can be removed at anytime. Hmm, that would warrant nuking all the reference counts on every driver. I must be missing something, since I really feel it goes against common sense. Can you point me to some discussion of this change? I mean, I couldn't remove the driver of a mounted filesystem. So why can I remove a driver serving live network traffic? -- Thanks, Feri. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above
Jie Chen a écrit : Hi, there: We have a simple pthread program that measures the synchronization overheads for various synchronization mechanisms such as spin locks, barriers (the barrier is implemented using queue-based barrier algorithm) and so on. We have dual quad-core AMD opterons (barcelona) clusters running 2.6.23.8 kernel at this moment using Fedora Core 7 distribution. Before we moved to this kernel, we had kernel 2.6.21. These two kernels are configured identical and compiled with the same gcc 4.1.2 compiler. Under the old kernel, we observed that the performance of these overheads increases as the number of threads increases from 2 to 8. The following are the values of total time and overhead for all threads acquiring a pthread spin lock and all threads executing a barrier synchronization call. Could you post the source of your test program ? spinlock are ... spining and should not call linux scheduler, so I have no idea why a kernel change could modify your results. Also I suspect you'll have better results with Fedora Core 8 (since glibc was updated to use private futexes in v 2.7), at least for the barrier ops. Kernel 2.6.21 Number of Threads 2 4 6 8 SpinLock (Time micro second) 10.561810.5853810.5915 10.643 (Overhead) 0.073 0.05746 0.102805 0.154563 Barrier (Time micro second)11.020410 11.678125 11.9889 12.38002 (Overhead)0.531660 1.1502 1.500112 1.891617 Each thread is bound to a particular core using pthread_setaffinity_np. Kernel 2.6.23.8 Number of Threads 2 4 6 8 SpinLock (Time micro second) 14.849915 17.117603 14.4496 10.5990 (Overhead)4.345417 6.6172073.949435 0.110985 Barrier (Time micro second)19.462255 20.285117 16.19395 12.37662 (Overhead)8.957755 9.7847225.699590 1.869518 It is clearly that the synchronization overhead increases as the number of threads increases in the kernel 2.6.21. But the synchronization overhead actually decreases as the number of threads increases in the kernel 2.6.23.8 (We observed the same behavior on kernel 2.6.22 as well). This certainly is not a correct behavior. The kernels are configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel configuration file is in the attachment of this e-mail. From what we have read, there was a new scheduler (CFS) appeared from 2.6.22. We are not sure whether the above behavior is caused by the new scheduler. Finally, our machine cpu information is listed in the following: processor : 0 vendor_id : AuthenticAMD cpu family : 16 model : 2 model name : Quad-Core AMD Opteron(tm) Processor 2347 stepping: 10 cpu MHz : 1909.801 cache size : 512 KB physical id : 0 siblings: 4 core id : 0 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw bogomips: 3822.95 TLB size: 1024 4K pages clflush size: 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate In addition, we have schedstat and sched_debug files in the /proc directory. Thank you for all your help to solve this puzzle. If you need more information, please let us know. P.S. I like to be cc'ed on the discussions related to this problem. ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-usb-devel] USB deadlock after resume
On Wed, 21 Nov 2007, Laurent Pinchart wrote: > > > When you suspend, you cut off vbus (afaik, correct me if I'm wrong), > > > which means your device will get disconnected. One way to avoid this is > > > enabling CONFIG_USB_PERSIST and trying with that on. > > > > Suspend may or may not cut off power. > > I've always been confused by this. > > If I'm not mistaken, there are three kind of suspend modes: autosuspend, You mean runtime (AKA dynamic) suspend -- autosuspend is merely one type of runtime suspend. > suspend to RAM and suspend to disk. The nomenclature du jour is just plain "suspend" for suspend-to-RAM and "hibernation" for suspend-to-disk. > In the first case I expect the USB hub > (either root hub or external hub) to make the bus idle but not power it down. Correct. > In the last case I suspect the USB bus to be powered down. Usually, not but always! Some Macs have been known to keep USB suspend current available during hibernation. > What controls the USB bus power on suspended ports ? Is it handled by the > system (BIOS, ...) ? Is it allowed to power down the ports or keep them > powered as it chooses ? What are the rules set in stone ? There are no rules set in stone. :-) Systems are _supposed_ to keep the ports powered during suspend, but some may fail to do so. It depends on the firmware (i.e., BIOS for PCs) and the motherboard design. > > If it does cut off power, resume() will never be called, instead either > > disconnect() or reset_resume(). > > What is reset_resume() for ? Which one will be called on resume after a bus > power down ? This is explained in Documentation/usb/power-management.txt. If the USB Persist facility has been enabled for a device then reset_resume will be called, to indicate that the device had to be reset as part of the resume procedure. If USB Persist isn't enabled then the disconnect method will be called and the device will be re-enumerated, exactly as though it had been unplugged and then plugged back in. Alan Stern - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/1] selinux: do not clear f_op when removing entries
On Wed, 21 Nov 2007, Stephen Smalley wrote: > Do not clear f_op when removing entries since it isn't safe to do. > > Signed-off-by: Stephen Smalley <[EMAIL PROTECTED]> Applied to git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6.git#for-akpm -- James Morris <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Possible bug from kernel 2.6.22 and above
Hi, there: We have a simple pthread program that measures the synchronization overheads for various synchronization mechanisms such as spin locks, barriers (the barrier is implemented using queue-based barrier algorithm) and so on. We have dual quad-core AMD opterons (barcelona) clusters running 2.6.23.8 kernel at this moment using Fedora Core 7 distribution. Before we moved to this kernel, we had kernel 2.6.21. These two kernels are configured identical and compiled with the same gcc 4.1.2 compiler. Under the old kernel, we observed that the performance of these overheads increases as the number of threads increases from 2 to 8. The following are the values of total time and overhead for all threads acquiring a pthread spin lock and all threads executing a barrier synchronization call. Kernel 2.6.21 Number of Threads 2 4 6 8 SpinLock (Time micro second) 10.561810.5853810.5915 10.643 (Overhead) 0.073 0.05746 0.102805 0.154563 Barrier (Time micro second)11.020410 11.678125 11.9889 12.38002 (Overhead)0.531660 1.1502 1.500112 1.891617 Each thread is bound to a particular core using pthread_setaffinity_np. Kernel 2.6.23.8 Number of Threads 2 4 6 8 SpinLock (Time micro second) 14.849915 17.117603 14.4496 10.5990 (Overhead)4.345417 6.6172073.949435 0.110985 Barrier (Time micro second)19.462255 20.285117 16.19395 12.37662 (Overhead)8.957755 9.7847225.699590 1.869518 It is clearly that the synchronization overhead increases as the number of threads increases in the kernel 2.6.21. But the synchronization overhead actually decreases as the number of threads increases in the kernel 2.6.23.8 (We observed the same behavior on kernel 2.6.22 as well). This certainly is not a correct behavior. The kernels are configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel configuration file is in the attachment of this e-mail. From what we have read, there was a new scheduler (CFS) appeared from 2.6.22. We are not sure whether the above behavior is caused by the new scheduler. Finally, our machine cpu information is listed in the following: processor : 0 vendor_id : AuthenticAMD cpu family : 16 model : 2 model name : Quad-Core AMD Opteron(tm) Processor 2347 stepping: 10 cpu MHz : 1909.801 cache size : 512 KB physical id : 0 siblings: 4 core id : 0 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw bogomips: 3822.95 TLB size: 1024 4K pages clflush size: 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate In addition, we have schedstat and sched_debug files in the /proc directory. Thank you for all your help to solve this puzzle. If you need more information, please let us know. P.S. I like to be cc'ed on the discussions related to this problem. ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### CONFIG_X86_64=y CONFIG_64BIT=y CONFIG_X86=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_ZONE_DMA32=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_CMPXCHG=y CONFIG_EARLY_PRINTK=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_DMI=y CONFIG_AUDIT_ARCH=y CONFIG_GENERIC_BUG=y CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y CONFIG_AUDIT=y CONFIG_AUDITSYSCALL=y CONFIG_IKCONFIG=m CONFIG_IKCONFIG_PROC=y CONFIG_CPUSETS=y CONFIG_SYSFS_DEPRECATED=y CONFIG_RELAY=y CONFIG_BLK_DEV_INITRD=y CONFIG_SYSCTL=y CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_KALLSYMS_EXTRA_PASS=y CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y
Modules: Fold percpu_modcopy into module.c and get rid of the macro from hell
percpu_modcopy is defined multiple times in arch files. However, the only use is in module.c. Put a static definition into module.c and remove the definitions from the arch files. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- arch/ia64/kernel/module.c| 10 -- include/asm-generic/percpu.h |8 include/asm-ia64/percpu.h|5 - include/asm-powerpc/percpu.h |9 - include/asm-s390/percpu.h|9 - include/asm-sparc64/percpu.h |8 include/asm-x86/percpu_32.h |9 - include/asm-x86/percpu_64.h |9 - kernel/module.c |8 9 files changed, 8 insertions(+), 67 deletions(-) Index: linux-2.6/include/asm-generic/percpu.h === --- linux-2.6.orig/include/asm-generic/percpu.h 2007-11-21 13:11:18.430858642 -0800 +++ linux-2.6/include/asm-generic/percpu.h 2007-11-21 13:11:42.871108294 -0800 @@ -26,14 +26,6 @@ extern unsigned long __per_cpu_offset[NR #define __get_cpu_var(var) per_cpu(var, smp_processor_id()) #define __raw_get_cpu_var(var) per_cpu(var, raw_smp_processor_id()) -/* A macro to avoid #include hell... */ -#define percpu_modcopy(pcpudst, src, size) \ -do { \ - unsigned int __i; \ - for_each_possible_cpu(__i) \ - memcpy((pcpudst)+__per_cpu_offset[__i], \ - (src), (size)); \ -} while (0) #else /* ! SMP */ #define DEFINE_PER_CPU(type, name) \ Index: linux-2.6/arch/ia64/kernel/module.c === --- linux-2.6.orig/arch/ia64/kernel/module.c2007-11-21 13:13:06.587858751 -0800 +++ linux-2.6/arch/ia64/kernel/module.c 2007-11-21 13:13:19.527309025 -0800 @@ -941,13 +941,3 @@ module_arch_cleanup (struct module *mod) unw_remove_unwind_table(mod->arch.core_unw_table); } -#ifdef CONFIG_SMP -void -percpu_modcopy (void *pcpudst, const void *src, unsigned long size) -{ - unsigned int i; - for_each_possible_cpu(i) { - memcpy(pcpudst + __per_cpu_offset[i], src, size); - } -} -#endif /* CONFIG_SMP */ Index: linux-2.6/include/asm-ia64/percpu.h === --- linux-2.6.orig/include/asm-ia64/percpu.h2007-11-21 13:12:37.140358213 -0800 +++ linux-2.6/include/asm-ia64/percpu.h 2007-11-21 13:12:55.271731039 -0800 @@ -39,10 +39,6 @@ DEFINE_PER_CPU(type, name) #endif -/* - * Pretty much a literal copy of asm-generic/percpu.h, except that percpu_modcopy() is an - * external routine, to avoid include-hell. - */ #ifdef CONFIG_SMP extern unsigned long __per_cpu_offset[NR_CPUS]; @@ -55,7 +51,6 @@ DECLARE_PER_CPU(unsigned long, local_per #define __get_cpu_var(var) (*RELOC_HIDE(_cpu__##var, __ia64_per_cpu_var(local_per_cpu_offset))) #define __raw_get_cpu_var(var) (*RELOC_HIDE(_cpu__##var, __ia64_per_cpu_var(local_per_cpu_offset))) -extern void percpu_modcopy(void *pcpudst, const void *src, unsigned long size); extern void setup_per_cpu_areas (void); extern void *per_cpu_init(void); Index: linux-2.6/include/asm-powerpc/percpu.h === --- linux-2.6.orig/include/asm-powerpc/percpu.h 2007-11-21 13:14:21.754859049 -0800 +++ linux-2.6/include/asm-powerpc/percpu.h 2007-11-21 13:14:33.651108379 -0800 @@ -30,15 +30,6 @@ #define __get_cpu_var(var) (*RELOC_HIDE(_cpu__##var, __my_cpu_offset())) #define __raw_get_cpu_var(var) (*RELOC_HIDE(_cpu__##var, local_paca->data_offset)) -/* A macro to avoid #include hell... */ -#define percpu_modcopy(pcpudst, src, size) \ -do { \ - unsigned int __i; \ - for_each_possible_cpu(__i) \ - memcpy((pcpudst)+__per_cpu_offset(__i), \ - (src), (size)); \ -} while (0) - extern void setup_per_cpu_areas(void); #else /* ! SMP */ Index: linux-2.6/include/asm-s390/percpu.h === --- linux-2.6.orig/include/asm-s390/percpu.h2007-11-21 13:14:39.835108493 -0800 +++ linux-2.6/include/asm-s390/percpu.h 2007-11-21 13:14:48.590858137 -0800 @@ -51,15 +51,6 @@ extern unsigned long __per_cpu_offset[NR #define per_cpu(var,cpu) __reloc_hide(var,__per_cpu_offset[cpu]) #define per_cpu_offset(x) (__per_cpu_offset[x]) -/* A macro to avoid #include hell... */ -#define percpu_modcopy(pcpudst, src, size) \ -do { \ - unsigned int __i; \ -
Re: mmap dirty limits on 32 bit kernels (Was: [BUG] New Kernel Bugs)
On Thu, 15 Nov 2007 13:47:54 -0800 (PST) Linus Torvalds wrote: > >But quite frankly, I refuse to even care about anything past that. If >you have 12G (or heaven forbid, even more) in your machine, and you >can't be bothered to just upgrade to a 64-bit CPU, then quite frankly, >*I* personally can't be bothered to care. > >If they have that much RAM (and bought it a few years ago when a 64-bit >CPU wasn't an option), they can't be poor. > >So the _only_ explanation today for 12GB on a 32-bit machine is > (a) insanity >or > (b) being so lazy as to not bother to upgrade > Just around the corner... $ ftp ftp Connected to ftp.gwdg.de. 220- 220-Gesellschaft fuer wissenschaftliche Datenverarbeitung mbH Goettingen 220- 220-This is a Linux PC (Dell PE-2650, 2 CPUs P4/2800, 12 GB RAM) 220-running SuSE-Linux-8.2 with SuSE kernel 2.4.20-64GB-SMP. There is no reason to upgrade the hardware - if it works, hey good then. And I am pretty sure that a few 2 GB sticks are cheaper than a big opteron (if you only go by that). It sure is now - and probably even back then. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: __rcu_process_callbacks() in Linux 2.6
On Wed, Nov 21, 2007 at 11:57:29AM -0800, James Huang wrote: > Paul, > >I am not sure I understand your answer about using test_and_set_bit() > in tasklet_schedule() as a > memory barrier in this case. > >Yes, tasklet_schedule() includes a > test_and_set_bit(TASKLET_STATE_SCHED, >state) on a tasklet, but > in this case the tasklet is a per CPU tasklet. Memory barriers are memory barriers, regardless of what type of data is being processed. >According to documentation/atomic_ops.txt, atomic op that returns a > value has the semantics of > "explicit memory barriers performed before and after the operation". And test_and_set_bit() does return a value, namely the value of the affected bit before the operation. Therefore, any correct implementation for a CONFIG_SMP build must include memory barriers before and after. Extracting the relevant passage from Documentation/atomic_ops.txt between the pair of dashed lines: int test_and_set_bit(unsigned long nr, volatile unsigned long *addr); int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); int test_and_change_bit(unsigned long nr, volatile unsigned long *addr); Like the above, except that these routines return a boolean which indicates whether the changed bit was set _BEFORE_ the atomic bit operation. WARNING! It is incredibly important that the value be a boolean, ie. "0" or "1". Do not try to be fancy and save a few instructions by declaring the above to return "long" and just returning something like "old_val & mask" because that will not work. For one thing, this return value gets truncated to int in many code paths using these interfaces, so on 64-bit if the bit is set in the upper 32-bits then testers will never see that. One great example of where this problem crops up are the thread_info flag operations. Routines such as test_and_set_ti_thread_flag() chop the return value into an int. There are other places where things like this occur as well. These routines, like the atomic_t counter operations returning values, require explicit memory barrier semantics around their execution. All memory operations before the atomic bit operation call must be made visible globally before the atomic bit operation is made visible. Likewise, the atomic bit operation must be visible globally before any subsequent memory operation is made visible. For example: obj->dead = 1; if (test_and_set_bit(0, >flags)) /* ... */; obj->killed = 1; The implementation of test_and_set_bit() must guarantee that "obj->dead = 1;" is visible to cpus before the atomic memory operation done by test_and_set_bit() becomes visible. Likewise, the atomic memory operation done by test_and_set_bit() must become visible before "obj->killed = 1;" is visible. > If I understand it correctly, this means that, for exmaple, > >atomic_t aa = ATOMIC_INIT(0); >int X = 1; >int Y = 2; > CPU 0: > update X=100; > atomc_inc_return(); > update Y=200; But, yes, atomic_inc_return() does indeed force ordering. > Then CPU 1 will always see X=100 before it sees the new value of aa (1), and > CPU 1 wil always > see the new value of aa (1) before it sees Y=200. Yep. And CPU 1 will also see any preceding unrelated assignment prior to the new value of aa as well. And it is not just preceding stores. See the sentence from Documentation/atomic_ops.txt: All memory operations before the atomic bit operation call must be made visible globally before the atomic bit operation is made visible. Both stores -and- loads. > This ordering semantics does not apply to the scenario in our discussion. > For one thing, the rcu tasklet is a per CPU tasklet. So obviously no other > CPU's will even read its t->state. > > Am I still missing something? Yep -- the test_and_set_bit() operation has no clue who else might or might not be reading t->state. Besides, tasklets need not be allocated on a per-CPU basis, and therefore tasklet_schedule() must be prepared to deal with other CPUs concurrently manipulating t->state, for example, via the tasklet_disable() interface. Another thing that might help is to fill in the RCU read-side critical section that CPU 2 would have to execute (after all the stuff you currently have it executing), along with the RCU update that would need to precede CPU 2's call_rcu() call. I have done this in your example code below. Note that in order for a failure to occur, CPU 1 must reach /* A */ before CPU 2 reaches /* B */. One key point is that tasklet_schedule()'s memory ordering affects this preceding code for CPU 2. The second key point is that acquiring and releasing a lock acts as a barrier as well (though a limited one). The
Re: Identifying a specific affected file on Ext3 on a top of raid0 of raid1s
On Wed, Nov 21, 2007 at 03:57:53PM -0500, [EMAIL PROTECTED] wrote: > I have a rather nasty situation developing on one of the big 24x7 > production database servers. It seems that a batch of drives in one of the > servers started to fail. > > The file servers are ext3fs on a top of raid-0 over a pair of raid-1 mirrors, > with each of raid-1 mirrors having two drives. The issue is that all four > drives are developing errors. > > Is there a way to determine what files are affected if I know the LBA# of the > errors on individual drives? > > It is running under 2.6.20.x series of kernels. Most likely debugfs can tell you what filesystem block is used by a file and which file is using a given filesystem block, although you would still have to translate the disk block number through the raid layers and partitions to find where it is relative to the begining of the partition the filesystem is on. -- Len Sorensen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Modules: Include sections.h to avoid defining linker variables explicitly
Module.c should not define linker variables on its own. We have an include file for that. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- kernel/module.c |4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) Index: linux-2.6/kernel/module.c === --- linux-2.6.orig/kernel/module.c 2007-11-21 13:02:39.415358527 -0800 +++ linux-2.6/kernel/module.c 2007-11-21 13:03:16.534858271 -0800 @@ -46,6 +46,7 @@ #include #include #include +#include extern int module_sysfs_initialized; @@ -338,9 +339,6 @@ static inline unsigned int block_size(in return val; } -/* Created by linker magic */ -extern char __per_cpu_start[], __per_cpu_end[]; - static void *percpu_modalloc(unsigned long size, unsigned long align, const char *name) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: network driver usage count
On Wed, 21 Nov 2007 20:45:11 +0100 Ferenc Wagner <[EMAIL PROTECTED]> wrote: > Hi! > > Under 2.6.23.1, my lsmod output shows this: > > $ lsmod | grep tg3 > tg3 100580 0 > > The usage count is zero, even though it drives my two physical > interfaces: > > $ ls -l /sys/class/net/eth-gb?/device/driver > lrwxrwxrwx 1 root root 0 2007-11-21 19:58 > /sys/class/net/eth-gb1/device/driver -> ../../../bus/pci/drivers/tg3 > lrwxrwxrwx 1 root root 0 2007-11-21 19:58 > /sys/class/net/eth-gb2/device/driver -> ../../../bus/pci/drivers/tg3 > > These interfaces are up and bonded together, but that doesn't seem to > matter at all. I also checked other machines, the network driver > (tg3, e1000) usage counts are always zero under various recent 2.6 > kernels, but nonzero under 2.4.21 for example. > > And really, the module could be removed, cutting my ssh session. :) > > Was this made possible intentionally? If yes, why? Yes, so devices can be removed at anytime. -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Modules: Handle symbols that have a zero value
On Wed, 21 Nov 2007, Mathieu Desnoyers wrote: > return -ENOENT; > > directly ? > > (ERR_PTR() in linux/err.h is a simple cast from long to void*). Right and there is also IS_ERR_VALUE. Thanks for the feedback. New version: Modules: Handle symbols that have a zero value The module subsystem cannot handle symbols that are zero. If symbols are present that have a zero value then the module resolver prints out a message that these symbols are unresolved. Use ERR_PTR to return an error code instead of 0. This is a bit awkward since the addresses are handled as unsigned longs. So we need to convert them everywhere. Cc: Mathieu Desnoyers <[EMAIL PROTECTED]> Cc: Kay Sievers <[EMAIL PROTECTED] Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- kernel/module.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) Index: linux-2.6/kernel/module.c === --- linux-2.6.orig/kernel/module.c 2007-11-21 12:58:33.095608448 -0800 +++ linux-2.6/kernel/module.c 2007-11-21 13:00:30.199108674 -0800 @@ -285,7 +285,7 @@ static unsigned long __find_symbol(const } } DEBUGP("Failed to find symbol %s\n", name); - return 0; + return -ENOENT; } /* Search for module by name: must hold module_mutex. */ @@ -756,7 +756,7 @@ void __symbol_put(const char *symbol) const unsigned long *crc; preempt_disable(); - if (!__find_symbol(symbol, , , 1)) + if (IS_ERR_VALUE(__find_symbol(symbol, , , 1))) BUG(); module_put(owner); preempt_enable(); @@ -902,7 +902,8 @@ static inline int check_modstruct_versio const unsigned long *crc; struct module *owner; - if (!__find_symbol("struct_module", , , 1)) + if (IS_ERR_VALUE(__find_symbol("struct_module", + , , 1))) BUG(); return check_version(sechdrs, versindex, "struct_module", mod, crc); @@ -955,7 +956,7 @@ static unsigned long resolve_symbol(Elf_ /* use_module can fail due to OOM, or module unloading */ if (!check_version(sechdrs, versindex, name, mod, crc) || !use_module(mod, owner)) - ret = 0; + ret = -EINVAL; } return ret; } @@ -1348,14 +1349,16 @@ static int verify_export_symbols(struct const unsigned long *crc; for (i = 0; i < mod->num_syms; i++) - if (__find_symbol(mod->syms[i].name, , , 1)) { + if (!IS_ERR_VALUE(__find_symbol(mod->syms[i].name, + , , 1))) { name = mod->syms[i].name; ret = -ENOEXEC; goto dup; } for (i = 0; i < mod->num_gpl_syms; i++) - if (__find_symbol(mod->gpl_syms[i].name, , , 1)) { + if (!IS_ERR_VALUE(__find_symbol(mod->gpl_syms[i].name, + , , 1))) { name = mod->gpl_syms[i].name; ret = -ENOEXEC; goto dup; @@ -1405,7 +1408,7 @@ static int simplify_symbols(Elf_Shdr *se strtab + sym[i].st_name, mod); /* Ok if resolved. */ - if (sym[i].st_value != 0) + if (!IS_ERR_VALUE(sym[i].st_value)) break; /* Ok if weak. */ if (ELF_ST_BIND(sym[i].st_info) == STB_WEAK) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Identifying a specific affected file on Ext3 on a top of raid0 of raid1s
Hi, I have a rather nasty situation developing on one of the big 24x7 production database servers. It seems that a batch of drives in one of the servers started to fail. The file servers are ext3fs on a top of raid-0 over a pair of raid-1 mirrors, with each of raid-1 mirrors having two drives. The issue is that all four drives are developing errors. Is there a way to determine what files are affected if I know the LBA# of the errors on individual drives? It is running under 2.6.20.x series of kernels. Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/