Re: CONFIG_IRQBALANCE for 64-bit x86 ?

2007-11-21 Thread Nick Piggin
On Wednesday 21 November 2007 06:07, Arjan van de Ven wrote:
> On Wed, 21 Nov 2007 02:43:46 +1100

> > Of course it is, if you want to effectively use your resources.
> > Imagine if the task balancer only polled once every 10s.
>
> but unlike the task balancer, moving an irq is really expensive.
> (at least for networking and a few other similar systems)
> ANd no it's not just the cache bouncing, it's the entire reassembly of
> multiple packets etc etc that gets really messy.

Actually a blanket statement like that is just wrong. Moving a
network interrupt yes is probably quite expensive, but it is
about the worst case one to move. What's more, moving tasks between
NUMA nodes could easily be many orders of magnitude worse than the
transient slowdown of moving irqs.

Furthermore, what you say doesn't really seem to be an argument
for doing it in userspace or an argument against moving IRQs. It
actually shows that there are complex, hardware and kernel
implementation dependent issues, all of which suggest it is better
to be in kernel.


> > Some constants
> > that make assumptions about the machine it is running on and may or
> > may not agree with what the task scheduler is trying to do.
> >
> > Some
> > classification stuff which makes guesses about how a particular bit of
>
> you misunderstood this; the classification stuff is there to spread
> different irqs of similar class (say networking) over multiple
> cores/packages. Doing this is a system resource balancing proposition
> not just a cpu time one.
>
> You may think this spreading based on classification is a mistake, but
> it's based on the following observation:

No I'm not misunderstanding or think it is a mistake. But it is
something which the kernel and the devices themselves should have
better knowledge of. You have a process which is reading off disk
and sending to a network interface? You may well want to put the
process and the disk interrupt and the network interrupt all on
the same CPU.

[snip]

> We used to rebalance this frequently in the 2.4-early kernels based on
> a patch from Ingo. Turned out to be a really really bad idea;
> performance really tanked.

To reiterate, I do not think that IRQs should be moved more frequently.
I think the kernel is in the position to know far better than userspace
about irq balancing.


> > hardware or device driver wants to be balanced. Hacks to poll
> > hotplugging and topology changes.
>
> "hacks" as in "rescan".. so falls under the topology code and would
> indeed be changed to hook into hotplug inside the kernel; just
> different complexity.

ie. simpler. All the topology stuff would be far simpler.


> > I'm still convinced. Who isn't?
>
> I know you can do SOME sort of balancing in the kernel. But please
> describe the algorithm you would use; I started out with the same
> thought but when it got down to the algorithm to me at least it became
> clear "we really don't want this complexity in kernel mode".

I'd rather not to this far into handwaving. I'm not saying that
I know exactly how it should work right now. I'm questioning the
established viewpoint that irq balancing belongs in userspace.

For that matter, I guess from the results you get, it's not terribly
bad to do in userspace or anything. But I think it can be done in
kernel.

Policy... I think that's a misused argument. The "policy" of any
kernel code I write is to utilise the hardware as efficiently as
possible within restrictions (eg. fairness, permissions). Setting
those restrictions is the realm of userspace, otherwise IMO it is
fine to go in kernel.

Using the same argument, task balancing and even scheduling is
policy, so is page reclaim, page writeback, filesystem block
allocation, etc. Now many of those things can be directed or
restricted somehow from userspace, and in-kernel irq balancing
would be no different.


> > > not needed. (also because on single socket machines, the
> > > irqbalancer basically has a one-shot task because there balancing
> > > is effectively a static setup)
> >
> > I don't think that's a good argument for not having it in kernel.
>
> if you don't care about kernel unpagable memory footprint, fine.
> Others do.

It would be a couple of K, right? I mean it would be probably less than
half the code of irqbalance because of the parsing and topology stuff.

Also, I don't think the one-shot behaviour on single socket machines is
good policy at all, and it can't capture dynamic behaviour at all.


> > > I listed a few;
> > > 1) it's policy
> >
> > I don't think that's such a constructive point. Task balancing is
> > policy in exactly the same way.
>
> not really; CFS has shown that the only real policy in task
> balancing is the fairness part,

Ahh, hate to get off topic, but let's not perpetuate this myth.
It wasn't Con, or CFS, or anything that showed fairness is some
great new idea. Actually I was arguing for fairness first,
against both Con and Ingo, way back when the old scheduler was
having so much 

Re: High priority tasks break SMP balancer?

2007-11-21 Thread Micah Dowty
On Tue, Nov 20, 2007 at 10:47:52PM +0100, Dmitry Adamushko wrote:
> btw., what's your system? If I recall right, SD_BALANCE_NEWIDLE is on
> by default for all configs, except for NUMA nodes.

It's a dual AMD64 Opteron.

So, I recompiled my 2.6.23.1 kernel without NUMA support, and with
your patch for scheduling domain flags in /proc. It looks like with
NUMA disabled, my test case no longer shows the CPU imbalance
problem. Cool.

With NUMA disabled (and my test running smoothly), the flags show that
SD_BALANCE_NEWIDLE is set:

[EMAIL PROTECTED]:~# cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags
55

Next I turned it off:

[EMAIL PROTECTED]:~# echo 53 > /proc/sys/kernel/sched_domain/cpu0/domain0/flags
[EMAIL PROTECTED]:~# echo 53 > /proc/sys/kernel/sched_domain/cpu1/domain0/flags

Oddly enough, I still don't observe the CPU imbalance problem.

Now I reboot into a kernel which has NUMA re-enabled but which is
otherwise identical. I verify that now I can reproduce the CPU
imbalance again.

[EMAIL PROTECTED]:~# cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags
1101

Now I set cpu[10]/domain0/flags to 1099, and the imbalance immediately
disappears. I can reliably cause the imbalance again by setting it
back to 1101, and remove the imbalance by setting them to 1099.

Do these results make sense? I'm not sure I understand how
SD_BALANCE_NEWIDLE could be the whole story, since my /proc/schedstat
graphs do show that we continuously try to balance on idle, but we
can't successfully do so because the idle CPU has a much higher load
than the non-idle CPU. I don't understand how the problem I'm seeing
could be related to the time at which we run the balancer, rather than
being related to the load average calculation.

Assuming the CPU imbalance I'm seeing is actually related to
SD_BALANCE_NEWIDLE being unset, I have a couple of questions:

 - Is this intended/expected behaviour for a machine without
   NEWIDLE set? I'm not familiar with the rationale for disabling
   this flag on NUMA systems.

 - Is there a good way to detect, without any kernel debug flags
   set, whether the current machine has any scheduling domains
   that are missing the SD_BALANCE_NEWIDLE bit? I'm looking for
   a good way to work around the problem I'm seeing with VMware's
   code. Right now the best I can do is disable all thread priority
   elevation when running on an SMP machine with Linux 2.6.20 or
   later.

Thank you again for all your help.
--Micah
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Posix file capabilities in 2.6.24rc2; now 2.6.24-rc3

2007-11-21 Thread Andrew Morgan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Serge E. Hallyn wrote:
> The problem is that when you run a setuid binary, its pP and pE are
> fully raised.  The following patch fixes it for me.  Chris, does it fix
> your problem?  Andrew, am I again confusing myself and doing something
> unsafe?

I think this is yet another example of the fragile mess that is UID
emulation with capabilities. Your patch is an example of privilege
escalation - luser can kill a more-capable process. In the kill CONT
case we reached the opposite conclusion to this one. As was the case
then, I didn't disagree then :*). If it meets folk's expectations, then
this is probably a good patch...

> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -543,6 +543,9 @@ int cap_task_kill(struct task_struct *p, struct siginfo 
> *info,
>   if (capable(CAP_KILL))
>   return 0;
>  
> + if (p->euid==0 && p->uid==current->uid)
> + return 0;
> +

Its late and I'm obviously tired, but is there any reason not to simply use:

 if (p->uid == current->uid)
 return 0;

Whatever the case, could you put the new code closer to the sig ==
SIGCONT test? The capability tests are at the end of cap_task_kill() and
this new check breaks that pattern.

Cheers

Andrew

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFHRTLlQheEq9QabfIRAt/hAKCJgj2kbuyAWI486LOwwDLdkbcpoQCfQdrQ
J+bcvi+9pGTodFn42PsHJHA=
=cXaG
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


In /proc/cpuinfo, all processor items show "0"

2007-11-21 Thread youquan_song
Build kernel 2.6.24-rc3,   cat /proc/cpuinfo,  all processor items show "0":

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model   : 6
model name  :Genuine Intel(R) CPU 3.40GHz
stepping: 8
cpu MHz : 3391.555
cache size  : 16384 KB
physical id : 0
siblings: 4
core id : 0
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 6
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx est tm2 cid cx16
xtpr lahf_lm
bogomips: 7465.18
clflush size: 64
cache_alignment : 128
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model   : 6
model name  :Genuine Intel(R) CPU 3.40GHz
stepping: 8
cpu MHz : 3391.555
cache size  : 16384 KB
physical id : 2
siblings: 4
core id : 0
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 6
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx est tm2 cid cx16
xtpr lahf_lm
bogomips: 6782.81
clflush size: 64
cache_alignment : 128
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model   : 6
model name  :Genuine Intel(R) CPU 3.40GHz
stepping: 8
cpu MHz : 3391.555
cache size  : 16384 KB
physical id : 3
siblings: 4
core id : 1
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 6
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx est tm2 cid cx16
xtpr lahf_lm
bogomips: 6782.80
clflush size: 64
cache_alignment : 128
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model   : 6
model name  :Genuine Intel(R) CPU 3.40GHz
stepping: 8
cpu MHz : 3391.555
cache size  : 16384 KB
physical id : 1
siblings: 4
core id : 1
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 6
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx est tm2 cid cx16
xtpr lahf_lm
bogomips: 6782.79
clflush size: 64
cache_alignment : 128
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model   : 6
model name  :Genuine Intel(R) CPU 3.40GHz
stepping: 8
cpu MHz : 3391.555
cache size  : 16384 KB
physical id : 2
siblings: 4
core id : 1
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 6
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx est tm2 cid cx16
xtpr lahf_lm
bogomips: 6782.80
clflush size: 64
cache_alignment : 128
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model   : 6
model name  :Genuine Intel(R) CPU 3.40GHz
stepping: 8
cpu MHz : 3391.555
cache size  : 16384 KB
physical id : 3
siblings: 4
core id : 0
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 6
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx est tm2 cid cx16
xtpr lahf_lm
bogomips: 6782.80
clflush size: 64
cache_alignment : 128
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model   : 6
model name  :Genuine Intel(R) CPU 3.40GHz
stepping: 8
cpu MHz : 3391.555
cache size  : 16384 KB
physical id : 0

Re: [kvm-devel] [PATCH 3/3] virtio PCI device

2007-11-21 Thread Avi Kivity

Zachary Amsden wrote:

On Wed, 2007-11-21 at 09:13 +0200, Avi Kivity wrote:

  
Where the device is implemented is an implementation detail that should 
be hidden from the guest, isn't that one of the strengths of 
virtualization?  Two examples: a file-based block device implemented in 
qemu gives you fancy file formats with encryption and compression, while 
the same device implemented in the kernel gives you a low-overhead path 
directly to a zillion-disk SAN volume.  Or a user-level network device 
capable of running with the slirp stack and no permissions vs. the 
kernel device running copyless most of the time and using a dma engine 
for the rest but requiring you to be good friends with the admin.


The user should expect zero reconfigurations moving a VM from one model 
to the other.



I think that is pretty insightful, and indeed, is probably the only
reason we would ever consider using a virtio based driver.

But is this really a virtualization problem, and is virtio the right
place to solve it?  Doesn't I/O hotplug with multipathing or NIC teaming
provide the same infrastructure in a way that is useful in more than
just a virtualization context?
  


With the aid of a dictionary I was able to understand about half the 
words in the last sentence.  Moving from device to device using 
hotplug+multipath is complex to configure, available on only some 
guests, uses rarely-exercised paths in the guest OS, and only works for 
a few types of devices (network and block).  Having host independence in 
the device means you can change the device implementation for, say, a 
display driver (consider, for example, a vmgl+virtio driver, which can 
be implemented in userspace or tunneled via virtio-over-tcp to some 
remote display without going through userspace, without the guest 
knowing about it).


--
error compiling committee.c: too many arguments to function

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9]: Reduce Log I/O latency

2007-11-21 Thread Matt Mackall
On Thu, Nov 22, 2007 at 02:41:06PM +1100, David Chinner wrote:
> On Wed, Nov 21, 2007 at 08:57:27PM -0600, Matt Mackall wrote:
> > On Thu, Nov 22, 2007 at 12:12:14PM +1100, David Chinner wrote:
> > > In all the cases that I know of where ppl are using what could
> > > be considered real-time I/O (e.g. media environments where they
> > > do real-time ingest and playout from the same filesystem) the
> > > real-time ingest processes create the files and do pre-allocation
> > > before doing their I/O. This I/O can get held up behind another
> > > process that is not real time that has issued log I/O. 
> > > 
> > > Given there is no I/O priority inheritence and having log I/O stall
> > > will stall the entire filesystem, we cannot allow log I/O to
> > > stall in real-time environments. Hence it must have the highest
> > > possible priority to prevent this.
> > 
> > I've seen PVRs that would be upset by this. They put media on one
> > filesystem and database/apps/swap/etc. on another, but have everything
> > on a single spindle. Stalling a media filesystem read for a write
> > anywhere else = fail.
> 
> Sounds like the PVR is badly designed to me. If a write can cause a
> read to miss a playback deadline, then you haven't built enough
> buffering into your playback application.

Normally it's not a problem. But your proposed change can push a
working system into a non-working system by making non-critical I/O on
an unrelated filesystem have higher priority than the thing that -actually
has real-time constraints-.

In other words, I/O priority is per-spindle and not per-filesystem and
thus this change has consequences that leak outside the filesystem in
question. That's bad.

I'd further add that the kernel internals probably shouldn't wander
into RT priority levels unless it's actually doing priority
inheritance, otherwise it's quite likely to upset the careful
considerations of the RT system designer's priority schemes. For
instance, a log-heavy but otherwise non-RT load with this patch could
possibly completely starve direct I/O to another partition even though
it's marked RT, thus livelocking the system.

To the general PVR problem: they typically want to work with a minimum
of buffering to maximize responsiveness to user commands (fast
forward, jump 30 seconds, play in reverse). Now consider that you're
recording and playing back multiple HD streams on low-margin set-top
hardware and you'll see that making this work -at all- means lots of
I/O tuning.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Laptop keyboard unusable when ACPI is active

2007-11-21 Thread [EMAIL PROTECTED]

Hi all,

some upates about this issue (see also bug 9147 
http://bugzilla.kernel.org/show_bug.cgi?id=9147 ).


The 'ac', 'battery' and 'thermal' modules (compiled as stand-alone) do 
cause the bug; it suffices that one of them (or any set of them) is 
loaded to trigger the bug either immediately or after some time.

If none of them is loaded into memory, the bug does not happen.
Also, the 'battery' module does not generate system messages although 
the problem is equally verified.


The 'thermal' module instead, when loaded with 'modprobe thermal', 
causes the enter key pressed to execute the command to be indefinitively 
repeated into any terminal. This is currently a perfectly reproducible 
testcase for bug 9147.


The bug has been confirmed by at least another user (with different 
hardware configuration); please reply for either bug addressing or 
confirmation.


The current known best workaround to this bug is to compile all the 
above mentioned ACPI modules as stand-alone and to not (auto)load them 
(loosing their vital functionalities, since we are talking about laptops 
here, see http://gentoo-wiki.com/HARDWARE_Maxdata_Pro_7000_DX for an 
example of affected hardware).


It is also important to note that this bug always comes with bug 8740 
http://bugzilla.kernel.org/show_bug.cgi?id=8740 (also confirmed and also 
an ACPI issue).


Best regards,
--
 Daniele C.


[EMAIL PROTECTED] ha scritto:

I am posting this message just to say that this bug is being addressed
on the bug tracker:
http://bugzilla.kernel.org/show_bug.cgi?id=9147

Regards,
--
  Daniele C.


[EMAIL PROTECTED] ha scritto:
  

[EMAIL PROTECTED] ha scritto:
  


Kernel: 2.6.22-r5
Kernel option: i8042.nomux=1

  

I am now using kernel 2.6.22-r8 (Gentoo) and the following kernel options:

i8042.nomux=1 acpi=off

I have tried kernel 2.6.23-rc9 but the problem is still there.

  


The problem which still remains, and I can't fix or work it around, is
witnessed by the below dmesg lines:
-
atkbd.c: Unknown key released (translated set 2, code 0xe0 on
isa0060/serio0).
atkbd.c: Use 'setkeycodes e060 ' to make it known.
atkbd.c: Unknown key released (translated set 2, code 0xe0 on
isa0060/serio0).
atkbd.c: Use 'setkeycodes e060 ' to make it known.
atkbd.c: Unknown key released (translated set 2, code 0xe0 on
isa0060/serio0).
atkbd.c: Use 'setkeycodes e060 ' to make it known.
-
The release event for some keys is never caught, so all sorts of
troubles happen if for example I use the Del key and it stucks, or if
I use the Ctrl key and it never gets released...pushing again the
stuck key brings back the key in the proper status.

  

With acpi=off the problem is totally worked around.

  


Can somebody please give me some clues about this issue, and possible
solutions? I have been searching the web for a couple of weeks and
seems like it is a common trouble of notebook users, but nobody has
yet published a solution.

  

I am trying to find a path myself in this issue - which dates back to at
least 2005 and has never been resolved.

I would now try some other kernel parameter in order to preserve ACPI
functionality and possibly prevent ACPI from messing up the keyboard IRQs.
Can somebody please give me istructions regarding the correct tests
(regarding kernel parameters and/or anything else) to perform in order
to better isolate the issue?

Related Gentoo bug tracker item:
http://bugs.gentoo.org/show_bug.cgi?id=194781

Other messages about the same kernel bug (many more can be found
googling around, and no solution yet):
https://lists.linux-foundation.org/pipermail/bugme-new/2005-January/011736.html
http://dev.laptop.org/ticket/2401

Regards,
--
  Daniele C.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sata_nv: fix ADMA ATAPI issues with memory over 4GB (v2)

2007-11-21 Thread Tejun Heo
Hello, Robert.

Robert Hancock wrote:
> This fixes some problems with ATAPI devices on nForce4 controllers in ADMA
> mode on systems with memory located above 4GB. We need to delay setting the
> 64-bit DMA mask until the PRD table and padding buffer are allocated so that
> they don't get allocated above 4GB and break legacy mode (which is needed for
> ATAPI devices). Also, explicitly set a 32-bit DMA mask before allocating the
> legacy buffers since setting the DMA mask affects both ports and we need to
> ensure the second port's buffers are allocated properly (fixes a problem
> with the previous version of this patch).
> 
> Signed-off-by: Robert Hancock <[EMAIL PROTECTED]>
> 
> + /* Ensure DMA mask is set to 32-bit before allocating legacy PRD and
> +pad buffers */
> + pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
> + pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(32));
[--snip--]
> + pci_set_dma_mask(pdev, DMA_BIT_MASK(64));
> + pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64));

I'm probably being paranoid here but please add error checks.  Just
checking return value and returning error suffices.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem with ufs nextstep in 2.6.18 (debian)

2007-11-21 Thread Evgeniy Dushistov
On Tue, Nov 20, 2007 at 12:29:03PM -0800, Dave Bailey wrote:
> This problem has been around since kernel 2.6.16, and I see it in
> 2.6.23.1-10.fc7. It occurs in the ufs_check_page function of ufs/dir.c
> at the Espan test, which seems unnecessary for NextStep/OpenStep
> files systems. The following patch preserves the test for other file
> systems and makes the mount useful for NextStep/OpenStep:
> (against the 2.6.23.1-10.fc7 source tree)
>
> [EMAIL PROTECTED] diff dir.c dir.c.orig
> 108,110d107
> <   unsigned mnext = UFS_SB(sb)->s_mount_opt &
> < (UFS_MOUNT_UFSTYPE_NEXTSTEP || UFS_MOUNT_UFSTYPE_NEXTSTEP_CD ||
> <  UFS_MOUNT_UFSTYPE_OPENSTEP);
> 131c128
> <   if ((mnext == 0) & (((offs + rec_len - 1) ^ offs) & 
> ~chunk_mask))
> ---
> >   if (((offs + rec_len - 1) ^ offs) & ~chunk_mask)

This fixes only symptom, not illness.
This check represent what code think about filesystem layout.
On what actually kind of UFS system did you test this patch?
When I sometime ago fixed similar issue for openstep ufs,
actully this was darwin's ufs which has the same layout,
I just set s_dirblksize to right value, may be for 
UFS_MOUNT_UFSTYPE_NEXTSTEP, UFS_MOUNT_UFSTYPE_NEXTSTEP_CD you need
do the same, see TODO items in fs/ufs/super.c.

-- 
/Evgeniy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 49/59] fs/ufs: Add missing "space"

2007-11-21 Thread Evgeniy Dushistov
On Mon, Nov 19, 2007 at 05:53:36PM -0800, Joe Perches wrote:
> 
> Signed-off-by: Joe Perches <[EMAIL PROTECTED]>
> ---
>  fs/ufs/dir.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
> index 30f8c2b..d19dfe8 100644
> --- a/fs/ufs/dir.c
> +++ b/fs/ufs/dir.c
> @@ -180,7 +180,7 @@ bad_entry:
>  Eend:
>   p = (struct ufs_dir_entry *)(kaddr + offs);
>   ufs_error (sb, "ext2_check_page",

If you touch this code, it will be good,
if you also replace "ext2_check_page" with something like __FUNCTION__.

> -"entry in directory #%lu spans the page boundary"
> +"entry in directory #%lu spans the page boundary "
>  "offset=%lu",
>  dir->i_ino, (page->index<  fail:
> -- 
> 1.5.3.5.652.gf192c

-- 
/Evgeniy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9]: Reduce Log I/O latency

2007-11-21 Thread Stewart Smith
On Thu, 2007-11-22 at 12:12 +1100, David Chinner wrote:
> In all the cases that I know of where ppl are using what could
> be considered real-time I/O (e.g. media environments where they
> do real-time ingest and playout from the same filesystem) the
> real-time ingest processes create the files and do pre-allocation
> before doing their I/O. This I/O can get held up behind another
> process that is not real time that has issued log I/O. 
> 
> Given there is no I/O priority inheritence and having log I/O stall
> will stall the entire filesystem, we cannot allow log I/O to
> stall in real-time environments. Hence it must have the highest
> possible priority to prevent this.

FWIW from a "real time" database POV this seems to make sense to me...
in fact, we probably rely on filesystem metadata way too much
(historically it's just "worked" although we do seem to get issues
on ext3).

I have a (casually stupid) simulation program... although I've observed
little to no problems on all my XFS tests using it.

-- 
Stewart Smith, Senior Software Engineer (MySQL Cluster)
MySQL AB, www.mysql.com
Office: +14082136540 Ext: 6616
VoIP: [EMAIL PROTECTED]
Mobile: +61 4 3 8844 332


signature.asc
Description: This is a digitally signed message part


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-21 Thread Rusty Russell
On Thursday 22 November 2007 13:43:06 Andi Kleen wrote:
> There seems to be rough consensus that the kernel currently has too many
> exported symbols. A lot of these exports are generally usable utility
> functions or important driver interfaces; but another large part are
> functions intended by only one or two very specific modules for a very
> specific purpose.

Hi Andi,

This is an interesting idea, thanks for the code!  My only question is 
whether we can get most of this benefit by dropping the indirection of 
namespaces and have something like "EXPORT_SYMBOL_TO(sym, modname)"?  It
doesn't work so well for exporting to a group of modules, but that seems
a reasonable line to draw anyway.

Cheers,
Rusty.
PS.  Probably better to use the standard warnx and errx in modpost, too.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


freeze vs freezer

2007-11-21 Thread Jeremy Fitzhardinge
It seems that a process blocked in a write to an xfs filesystem due to
xfs_freeze cannot be frozen by the freezer.

I see this if I suspend my laptop while doing something xfs-filesystem
intensive, like a kernel build.  My suspend scripts freeze the XFS
filesystem (as Dave said I should), which presumably blocks some writer,
and then the freezer times out and fails to complete.

Here's part of the process dump the freezer does when it times out:

cc1   D  0 18138  18137
   dd5f1e24 00200082 0002  ecdeeb00 ecdeec64 c200f280 0001 
   009c09a0 dd5f1e0c dd5f1e0c 000f    dd5f1e74 
   c7beb480 dd5f1e88 dd5f1ea8 c0228d97 e8889540 dd5f1e38 c015b75d dd5f1e44 
Call Trace:
 [] xfs_write+0xf4/0x6d9
 [] xfs_file_aio_write+0x53/0x5b
 [] do_sync_write+0xae/0xec
 [] vfs_write+0xa4/0x120
 [] sys_write+0x3b/0x60
 [] sysenter_past_esp+0x6b/0xa1
 ===


I haven't looked at how to fix this yet.  I only just worked out why I
was getting suspend failures.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-21 Thread Dave Jones
On Thu, Nov 22, 2007 at 03:43:06AM +0100, Andi Kleen wrote:

 > There seems to be rough consensus that the kernel currently has too many 
 > exported symbols. A lot of these exports are generally usable utility 
 > functions or important driver interfaces; but another large part are 
 > functions
 > intended by only one or two very specific modules for a very specific 
 > purpose.
 > One example is the TCP code. It has most of its internals exported, but 
 > only for use by tcp_ipv6.c (and now a few more by the TCP/IP congestion 
 > modules) 
 > But it doesn't make sense to include these exported for a specific module
 > functions into a broader "kernel interface".   External modules assume
 > they can use these functions, but they were never intended for that.
 > 
 > This patch allows to export symbols only for specific modules by 
 > introducing symbol name spaces. A module name space has a white
 > list of modules that are allowed to import symbols for it; all others
 > can't use the symbols.

I really like this patchset.   Definitely a step in the right direction imo.
Looks like some nits there that checkpatch will probably pick up on,
but otherwise, looks very straightforward too.

Kudos.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1] mm: add dirty_highmem option

2007-11-21 Thread Bron Gondwana
On Thu, Nov 15, 2007 at 01:14:32PM -0800, Linus Torvalds wrote:
> Examples of non-broken solutions:
>  (a) always use lowmem sizes (what we do now)
>  (b) always use total mem sizes (sane but potentially dangerous: but the 
>  VM pressure should work! It has serious bounce-buffer issues, though, 
>  which is why I think it's crazy even if it's otherwise consistent)
> 
> Btw, I actually suspect that while (a) is what we do now, for the specific 
> case that Bron has, we could have a /proc/sys/vm option to just enable 
> (b). So we don't have to have just one consistent model, we can allow odd 
> users (and Bron sounds like one - sorry Bron ;) to just force other, odd, 
> but consistent models.

A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file
of approximately 2Gb size which contains a hash format that is written
"randomly" by the dbclean process.  On 2.6.16 this process took a few
minutes.  With lowmem only accounting of dirty ratios, this takes about
12 hours of 100% disk IO, all random writes.

This patch includes some code cleanup from Linus and a toggle in
/proc/sys/vm/dirty_highmem which can be set to 1 to add the highmem
back to the total available memory count.

Signed-off-by: Bron Gondwana <[EMAIL PROTECTED]>

Index: linux-2.6.23.8-reiserfix-fai-vmdirty/mm/page-writeback.c
===
--- linux-2.6.23.8-reiserfix-fai-vmdirty.orig/mm/page-writeback.c   
2007-11-22 01:48:20.0 +
+++ linux-2.6.23.8-reiserfix-fai-vmdirty/mm/page-writeback.c2007-11-22 
02:42:04.0 +
@@ -70,6 +70,12 @@ static inline long sync_writeback_pages(
 int dirty_background_ratio = 5;
 
 /*
+ * free highmem will not be subtracted from the total free memory
+ * for calculating free ratios if vm_dirty_highmem is true
+ */
+int vm_dirty_highmem;
+
+/*
  * The generator of dirty data starts writeback at this percentage
  */
 int vm_dirty_ratio = 10;
@@ -153,7 +159,8 @@ static unsigned long determine_dirtyable
x = global_page_state(NR_FREE_PAGES)
+ global_page_state(NR_INACTIVE)
+ global_page_state(NR_ACTIVE);
-   x -= highmem_dirtyable_memory(x);
+   if (!vm_dirty_highmem)
+   x -= highmem_dirtyable_memory(x);
return x + 1;   /* Ensure that we never return 0 */
 }
 
@@ -163,20 +170,12 @@ get_dirty_limits(long *pbackground, long
 {
int background_ratio;   /* Percentages */
int dirty_ratio;
-   int unmapped_ratio;
long background;
long dirty;
unsigned long available_memory = determine_dirtyable_memory();
struct task_struct *tsk;
 
-   unmapped_ratio = 100 - ((global_page_state(NR_FILE_MAPPED) +
-   global_page_state(NR_ANON_PAGES)) * 100) /
-   available_memory;
-
dirty_ratio = vm_dirty_ratio;
-   if (dirty_ratio > unmapped_ratio / 2)
-   dirty_ratio = unmapped_ratio / 2;
-
if (dirty_ratio < 5)
dirty_ratio = 5;
 
Index: linux-2.6.23.8-reiserfix-fai-vmdirty/include/linux/writeback.h
===
--- linux-2.6.23.8-reiserfix-fai-vmdirty.orig/include/linux/writeback.h 
2007-10-09 20:31:38.0 +
+++ linux-2.6.23.8-reiserfix-fai-vmdirty/include/linux/writeback.h  
2007-11-22 01:48:21.0 +
@@ -92,6 +92,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
+extern int vm_dirty_highmem;
 extern int vm_dirty_ratio;
 extern int dirty_writeback_interval;
 extern int dirty_expire_interval;
Index: linux-2.6.23.8-reiserfix-fai-vmdirty/kernel/sysctl.c
===
--- linux-2.6.23.8-reiserfix-fai-vmdirty.orig/kernel/sysctl.c   2007-10-09 
20:31:38.0 +
+++ linux-2.6.23.8-reiserfix-fai-vmdirty/kernel/sysctl.c2007-11-22 
01:48:21.0 +
@@ -776,6 +776,7 @@ static ctl_table kern_table[] = {
 /* Constants for minimum and maximum testing in vm_table.
We use these as one-element integer vectors. */
 static int zero;
+static int one = 1;
 static int two = 2;
 static int one_hundred = 100;
 
@@ -1066,6 +1067,19 @@ static ctl_table vm_table[] = {
.extra1 = ,
},
 #endif
+#ifdef CONFIG_HIGHMEM
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = "dirty_highmem",
+   .data   = _dirty_highmem,
+   .maxlen = sizeof(vm_dirty_highmem),
+   .mode   = 0644,
+   .proc_handler   = _dointvec_minmax,
+   .strategy   = _intvec,
+   .extra1 = ,
+   .extra2 = ,
+   },
+#endif
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt

Re: [PATCH 2/9]: Reduce Log I/O latency

2007-11-21 Thread David Chinner
On Wed, Nov 21, 2007 at 08:57:27PM -0600, Matt Mackall wrote:
> On Thu, Nov 22, 2007 at 12:12:14PM +1100, David Chinner wrote:
> > In all the cases that I know of where ppl are using what could
> > be considered real-time I/O (e.g. media environments where they
> > do real-time ingest and playout from the same filesystem) the
> > real-time ingest processes create the files and do pre-allocation
> > before doing their I/O. This I/O can get held up behind another
> > process that is not real time that has issued log I/O. 
> > 
> > Given there is no I/O priority inheritence and having log I/O stall
> > will stall the entire filesystem, we cannot allow log I/O to
> > stall in real-time environments. Hence it must have the highest
> > possible priority to prevent this.
> 
> I've seen PVRs that would be upset by this. They put media on one
> filesystem and database/apps/swap/etc. on another, but have everything
> on a single spindle. Stalling a media filesystem read for a write
> anywhere else = fail.

Sounds like the PVR is badly designed to me. If a write can cause a
read to miss a playback deadline, then you haven't built enough
buffering into your playback application.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Linux Kernel Markers - Support Multiple Probes

2007-11-21 Thread Mathieu Desnoyers
RCU style multiple probes support for the Linux Kernel Markers.
Common case (one probe) is still fast and does not require dynamic allocation
or a supplementary pointer dereference on the fast path.

- Move preempt disable from the marker site to the callback.

Since we now have an internal callback, move the preempt disable/enable to the
callback instead of the marker site.

Since the callback change is done asynchronously (passing from a handler that
supports arguments to a handler that does not setup the arguments is no
arguments are passed), we can safely update it even if it is outside the preempt
disable section.

- Move probe arm to probe connection. Now, a connected probe is automatically
  armed.

Remove MARK_MAX_FORMAT_LEN, unused.

This patch modifies the Linux Kernel Markers API : it removes the probe
"arm/disarm" and changes the probe function prototype : it now expects a
va_list * instead of a "...".

It applies on top of 2.6.24-rc3-git1.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: Christoph Hellwig <[EMAIL PROTECTED]>
CC: Andrew Morton <[EMAIL PROTECTED]>
CC: Mike Mason <[EMAIL PROTECTED]>
CC: Dipankar Sarma <[EMAIL PROTECTED]>
---
 include/linux/marker.h  |   59 ++-
 include/linux/module.h  |2 
 kernel/marker.c |  671 +---
 kernel/module.c |7 
 samples/markers/probe-example.c |   25 -
 5 files changed, 548 insertions(+), 216 deletions(-)

Index: linux-2.6-lttng/include/linux/marker.h
===
--- linux-2.6-lttng.orig/include/linux/marker.h 2007-11-21 19:01:02.0 
-0500
+++ linux-2.6-lttng/include/linux/marker.h  2007-11-21 19:17:30.0 
-0500
@@ -19,16 +19,23 @@ struct marker;
 
 /**
  * marker_probe_func - Type of a marker probe function
- * @mdata: pointer of type struct marker
- * @private_data: caller site private data
+ * @probe_private: probe private data
+ * @call_private: call site private data
  * @fmt: format string
- * @...: variable argument list
+ * @args: variable argument list pointer. Use a pointer to overcome C's
+ *inability to pass this around as a pointer in a portable manner in
+ *the callee otherwise.
  *
  * Type of marker probe functions. They receive the mdata and need to parse the
  * format string to recover the variable argument list.
  */
-typedef void marker_probe_func(const struct marker *mdata,
-   void *private_data, const char *fmt, ...);
+typedef void marker_probe_func(void *probe_private, void *call_private,
+   const char *fmt, va_list *args);
+
+struct marker_probe_closure {
+   marker_probe_func *func;/* Callback */
+   void *probe_private;/* Private probe data */
+};
 
 struct marker {
const char *name;   /* Marker name */
@@ -36,8 +43,11 @@ struct marker {
 * variable argument list.
 */
char state; /* Marker state. */
-   marker_probe_func *call;/* Probe handler function pointer */
-   void *private;  /* Private probe data */
+   char ptype; /* probe type : 0 : single, 1 : multi */
+   void (*call)(const struct marker *mdata,/* Probe wrapper */
+   void *call_private, const char *fmt, ...);
+   struct marker_probe_closure single;
+   struct marker_probe_closure *multi;
 } __attribute__((aligned(8)));
 
 #ifdef CONFIG_MARKERS
@@ -49,7 +59,7 @@ struct marker {
  * not add unwanted padding between the beginning of the section and the
  * structure. Force alignment to the same alignment as the section start.
  */
-#define __trace_mark(name, call_data, format, args...) \
+#define __trace_mark(name, call_private, format, args...)  \
do {\
static const char __mstrtab_name_##name[]   \
__attribute__((section("__markers_strings")))   \
@@ -60,24 +70,23 @@ struct marker {
static struct marker __mark_##name  \
__attribute__((section("__markers"), aligned(8))) = \
{ __mstrtab_name_##name, __mstrtab_format_##name,   \
-   0, __mark_empty_function, NULL };   \
+   0, 0, marker_probe_cb,  \
+   { __mark_empty_function, NULL}, NULL }; \
__mark_check_format(format, ## args);   \
if (unlikely(__mark_##name.state)) {\
-   preempt_disable();  \
(*__mark_##name.call)   \
-   (&__mark_##name, call_data, \
+   (&__mark_##name, 

Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-21 Thread Andi Kleen

> I like this concept in general; I have one minor comment; right now
> your namespace argument is like
> 
> EXPORT_SYMBOL_NS(foo, some_symbol);
> 
> from a language-like pov I kinda wonder if it's nicer to do
> 
> EXPORT_SYMBOL_NS("foo", some_symbol);
> 
> because foo isn't something in C scope, but more a string-like
> identifier...

That wouldn't work for MODULE_ALLOW() because it appends the namespace
to other identifiers. I don't know of a way in the C processor to get
back from a string to a ## concatenable identifier.

For EXPORT_SYMBOL_NS it would be in theory possible, but making 
it asymmetric to MODULE_ALLOW would be ugly imho.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC][try 2] IA64 signal : remove redundant code in setup_sigcontext()

2007-11-21 Thread Matthew Wilcox
On Thu, Nov 22, 2007 at 11:15:55AM +0800, Shi Weihua wrote:
> This patch removes some redundant code in the function setup_sigcontext().
> 
> The registers ar.ccv,b7,r14,ar.csd,ar.ssd,r2-r3 and r16-r31 are not restored 
> in restore_sigcontext() when (flags & IA64_SC_FLAG_IN_SYSCALL) is true. 
> So we don't need to zero those variables in setup_sigcontext().

Erm, couldn't those registers contain information the process shouldn't
see?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC][try 2] IA64 signal : remove redundant code in setup_sigcontext()

2007-11-21 Thread Shi Weihua
This patch removes some redundant code in the function setup_sigcontext().

The registers ar.ccv,b7,r14,ar.csd,ar.ssd,r2-r3 and r16-r31 are not restored 
in restore_sigcontext() when (flags & IA64_SC_FLAG_IN_SYSCALL) is true. 
So we don't need to zero those variables in setup_sigcontext().

Signed-off-by: Shi Weihua <[EMAIL PROTECTED]> 

---
diff -urp linux-2.6.24-rc3-git1.orig/arch/ia64/kernel/signal.c 
linux-2.6.24-rc3-git1/arch/ia64/kernel/signal.c
--- linux-2.6.24-rc3-git1.orig/arch/ia64/kernel/signal.c2007-11-17 
13:16:36.0 +0800
+++ linux-2.6.24-rc3-git1/arch/ia64/kernel/signal.c 2007-11-22 
11:02:27.0 +0800
@@ -280,15 +280,7 @@ setup_sigcontext (struct sigcontext __us
err |= __copy_to_user(>sc_gr[15], >pt.r15, 8); /* r15 
*/
err |= __put_user(scr->pt.cr_iip + ia64_psr(>pt)->ri, >sc_ip);
 
-   if (flags & IA64_SC_FLAG_IN_SYSCALL) {
-   /* Clear scratch registers if the signal interrupted a system 
call. */
-   err |= __put_user(0, >sc_ar_ccv);   
/* ar.ccv */
-   err |= __put_user(0, >sc_br[7]);
/* b7 */
-   err |= __put_user(0, >sc_gr[14]);   
/* r14 */
-   err |= __clear_user(>sc_ar25, 2*8); /* 
ar.csd & ar.ssd */
-   err |= __clear_user(>sc_gr[2], 2*8);
/* r2-r3 */
-   err |= __clear_user(>sc_gr[16], 16*8);  
/* r16-r31 */
-   } else {
+   if (!(flags & IA64_SC_FLAG_IN_SYSCALL)) {
/* Copy scratch regs to sigcontext if the signal didn't 
interrupt a syscall. */
err |= __put_user(scr->pt.ar_ccv, >sc_ar_ccv);  
/* ar.ccv */
err |= __put_user(scr->pt.b7, >sc_br[7]);   
/* b7 */

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 (sync is slow ?)

2007-11-21 Thread KAMEZAWA Hiroyuki
On Wed, 21 Nov 2007 00:49:09 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

> On Wed, 21 Nov 2007 17:42:15 +0900 KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> 
> wrote:
> 
> > Hi, Andrew
> > 
> > I got following result in 'sync' command.
> > It was too slow. (memory controller config is off ;)
> > I attaches my .config.
> > ==
> > [2.6.24-rc3-mm1]
> > [EMAIL PROTECTED] ~]$ dd if=/dev/zero of=./tmpfile bs=4096 count=10
> > 10+0 records in
> > 10+0 records out
> > 40960 bytes (410 MB) copied, 1.46706 seconds, 279 MB/s
> > [EMAIL PROTECTED] ~]$ time sync
> > 
> > real3m6.440s
> > user0m0.000s
> > sys 0m0.133s

> Well I wonder how we did that.
> 
> It seems OK here from a quick test (i386, ext3-on-IDE).
> 
> Maybe device driver/block breakage?
> 

I confirmed This slowdown is caused by git-scsi-misc.patch.
I'm sorry that I can't chase more and will be offline in this weekend.

This is scsi_mod information in /proc/modules
=
scsi_mod 409416 8 
mptctl,sg,lpfc,scsi_transport_fc,mptspi,mptscsih,scsi_transport_spi,sd_mod, 
Live 0xa00202818000
=

What information should I provide more ?

Thanks,
-Kame


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-21 Thread Arjan van de Ven
On Thu, 22 Nov 2007 03:43:06 +0100 (CET)
Andi Kleen <[EMAIL PROTECTED]> wrote:

> 
> There seems to be rough consensus that the kernel currently has too
> many exported symbols. A lot of these exports are generally usable
> utility functions or important driver interfaces; but another large
> part are functions intended by only one or two very specific modules
> for a very specific purpose. One example is the TCP code. It has most
> of its internals exported, but only for use by tcp_ipv6.c (and now a
> few more by the TCP/IP congestion modules) But it doesn't make sense
> to include these exported for a specific module functions into a
> broader "kernel interface".   External modules assume they can use
> these functions, but they were never intended for that.
> 
> This patch allows to export symbols only for specific modules by 
> introducing symbol name spaces. A module name space has a white
> list of modules that are allowed to import symbols for it; all others
> can't use the symbols.
> 
> It adds two new macros: 
> 
> MODULE_NAMESPACE_ALLOW(namespace, module);
> 
> Allow module to import symbols from namespace. module is the module
> name without .ko as displayed by lsmod.  Must be in the same module
> as the export (and be duplicated if there are multiple modules
> exporting symbols to a namespace).  Multiple allows for the same name
> space are allowed.
> 
> EXPORT_SYMBOL_NS(namespace, symbol);
> 

Hi,

I like this concept in general; I have one minor comment; right now
your namespace argument is like

EXPORT_SYMBOL_NS(foo, some_symbol);

from a language-like pov I kinda wonder if it's nicer to do

EXPORT_SYMBOL_NS("foo", some_symbol);

because foo isn't something in C scope, but more a string-like
identifier...



-- 
If you want to reach me at my work email, use [EMAIL PROTECTED]
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9]: Reduce Log I/O latency

2007-11-21 Thread Matt Mackall
On Thu, Nov 22, 2007 at 12:12:14PM +1100, David Chinner wrote:
> On Thu, Nov 22, 2007 at 01:49:25AM +0100, Andi Kleen wrote:
> > David Chinner <[EMAIL PROTECTED]> writes:
> > 
> > > To ensure that log I/O is issued as the highest priority I/O, set
> > > the I/O priority of the log I/O to the highest possible. This will
> > > ensure that log I/O is not held up behind bulk data or other
> > > metadata I/O as delaying log I/O can pause the entire transaction
> > > subsystem. Introduce a new buffer flag to allow us to tag the log
> > > buffers so we can discrimiate when issuing the I/O.
> > 
> > Won't that possible disturb other RT priority users that do not need 
> > log IO (e.g. working on preallocated files)? Seems a little
> > dangerous.
> 
> In all the cases that I know of where ppl are using what could
> be considered real-time I/O (e.g. media environments where they
> do real-time ingest and playout from the same filesystem) the
> real-time ingest processes create the files and do pre-allocation
> before doing their I/O. This I/O can get held up behind another
> process that is not real time that has issued log I/O. 
> 
> Given there is no I/O priority inheritence and having log I/O stall
> will stall the entire filesystem, we cannot allow log I/O to
> stall in real-time environments. Hence it must have the highest
> possible priority to prevent this.

I've seen PVRs that would be upset by this. They put media on one
filesystem and database/apps/swap/etc. on another, but have everything
on a single spindle. Stalling a media filesystem read for a write
anywhere else = fail.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible bug from kernel 2.6.22 and above

2007-11-21 Thread Jie Chen

Simon Holm Thøgersen wrote:

ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:



There is a backport of the CFS scheduler to 2.6.21, see
http://lkml.org/lkml/2007/11/19/127


Hi, Simon:

I will try that after the thanksgiving holiday to find out whether the 
odd behavior will show up using 2.6.21 with back ported CFS.



Kernel 2.6.21
Number of Threads  2  4   6 8
SpinLock (Time micro second)   10.561810.5853810.5915   10.643
  (Overhead)   0.073  0.05746 0.102805 0.154563
Barrier (Time micro second)11.020410  11.678125   11.9889   12.38002
 (Overhead)0.531660   1.1502  1.500112 1.891617

Each thread is bound to a particular core using pthread_setaffinity_np.

Kernel 2.6.23.8
Number of Threads  2  4   6 8
SpinLock (Time micro second)   14.849915  17.117603   14.4496   10.5990
 (Overhead)4.345417   6.6172073.949435  0.110985
Barrier (Time micro second)19.462255  20.285117   16.19395  12.37662
 (Overhead)8.957755   9.7847225.699590  1.869518






Simon Holm Thøgersen


I just ran a simple test to prove that the problem may be related to 
load balance of the scheduler. I first started 6 processes using 
"taskset -c 2 donothing&; taskset -c 3 donothing&; ..., taskset -c 7 
donothing". These 6 processes will run on core 2 to 7. Then I started my 
test program using two threads bound to core 0 and 1. Here is the result:


Two threads on Kernel 2.6.23.8:
SpinLock (Time micro second) 10.558255
 (Overhead)  0.068965
Barrier  (Time micro second) 10.865520
 (Overhead)  0.376230

Similarly, I started 4 donothing processes on core 4, 5, 6 and 7, and 
ran the test program. I have the following result:


Four threads on Kernel 2.6.23.8:
SpinLock (Time micro second) 10.579413
 (Overhead)  0.090023
Barrier  (Time micro second) 11.363193
 (Overhead)  0.873803

Finally, here is the result for 6 threads with two donothing processes 
running on core 6 and 7:


Six threads on Kernel 2.6.23.8:
SpinLock (Time micro second) 10.590030
 (Overhead)  0.100940
Barrier  (Time micro second) 11.977548
 (Overhead)  1.488458

Now the above results are very much similar to the results obtained for 
the kernel 2.6.21. I hope this helps you guys in some ways. Thank you.


--
#
# Jie Chen
# Scientific Computing Group
# Thomas Jefferson National Accelerator Facility
# Newport News, VA 23606
#
# [EMAIL PROTECTED]
# (757)269-5046 (office)
# (757)269-6248 (fax)
#
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the interrupt going?

2007-11-21 Thread Arjan van de Ven
On Wed, 21 Nov 2007 17:08:30 -0800
Al Niessner <[EMAIL PROTECTED]> wrote:
> 
> Lastly, I would be happy to give out the entire module to anyone who
> requests it, but it is about 550 lines so I did not want to attach it
> to this already long post.
> 

can you send it to me, or even better, post it somewhere online ?
I have something I'd like to check to see if you do it correct but I
can't without the code...


-- 
If you want to reach me at my work email, use [EMAIL PROTECTED]
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCH] SO_NO_CHECK for IPv6

2007-11-21 Thread YOSHIFUJI Hideaki / 吉藤英明
In article <[EMAIL PROTECTED]> (at Thu, 22 Nov 2007 10:34:03 +0800), Herbert Xu 
<[EMAIL PROTECTED]> says:

> On Wed, Nov 21, 2007 at 07:17:40PM -0500, Jeff Garzik wrote:
> >
> > For those interested, I am dealing with a UDP app that already does very 
> > strong checksumming and encryption, so additional software checksumming 
> > at the lower layers is quite simply a waste of CPU cycles.  Hardware 
> > checksumming is fine, as long as its "free."
> 
> No matter how strong your underlying checksumming is it's not
> going to protect the IPv6 header is it :)

In that sense, we should use AH.

--yoshfuji
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] alpha: kill deprecated virt_to_bus

2007-11-21 Thread FUJITA Tomonori
On Wed, 21 Nov 2007 12:26:55 +0100
Jens Axboe <[EMAIL PROTECTED]> wrote:

> On Tue, Nov 20 2007, FUJITA Tomonori wrote:
> > pci-noop.c doesn't use DMA mappings.
> 
> you should send that one to the alpha maintainers, it needs to go in
> that way.

Yeah, I'll do.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -mm 3/4] swiotlb: respect the segment boundary limits

2007-11-21 Thread FUJITA Tomonori
This patch makes swiotlb not allocate a memory area spanning LLD's
segment boundary.

is_span_boundary() judges whether a memory area spans LLD's segment
boundary. If map_single finds such a area, map_single tries to find
the next available memory area.

Signed-off-by: FUJITA Tomonori <[EMAIL PROTECTED]>
---
 lib/swiotlb.c |   41 +++--
 1 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 1a8050a..4bb5a11 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -282,6 +282,15 @@ address_needs_mapping(struct device *hwdev, dma_addr_t 
addr)
return (addr & ~mask) != 0;
 }
 
+static inline unsigned int is_span_boundary(unsigned int index,
+   unsigned int nslots,
+   unsigned long offset_slots,
+   unsigned long max_slots)
+{
+   unsigned long offset = (offset_slots + index) & (max_slots - 1);
+   return offset + nslots > max_slots;
+}
+
 /*
  * Allocates bounce buffer and returns its kernel virtual address.
  */
@@ -292,6 +301,16 @@ map_single(struct device *hwdev, char *buffer, size_t 
size, int dir)
char *dma_addr;
unsigned int nslots, stride, index, wrap;
int i;
+   unsigned long start_dma_addr;
+   unsigned long mask;
+   unsigned long offset_slots;
+   unsigned long max_slots;
+
+   mask = dma_get_seg_boundary(hwdev);
+   start_dma_addr = virt_to_bus(io_tlb_start) & mask;
+
+   offset_slots = ALIGN(start_dma_addr, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT;
+   max_slots = ALIGN(mask + 1, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT;
 
/*
 * For mappings greater than a page, we limit the stride (and
@@ -311,10 +330,17 @@ map_single(struct device *hwdev, char *buffer, size_t 
size, int dir)
 */
spin_lock_irqsave(_tlb_lock, flags);
{
-   wrap = index = ALIGN(io_tlb_index, stride);
-
+   index = ALIGN(io_tlb_index, stride);
if (index >= io_tlb_nslabs)
-   wrap = index = 0;
+   index = 0;
+
+   while (is_span_boundary(index, nslots, offset_slots,
+   max_slots)) {
+   index += stride;
+   if (index >= io_tlb_nslabs)
+   index = 0;
+   }
+   wrap = index;
 
do {
/*
@@ -341,9 +367,12 @@ map_single(struct device *hwdev, char *buffer, size_t 
size, int dir)
 
goto found;
}
-   index += stride;
-   if (index >= io_tlb_nslabs)
-   index = 0;
+   do {
+   index += stride;
+   if (index >= io_tlb_nslabs)
+   index = 0;
+   } while (is_span_boundary(index, nslots, offset_slots,
+ max_slots));
} while (index != wrap);
 
spin_unlock_irqrestore(_tlb_lock, flags);
-- 
1.5.3.4

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -mm 0/4] fix iommu segment boundary problems

2007-11-21 Thread FUJITA Tomonori
This is the latter half of my iommu work to make the IOMMUs respect
LLDs restrictions.

IOMMUs allocate memory areas without considering a low level driver's
segment boundary limits. So we have some workarounds: splitting sg
segments again in LLDs; reserving all I/O space spanning 4GB boundary
in IOMMUs (with assumption that all the LLDs have 4GB boundary
restrictions). The goal is killing all the workarounds.

This patchset adds new accessors for segment_boundary_mask in
device_dma_parameters structure in the same way as the first half of
my work did for max_segment_size.

Currently, I fixed only swiotlb. Next, I'll generalize swiotlb's free
area management and convert all the IOMMUs to use it. Or I'll
generalize a free area management to use bitmap that most of the
IOMMUs use and convert them to use it.

This is against 2.6.24-rc3-mm1.

The first half of my iommu work is:

http://thread.gmane.org/gmane.linux.scsi/35602
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -mm 1/4] add accessors for segment_boundary_mask in device_dma_parameters

2007-11-21 Thread FUJITA Tomonori
This adds new accessors for segment_boundary_mask in
device_dma_parameters structure in the same way I did for
max_segment_size. So we can easily change where to place struct
device_dma_parameters in the future.

dma_get_segment boundary returns 0x if dma_parms in struct
device isn't set up properly. 0x is the default value used in
the block layer and the scsi mid layer.

Signed-off-by: FUJITA Tomonori <[EMAIL PROTECTED]>
---
 include/linux/dma-mapping.h |   15 +++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 71972ca..7d157ed 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -75,6 +75,21 @@ static inline unsigned int dma_set_max_seg_size(struct 
device *dev,
return -EIO;
 }
 
+static inline unsigned long dma_get_seg_boundary(struct device *dev)
+{
+   return dev->dma_parms ?
+   dev->dma_parms->segment_boundary_mask : 0x;
+}
+
+static inline int dma_set_seg_boundary(struct device *dev, unsigned long mask)
+{
+   if (dev->dma_parms) {
+   dev->dma_parms->segment_boundary_mask = mask;
+   return 0;
+   } else
+   return -EIO;
+}
+
 /* flags for the coherent memory api */
 #defineDMA_MEMORY_MAP  0x01
 #define DMA_MEMORY_IO  0x02
-- 
1.5.3.4

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -mm 4/4] call dma_set_seg_boundary in __scsi_alloc_queue

2007-11-21 Thread FUJITA Tomonori
This is a one-line patch to add the following to __scsi_alloc_queue():

dma_set_seg_boundary(dev, shost->dma_boundary);

This is the simplest approach but the result looks odd,
__scsi_alloc_queue() does:

blk_queue_segment_boundary(q, shost->dma_boundary);
dma_set_seg_boundary(dev, shost->dma_boundary);
blk_queue_max_segment_size(q, dma_get_max_seg_size(dev));

I think that it would be better to set up segment boundary in the same
way as we did for the maximum segment size. That is, removing
shost->dma_boundary and LLDs call pci_set_dma_seg_boundary (or its
friends).

Then __scsi_alloc_queue() can set up both limits in the same way:

blk_queue_segment_boundary(q, dma_get_seg_boundary(dev));
blk_queue_max_segment_size(q, dma_get_max_seg_size(dev));

killing dma_boundary in scsi_host_template needs a large patch for
libata (dma_boundary is used by only libata and sym53c8xx). I'll send
a patch to do that if it is acceptable. James and Jeff?

Signed-off-by: FUJITA Tomonori <[EMAIL PROTECTED]>
---
 drivers/scsi/scsi_lib.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 733176d..2a15a3b 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1767,6 +1767,7 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host 
*shost,
blk_queue_max_sectors(q, shost->max_sectors);
blk_queue_bounce_limit(q, scsi_calculate_bounce_limit(shost));
blk_queue_segment_boundary(q, shost->dma_boundary);
+   dma_set_seg_boundary(dev, shost->dma_boundary);
 
blk_queue_max_segment_size(q, dma_get_max_seg_size(dev));
 
-- 
1.5.3.4

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -mm 2/4] PCI: add dma segment boundary support

2007-11-21 Thread FUJITA Tomonori
This adds PCI's accessor for segment_boundary_mask in
device_dma_parameters.

The default segment_boundary is set to 0x, same to the block
layer's default value (and the scsi mid layer uses the same value).

Signed-off-by: FUJITA Tomonori <[EMAIL PROTECTED]>
---
 drivers/pci/pci.c   |8 
 drivers/pci/probe.c |1 +
 include/linux/pci.h |2 ++
 3 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index de623cf..3b7e0e0 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1435,6 +1435,14 @@ int pci_set_dma_max_seg_size(struct pci_dev *dev, 
unsigned int size)
 EXPORT_SYMBOL(pci_set_dma_max_seg_size);
 #endif
 
+#ifndef HAVE_ARCH_PCI_SET_DMA_SEGMENT_BOUNDARY
+int pci_set_dma_seg_boundary(struct pci_dev *dev, unsigned long mask)
+{
+   return dma_set_seg_boundary(>dev, mask);
+}
+EXPORT_SYMBOL(pci_set_dma_seg_boundary);
+#endif
+
 /**
  * pcix_get_max_mmrbc - get PCI-X maximum designed memory read byte count
  * @dev: PCI device to query
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index aa343e1..2e8b539 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -987,6 +987,7 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus 
*bus)
dev->dev.coherent_dma_mask = 0xull;
 
pci_set_dma_max_seg_size(dev, 65536);
+   pci_set_dma_seg_boundary(dev, 0x);
 
/* Fix up broken headers */
pci_fixup_device(pci_fixup_header, dev);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index d56d0b6..a05a843 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -567,6 +567,7 @@ void pci_msi_off(struct pci_dev *dev);
 int pci_set_dma_mask(struct pci_dev *dev, u64 mask);
 int pci_set_consistent_dma_mask(struct pci_dev *dev, u64 mask);
 int pci_set_dma_max_seg_size(struct pci_dev *dev, unsigned int size);
+int pci_set_dma_seg_boundary(struct pci_dev *dev, unsigned long mask);
 int pcix_get_max_mmrbc(struct pci_dev *dev);
 int pcix_get_mmrbc(struct pci_dev *dev);
 int pcix_set_mmrbc(struct pci_dev *dev, int mmrbc);
@@ -753,6 +754,7 @@ static inline int pci_enable_device(struct pci_dev *dev) { 
return -EIO; }
 static inline void pci_disable_device(struct pci_dev *dev) { }
 static inline int pci_set_dma_mask(struct pci_dev *dev, u64 mask) { return 
-EIO; }
 static inline int pci_set_dma_max_seg_size(struct pci_dev *dev, unsigned int 
size) { return -EIO; }
+static inline int pci_set_dma_seg_boundary(struct pci_dev *dev, unsigned long 
mask) { return -EIO; }
 static inline int pci_assign_resource(struct pci_dev *dev, int i) { return 
-EBUSY;}
 static inline int __pci_register_driver(struct pci_driver *drv, struct module 
*owner) { return 0;}
 static inline int pci_register_driver(struct pci_driver *drv) { return 0;}
-- 
1.5.3.4

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] [7/9] Convert TCP exports into namespaces

2007-11-21 Thread Andi Kleen

I defined two namespaces: tcp for TCP internals which are only used by 
tcp_ipv6.ko And tcpcong for exports used by the TCP congestion modules

No need to export any TCP internals to anybody else. So express this in a 
namespace.

I admit I'm not 100% sure tcpcong makes sense -- there might be a legitimate
need to have external out of tree congestion modules. They seem nearly like
drivers, but only nearly. If that was deemed the case it would be possible to 
remove tcpcong again to allow everybody to access this.

This implicitely turns all exports into GPL only, but that won't matter
because all modules allowed to import TCP functions are GPLed.

---
 net/ipv4/tcp.c   |   71 +++
 net/ipv4/tcp_cong.c  |   14 -
 net/ipv4/tcp_input.c |   12 +++
 net/ipv4/tcp_ipv4.c  |   38 -
 net/ipv4/tcp_minisocks.c |   12 +++
 net/ipv4/tcp_output.c|   12 +++
 net/ipv4/tcp_timer.c |2 -
 7 files changed, 87 insertions(+), 74 deletions(-)

Index: linux/net/ipv4/tcp.c
===
--- linux.orig/net/ipv4/tcp.c
+++ linux/net/ipv4/tcp.c
@@ -275,21 +275,21 @@ DEFINE_SNMP_STAT(struct tcp_mib, tcp_sta
 
 atomic_t tcp_orphan_count = ATOMIC_INIT(0);
 
-EXPORT_SYMBOL_GPL(tcp_orphan_count);
+EXPORT_SYMBOL_NS(tcp, tcp_orphan_count);
 
 int sysctl_tcp_mem[3] __read_mostly;
 int sysctl_tcp_wmem[3] __read_mostly;
 int sysctl_tcp_rmem[3] __read_mostly;
 
-EXPORT_SYMBOL(sysctl_tcp_mem);
-EXPORT_SYMBOL(sysctl_tcp_rmem);
-EXPORT_SYMBOL(sysctl_tcp_wmem);
+EXPORT_SYMBOL_NS(tcp, sysctl_tcp_mem);
+EXPORT_SYMBOL_NS(tcp, sysctl_tcp_rmem);
+EXPORT_SYMBOL_NS(tcp, sysctl_tcp_wmem);
 
 atomic_t tcp_memory_allocated; /* Current allocated memory. */
 atomic_t tcp_sockets_allocated;/* Current number of TCP sockets. */
 
-EXPORT_SYMBOL(tcp_memory_allocated);
-EXPORT_SYMBOL(tcp_sockets_allocated);
+EXPORT_SYMBOL_NS(tcp, tcp_memory_allocated);
+EXPORT_SYMBOL_NS(tcp, tcp_sockets_allocated);
 
 /*
  * Pressure flag: try to collapse.
@@ -299,7 +299,7 @@ EXPORT_SYMBOL(tcp_sockets_allocated);
  */
 int tcp_memory_pressure __read_mostly;
 
-EXPORT_SYMBOL(tcp_memory_pressure);
+EXPORT_SYMBOL_NS(tcp, tcp_memory_pressure);
 
 void tcp_enter_memory_pressure(void)
 {
@@ -309,7 +309,7 @@ void tcp_enter_memory_pressure(void)
}
 }
 
-EXPORT_SYMBOL(tcp_enter_memory_pressure);
+EXPORT_SYMBOL_NS(tcp, tcp_enter_memory_pressure);
 
 /*
  * Wait for a TCP event.
@@ -1995,7 +1995,7 @@ int compat_tcp_setsockopt(struct sock *s
return do_tcp_setsockopt(sk, level, optname, optval, optlen);
 }
 
-EXPORT_SYMBOL(compat_tcp_setsockopt);
+EXPORT_SYMBOL_NS(tcp, compat_tcp_setsockopt);
 #endif
 
 /* Return information about state of tcp endpoint in API format. */
@@ -2061,7 +2061,7 @@ void tcp_get_info(struct sock *sk, struc
info->tcpi_total_retrans = tp->total_retrans;
 }
 
-EXPORT_SYMBOL_GPL(tcp_get_info);
+EXPORT_SYMBOL_NS(tcp, tcp_get_info);
 
 static int do_tcp_getsockopt(struct sock *sk, int level,
int optname, char __user *optval, int __user *optlen)
@@ -2174,7 +2174,7 @@ int compat_tcp_getsockopt(struct sock *s
return do_tcp_getsockopt(sk, level, optname, optval, optlen);
 }
 
-EXPORT_SYMBOL(compat_tcp_getsockopt);
+EXPORT_SYMBOL_NS(tcp, compat_tcp_getsockopt);
 #endif
 
 struct sk_buff *tcp_tso_segment(struct sk_buff *skb, int features)
@@ -2262,7 +2262,7 @@ struct sk_buff *tcp_tso_segment(struct s
 out:
return segs;
 }
-EXPORT_SYMBOL(tcp_tso_segment);
+EXPORT_SYMBOL_NS(tcp, tcp_tso_segment);
 
 #ifdef CONFIG_TCP_MD5SIG
 static unsigned long tcp_md5sig_users;
@@ -2298,7 +2298,7 @@ void tcp_free_md5sig_pool(void)
__tcp_free_md5sig_pool(pool);
 }
 
-EXPORT_SYMBOL(tcp_free_md5sig_pool);
+EXPORT_SYMBOL_NS(tcp, tcp_free_md5sig_pool);
 
 static struct tcp_md5sig_pool **__tcp_alloc_md5sig_pool(void)
 {
@@ -2371,7 +2371,7 @@ retry:
return pool;
 }
 
-EXPORT_SYMBOL(tcp_alloc_md5sig_pool);
+EXPORT_SYMBOL_NS(tcp, tcp_alloc_md5sig_pool);
 
 struct tcp_md5sig_pool *__tcp_get_md5sig_pool(int cpu)
 {
@@ -2384,14 +2384,14 @@ struct tcp_md5sig_pool *__tcp_get_md5sig
return (p ? *per_cpu_ptr(p, cpu) : NULL);
 }
 
-EXPORT_SYMBOL(__tcp_get_md5sig_pool);
+EXPORT_SYMBOL_NS(tcp, __tcp_get_md5sig_pool);
 
 void __tcp_put_md5sig_pool(void)
 {
tcp_free_md5sig_pool();
 }
 
-EXPORT_SYMBOL(__tcp_put_md5sig_pool);
+EXPORT_SYMBOL_NS(tcp, __tcp_put_md5sig_pool);
 #endif
 
 void tcp_done(struct sock *sk)
@@ -2409,7 +2409,7 @@ void tcp_done(struct sock *sk)
else
inet_csk_destroy_sock(sk);
 }
-EXPORT_SYMBOL_GPL(tcp_done);
+EXPORT_SYMBOL_NS(tcp, tcp_done);
 
 extern void __skb_cb_too_small_for_tcp(int, int);
 extern struct tcp_congestion_ops tcp_reno;
@@ -2524,15 +2524,28 @@ void __init tcp_init(void)
tcp_register_congestion_control(_reno);
 }
 
-EXPORT_SYMBOL(tcp_close);
-EXPORT_SYMBOL(tcp_disconnect);

[PATCH RFC] [9/9] Add a inet namespace

2007-11-21 Thread Andi Kleen

Shared by IP, IPv6, DCCP, UDPLITE, SCTP. 

The symbols used by tunnel modules weren't put into any name space
because there are quite a lot of them.

---
 net/core/fib_rules.c|9 --
 net/ipv4/af_inet.c  |   52 
 net/ipv4/arp.c  |1 
 net/ipv4/icmp.c |   10 +++
 net/ipv4/inet_connection_sock.c |   40 +++---
 net/ipv4/inet_diag.c|4 +--
 net/ipv4/inet_hashtables.c  |8 +++---
 net/ipv4/inet_timewait_sock.c   |   12 -
 net/ipv4/ip_input.c |2 -
 net/ipv4/ip_output.c|7 +++--
 net/ipv4/ip_sockglue.c  |   10 +++
 11 files changed, 86 insertions(+), 69 deletions(-)

Index: linux/net/ipv4/af_inet.c
===
--- linux.orig/net/ipv4/af_inet.c
+++ linux/net/ipv4/af_inet.c
@@ -218,7 +218,7 @@ out:
 }
 
 u32 inet_ehash_secret __read_mostly;
-EXPORT_SYMBOL(inet_ehash_secret);
+EXPORT_SYMBOL_NS(inet, inet_ehash_secret);
 
 /*
  * inet_ehash_secret must be set exactly once
@@ -235,7 +235,7 @@ void build_ehash_secret(void)
inet_ehash_secret = rnd;
spin_unlock_bh(_lock);
 }
-EXPORT_SYMBOL(build_ehash_secret);
+EXPORT_SYMBOL_NS(inet, build_ehash_secret);
 
 /*
  * Create an inet socket.
@@ -1127,7 +1127,7 @@ int inet_sk_rebuild_header(struct sock *
return err;
 }
 
-EXPORT_SYMBOL(inet_sk_rebuild_header);
+EXPORT_SYMBOL_NS(inet,inet_sk_rebuild_header);
 
 static int inet_gso_send_check(struct sk_buff *skb)
 {
@@ -1235,6 +1235,8 @@ unsigned long snmp_fold_field(void *mib[
}
return res;
 }
+/* AK: Not in inet namespace because they're a generic facility. Probably
+   should be in another file though. */
 EXPORT_SYMBOL_GPL(snmp_fold_field);
 
 int snmp_mib_init(void *ptr[2], size_t mibsize, size_t mibalign)
@@ -1499,20 +1501,30 @@ static int __init ipv4_proc_init(void)
 
 MODULE_ALIAS_NETPROTO(PF_INET);
 
-EXPORT_SYMBOL(inet_accept);
-EXPORT_SYMBOL(inet_bind);
-EXPORT_SYMBOL(inet_dgram_connect);
-EXPORT_SYMBOL(inet_dgram_ops);
-EXPORT_SYMBOL(inet_getname);
-EXPORT_SYMBOL(inet_ioctl);
-EXPORT_SYMBOL(inet_listen);
-EXPORT_SYMBOL(inet_register_protosw);
-EXPORT_SYMBOL(inet_release);
-EXPORT_SYMBOL(inet_sendmsg);
-EXPORT_SYMBOL(inet_shutdown);
-EXPORT_SYMBOL(inet_sock_destruct);
-EXPORT_SYMBOL(inet_stream_connect);
-EXPORT_SYMBOL(inet_stream_ops);
-EXPORT_SYMBOL(inet_unregister_protosw);
-EXPORT_SYMBOL(net_statistics);
-EXPORT_SYMBOL(sysctl_ip_nonlocal_bind);
+MODULE_NAMESPACE_ALLOW(inet, ipv6);
+MODULE_NAMESPACE_ALLOW(inet, udplite);
+MODULE_NAMESPACE_ALLOW(inet, dccp_ipv6);
+MODULE_NAMESPACE_ALLOW(inet, dccp_ipv4);
+MODULE_NAMESPACE_ALLOW(inet, dccp);
+MODULE_NAMESPACE_ALLOW(inet, sctp);
+
+/* RED-PEN: would be better to fix wanrouter */
+MODULE_NAMESPACE_ALLOW(inet, wanrouter);
+
+EXPORT_SYMBOL_NS(inet,inet_accept);
+EXPORT_SYMBOL_NS(inet,inet_bind);
+EXPORT_SYMBOL_NS(inet,inet_dgram_connect);
+EXPORT_SYMBOL_NS(inet,inet_dgram_ops);
+EXPORT_SYMBOL_NS(inet,inet_getname);
+EXPORT_SYMBOL_NS(inet,inet_ioctl);
+EXPORT_SYMBOL_NS(inet,inet_listen);
+EXPORT_SYMBOL_NS(inet,inet_register_protosw);
+EXPORT_SYMBOL_NS(inet,inet_release);
+EXPORT_SYMBOL_NS(inet,inet_sendmsg);
+EXPORT_SYMBOL_NS(inet,inet_shutdown);
+EXPORT_SYMBOL_NS(inet,inet_sock_destruct);
+EXPORT_SYMBOL_NS(inet,inet_stream_connect);
+EXPORT_SYMBOL_NS(inet,inet_stream_ops);
+EXPORT_SYMBOL_NS(inet,inet_unregister_protosw);
+EXPORT_SYMBOL_NS(inet,net_statistics);
+EXPORT_SYMBOL_NS(inet,sysctl_ip_nonlocal_bind);
Index: linux/net/ipv4/arp.c
===
--- linux.orig/net/ipv4/arp.c
+++ linux/net/ipv4/arp.c
@@ -1406,6 +1406,7 @@ static int __init arp_proc_init(void)
 
 #endif /* CONFIG_PROC_FS */
 
+/* No namespace because those are used by various drivers */
 EXPORT_SYMBOL(arp_broken_ops);
 EXPORT_SYMBOL(arp_find);
 EXPORT_SYMBOL(arp_create);
Index: linux/net/ipv4/icmp.c
===
--- linux.orig/net/ipv4/icmp.c
+++ linux/net/ipv4/icmp.c
@@ -1101,7 +1101,7 @@ void __init icmp_init(struct net_proto_f
}
 }
 
-EXPORT_SYMBOL(icmp_err_convert);
-EXPORT_SYMBOL(icmp_send);
-EXPORT_SYMBOL(icmp_statistics);
-EXPORT_SYMBOL(xrlim_allow);
+EXPORT_SYMBOL_NS(inet, icmp_err_convert);
+EXPORT_SYMBOL_NS(inet, icmp_send);
+EXPORT_SYMBOL_NS(inet, icmp_statistics);
+EXPORT_SYMBOL_NS(inet, xrlim_allow);
Index: linux/net/ipv4/inet_connection_sock.c
===
--- linux.orig/net/ipv4/inet_connection_sock.c
+++ linux/net/ipv4/inet_connection_sock.c
@@ -26,7 +26,7 @@
 
 #ifdef INET_CSK_DEBUG
 const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
-EXPORT_SYMBOL(inet_csk_timer_bug_msg);
+EXPORT_SYMBOL_NS(inet, inet_csk_timer_bug_msg);
 #endif
 
 /*
@@ -73,7 +73,7 @@ int 

[PATCH RFC] [8/9] Put UDP exports into a namespace

2007-11-21 Thread Andi Kleen

The UDP exports are only used by UDPv6 and UDP lite. They are internal functions
not supposed to be used by anybody else. So turn them into a name space that 
only allows those.

---
 net/ipv4/udp.c |   27 +++
 net/ipv4/udplite.c |6 +++---
 2 files changed, 18 insertions(+), 15 deletions(-)

Index: linux/net/ipv4/udp.c
===
--- linux.orig/net/ipv4/udp.c
+++ linux/net/ipv4/udp.c
@@ -105,6 +105,9 @@
 #include 
 #include "udp_impl.h"
 
+MODULE_NAMESPACE_ALLOW(udp, udplite);
+MODULE_NAMESPACE_ALLOW(udp, ipv6);
+
 /*
  * Snmp MIB for the UDP layer
  */
@@ -1641,18 +1644,18 @@ void udp4_proc_exit(void)
 }
 #endif /* CONFIG_PROC_FS */
 
-EXPORT_SYMBOL(udp_disconnect);
-EXPORT_SYMBOL(udp_hash);
-EXPORT_SYMBOL(udp_hash_lock);
-EXPORT_SYMBOL(udp_ioctl);
-EXPORT_SYMBOL(udp_get_port);
-EXPORT_SYMBOL(udp_prot);
-EXPORT_SYMBOL(udp_sendmsg);
-EXPORT_SYMBOL(udp_lib_getsockopt);
-EXPORT_SYMBOL(udp_lib_setsockopt);
-EXPORT_SYMBOL(udp_poll);
+EXPORT_SYMBOL_NS(udp, udp_disconnect);
+EXPORT_SYMBOL_NS(udp, udp_hash);
+EXPORT_SYMBOL_NS(udp, udp_hash_lock);
+EXPORT_SYMBOL_NS(udp, udp_ioctl);
+EXPORT_SYMBOL_NS(udp, udp_get_port);
+EXPORT_SYMBOL_NS(udp, udp_prot);
+EXPORT_SYMBOL_NS(udp, udp_sendmsg);
+EXPORT_SYMBOL_NS(udp, udp_lib_getsockopt);
+EXPORT_SYMBOL_NS(udp, udp_lib_setsockopt);
+EXPORT_SYMBOL_NS(udp, udp_poll);
 
 #ifdef CONFIG_PROC_FS
-EXPORT_SYMBOL(udp_proc_register);
-EXPORT_SYMBOL(udp_proc_unregister);
+EXPORT_SYMBOL_NS(udp, udp_proc_register);
+EXPORT_SYMBOL_NS(udp, udp_proc_unregister);
 #endif
Index: linux/net/ipv4/udplite.c
===
--- linux.orig/net/ipv4/udplite.c
+++ linux/net/ipv4/udplite.c
@@ -113,6 +113,6 @@ out_register_err:
printk(KERN_CRIT "%s: Cannot add UDP-Lite protocol.\n", __FUNCTION__);
 }
 
-EXPORT_SYMBOL(udplite_hash);
-EXPORT_SYMBOL(udplite_prot);
-EXPORT_SYMBOL(udplite_get_port);
+EXPORT_SYMBOL_NS(udp, udplite_hash);
+EXPORT_SYMBOL_NS(udp, udplite_prot);
+EXPORT_SYMBOL_NS(udp, udplite_get_port);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] [5/9] modpost: Fix a buffer overflow in modpost

2007-11-21 Thread Andi Kleen

When passing an file name > 1k the stack could be overflowed.
Not really a security issue, but still better plugged.


---
 scripts/mod/modpost.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux/scripts/mod/modpost.c
===
--- linux.orig/scripts/mod/modpost.c
+++ linux/scripts/mod/modpost.c
@@ -1656,7 +1656,6 @@ int main(int argc, char **argv)
 {
struct module *mod;
struct buffer buf = { };
-   char fname[SZ];
char *kernel_read = NULL, *module_read = NULL;
char *dump_write = NULL;
int opt;
@@ -1709,6 +1708,8 @@ int main(int argc, char **argv)
err = 0;
 
for (mod = modules; mod; mod = mod->next) {
+   char fname[strlen(mod->name) + 10];
+
if (mod->skip)
continue;
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] [6/9] Implement namespace checking in modpost

2007-11-21 Thread Andi Kleen

This checks the namespaces at build time in modpost

---
 scripts/mod/modpost.c |  344 ++
 1 file changed, 317 insertions(+), 27 deletions(-)

Index: linux/scripts/mod/modpost.c
===
--- linux.orig/scripts/mod/modpost.c
+++ linux/scripts/mod/modpost.c
@@ -1,8 +1,9 @@
-/* Postprocess module symbol versions
+/* Postprocess module symbol versions and do various other module checks.
  *
  * Copyright 2003   Kai Germaschewski
  * Copyright 2002-2004  Rusty Russell, IBM Corporation
  * Copyright 2006   Sam Ravnborg
+ * Copyright 2007  Andi Kleen, SUSE Labs (changes licensed GPLv2 only)
  * Based in part on module-init-tools/depmod.c,file2alias
  *
  * This software may be used and distributed according to the terms
@@ -12,9 +13,13 @@
  */
 
 #include 
+#include 
 #include "modpost.h"
 #include "../../include/linux/license.h"
 
+#define NS_SEPARATOR '.'
+#define NS_SEPARATOR_STRING "."
+
 /* Are we using CONFIG_MODVERSIONS? */
 int modversions = 0;
 /* Warn about undefined symbols? (do so if we have vmlinux) */
@@ -27,6 +32,9 @@ static int external_module = 0;
 static int vmlinux_section_warnings = 1;
 /* Only warn about unresolved symbols */
 static int warn_unresolved = 0;
+/* Fixing those would cause too many ifdefs -- off by default. */
+static int warn_missing_modules = 0;
+
 /* How a symbol is exported */
 enum export {
export_plain,  export_unused, export_gpl,
@@ -105,19 +113,43 @@ static struct module *find_module(char *
return mod;
 }
 
-static struct module *new_module(char *modname)
+static const char *basename(const char *s)
+{
+   char *p = strrchr(s, '/');
+   if (p)
+   return p + 1;
+   return s;
+}
+
+static struct module *find_module_base(char *modname)
 {
struct module *mod;
-   char *p, *s;
 
-   mod = NOFAIL(malloc(sizeof(*mod)));
-   memset(mod, 0, sizeof(*mod));
-   p = NOFAIL(strdup(modname));
+   for (mod = modules; mod; mod = mod->next) {
+   if (strcmp(basename(mod->name), modname) == 0)
+   break;
+   }
+   return mod;
+}
 
+static void strip_o(char *p)
+{
+   char *s;
/* strip trailing .o */
if ((s = strrchr(p, '.')) != NULL)
if (strcmp(s, ".o") == 0)
*s = '\0';
+}
+
+static struct module *new_module(char *modname)
+{
+   struct module *mod;
+   char *p;
+
+   mod = NOFAIL(malloc(sizeof(*mod)));
+   memset(mod, 0, sizeof(*mod));
+   p = NOFAIL(strdup(modname));
+   strip_o(p);
 
/* add to list */
mod->name = p;
@@ -132,10 +164,12 @@ static struct module *new_module(char *m
  * struct symbol is also used for lists of unresolved symbols */
 
 #define SYMBOL_HASH_SIZE 1024
+#define NSALLOW_HASH_SIZE 64
 
 struct symbol {
struct symbol *next;
struct module *module;
+   const char *namespace;
unsigned int crc;
int crc_valid;
unsigned int weak:1;
@@ -147,10 +181,19 @@ struct symbol {
char name[0];
 };
 
+struct nsallow {
+   struct nsallow *next;
+   struct module *mod;
+   struct module *orig;
+   int ref;
+   char name[0];
+};
+
 static struct symbol *symbolhash[SYMBOL_HASH_SIZE];
+static struct nsallow *nsallowhash[NSALLOW_HASH_SIZE];
 
 /* This is based on the hash agorithm from gdbm, via tdb */
-static inline unsigned int tdb_hash(const char *name)
+static unsigned int tdb_hash(const char *name)
 {
unsigned value; /* Used to compute the hash value.  */
unsigned   i;   /* Used to cycle through random values. */
@@ -192,21 +235,67 @@ static struct symbol *new_symbol(const c
return new;
 }
 
-static struct symbol *find_symbol(const char *name)
+static struct symbol *find_symbol(const char *name, const char *ns)
 {
-   struct symbol *s;
+   struct symbol *s, *match;
 
/* For our purposes, .foo matches foo.  PPC64 needs this. */
if (name[0] == '.')
name++;
 
+   match = NULL;
for (s = symbolhash[tdb_hash(name) % SYMBOL_HASH_SIZE]; s; s=s->next) {
+   if (strcmp(s->name, name) == 0) {
+   match = s;
+   if (ns && s->namespace && strcmp(s->namespace, ns))
+   continue;
+   return s;
+   }
+   }
+   return ns ? NULL : match;
+}
+
+static struct nsallow *find_nsallow(const char *name, struct module *mod)
+{
+   struct nsallow *s;
+
+   for (s = nsallowhash[tdb_hash(name)%NSALLOW_HASH_SIZE]; s; s=s->next) {
+   if (strcmp(s->name, name) == 0 && s->mod == mod)
+   return s;
+   }
+   return NULL;
+}
+
+static struct nsallow *find_nsallow_name(const char *name)
+{
+   struct nsallow *s;
+
+   for (s = 

[PATCH RFC] [4/9] modpost: Fix format string warnings

2007-11-21 Thread Andi Kleen

Fix wrong format strings in modpost exposed by the previous patch.
Including one missing argument -- some random data was printed instead.

---
 scripts/mod/modpost.c |7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux/scripts/mod/modpost.c
===
--- linux.orig/scripts/mod/modpost.c
+++ linux/scripts/mod/modpost.c
@@ -388,7 +388,7 @@ static int parse_elf(struct elf_info *in
 
/* Check if file offset is correct */
if (hdr->e_shoff > info->size) {
-   fatal("section header offset=%u in file '%s' is bigger then 
filesize=%lu\n", hdr->e_shoff, filename, info->size);
+   fatal("section header offset=%lu in file '%s' is bigger then 
filesize=%lu\n", (unsigned long)hdr->e_shoff, filename, info->size);
return 0;
}
 
@@ -409,7 +409,7 @@ static int parse_elf(struct elf_info *in
const char *secname;
 
if (sechdrs[i].sh_offset > info->size) {
-   fatal("%s is truncated. sechdrs[i].sh_offset=%u > 
sizeof(*hrd)=%ul\n", filename, (unsigned int)sechdrs[i].sh_offset, 
sizeof(*hdr));
+   fatal("%s is truncated. sechdrs[i].sh_offset=%lu > 
sizeof(*hrd)=%lu\n", filename, (unsigned long)sechdrs[i].sh_offset, 
sizeof(*hdr));
return 0;
}
secname = secstrings + sechdrs[i].sh_name;
@@ -907,7 +907,8 @@ static void warn_sec_mismatch(const char
 "before '%s' (at offset -0x%llx)\n",
 modname, fromsec, (unsigned long long)r.r_offset,
 secname, refsymname,
-elf->strtab + after->st_name);
+elf->strtab + after->st_name,
+(unsigned long long)r.r_offset);
} else {
warn("%s(%s+0x%llx): Section mismatch: reference to %s:%s\n",
 modname, fromsec, (unsigned long long)r.r_offset,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] [3/9] modpost: Declare the modpost error functions as printf like

2007-11-21 Thread Andi Kleen

This way gcc can warn for wrong format strings

---
 scripts/mod/modpost.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux/scripts/mod/modpost.c
===
--- linux.orig/scripts/mod/modpost.c
+++ linux/scripts/mod/modpost.c
@@ -33,7 +33,9 @@ enum export {
export_unused_gpl, export_gpl_future, export_unknown
 };
 
-void fatal(const char *fmt, ...)
+#define PRINTF __attribute__ ((format (printf, 1, 2)))
+
+PRINTF void fatal(const char *fmt, ...)
 {
va_list arglist;
 
@@ -46,7 +48,7 @@ void fatal(const char *fmt, ...)
exit(1);
 }
 
-void warn(const char *fmt, ...)
+PRINTF void warn(const char *fmt, ...)
 {
va_list arglist;
 
@@ -57,7 +59,7 @@ void warn(const char *fmt, ...)
va_end(arglist);
 }
 
-void merror(const char *fmt, ...)
+PRINTF void merror(const char *fmt, ...)
 {
va_list arglist;
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] [2/9] Fix duplicate symbol check to also check future gpl and unused symbols

2007-11-21 Thread Andi Kleen

This seems to have been forgotten earlier. Right now it was possible
for a normal symbol to override a future gpl symbol and similar.
I restructured the code a bit to avoid too much duplicated code.

---
 kernel/module.c |   45 -
 1 file changed, 24 insertions(+), 21 deletions(-)

Index: linux/kernel/module.c
===
--- linux.orig/kernel/module.c
+++ linux/kernel/module.c
@@ -1430,33 +1430,36 @@ EXPORT_SYMBOL_GPL(do_symbol_get);
  * Ensure that an exported symbol [global namespace] does not already exist
  * in the kernel or in some other module's exported symbol table.
  */
-static int verify_export_symbols(struct module *mod)
+
+static int check_duplicate(const struct kernel_symbol *syms, int num, struct 
module *owner)
 {
-   const char *name = NULL;
-   unsigned long i, ret = 0;
-   struct module *owner;
+   int i;
const unsigned long *crc;
 
-   for (i = 0; i < mod->num_syms; i++)
-   if (find_symbol(mod->syms[i].name, , , 1, mod)) {
-   name = mod->syms[i].name;
-   ret = -ENOEXEC;
-   goto dup;
-   }
-
-   for (i = 0; i < mod->num_gpl_syms; i++)
-   if (find_symbol(mod->gpl_syms[i].name, , , 1, mod)) {
-   name = mod->gpl_syms[i].name;
-   ret = -ENOEXEC;
-   goto dup;
+   for (i = 0; i < num; i++)
+   if (find_symbol(syms[i].name, , , 1, owner)) {
+   printk(KERN_ERR "%s: exports duplicate symbol %s (owned 
by %s)\n",
+   owner->name, syms[i].name, module_name(owner));
+   return -ENOEXEC;
}
+   return 0;
+}
 
-dup:
+static int verify_export_symbols(struct module *mod)
+{
+   int ret = check_duplicate(mod->syms, mod->num_syms, mod);
if (ret)
-   printk(KERN_ERR "%s: exports duplicate symbol %s (owned by 
%s)\n",
-   mod->name, name, module_name(owner));
-
-   return ret;
+   return ret;
+   ret = check_duplicate(mod->gpl_syms, mod->num_gpl_syms, mod);
+   if (ret)
+   return ret;
+   ret = check_duplicate(mod->unused_syms, mod->num_unused_syms, mod);
+   if (ret)
+   return ret;
+   ret = check_duplicate(mod->unused_gpl_syms, mod->num_unused_gpl_syms, 
mod);
+   if (ret)
+   return ret;
+   return check_duplicate(mod->gpl_future_syms, mod->num_gpl_future_syms, 
mod);
 }
 
 /* Change all symbols so that sh_value encodes the pointer directly. */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-21 Thread Andi Kleen

There seems to be rough consensus that the kernel currently has too many 
exported symbols. A lot of these exports are generally usable utility 
functions or important driver interfaces; but another large part are functions
intended by only one or two very specific modules for a very specific purpose.
One example is the TCP code. It has most of its internals exported, but 
only for use by tcp_ipv6.c (and now a few more by the TCP/IP congestion 
modules) 
But it doesn't make sense to include these exported for a specific module
functions into a broader "kernel interface".   External modules assume
they can use these functions, but they were never intended for that.

This patch allows to export symbols only for specific modules by 
introducing symbol name spaces. A module name space has a white
list of modules that are allowed to import symbols for it; all others
can't use the symbols.

It adds two new macros: 

MODULE_NAMESPACE_ALLOW(namespace, module);

Allow module to import symbols from namespace. module is the module name without
.ko as displayed by lsmod.  Must be in the same module as the export
(and be duplicated if there are multiple modules exporting symbols
to a namespace).  Multiple allows for the same name space are allowed.

EXPORT_SYMBOL_NS(namespace, symbol);

Export symbol into namespace.  Only modules allowed for the namespace
will be able to use them. EXPORT_SYMBOL_NS implies GPL only
because it is only for "internal" interfaces.

The name spaces only work for module loading. I didn't find
a nice way to make them work inside the main kernel binary. This means
the name space is not enforced for modules that are built in.

The biggest amount of work is of course still open: to go over all the existing
exports and figure for which ones it makes sense to define a namespace.
I did it for TCP and UDP so far, but the kernel right now has nearly 10k
exports (with some dups) that would need to be checked and turned into
name spaces. I would expect any symbol that is only used by one or two
other modules is a strong candidate for a namespace; in some cases even more
with modules that are tightly coupled.

I am optimistic that in the end we will have a much more manageable 
kernel interface.

Caveats: 

Exports need one long word more memory.

I had to add some alignment magic to the existing EXPORT_SYMBOLs
to get the sections right. Tested on i386/x86-64, but I hope it also
still works on architectures with stricter alignment requirements
like ARM. Any testers for that?

---
 arch/arm/kernel/armksyms.c|2 
 include/asm-generic/vmlinux.lds.h |7 +
 include/linux/module.h|   71 +++
 kernel/module.c   |  137 +++---
 4 files changed, 177 insertions(+), 40 deletions(-)

Index: linux/include/linux/module.h
===
--- linux.orig/include/linux/module.h
+++ linux/include/linux/module.h
@@ -34,6 +34,7 @@ struct kernel_symbol
 {
unsigned long value;
const char *name;
+   const char *namespace;
 };
 
 struct modversion_info
@@ -167,49 +168,80 @@ struct notifier_block;
 #ifdef CONFIG_MODULES
 
 /* Get/put a kernel symbol (calls must be symmetric) */
-void *__symbol_get(const char *symbol);
-void *__symbol_get_gpl(const char *symbol);
+extern void *do_symbol_get(const char *symbol, struct module *caller);
+#define __symbol_get(sym) do_symbol_get(sym, THIS_MODULE)
 #define symbol_get(x) ((typeof())(__symbol_get(MODULE_SYMBOL_PREFIX #x)))
 
+struct module_ns {
+   char *name;
+   char *allowed;
+};
+
+#define NS_SEPARATOR "."
+
+/*
+ * Allow module MODULE to reference namespace NS.
+ * MODULE is just the base module name with suffix or path.
+ * This must be declared in the module (or main kernel) as where the
+ * symbols are defined. When multiple modules export symbols from
+ * a single namespace all modules need to contain a full set
+ * of MODULE_NAMESPACE_ALLOWs.
+ */
+#define MODULE_NAMESPACE_ALLOW(ns, module) \
+   static const struct module_ns __knamespace_##module##_##_##ns \
+   asm("__knamespace_" #module NS_SEPARATOR #ns)   \
+   __attribute_used__  \
+   __attribute__((section("__knamespace"), unused))\
+   = { #ns,  #module }
+
 #ifndef __GENKSYMS__
 #ifdef CONFIG_MODVERSIONS
 /* Mark the CRC weak since genksyms apparently decides not to
  * generate a checksums for some symbols */
-#define __CRC_SYMBOL(sym, sec) \
+#define __CRC_SYMBOL(sym, sec, post, post2)\
extern void *__crc_##sym __attribute__((weak)); \
-   static const unsigned long __kcrctab_##sym  \
+   static const unsigned long __kcrctab_##sym##post\
+   asm("__kcrctab_" #sym post2)\
__attribute_used__  \

[PATCH] Allow changing O_SYNC with fcntl().

2007-11-21 Thread Timo Sirainen
Is there a reason why this isn't allowed now?

---
 fs/fcntl.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 8685263..fc0c92e 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -203,7 +203,7 @@ asmlinkage long sys_dup(unsigned int fildes)
return ret;
 }
 
-#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | 
O_NOATIME)
+#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | 
O_NOATIME | O_SYNC)
 
 static int setfl(int fd, struct file * filp, unsigned long arg)
 {



signature.asc
Description: This is a digitally signed message part


Re: [RFC/PATCH] SO_NO_CHECK for IPv6

2007-11-21 Thread Herbert Xu
On Wed, Nov 21, 2007 at 07:17:40PM -0500, Jeff Garzik wrote:
>
> For those interested, I am dealing with a UDP app that already does very 
> strong checksumming and encryption, so additional software checksumming 
> at the lower layers is quite simply a waste of CPU cycles.  Hardware 
> checksumming is fine, as long as its "free."

No matter how strong your underlying checksumming is it's not
going to protect the IPv6 header is it :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible bug from kernel 2.6.22 and above

2007-11-21 Thread Simon Holm Thøgersen

ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:
> Eric Dumazet wrote:
> > Jie Chen a écrit :
> >> Hi, there:
> >>
> >> We have a simple pthread program that measures the synchronization 
> >> overheads for various synchronization mechanisms such as spin locks, 
> >> barriers (the barrier is implemented using queue-based barrier 
> >> algorithm) and so on. We have dual quad-core AMD opterons (barcelona) 
> >> clusters running 2.6.23.8 kernel at this moment using Fedora Core 7 
> >> distribution. Before we moved to this kernel, we had kernel 2.6.21. 
> >> These two kernels are configured identical and compiled with the same 
> >> gcc 4.1.2 compiler. Under the old kernel, we observed that the 
> >> performance of these overheads increases as the number of threads 
> >> increases from 2 to 8. The following are the values of total time and 
> >> overhead for all threads acquiring a pthread spin lock and all threads 
> >> executing a barrier synchronization call.
> > 
> > Could you post the source of your test program ?
> > 
> 
> 
> Hi, Eric:
> 
> Thank you for the quick response. You can get the source code containing 
> the test code from ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz . This is a 
> data parallel threading package for physics calculation. The test code 
> is pthread_sync in the src directory once you unpack the gz file. To 
> configure and build this package is very simple: configure and make. The 
>   test program is built by make check. The number of threads is 
> controlled by QMT_NUM_THREADS. The package is using pthread spin lock, 
> but the barrier is implemented using a queue-based barrier algorithm 
> proposed by  J. B. Carter of University of Utah (2005).
> 
> 
> 
> 
> 
> > spinlock are ... spining and should not call linux scheduler, so I have 
> > no idea why a kernel change could modify your results.
> > 
> > Also I suspect you'll have better results with Fedora Core 8 (since 
> > glibc was updated to use private futexes in v 2.7), at least for the 
> > barrier ops.
> > 
> > 
> 
> I am not sure what the biggest change between kernel 2.6.21 and 2.6.22 
> (23) is? Is the scheduler the biggest change between these versions? Can 
> the scheduler of kernel somehow effect the performance? I know the 
> scheduler is trying to do load balance and so on. Can the scheduler move 
> threads to different cores according to the load balance algorithm even 
> though the threads are bound to cores using pthread_setaffinity_np call 
> when the number of threads is fewer than the number of cores? I am 
> thinking about this because the performance of our test code is roughly 
> the same for both kernels when the number of threads equals to the 
> number of cores.
> 
There is a backport of the CFS scheduler to 2.6.21, see
http://lkml.org/lkml/2007/11/19/127

> >>
> >> Kernel 2.6.21
> >> Number of Threads  2  4   6 8
> >> SpinLock (Time micro second)   10.561810.5853810.5915   10.643
> >>   (Overhead)   0.073  0.05746 0.102805 0.154563
> >> Barrier (Time micro second)11.020410  11.678125   11.9889   12.38002
> >>  (Overhead)0.531660   1.1502  1.500112 1.891617
> >>
> >> Each thread is bound to a particular core using pthread_setaffinity_np.
> >>
> >> Kernel 2.6.23.8
> >> Number of Threads  2  4   6 8
> >> SpinLock (Time micro second)   14.849915  17.117603   14.4496   10.5990
> >>  (Overhead)4.345417   6.6172073.949435  0.110985
> >> Barrier (Time micro second)19.462255  20.285117   16.19395  12.37662
> >>  (Overhead)8.957755   9.7847225.699590  1.869518
> >>
> >> It is clearly that the synchronization overhead increases as the 
> >> number of threads increases in the kernel 2.6.21. But the 
> >> synchronization overhead actually decreases as the number of threads 
> >> increases in the kernel 2.6.23.8 (We observed the same behavior on 
> >> kernel 2.6.22 as well). This certainly is not a correct behavior. The 
> >> kernels are configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, 
> >> CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel 
> >> configuration file is in the attachment of this e-mail.
> >>
> >>  From what we have read, there was a new scheduler (CFS) appeared from 
> >> 2.6.22. We are not sure whether the above behavior is caused by the 
> >> new scheduler.
> >>
> >> Finally, our machine cpu information is listed in the following:
> >>
> >> processor   : 0
> >> vendor_id   : AuthenticAMD
> >> cpu family  : 16
> >> model   : 2
> >> model name  : Quad-Core AMD Opteron(tm) Processor 2347
> >> stepping: 10
> >> cpu MHz : 1909.801
> >> cache size  : 512 KB
> >> physical id : 0
> >> siblings: 4
> >> core id : 0
> >> cpu cores   : 4
> >> fpu : yes
> >> fpu_exception   : yes
> >> cpuid level : 5
> >> wp  

Re: Where is the interrupt going?

2007-11-21 Thread Kyle McMartin
On Wed, Nov 21, 2007 at 05:08:30PM -0800, Al Niessner wrote:
> On with the detailed technical information. I developed a kernel module
> for an PCI card back in 2.4, moved it to 2.6.3, then 2.6.11 or so and
> now I am trying to move it to 2.6.22. When I began the to move to
> 2.6.22, I changed all of the deprecated calls for finding the card on
> the PCI bus, modified the interrupt handler prototype, and changed my
> readvv/writev to aio_read/aio_write following
> http://lwn.net/Articles/202449/. So initialization looks like this:
> 

Hi Al,

>From the sounds of it, you might have an interrupt routing problem. Can
you describe the machine you have this plugged into? Possibly attaching
a copy of "dmesg" and "/proc/interrupts"?

Feel free to attach the driver source to your email if the size is
reasonable (which it sounds like it is.)

As a "big hammer" in case it is an APIC problem, please try booting the
kernel with the "noapic" parameter.

cheers,
Kyle
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the interrupt going?

2007-11-21 Thread Jesper Juhl
On 22/11/2007, Al Niessner <[EMAIL PROTECTED]> wrote:
>
> Quickly stated, I have a piece of hardware on the PCI bus that is
> generating an interrupt (can watch it with a scope) but my handler is
> not being called (no printk in /var/log/messages). So, where has the
> interrupt gone?
>
Just to rule out the trivial causes. Could it be that you've simply
not configured your system to log messages at the loglevel that your
printk() is using?

-- 
Jesper Juhl <[EMAIL PROTECTED]>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mmap dirty limits on 32 bit kernels (Was: [BUG] New Kernel Bugs)

2007-11-21 Thread Bron Gondwana
On Thu, Nov 22, 2007 at 10:51:15AM +1100, Bron Gondwana wrote:
> On Thu, Nov 15, 2007 at 08:32:22AM -0800, Linus Torvalds wrote:
> > If this patch makes a difference, please holler. I think it's the correct 
> > thing to do, but I'm not going to actually commit it without somebody 
> > saying that it makes a difference (and preferably Peter Zijlstra and 
> > Andrew acking it too).
> 
> mmap: mmap call failed: errno: 12 errmsg: Cannot allocate memory
> 
> Yep, that's "fixed" the problem alright!  No way this puppy is
> dirtying 2Gb of memory any more.
> 
> http://linux.brong.fastmail.fm/2007-11-22/bmtest.pl

Alternatively perhaps I'm just a moron who used a config file with:
CONFIG_PAGE_OFFSET=0x8000 set to build the new kernel (I hadn't
committed it because it turned out not to solve the issue it was
there for).  That would explain a few things.

[EMAIL PROTECTED] perl]$ free
 total   used   free sharedbuffers cached
Mem:   415062022722841878336  0  112122066536
-/+ buffers/cache: 1945363956084
Swap:  2096472  02096472

That's more the usage I would expect to see.

Now for the downside.  It works again, but it still runs slow.  Seems to
hit (and this is totally unscientific, I'm just watching the numbers
scroll by) at about 12 writes rather than 7 writes, but that's
still not fitting the while file dirty.

I notice that PF_LESS_THROTTLE gets set by nfsd to get an extra 25%
bonus free space allocated.  Potentially dcc could use similar tricks
to claim extra space if that knob is available up in userspace.  I'm
happy to patch dcc as well if I have to, I'm already backporting it,
so adding another little quilt directory and applying it is pretty
trivial (must try guilt/stgit one of these days)

Bron.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the interrupt going?

2007-11-21 Thread Kyle McMartin
On Thu, Nov 22, 2007 at 01:56:25AM +, Alan Cox wrote:
> > status = request_irq (apcsi[i].board_irq,
> >   apc8620_handler,
> >   IRQF_DISABLED,
> 
> You set IRQF_DISABLED
> 
> Do you then enable the interrupt anywhere later on ?
> 

IRQF_DISABLED just means that the handler is atomic wrt other local
interrupts. Shouldn't be the cause of this.

cheers,
Kyle
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the interrupt going?

2007-11-21 Thread Alan Cox
> status = request_irq (apcsi[i].board_irq,
>   apc8620_handler,
>   IRQF_DISABLED,

You set IRQF_DISABLED

Do you then enable the interrupt anywhere later on ?

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible bug from kernel 2.6.22 and above

2007-11-21 Thread Jie Chen

Eric Dumazet wrote:

Jie Chen a écrit :

Hi, there:

We have a simple pthread program that measures the synchronization 
overheads for various synchronization mechanisms such as spin locks, 
barriers (the barrier is implemented using queue-based barrier 
algorithm) and so on. We have dual quad-core AMD opterons (barcelona) 
clusters running 2.6.23.8 kernel at this moment using Fedora Core 7 
distribution. Before we moved to this kernel, we had kernel 2.6.21. 
These two kernels are configured identical and compiled with the same 
gcc 4.1.2 compiler. Under the old kernel, we observed that the 
performance of these overheads increases as the number of threads 
increases from 2 to 8. The following are the values of total time and 
overhead for all threads acquiring a pthread spin lock and all threads 
executing a barrier synchronization call.


Could you post the source of your test program ?




Hi, Eric:

Thank you for the quick response. You can get the source code containing 
the test code from ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz . This is a 
data parallel threading package for physics calculation. The test code 
is pthread_sync in the src directory once you unpack the gz file. To 
configure and build this package is very simple: configure and make. The 
 test program is built by make check. The number of threads is 
controlled by QMT_NUM_THREADS. The package is using pthread spin lock, 
but the barrier is implemented using a queue-based barrier algorithm 
proposed by  J. B. Carter of University of Utah (2005).






spinlock are ... spining and should not call linux scheduler, so I have 
no idea why a kernel change could modify your results.


Also I suspect you'll have better results with Fedora Core 8 (since 
glibc was updated to use private futexes in v 2.7), at least for the 
barrier ops.





I am not sure what the biggest change between kernel 2.6.21 and 2.6.22 
(23) is? Is the scheduler the biggest change between these versions? Can 
the scheduler of kernel somehow effect the performance? I know the 
scheduler is trying to do load balance and so on. Can the scheduler move 
threads to different cores according to the load balance algorithm even 
though the threads are bound to cores using pthread_setaffinity_np call 
when the number of threads is fewer than the number of cores? I am 
thinking about this because the performance of our test code is roughly 
the same for both kernels when the number of threads equals to the 
number of cores.




Kernel 2.6.21
Number of Threads  2  4   6 8
SpinLock (Time micro second)   10.561810.5853810.5915   10.643
  (Overhead)   0.073  0.05746 0.102805 0.154563
Barrier (Time micro second)11.020410  11.678125   11.9889   12.38002
 (Overhead)0.531660   1.1502  1.500112 1.891617

Each thread is bound to a particular core using pthread_setaffinity_np.

Kernel 2.6.23.8
Number of Threads  2  4   6 8
SpinLock (Time micro second)   14.849915  17.117603   14.4496   10.5990
 (Overhead)4.345417   6.6172073.949435  0.110985
Barrier (Time micro second)19.462255  20.285117   16.19395  12.37662
 (Overhead)8.957755   9.7847225.699590  1.869518

It is clearly that the synchronization overhead increases as the 
number of threads increases in the kernel 2.6.21. But the 
synchronization overhead actually decreases as the number of threads 
increases in the kernel 2.6.23.8 (We observed the same behavior on 
kernel 2.6.22 as well). This certainly is not a correct behavior. The 
kernels are configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, 
CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel 
configuration file is in the attachment of this e-mail.


 From what we have read, there was a new scheduler (CFS) appeared from 
2.6.22. We are not sure whether the above behavior is caused by the 
new scheduler.


Finally, our machine cpu information is listed in the following:

processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 16
model   : 2
model name  : Quad-Core AMD Opteron(tm) Processor 2347
stepping: 10
cpu MHz : 1909.801
cache size  : 512 KB
physical id : 0
siblings: 4
core id : 0
cpu cores   : 4
fpu : yes
fpu_exception   : yes
cpuid level : 5
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp
 lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm 
cmp_legacy svm

extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips: 3822.95
TLB size: 1024 4K pages
clflush size: 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

In addition, we 

Where is the interrupt going?

2007-11-21 Thread Al Niessner

Quickly stated, I have a piece of hardware on the PCI bus that is
generating an interrupt (can watch it with a scope) but my handler is
not being called (no printk in /var/log/messages). So, where has the
interrupt gone?

Obligatory information:
1) I have done the google search and mailing list search finding lots of
ancillary information but not what I needed.
2) Yes, it is my fault, but I need some help from people more directly
involved in the kernel than myself to point out what I am doing wrong.
3) Thanks for any and all help in advance.

On with the detailed technical information. I developed a kernel module
for an PCI card back in 2.4, moved it to 2.6.3, then 2.6.11 or so and
now I am trying to move it to 2.6.22. When I began the to move to
2.6.22, I changed all of the deprecated calls for finding the card on
the PCI bus, modified the interrupt handler prototype, and changed my
readvv/writev to aio_read/aio_write following
http://lwn.net/Articles/202449/. So initialization looks like this:

p8620 = pci_get_device (APC8620_VENDOR_ID, APC8620_DEVICE_ID, p8620);
<... fail if p8620 is 0 ...>
apcsi[i].ret_val = register_chrdev (MAJOR_NUM,

DEVICE_NAME,

_ops);
<... fail if ret_val < 0 ...>
apcsi[i].board_irq = p8620->irq;
status = request_irq (apcsi[i].board_irq,
  apc8620_handler,
  IRQF_DISABLED,
  DEVICE_NAME,
  (void*)[i]);
<... fail if status != 0 ...>

I do check all of the return values to verify the call happened
successfully. There are some memory mapping calls that I have left out
since they are working while the interrupt is not.

Things seem to work for the most part because I can read/write data
through a memory map and verify the IndustryPack modules on the carrier
through their header. The memory map is still working sufficiently well
that I can program up one of the IndustryPack modules to generate an
interrupt every 2 seconds or so. Prior to my changes for 2.6.22 this
worked quite well. Since it is the interrupt portion of this game that
is giving me grief, lets stick with just that. apc8620_handler is:

static irqreturn_t apc8620_handler (int irq,
   void
*did)
{
  printk (KERN_NOTICE "apc8620: did (0x%lx)\n", (unsigned long)did);
  <... other irrelevant steps ...>
  return IRQ_HANDLED;
}

I would then expect that every two seconds or so I would see a message
from apc8620_handler pop up. Instead I see nothing. Poking around I see
that the kernel module is loaded and attached to my devices and set for
IRQ 10:

lsmod:  -> acromag8620  4207556  0 
cat /proc/devices   -> 46 apc8620
cat /proc/interrupts -> 10:  0   IO-APIC-edge  apc8620

With /proc/interrupts, LOC keeps growing at a rate faster than what my
hardware is generating and I have no idea what LOC means, but ERR and
MIS (I take it to mean error and missed respectively) are both 0 and
remain 0 indefinitely.

In /var/log/messages, I do not see any missing interrupt messages or any
other report indicating that there is some trouble.

Assuming no one sees the error I am making right off the bat and would
like me to probe the interrupt system a little bit more, please give me
a suggestion as to where to poke. There is lots of code there and I
would prefer to have guided poke over a random one.

Anyway, I read through linux/interrupts.h looking for some bit, flag, or
call that I have omitted but found nothing. I understand why the
interrupt handlers have changed, but the changes made should not be
causing this problem.

Again, any and all help in finding my lost interrupt is much
appreciated.

Lastly, I would be happy to give out the entire module to anyone who
requests it, but it is about 550 lines so I did not want to attach it to
this already long post.

-- 
Al Niessner
818.354.0859

All opinions stated above are mine and do not necessarily reflect those
of JPL or NASA.


|  dS  | >= 0



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9]: Reduce Log I/O latency

2007-11-21 Thread David Chinner
On Thu, Nov 22, 2007 at 01:49:25AM +0100, Andi Kleen wrote:
> David Chinner <[EMAIL PROTECTED]> writes:
> 
> > To ensure that log I/O is issued as the highest priority I/O, set
> > the I/O priority of the log I/O to the highest possible. This will
> > ensure that log I/O is not held up behind bulk data or other
> > metadata I/O as delaying log I/O can pause the entire transaction
> > subsystem. Introduce a new buffer flag to allow us to tag the log
> > buffers so we can discrimiate when issuing the I/O.
> 
> Won't that possible disturb other RT priority users that do not need 
> log IO (e.g. working on preallocated files)? Seems a little
> dangerous.

In all the cases that I know of where ppl are using what could
be considered real-time I/O (e.g. media environments where they
do real-time ingest and playout from the same filesystem) the
real-time ingest processes create the files and do pre-allocation
before doing their I/O. This I/O can get held up behind another
process that is not real time that has issued log I/O. 

Given there is no I/O priority inheritence and having log I/O stall
will stall the entire filesystem, we cannot allow log I/O to
stall in real-time environments. Hence it must have the highest
possible priority to prevent this.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] 0/4 Support for Toshiba TMIO multifunction devices

2007-11-21 Thread Anton Vorontsov
Hi Ian,

Personally I'm very appreciate your patches, they'll will
help submitting HP iPaqs SOCs/MFDs, you know... ;-)

Thus, much thanks in advance.

Few comments...

On Wed, Nov 21, 2007 at 03:54:15AM +, ian wrote:
> On Wed, 2007-11-21 at 10:23 +0800, eric miao wrote:
> > Roughly went through the patch, looks good, here comes the remind, though 
> > :-)
> > 
> > 1. is it possible to use some name other than "soc_core", maybe
> > "tmio_core" so that other multifunction chips sharing a core base
> > will live easier.
> 
> It's (soc-core) not tmio MFD specific - its already used by other MFD
> chips (although obviously not ones in mainline (yet!)
> 
> it might be better named 'mfd-core' though, as thats its intended use...
> 
> > 2. those C++ style comments "//" are not so pleasant...
> 
> Should I clean them up and resubmit?

I'd resubmit cleaned up version. I think four or even more
resubmissions is inevitable for such patch-set (new general code +
a lot of drivers).

About patches their self... I think soc_add_devices could be
split into two small functions, thus you'll get rid of high
indentation level + code will be more reader friendly.


Ideally, checkpatch.pl should be happy. If it will, then there will
be less nitpicks somebody can pull. ;-)

Here it is:
- - - -
~/linux-2.6$ scripts/checkpatch.pl 
~/0001-Reuseable-SOC-core-code-suitable-for-multifunction-c.patch
WARNING: line over 80 characters
#57: FILE: drivers/mfd/soc-core.c:37:
+#define SIGNED_SHIFT(val, shift) ((shift) >= 0 ? ((val) << (shift)) : ((val) 
>> -(shift)))

WARNING: line over 80 characters
#60: FILE: drivers/mfd/soc-core.c:40:
+   struct soc_device_data *soc, int 
nr_devs,

WARNING: line over 80 characters
#84: FILE: drivers/mfd/soc-core.c:64:
+   res = kzalloc(blk->num_resources * sizeof (struct resource), 
GFP_KERNEL);

WARNING: no space between function name and open parenthesis '('
#84: FILE: drivers/mfd/soc-core.c:64:
+   res = kzalloc(blk->num_resources * sizeof (struct resource), 
GFP_KERNEL);

ERROR: do not use C99 // comments
#89: FILE: drivers/mfd/soc-core.c:69:
+   res[r].name = blk->res[r].name; // Fixme - should copy

WARNING: braces {} are not necessary for single statement blocks
#93: FILE: drivers/mfd/soc-core.c:73:
+   if (blk->res[r].flags & IORESOURCE_MEM) {
+   base = mem->start;
+   } else if ((blk->res[r].flags & IORESOURCE_IRQ) &&

WARNING: braces {} are not necessary for single statement blocks
#95: FILE: drivers/mfd/soc-core.c:75:
+   } else if ((blk->res[r].flags & IORESOURCE_IRQ) &&
+   (blk->res[r].flags & 
IORESOURCE_IRQ_SOC_SUBDEVICE)) {
+   base = irq_base;
+   }

WARNING: line over 80 characters
#96: FILE: drivers/mfd/soc-core.c:76:
+   (blk->res[r].flags & 
IORESOURCE_IRQ_SOC_SUBDEVICE)) {

WARNING: line over 80 characters
#103: FILE: drivers/mfd/soc-core.c:83:
+   res[r].start = base + 
SIGNED_SHIFT(blk->res[r].start, relative_addr_shift);

WARNING: line over 80 characters
#104: FILE: drivers/mfd/soc-core.c:84:
+   res[r].end   = base + 
SIGNED_SHIFT(blk->res[r].end,   relative_addr_shift);

ERROR: Missing Signed-off-by: line(s)

total: 2 errors, 9 warnings, 145 lines checked
Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
- - - -

There is false positive though:

if (...) {
single_stmt;
} else {
one;
two;
}

^^^ is perfectly OK and preferred, IIRC. checkpatch isn't ideal,
but it's mostly good.

> More to the point, who should I be submitting them to? the files under
> arm/ are obviously for RMK to peruse, but I couldnt find an entry for
> drivers/mfd in MAINTAINERS...

Well, don't know about drivers/mfd/*. Probably there simply isn't
any [official] maintainer, thus lkml is the right place.

There is one not so obvious thing though: you should not submit patches
with To/Cc'ing lkml (open list) and linux-arm-kernel (subscribers-only).

Russell King will probably point to linux-arm-kernel etiquette article
(http://www.arm.linux.org.uk/mailinglists/etiquette.php
"Cross-posting between linux-arm* lists and other lists.")

So, either place linux-arm-kernel into Bcc:, or duplicate stuff for
lkml and linux-arm-kernel separately, thus they'll not see each
others' To/Cc.


Looking forward to your patches!

-- 
Anton Vorontsov
email: [EMAIL PROTECTED]
backup email: [EMAIL PROTECTED]
irc://irc.freenode.net/bd2
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [UPDATED PATCH] Support for Toshiba TMIO multifunction devices

2007-11-21 Thread Russell King - ARM Linux
On Thu, Nov 22, 2007 at 12:34:09AM +, ian wrote:
> +void mfd_free_devices(struct platform_device *devices, int nr_devs)
> +{
> + struct platform_device *dev = devices;
> + int i;
> +
> + for (i = 0; i < nr_devs; i++) {
> + struct resource *res = dev->resource;
> + platform_device_unregister(dev++);
> + kfree(res);
> + }
> + kfree(devices);
> +}
> +EXPORT_SYMBOL_GPL(mfd_free_devices);

Unfortunately, this is broken as designed (in fact this whole file is.)
I'm not sure why people just don't get it.  sysfs.  devices.  device tree.
It has object lifetime rules.  You can _not_ go around unregistering things
and then immediately freeing them - something else might _still_ be using
stuff even after the call to unregister returns.

It's a potential OOPS just waiting to happen.

That's why we have a proper management API for platform devices.  Please
use it, I didn't add the code for just for fun.

See platform_device_alloc() + platform_device_add_resources() +
platform_device_add_data() + platform_device_add() to create, and
platform_device_unregister() to destroy.

(Not looked at the rest because you really really need to get this
right first.)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [stable] null pointer dereference during restart autofs (was: Linux 2.6.22.12)

2007-11-21 Thread Greg KH
On Wed, Nov 21, 2007 at 04:45:10AM +0100, Tomasz K?oczko wrote:
>
> BUG: unable to handle kernel NULL pointer dereference at virtual address 
> 0014

Did this happen with older versions of 2.6.22.y?

Have you asked the autofs people about this?

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH 2/9]: Reduce Log I/O latency

2007-11-21 Thread Andi Kleen
David Chinner <[EMAIL PROTECTED]> writes:

> To ensure that log I/O is issued as the highest priority I/O, set
> the I/O priority of the log I/O to the highest possible. This will
> ensure that log I/O is not held up behind bulk data or other
> metadata I/O as delaying log I/O can pause the entire transaction
> subsystem. Introduce a new buffer flag to allow us to tag the log
> buffers so we can discrimiate when issuing the I/O.

Won't that possible disturb other RT priority users that do not need 
log IO (e.g. working on preallocated files)? Seems a little
dangerous.

I suspect you want a "higher than bulk but lower than RT" priority
for this really unless there is any block RT priority task waiting
for log IO (but keeping track of the later might be tricky) 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] sata_nv: don't use legacy DMA in ADMA mode

2007-11-21 Thread Tejun Heo
Robert Hancock wrote:
> Tejun Heo wrote:
>> Tejun Heo wrote:
>>> If so, can you please add that switching into register mode is okay as
>>> long as there's no other ADMA commands in flight and add
>>> WARN_ON((qc->flags & ATA_QCFLAG_RESULT_TF) && link->sactive)?
>>
>> More accurately, link->sactive test can be substituted with
>> (ap->qc_allocated & ~(1 << qc->tag)).
> 
> Unfortunately we only get the ata_port and ata_taskfile in the tf_read
> callback, so I'm not sure if we can do the equivalent of the qc->flags &
> ATA_QCFLAG_RESULT_TF test (i.e. distinguishing between the
> error-handling case where we care if we abort outstanding commands and
> the normal case with a RESULT_TF command where we do)..

You can test it in ->qc_issue(), no?

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 9/9] Clean up open coded inode dirty checks

2007-11-21 Thread David Chinner
Use xfs_inode_clean() in more places.

Signed-off-by: Dave Chinner <[EMAIL PROTECTED]>
---
 fs/xfs/xfs_inode.c  |   27 +--
 fs/xfs/xfs_inode_item.h |8 
 fs/xfs/xfs_vnodeops.c   |4 +---
 3 files changed, 14 insertions(+), 25 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c   2007-11-22 10:33:57.728849000 
+1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c2007-11-22 10:33:59.692597965 +1100
@@ -2158,13 +2158,6 @@ xfs_iunlink_remove(
return 0;
 }
 
-STATIC_INLINE int xfs_inode_clean(xfs_inode_t *ip)
-{
-   return (((ip->i_itemp == NULL) ||
-   !(ip->i_itemp->ili_format.ilf_fields & XFS_ILOG_ALL)) &&
-   (ip->i_update_core == 0));
-}
-
 /* lookup all the inodes in the cluster */
 STATIC int
 xfs_icluster_lookup(
@@ -3067,7 +3060,6 @@ xfs_iflush_cluster(
int ilist_size;
xfs_inode_t **ilist;
xfs_inode_t *iq;
-   xfs_inode_log_item_t*iip;
int nr_found;
int clcount = 0;
int bufwasdelwri;
@@ -3094,13 +3086,8 @@ xfs_iflush_cluster(
 * is a candidate for flushing.  These checks will be repeated
 * later after the appropriate locks are acquired.
 */
-   iip = iq->i_itemp;
-   if ((iq->i_update_core == 0) &&
-   ((iip == NULL) ||
-!(iip->ili_format.ilf_fields & XFS_ILOG_ALL)) &&
- xfs_ipincount(iq) == 0) {
+   if (xfs_inode_clean(iq) && xfs_ipincount(iq) == 0)
continue;
-   }
 
/*
 * Try to get locks.  If any are unavailable or it is pinned,
@@ -3123,10 +3110,8 @@ xfs_iflush_cluster(
 * arriving here means that this inode can be flushed.  First
 * re-check that it's dirty before flushing.
 */
-   iip = iq->i_itemp;
-   if ((iq->i_update_core != 0) || ((iip != NULL) &&
-(iip->ili_format.ilf_fields & XFS_ILOG_ALL))) {
-   int error;
+   if (!xfs_inode_clean(iq)) {
+   int error;
error = xfs_iflush_int(iq, bp);
if (error) {
xfs_iunlock(iq, XFS_ILOCK_SHARED);
@@ -3230,8 +3215,7 @@ xfs_iflush(
 * If the inode isn't dirty, then just release the inode
 * flush lock and do nothing.
 */
-   if ((ip->i_update_core == 0) &&
-   ((iip == NULL) || !(iip->ili_format.ilf_fields & XFS_ILOG_ALL))) {
+   if (xfs_inode_clean(ip)) {
ASSERT((iip != NULL) ?
 !(iip->ili_item.li_flags & XFS_LI_IN_AIL) : 1);
xfs_ifunlock(ip);
@@ -3398,8 +3382,7 @@ xfs_iflush_int(
 * If the inode isn't dirty, then just release the inode
 * flush lock and do nothing.
 */
-   if ((ip->i_update_core == 0) &&
-   ((iip == NULL) || !(iip->ili_format.ilf_fields & XFS_ILOG_ALL))) {
+   if (xfs_inode_clean(ip)) {
xfs_ifunlock(ip);
return 0;
}
Index: 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_vnodeops.c2007-11-22 10:33:57.732848488 
+1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c 2007-11-22 10:33:59.696597454 +1100
@@ -3532,7 +3532,6 @@ xfs_inode_flush(
int flags)
 {
xfs_mount_t *mp = ip->i_mount;
-   xfs_inode_log_item_t *iip = ip->i_itemp;
int error = 0;
 
if (XFS_FORCED_SHUTDOWN(mp))
@@ -3542,8 +3541,7 @@ xfs_inode_flush(
 * Bypass inodes which have already been cleaned by
 * the inode flush clustering code inside xfs_iflush
 */
-   if ((ip->i_update_core == 0) &&
-   ((iip == NULL) || !(iip->ili_format.ilf_fields & XFS_ILOG_ALL)))
+   if (xfs_inode_clean(ip))
return 0;
 
/*
Index: 2.6.x-xfs-new/fs/xfs/xfs_inode_item.h
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode_item.h  2007-11-22 10:25:23.286572511 
+1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode_item.h   2007-11-22 10:33:59.696597454 
+1100
@@ -168,6 +168,14 @@ static inline int xfs_ilog_fext(int w)
return (w == XFS_DATA_FORK ? XFS_ILOG_DEXT : XFS_ILOG_AEXT);
 }
 
+STATIC_INLINE int xfs_inode_clean(xfs_inode_t *ip)
+{
+   return (((ip->i_itemp == NULL) ||
+   !(ip->i_itemp->ili_format.ilf_fields & XFS_ILOG_ALL)) &&
+   (ip->i_update_core == 0));
+}
+
+
 #ifdef __KERNEL__
 
 extern void xfs_inode_item_init(struct xfs_inode *, struct xfs_mount *);
-
To 

[PATCH 8/9] Convert inode cache locking to RCU

2007-11-21 Thread David Chinner
Use RCU locking on the inode radix trees

To make use of the efficient radix tree gang lookups for
inode cluster operations we had to increase the time we hold
the radix tree read lock for. This will affect performance
somewhat.

Given that all the lookups are done on a radix tree and we
already have mechanisms to determine if an inode is valid
or not during lookup, we can pretty easily move this across
to lockless lookups using RCU.

The wrinkle is that the current read lock is used to synchronise
inode reclaim and lookup. Luckily, we have the inode flags lock
which is used in the same places as we need for this synchronisation
and hence the code can be easily changed to use this lock for
reclaim/lookup synchronisation.

Also, we can avoid growing the xfs_inode structure to place the
rcuhead structure for the rcu_call() on inode destruction by
reusing the reclaim list listhead structure. We can safely do
this because the inode has been removed from the reclaim list
before the reclaim code calls xfs_idestroy(). This is effectively
the same trick as used in the dentry cache to avoid growing the
dentry structure.

Signed-off-by: Dave Chinner <[EMAIL PROTECTED]>
---
 fs/xfs/xfs_ag.h   |2 
 fs/xfs/xfs_iget.c |  107 ++
 fs/xfs/xfs_inode.c|   47 -
 fs/xfs/xfs_inode.h|   14 +-
 fs/xfs/xfs_mount.c|2 
 fs/xfs/xfs_vnodeops.c |8 ---
 6 files changed, 108 insertions(+), 72 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c2007-11-22 10:33:53.993326524 
+1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-22 10:33:57.724849511 +1100
@@ -40,6 +40,37 @@
 #include "xfs_utils.h"
 
 /*
+ * Attempt to move the inode out of the IRECLAIMABLE state.
+ * Must be called under rcu_read_lock() and with the ip->i_flags_lock
+ * held for synchronisation with xfs_ireclaim_finish().
+ */
+STATIC int
+xfs_iget_reclaim_check(
+   xfs_inode_t *ip,
+   int flags)
+{
+   /*
+* If IRECLAIM is set this inode is on its way out of the system, we
+* need to pause and try again.
+*/
+   if (__xfs_iflags_test(ip, XFS_IRECLAIM))
+   return EAGAIN;
+   ASSERT(__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
+
+   /*
+* If lookup is racing with unlink, then we should return an error
+* immediately so we don't remove it from the reclaim list and
+* potentially leak the inode.
+*/
+
+   if ((ip->i_d.di_mode == 0) && !(flags & XFS_IGET_CREATE))
+   return ENOENT;
+
+   __xfs_iflags_clear(ip, XFS_IRECLAIMABLE);
+   return 0;
+}
+
+/*
  * Look up an inode by number in the given file system.
  * The inode is looked up in the cache held in each AG.
  * If the inode is found in the cache, attach it to the provided
@@ -94,7 +125,7 @@ xfs_iget_core(
agino = XFS_INO_TO_AGINO(mp, ino);
 
 again:
-   read_lock(>pag_ici_lock);
+   rcu_read_lock();
ip = radix_tree_lookup(>pag_ici_root, agino);
 
if (ip != NULL) {
@@ -103,52 +134,44 @@ again:
 * we need to pause and try again.
 */
if (xfs_iflags_test(ip, XFS_INEW)) {
-   read_unlock(>pag_ici_lock);
+   rcu_read_unlock();
delay(1);
XFS_STATS_INC(xs_ig_frecycle);
 
goto again;
}
 
+   /*
+* Determine if the inode is queued for reclaim or being
+* reclaimed.  This is trickier now we are under RCU locking.
+*
+* Basically, xfs_ireclaim_finish() uses the i_flags_lock to
+* atomically move the inode out of the IRECLAIMABLE state and
+* inode the IRECLAIM state, so we have to use the same lock to
+* do an equivalent set of tests and move the inode out of the
+* IRECLAIMABLE state.
+*/
old_inode = ip->i_vnode;
if (old_inode == NULL) {
-   /*
-* If IRECLAIM is set this inode is
-* on its way out of the system,
-* we need to pause and try again.
-*/
-   if (xfs_iflags_test(ip, XFS_IRECLAIM)) {
-   read_unlock(>pag_ici_lock);
-   delay(1);
+   spin_lock(>i_flags_lock);
+   error = xfs_iget_reclaim_check(ip, flags);
+   spin_unlock(>i_flags_lock);
+   rcu_read_unlock();
+   if (error) {
XFS_STATS_INC(xs_ig_frecycle);
-
-   goto again;
-

[PATCH 7/9] Use radix_tree_gang_lookup_range for cluster lookups

2007-11-21 Thread David Chinner
Use radix_tree_gang_lookup_range() for inode cluster lookups

Now that we have an efficent lookup method for the radix tree,
convert cluster lookups to use it. Factor out the common
lookup, add some debug checking to it and call it where needed.

For sanity, we need to hold the radix tree lock in read
mode across the entire set of locking operations done to
ensure we can operate on the inodes. This does increase
the length of time we hold the lock in read mode, but we'll
correct that with another patch.

Signed-off-by: Dave Chinner <[EMAIL PROTECTED]>
---
 fs/xfs/xfs_inode.c |   83 +++--
 1 file changed, 56 insertions(+), 27 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c   2007-11-22 10:33:53.993326524 
+1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c2007-11-22 10:33:55.877085717 +1100
@@ -2165,6 +2165,37 @@ STATIC_INLINE int xfs_inode_clean(xfs_in
(ip->i_update_core == 0));
 }
 
+/* lookup all the inodes in the cluster */
+STATIC int
+xfs_icluster_lookup(
+   xfs_mount_t *mp,
+   xfs_perag_t *pag,
+   xfs_ino_t   ino,
+   xfs_inode_t **ilist,
+   int clsize)
+{
+   unsigned long   first_index, last_index, mask;
+   int nr_found;
+
+   mask = ~(clsize - 1);
+   first_index = XFS_INO_TO_AGINO(mp, ino) & mask;
+   last_index = (XFS_INO_TO_AGINO(mp, ino + clsize) & mask) - 1;
+   nr_found = radix_tree_gang_lookup_range(>pag_ici_root,
+   (void**)ilist, first_index, last_index,
+   clsize);
+   ASSERT(nr_found <= clsize);
+#ifdef DEBUG
+{  int i;
+   xfs_inode_t *iq;
+   for (i = 0; i < nr_found; i++) {
+   iq = ilist[i];
+   ASSERT((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index);
+   }
+}
+#endif
+   return nr_found;
+}
+
 STATIC void
 xfs_ifree_cluster(
xfs_inode_t *free_ip,
@@ -2178,7 +2209,7 @@ xfs_ifree_cluster(
int i, j, found, pre_flushed;
xfs_daddr_t blkno;
xfs_buf_t   *bp;
-   xfs_inode_t *ip, **ip_found;
+   xfs_inode_t *ip, **ip_found, **ilist;
xfs_inode_log_item_t*iip;
xfs_log_item_t  *lip;
xfs_perag_t *pag = xfs_get_perag(mp, inum);
@@ -2195,8 +2226,10 @@ xfs_ifree_cluster(
}
 
ip_found = kmem_alloc(ninodes * sizeof(xfs_inode_t *), KM_NOFS);
+   ilist = kmem_alloc(ninodes * sizeof(xfs_inode_t *), KM_NOFS);
 
for (j = 0; j < nbufs; j++, inum += ninodes) {
+   int nr_found;
blkno = XFS_AGB_TO_DADDR(mp, XFS_INO_TO_AGNO(mp, inum),
 XFS_INO_TO_AGBNO(mp, inum));
 
@@ -2213,24 +2246,23 @@ xfs_ifree_cluster(
 * and fail, we need some other form of interlock
 * here.
 */
+   read_lock(>pag_ici_lock);
+   nr_found = xfs_icluster_lookup(mp, pag, inum, ilist, ninodes);
+   if (nr_found == 0) {
+   read_unlock(>pag_ici_lock);
+   continue;
+   }
found = 0;
-   for (i = 0; i < ninodes; i++) {
-   read_lock(>pag_ici_lock);
-   ip = radix_tree_lookup(>pag_ici_root,
-   XFS_INO_TO_AGINO(mp, (inum + i)));
+   for (i = 0; i < nr_found; i++) {
+   ip = ilist[i];
 
/* Inode not in memory or we found it already,
 * nothing to do
 */
-   if (!ip || xfs_iflags_test(ip, XFS_ISTALE)) {
-   read_unlock(>pag_ici_lock);
+   if (!ip || xfs_iflags_test(ip, XFS_ISTALE))
continue;
-   }
-
-   if (xfs_inode_clean(ip)) {
-   read_unlock(>pag_ici_lock);
+   if (xfs_inode_clean(ip))
continue;
-   }
 
/* If we can get the locks then add it to the
 * list, otherwise by the time we get the bp lock
@@ -2251,7 +2283,6 @@ xfs_ifree_cluster(
ip_found[found++] = ip;
}
}
-   read_unlock(>pag_ici_lock);
continue;
}
 
@@ -2269,8 +2300,8 @@ xfs_ifree_cluster(
xfs_iunlock(ip, XFS_ILOCK_EXCL);
}
  

[PATCH 6/9] Remove xfs_icluster

2007-11-21 Thread David Chinner
Remove the xfs_icluster structure and replace with a radix tree lookup.

We don't need to keep a list of inodes in each cluster around anymore
as we can look them up quickly when we need to. The only time we need
to do this now is during inode writeback.

Factor the inode cluster writeback code out of xfs_iflush and convert
it to use radix_tree_gang_lookup() instead of walking a list of
inodes built when we first read in the inodes.

This remove 3 pointers from each xfs_inode structure and the xfs_icluster
structure per inode cluster. Hence we reduce the cache footprint of the
xfs_inodes by between 5-10% depending on cluster sparseness.

To be truly efficient we need a radix_tree_gang_lookup_range() call
to stop searching once we are past the end of the cluster instead
of trying to find a full cluster's worth of inodes.

Before (ia64):

$ cat /sys/slab/xfs_inode/object_size
536

After:

$ cat /sys/slab/xfs_inode/object_size
512

Signed-off-by: Dave Chinner <[EMAIL PROTECTED]>
---
 fs/xfs/linux-2.6/xfs_ksyms.c |1 
 fs/xfs/xfs_iget.c|   49 ---
 fs/xfs/xfs_inode.c   |  266 ---
 fs/xfs/xfs_inode.h   |   16 --
 fs/xfs/xfs_vfsops.c  |5 
 fs/xfs/xfsidbg.c |4 
 6 files changed, 154 insertions(+), 187 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c2007-11-22 10:25:24.178458638 
+1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-22 10:33:53.993326524 +1100
@@ -78,7 +78,6 @@ xfs_iget_core(
xfs_inode_t *ip;
xfs_inode_t *iq;
int error;
-   xfs_icluster_t  *icl, *new_icl = NULL;
unsigned long   first_index, mask;
xfs_perag_t *pag;
xfs_agino_t agino;
@@ -229,11 +228,9 @@ finish_inode:
}
 
/*
-* This is a bit messy - we preallocate everything we _might_
-* need before we pick up the ici lock. That way we don't have to
-* juggle locks and go all the way back to the start.
+* Preload the radix tree so we can insert safely under the
+* write spinlock.
 */
-   new_icl = kmem_zone_alloc(xfs_icluster_zone, KM_SLEEP);
if (radix_tree_preload(GFP_KERNEL)) {
delay(1);
goto again;
@@ -241,17 +238,6 @@ finish_inode:
mask = ~(((XFS_INODE_CLUSTER_SIZE(mp) >> mp->m_sb.sb_inodelog)) - 1);
first_index = agino & mask;
write_lock(>pag_ici_lock);
-
-   /*
-* Find the cluster if it exists
-*/
-   icl = NULL;
-   if (radix_tree_gang_lookup(>pag_ici_root, (void**),
-   first_index, 1)) {
-   if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index)
-   icl = iq->i_cluster;
-   }
-
/*
 * insert the new inode
 */
@@ -266,30 +252,13 @@ finish_inode:
}
 
/*
-* These values _must_ be set before releasing ihlock!
+* These values _must_ be set before releasing the radix tree lock!
 */
ip->i_udquot = ip->i_gdquot = NULL;
xfs_iflags_set(ip, XFS_INEW);
 
-   ASSERT(ip->i_cluster == NULL);
-
-   if (!icl) {
-   spin_lock_init(_icl->icl_lock);
-   INIT_HLIST_HEAD(_icl->icl_inodes);
-   icl = new_icl;
-   new_icl = NULL;
-   } else {
-   ASSERT(!hlist_empty(>icl_inodes));
-   }
-   spin_lock(>icl_lock);
-   hlist_add_head(>i_cnode, >icl_inodes);
-   ip->i_cluster = icl;
-   spin_unlock(>icl_lock);
-
write_unlock(>pag_ici_lock);
radix_tree_preload_end();
-   if (new_icl)
-   kmem_zone_free(xfs_icluster_zone, new_icl);
 
/*
 * Link ip to its mount and thread it on the mount's inode list.
@@ -528,18 +497,6 @@ xfs_iextract(
xfs_put_perag(mp, pag);
 
/*
-* Remove from cluster list
-*/
-   mp = ip->i_mount;
-   spin_lock(>i_cluster->icl_lock);
-   hlist_del(>i_cnode);
-   spin_unlock(>i_cluster->icl_lock);
-
-   /* was last inode in cluster? */
-   if (hlist_empty(>i_cluster->icl_inodes))
-   kmem_zone_free(xfs_icluster_zone, ip->i_cluster);
-
-   /*
 * Remove from mount's inode list.
 */
XFS_MOUNT_ILOCK(mp);
Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c   2007-11-22 10:33:51.037704348 
+1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c2007-11-22 10:33:53.993326524 +1100
@@ -53,7 +53,6 @@
 
 kmem_zone_t *xfs_ifork_zone;
 kmem_zone_t *xfs_inode_zone;
-kmem_zone_t *xfs_icluster_zone;
 
 /*
  * Used in xfs_itruncate().  This is the maximum number of extents
@@ -3014,6 +3013,151 @@ xfs_iflush_fork(
   

[PATCH 5/9] Don't block pdflush when flushing inodes

2007-11-21 Thread David Chinner
When pdflush is writing back inodes, it can get stuck on inode cluster
buffers that are currently under I/O. This occurs when we write data to
multiple inodes in the same inode cluster at the same time.

Effectively, delayed allocation marks the inode dirty during the data
writeback. Hence if the inode cluster was flushed during the writeback
of the first inode, the writeback of the second inode will block waiting
for the inode cluster write to complete before writing it again for the
newly dirtied inode.

Basically, we want to avoid this from happening so we don't block
pdflush and slow down all of writeback. Hence we introduce a
non-blocking async inode flush flag that pdflush uses. If this flag is
set, we use non-blocking operations (e.g. try locks) where-ever we can
to avoid blocking or extra I/O being issued.

Signed-off-by: Dave Chinner <[EMAIL PROTECTED]>
---
 fs/xfs/linux-2.6/xfs_super.c |3 +
 fs/xfs/linux-2.6/xfs_vnode.h |5 --
 fs/xfs/xfs_inode.c   |   82 +--
 fs/xfs/xfs_inode.h   |8 +++-
 fs/xfs/xfs_trans_buf.c   |3 +
 fs/xfs/xfs_vnodeops.c|   55 ++--
 6 files changed, 79 insertions(+), 77 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c   2007-11-22 10:33:43.014729931 
+1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c2007-11-22 10:33:51.037704348 +1100
@@ -183,12 +183,20 @@ xfs_imap_to_bp(
int ni;
xfs_buf_t   *bp;
 
+   if (buf_flags == 0)
+   buf_flags = XFS_BUF_LOCK;
+
error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, imap->im_blkno,
-  (int)imap->im_len, XFS_BUF_LOCK, );
+  (int)imap->im_len, buf_flags, );
if (error) {
-   cmn_err(CE_WARN, "xfs_imap_to_bp: xfs_trans_read_buf()returned "
+   if (error != EAGAIN) {
+   cmn_err(CE_WARN,
+   "xfs_imap_to_bp: xfs_trans_read_buf()returned "
"an error %d on %s.  Returning error.",
error, mp->m_fsname);
+   } else {
+   ASSERT(buf_flags & XFS_BUF_TRYLOCK);
+   }
return error;
}
 
@@ -306,14 +314,15 @@ xfs_inotobp(
  * 0 for the disk block address.
  */
 int
-xfs_itobp(
+xfs_itobp_flags(
xfs_mount_t *mp,
xfs_trans_t *tp,
xfs_inode_t *ip,
xfs_dinode_t**dipp,
xfs_buf_t   **bpp,
xfs_daddr_t bno,
-   uintimap_flags)
+   uintimap_flags,
+   uintbuf_flags)
 {
xfs_imap_t  imap;
xfs_buf_t   *bp;
@@ -344,10 +353,17 @@ xfs_itobp(
}
ASSERT(bno == 0 || bno == imap.im_blkno);
 
-   error = xfs_imap_to_bp(mp, tp, , , XFS_BUF_LOCK, imap_flags);
+   error = xfs_imap_to_bp(mp, tp, , , buf_flags, imap_flags);
if (error)
return error;
 
+   if (!bp) {
+   ASSERT(buf_flags & XFS_BUF_TRYLOCK);
+   ASSERT(tp == NULL);
+   *bpp = NULL;
+   return EAGAIN;
+   }
+
*dipp = (xfs_dinode_t *)xfs_buf_offset(bp, imap.im_boffset);
*bpp = bp;
return 0;
@@ -3023,6 +3039,7 @@ xfs_iflush(
int bufwasdelwri;
struct hlist_node   *entry;
enum { INT_DELWRI = (1 << 0), INT_ASYNC = (1 << 1) };
+   int noblock = (flags == XFS_IFLUSH_ASYNC_NOBLOCK);
 
XFS_STATS_INC(xs_iflush_count);
 
@@ -3047,11 +3064,22 @@ xfs_iflush(
}
 
/*
-* We can't flush the inode until it is unpinned, so
-* wait for it.  We know noone new can pin it, because
-* we are holding the inode lock shared and you need
-* to hold it exclusively to pin the inode.
+* We can't flush the inode until it is unpinned, so wait for it if we
+* are allowed to block.  We know noone new can pin it, because we are
+* holding the inode lock shared and you need to hold it exclusively to
+* pin the inode.
+*
+* If we are not allowed to block, force the log out asynchronously so
+* that when we come back the inode will be unpinned. If other inodes
+* in the same cluster are dirty, they will probably write the inode
+* out for us if they occur after the log force completes.
 */
+
+   if (noblock && xfs_ipincount(ip)) {
+   xfs_log_force(mp, (xfs_lsn_t)0, XFS_LOG_FORCE);
+   xfs_ifunlock(ip);
+   return EAGAIN;
+   }
xfs_iunpin_wait(ip);
 
/*
@@ -3068,15 +3096,6 @@ xfs_iflush(
}
 
/*
-* Get the buffer containing the on-disk inode.
-*/
-   

[UPDATED PATCH] Support for Toshiba TMIO multifunction devices

2007-11-21 Thread ian
On Wed, 2007-11-21 at 12:05 +0800, eric miao wrote:
> On Nov 21, 2007 11:54 AM, ian <[EMAIL PROTECTED]> wrote:
> > On Wed, 2007-11-21 at 10:23 +0800, eric miao wrote:
> > > Roughly went through the patch, looks good, here comes the remind, though 
> > > :-)
> > >
> > > 1. is it possible to use some name other than "soc_core", maybe
> > > "tmio_core" so that other multifunction chips sharing a core base
> > > will live easier.
> >
> > It's (soc-core) not tmio MFD specific - its already used by other MFD
> > chips (although obviously not ones in mainline (yet!)

I've renamed soc-core to mfd-core in the patches attached to this
message.

> > > 2. those C++ style comments "//" are not so pleasant...
> >
> > Should I clean them up and resubmit?
> 
> Will be nice then, anyway, could you inline them so others can comment?

All done.

> Well, I briefly went through the git history, looks like Russell is the proper
> one you could sent them to (probably not) :-)

I've added RMK to the CC.

I've ommitted the platform support for e-series - I'll send that to RMK
once this is merged.

Patches follow:

>From 9c4ffb764ae2366368a0038a6fbdd9a19ce430c4 Mon Sep 17 00:00:00 2001
From: Ian Molton <[EMAIL PROTECTED]>
Date: Wed, 21 Nov 2007 23:32:37 +
Subject: [PATCH] Reuseable MFD core code suitable for multifunction
chips with
 built in IRQ multiplexing and local RAM.

---
 drivers/mfd/Kconfig|   25 
 drivers/mfd/Makefile   |3 +
 drivers/mfd/mfd-core.c |  102

 drivers/mfd/mfd-core.h |   26 
 include/linux/ioport.h |3 +
 5 files changed, 159 insertions(+), 0 deletions(-)
 create mode 100644 drivers/mfd/mfd-core.c
 create mode 100644 drivers/mfd/mfd-core.h

diff --git a/drivers/mfd/Kconfig b/drivers/mfd/Kconfig
index 2571619..38edfdc 100644
--- a/drivers/mfd/Kconfig
+++ b/drivers/mfd/Kconfig
@@ -15,6 +15,31 @@ config MFD_SM501
  interface. The device may be connected by PCI or local bus with
  varying functions enabled.
 
+config MFD_T7L66XB
+   bool "Toshiba T7L66XB SoC support"
+   ---help---
+ This driver supports the T7L66XB, which incorporates SD/MMC, and
+ USB host functionality. associated subdevices are:
+ tmio_mmc
+ tmio_ohci
+
+config MFD_TC6387XB
+   bool "Toshiba TC6387XB SoC support"
+   ---help---
+ This driver supports the TC6393XB, which incorporates SD/MMC, NAND,
+ Video, and USB host functionality. associated subdevices are:
+ tmio_mmc
+
+config MFD_TC6393XB
+   bool "Toshiba TC6393XB SoC support"
+   ---help---
+ This driver supports the TC6393XB, which incorporates SD/MMC, NAND,
+ Video, and USB host functionality. associated subdevices are:
+ tmio_mmc
+ tmio_nand
+ tmio_fb
+ tmio_ohci
+
 endmenu
 
 menu "Multimedia Capabilities Port drivers"
diff --git a/drivers/mfd/Makefile b/drivers/mfd/Makefile
index 5143209..5ae3877 100644
--- a/drivers/mfd/Makefile
+++ b/drivers/mfd/Makefile
@@ -3,6 +3,9 @@
 #
 
 obj-$(CONFIG_MFD_SM501)+= sm501.o
+obj-$(CONFIG_MFD_T7L66XB)   += t7l66xb.o  mfd-core.o
+obj-$(CONFIG_MFD_TC6387XB)  += tc6387xb.o mfd-core.o
+obj-$(CONFIG_MFD_TC6393XB)  += tc6393xb.o mfd-core.o
 
 obj-$(CONFIG_MCP)  += mcp-core.o
 obj-$(CONFIG_MCP_SA11X0)   += mcp-sa11x0.o
diff --git a/drivers/mfd/mfd-core.c b/drivers/mfd/mfd-core.c
new file mode 100644
index 000..e668c92
--- /dev/null
+++ b/drivers/mfd/mfd-core.c
@@ -0,0 +1,102 @@
+/*
+ * drivers/mfd/mfd-core.c
+ *
+ * core MFD support
+ * Copyright (c) 2006 Ian Molton
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include "mfd-core.h"
+
+void mfd_free_devices(struct platform_device *devices, int nr_devs)
+{
+   struct platform_device *dev = devices;
+   int i;
+
+   for (i = 0; i < nr_devs; i++) {
+   struct resource *res = dev->resource;
+   platform_device_unregister(dev++);
+   kfree(res);
+   }
+   kfree(devices);
+}
+EXPORT_SYMBOL_GPL(mfd_free_devices);
+
+#define SIGNED_SHIFT(val, shift) ((shift) >= 0 ? ((val) << (shift)) :
((val) >> -(shift)))
+
+struct platform_device *mfd_add_devices(struct platform_device *dev,
+   struct mfd_device_data *mfd, int 
nr_devs,
+   struct resource *mem,
+   int relative_addr_shift, int irq_base)
+{
+   struct platform_device *devices;
+   int i, r, base;
+
+   devices = kzalloc(nr_devs * sizeof(struct platform_device),
GFP_KERNEL);
+   if (!devices)
+   return NULL;
+
+   for (i = 0; i < nr_devs; i++) {
+   

[PATCH 3/9] Use _META bio I/O types for metadata I/O

2007-11-21 Thread David Chinner
Improve metadata I/O merging in the elevator

Change all async metadata buffers to use [READ|WRITE]_META I/O types
so that the I/O doesn't get issued immediately. This allows merging
of adjacent metadata requests but still prioritises them over bulk
data. This shows a 10-15% improvement in sequential create speed of
small files.

Don't include the log buffers in this classification - leave them
as sync types so they are issued immediately.

Signed-off-by: Dave Chinner <[EMAIL PROTECTED]>
---
 fs/xfs/linux-2.6/xfs_buf.c |6 +-
 include/linux/fs.h |1 +
 2 files changed, 6 insertions(+), 1 deletion(-)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c   2007-11-22 
10:53:11.556186722 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c2007-11-22 10:53:43.748024392 
+1100
@@ -1175,10 +1175,14 @@ _xfs_buf_ioapply(
if (bp->b_flags & XBF_ORDERED) {
ASSERT(!(bp->b_flags & XBF_READ));
rw = WRITE_BARRIER;
-   } else if (bp->b_flags & _XBF_RUN_QUEUES) {
+   } else if (bp->b_flags & XBF_LOG_BUFFER) {
ASSERT(!(bp->b_flags & XBF_READ_AHEAD));
bp->b_flags &= ~_XBF_RUN_QUEUES;
rw = (bp->b_flags & XBF_WRITE) ? WRITE_SYNC : READ_SYNC;
+   } else if (bp->b_flags & _XBF_RUN_QUEUES) {
+   ASSERT(!(bp->b_flags & XBF_READ_AHEAD));
+   bp->b_flags &= ~_XBF_RUN_QUEUES;
+   rw = (bp->b_flags & XBF_WRITE) ? WRITE_META : READ_META;
} else {
rw = (bp->b_flags & XBF_WRITE) ? WRITE :
 (bp->b_flags & XBF_READ_AHEAD) ? READA : READ;
Index: 2.6.x-xfs-new/include/linux/fs.h
===
--- 2.6.x-xfs-new.orig/include/linux/fs.h   2007-11-22 10:47:21.965392742 
+1100
+++ 2.6.x-xfs-new/include/linux/fs.h2007-11-22 10:53:43.748024392 +1100
@@ -83,6 +83,7 @@ extern int dir_notify_enable;
 #define READ_SYNC  (READ | (1 << BIO_RW_SYNC))
 #define READ_META  (READ | (1 << BIO_RW_META))
 #define WRITE_SYNC (WRITE | (1 << BIO_RW_SYNC))
+#define WRITE_META (WRITE | (1 << BIO_RW_META))
 #define WRITE_BARRIER  ((1 << BIO_RW) | (1 << BIO_RW_BARRIER))
 
 #define SEL_IN 1
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/9] Factor common inode cluster buffer lookup code

2007-11-21 Thread David Chinner
Factor xfs_itobp() and xfs_inotobp().

The only difference between the functions is one passes an
inode for the lookup, the other passes an inode number.
However, they don't do the same validity checking or set
all the same state on the buffer that is returned yet
they should.

Factor the functions into a common implementation.

Signed-off-by: Dave Chinner <[EMAIL PROTECTED]>
---
 fs/xfs/xfs_inode.c |  283 -
 1 file changed, 129 insertions(+), 154 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c   2007-11-22 10:31:44.0 
+1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c2007-11-22 10:33:43.014729931 +1100
@@ -124,6 +124,126 @@ xfs_inobp_check(
 #endif
 
 /*
+ * Simple wrapper for calling xfs_imap() that includes error
+ * and bounds checking
+ */
+STATIC int
+xfs_ino_to_imap(
+   xfs_mount_t *mp,
+   xfs_trans_t *tp,
+   xfs_ino_t   ino,
+   xfs_imap_t  *imap,
+   uintimap_flags)
+{
+   int error;
+
+   error = xfs_imap(mp, tp, ino, imap, imap_flags);
+   if (error) {
+   cmn_err(CE_WARN, "xfs_ino_to_imap: xfs_imap()  returned an "
+   "error %d on %s.  Returning error.",
+   error, mp->m_fsname);
+   return error;
+   }
+
+   /*
+* If the inode number maps to a block outside the bounds
+* of the file system then return NULL rather than calling
+* read_buf and panicing when we get an error from the
+* driver.
+*/
+   if ((imap->im_blkno + imap->im_len) >
+   XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)) {
+   xfs_fs_cmn_err(CE_ALERT, mp, "xfs_ino_to_imap: "
+   "(imap->im_blkno (0x%llx) + imap->im_len (0x%llx)) > "
+   " XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) (0x%llx)",
+   (unsigned long long) imap->im_blkno,
+   (unsigned long long) imap->im_len,
+   XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks));
+   return XFS_ERROR(EINVAL);
+   }
+   return 0;
+}
+
+/*
+ * Find the buffer associated with the given inode map
+ * We do basic validation checks on the buffer once it has been
+ * retrieved from disk.
+ */
+STATIC int
+xfs_imap_to_bp(
+   xfs_mount_t *mp,
+   xfs_trans_t *tp,
+   xfs_imap_t  *imap,
+   xfs_buf_t   **bpp,
+   uintbuf_flags,
+   uintimap_flags)
+{
+   int error;
+   int i;
+   int ni;
+   xfs_buf_t   *bp;
+
+   error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, imap->im_blkno,
+  (int)imap->im_len, XFS_BUF_LOCK, );
+   if (error) {
+   cmn_err(CE_WARN, "xfs_imap_to_bp: xfs_trans_read_buf()returned "
+   "an error %d on %s.  Returning error.",
+   error, mp->m_fsname);
+   return error;
+   }
+
+   /*
+* Validate the magic number and version of every inode in the buffer
+* (if DEBUG kernel) or the first inode in the buffer, otherwise.
+*/
+#ifdef DEBUG
+   ni = BBTOB(imap->im_len) >> mp->m_sb.sb_inodelog;
+#else  /* usual case */
+   ni = 1;
+#endif
+
+   for (i = 0; i < ni; i++) {
+   int di_ok;
+   xfs_dinode_t*dip;
+
+   dip = (xfs_dinode_t *)xfs_buf_offset(bp,
+   (i << mp->m_sb.sb_inodelog));
+   di_ok = be16_to_cpu(dip->di_core.di_magic) == XFS_DINODE_MAGIC 
&&
+   XFS_DINODE_GOOD_VERSION(dip->di_core.di_version);
+   if (unlikely(XFS_TEST_ERROR(!di_ok, mp,
+   XFS_ERRTAG_ITOBP_INOTOBP,
+   XFS_RANDOM_ITOBP_INOTOBP))) {
+   if (imap_flags & XFS_IMAP_BULKSTAT) {
+   xfs_trans_brelse(tp, bp);
+   return XFS_ERROR(EINVAL);
+   }
+   XFS_CORRUPTION_ERROR("xfs_imap_to_bp",
+   XFS_ERRLEVEL_HIGH, mp, dip);
+#ifdef DEBUG
+   cmn_err(CE_PANIC,
+   "Device %s - bad inode magic/vsn "
+   "daddr %lld #%d (magic=%x)",
+   XFS_BUFTARG_NAME(mp->m_ddev_targp),
+   (unsigned long long)imap->im_blkno, i,
+   be16_to_cpu(dip->di_core.di_magic));
+#endif
+   xfs_trans_brelse(tp, bp);
+   return XFS_ERROR(EFSCORRUPTED);
+   }
+

tun device supplementary group ownership

2007-11-21 Thread Mike Mohr
Hi,

It seems to me that supplementary groups should be taken into account
when checking for permissions on a tun device.  Can someone comment on
my patch below; is it a reasonable approach?  If so, I'd like to
submit it for inclusion in the kernel under the GPL.

Please forward any responses to me directly in addition to the lkml.

Mike

--- tun.c   2007-11-16 10:14:27.0 -0800
+++ tun.c.new   2007-11-21 16:12:15.0 -0800
@@ -471,7 +471,8 @@
   if (((tun->owner != -1 &&
 current->euid != tun->owner) ||
(tun->group != -1 &&
- current->egid != tun->group)) &&
+ (current->egid != tun->group &&
+  !groups_search(current->group_info, tun->group &&
!capable(CAP_NET_ADMIN))
   return -EPERM;
   }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/9]: Reduce Log I/O latency

2007-11-21 Thread David Chinner
Reduce log I/O latency

To ensure that log I/O is issued as the highest priority I/O, set
the I/O priority of the log I/O to the highest possible. This will
ensure that log I/O is not held up behind bulk data or other
metadata I/O as delaying log I/O can pause the entire transaction
subsystem. Introduce a new buffer flag to allow us to tag the log
buffers so we can discrimiate when issuing the I/O.

Signed-off-by: Dave Chinner <[EMAIL PROTECTED]>
---
 fs/xfs/linux-2.6/xfs_buf.c |3 +++
 fs/xfs/linux-2.6/xfs_buf.h |5 -
 fs/xfs/xfs_log.c   |2 ++
 3 files changed, 9 insertions(+), 1 deletion(-)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c   2007-11-22 
10:47:21.937396362 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c2007-11-22 10:53:11.556186722 
+1100
@@ -1255,6 +1255,9 @@ next_chunk:
 
 submit_io:
if (likely(bio->bi_size)) {
+   /* log I/O should not be delayed by anything. */
+   if (bp->b_flags & XBF_LOG_BUFFER)
+   bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 
0));
submit_bio(rw, bio);
if (size)
goto next_chunk;
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.h
===
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.h   2007-11-22 
10:47:21.945395328 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.h2007-11-22 10:53:11.556186722 
+1100
@@ -53,7 +53,8 @@ typedef enum {
XBF_DELWRI = (1 << 6),  /* buffer has dirty pages  */
XBF_STALE = (1 << 7),   /* buffer has been staled, do not find it  */
XBF_FS_MANAGED = (1 << 8),  /* filesystem controls freeing memory  */
-   XBF_ORDERED = (1 << 11),/* use ordered writes  */
+   XBF_LOG_BUFFER = (1 << 9),  /* Buffer issued by the log*/
+   XBF_ORDERED = (1 << 11),/* use ordered writes  */
XBF_READ_AHEAD = (1 << 12), /* asynchronous read-ahead */
 
/* flags used only as arguments to access routines */
@@ -340,6 +341,8 @@ extern void xfs_buf_trace(xfs_buf_t *, c
 #define XFS_BUF_TARGET(bp) ((bp)->b_target)
 #define XFS_BUFTARG_NAME(target)   xfs_buf_target_name(target)
 
+#define XFS_BUF_SET_LOGBUF(bp) ((bp)->b_flags |= XBF_LOG_BUFFER)
+
 static inline int xfs_bawrite(void *mp, xfs_buf_t *bp)
 {
bp->b_fspriv3 = mp;
Index: 2.6.x-xfs-new/fs/xfs/xfs_log.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_log.c 2007-11-22 10:47:21.945395328 +1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_log.c  2007-11-22 10:53:11.556186722 +1100
@@ -1443,6 +1443,8 @@ xlog_sync(xlog_t  *log,
XFS_BUF_ZEROFLAGS(bp);
XFS_BUF_BUSY(bp);
XFS_BUF_ASYNC(bp);
+   XFS_BUF_SET_LOGBUF(bp);
+
/*
 * Do an ordered write for the log block.
 * Its unnecessary to flush the first split block in the log wrap case.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/9]: introduce radix_tree_gang_lookup_range

2007-11-21 Thread David Chinner

Introduce radix_tree_gang_lookup_range()

The inode clustering in XFS requires a gang lookup on the radix tree to
find all the inodes in the cluster.  The gang lookup has to set the
maximum items to that of a fully populated cluster so we get all the
inodes in the cluster, but we only populate the radix tree sparsely (on
demand).

As a result, the gang lookup can search way, way past the index of end
of the cluster because it is looking for a fixed number of entries to
return.

We know we want to terminate the search at either a specific index or a
maximum number of items, so we need to add a "last_index" parameter to
the lookup.

Furthermore, the existing radix_tree_gang_lookup() can use this same
function if we define a RADIX_TREE_MAX_INDEX value so the search is not
limited by the last_index.

Signed-off-by: Dave Chinner <[EMAIL PROTECTED]>
---
 include/linux/radix-tree.h |7 -
 lib/radix-tree.c   |   55 -
 2 files changed, 51 insertions(+), 11 deletions(-)

Index: 2.6.x-xfs-new/include/linux/radix-tree.h
===
--- 2.6.x-xfs-new.orig/include/linux/radix-tree.h   2007-11-22 
10:25:23.834502553 +1100
+++ 2.6.x-xfs-new/include/linux/radix-tree.h2007-11-22 10:31:46.689597763 
+1100
@@ -98,10 +98,11 @@ do {
\
  * radix_tree_lookup
  * radix_tree_tag_get
  * radix_tree_gang_lookup
+ * radix_tree_gang_lookup_range
  * radix_tree_gang_lookup_tag
  * radix_tree_tagged
  *
- * The first 4 functions are able to be called locklessly, using RCU. The
+ * The first 5 functions are able to be called locklessly, using RCU. The
  * caller must ensure calls to these functions are made within rcu_read_lock()
  * regions. Other readers (lock-free or otherwise) and modifications may be
  * running concurrently.
@@ -155,6 +156,10 @@ void *radix_tree_delete(struct radix_tre
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
unsigned long first_index, unsigned int max_items);
+unsigned int
+radix_tree_gang_lookup_range(struct radix_tree_root *root, void **results,
+   unsigned long first_index, unsigned long last_index,
+   unsigned int max_items);
 int radix_tree_preload(gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
Index: 2.6.x-xfs-new/lib/radix-tree.c
===
--- 2.6.x-xfs-new.orig/lib/radix-tree.c 2007-11-22 10:31:24.564425190 +1100
+++ 2.6.x-xfs-new/lib/radix-tree.c  2007-11-22 10:31:46.693597252 +1100
@@ -62,6 +62,8 @@ struct radix_tree_path {
 #define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
 #define RADIX_TREE_MAX_PATH (RADIX_TREE_INDEX_BITS/RADIX_TREE_MAP_SHIFT + 2)
 
+#define RADIX_TREE_MAX_KEY ~0UL
+
 static unsigned long height_to_maxindex[RADIX_TREE_MAX_PATH] __read_mostly;
 
 /*
@@ -599,7 +601,8 @@ EXPORT_SYMBOL(radix_tree_tag_get);
 
 static unsigned int
 __lookup(struct radix_tree_node *slot, void **results, unsigned long index,
-   unsigned int max_items, unsigned long *next_index)
+   unsigned long last_index, unsigned int max_items,
+   unsigned long *next_index)
 {
unsigned int nr_found = 0;
unsigned int shift, height;
@@ -640,6 +643,8 @@ __lookup(struct radix_tree_node *slot, v
if (nr_found == max_items)
goto out;
}
+   if (index > last_index)
+   goto out;
}
 out:
*next_index = index;
@@ -647,27 +652,29 @@ out:
 }
 
 /**
- * radix_tree_gang_lookup - perform multiple lookup on a radix tree
+ * radix_tree_gang_lookup_range - perform multiple lookup on a radix tree
  * @root:  radix tree root
  * @results:   where the results of the lookup are placed
  * @first_index:   start the lookup from this key
+ * @last_index:end the lookup at this key
  * @max_items: place up to this many items at *results
  *
- * Performs an index-ascending scan of the tree for present items.  Places
- * them at [EMAIL PROTECTED] and returns the number of items which were 
placed at
- * [EMAIL PROTECTED]
+ * Performs an index-ascending scan of the tree for present items up to
+ * @last_index in the tree.  Places them at [EMAIL PROTECTED] and returns 
the
+ * number of items which were placed at [EMAIL PROTECTED]
  *
  * The implementation is naive.
  *
- * Like radix_tree_lookup, radix_tree_gang_lookup may be called under
+ * Like radix_tree_lookup, radix_tree_gang_lookup_range may be called under
  * rcu_read_lock. In this case, rather than the returned results being
  * an atomic snapshot of the tree at a single point in time, the semantics
  * of an 

[PATCH 0/9]: Various XFS inode clustering improvements

2007-11-21 Thread David Chinner

Normally I wouldn't bother cc'ing lkml on XFS changes, however a
couple of these patches touch generic code. The changes to generic
code are introducing a WRITE_META bio type and
radix_tree_gang_lookup_range() and hence the wider ditribution.
This patch set is against the current xfs-dev tree so bits of
it may not apply to current mainline.

Overall, the patch set is focussed on improving the XFS inode
cache and clustering code. It reduces memory usage of the
cache by 5-10% and improves performance on some workloads
by 10-15%.

Comments welcome.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCH] SO_NO_CHECK for IPv6

2007-11-21 Thread Jeff Garzik

YOSHIFUJI Hideaki / 吉藤英明 wrote:

In article <[EMAIL PROTECTED]> (at Wed, 21 Nov 2007 07:45:32 -0500), Jeff Garzik 
<[EMAIL PROTECTED]> says:


SO_NO_CHECK support for IPv6 appeared to be missing. This is presented,
based on a reading of net/ipv4/udp.c.


Disagree. UDP checksum is mandatory in IPv6.


Ah, you mean that I need to turn off UDP checksum on receive end as well 
in IPv6...  true.


For those interested, I am dealing with a UDP app that already does very 
strong checksumming and encryption, so additional software checksumming 
at the lower layers is quite simply a waste of CPU cycles.  Hardware 
checksumming is fine, as long as its "free."


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: radeonfb i2c regression post-2.6.18.

2007-11-21 Thread Benjamin Herrenschmidt

On Wed, 2007-11-21 at 23:56 +, Roger Leigh wrote:
> Fantastic, thanks!  I've copied this to Debian bugs 433236 and 426124
> which were about this problem.
> 
> BTW, the framebuffer penguin logo looked a little wierd (low number of
> colours, odd colours), though on my powerpc it has always looked odd
> (wrong colours).  Could there be some endianness bug in the fblogo
> code?  I'll check it with other video options when I next have a few
> minutes.

Strange, it's always been working fine for me.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: radeonfb i2c regression post-2.6.18.

2007-11-21 Thread Roger Leigh
Benjamin Herrenschmidt <[EMAIL PROTECTED]> writes:

>> > Can you try the patch from Jean that I pasted below and let us know if
>> > it helps ? It looks like the releasing of the i2c lines may have been
>> > done backward.
>> 
>> This patch fixes the problem.  The monitor stays powered on during the
>> switch to the framebuffer.
>
> Excellent ! That saves me having to test myself :-)
>
> As far as I'm concerned, that's an Ack for the patch.

Fantastic, thanks!  I've copied this to Debian bugs 433236 and 426124
which were about this problem.

BTW, the framebuffer penguin logo looked a little wierd (low number of
colours, odd colours), though on my powerpc it has always looked odd
(wrong colours).  Could there be some endianness bug in the fblogo
code?  I'll check it with other video options when I next have a few
minutes.


Thanks again,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?   http://gutenprint.sourceforge.net/
   `-GPG Public Key: 0x25BFB848   Please GPG sign your mail.


pgpbGl5qp3z8C.pgp
Description: PGP signature


Re: network driver usage count

2007-11-21 Thread Francois Romieu
Wagner Ferenc <[EMAIL PROTECTED]> :
[...]
> So why can I remove a driver serving live network traffic?

Why not ? It is quite common to remove physically a network/storage
device. 

-- 
Ueimor
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mmap dirty limits on 32 bit kernels (Was: [BUG] New Kernel Bugs)

2007-11-21 Thread Bron Gondwana
On Thu, Nov 15, 2007 at 08:32:22AM -0800, Linus Torvalds wrote:
> On Thu, 15 Nov 2007, Bron Gondwana wrote:
> > 
> > I guess we'll be doing the one-liner kernel mod and testing
> > that then.
> 
> The thing to look at is "get_dirty_limits()" in mm/page-writeback.c, and 
> in this particular case it's the
> 
>   unsigned long available_memory = determine_dirtyable_memory();
> 
> that's going to bite you. In particular, note the
> 
>   x -= highmem_dirtyable_memory(x);
> 
> that we do in determine_dirtyable_memory().
> 
> So in this case, if you basically remove that line, it will allow all of 
> memory to be dirtied (including highmem), and then the background_ratio 
> will work on the whole 6GB.
> 
> HOWEVER! It's worth noting that we also have some other old legacy cruft 
> there that may interfere with your code. In particular, if you look at the 
> top of "get_dirty_limits()", it *also* does a
> 
> unmapped_ratio = 100 - ((global_page_state(NR_FILE_MAPPED) +
> global_page_state(NR_ANON_PAGES)) * 100) /
> available_memory;
> 
> dirty_ratio = vm_dirty_ratio;
> if (dirty_ratio > unmapped_ratio / 2)
> dirty_ratio = unmapped_ratio / 2;
> 
> and that whole "unmapped_ratio" comparison is probably bogus these days, 
> since we now take the mapped dirty pages into account. That code harks 
> back to the days before we did that, and dirty ratios only affected 
> non-mapped pages.
> 
> And in particular, now that I look at it, I wonder if it can even go 
> negative (because "available_memory" may be *smaller* than the 
> NR_FILE_MAPPED|ANON_PAGES sum!).
> 
> We'll fix up a negative value anyway (because of the clamping of 
> dirty_ratio to no less than 5), but the point is that the whole 
> "unmapped_ratio" thing probably doesn't make sense any more, and may well 
> make the dirty_ratio not work for you, because you may have a very small 
> unmapped_ratio that effectively makes all dirty limits always clamp to a 
> very small value.
> 
> So regardless, I think you may want to try the appended patch *first*.
> 
> If this patch makes a difference, please holler. I think it's the correct 
> thing to do, but I'm not going to actually commit it without somebody 
> saying that it makes a difference (and preferably Peter Zijlstra and 
> Andrew acking it too).

mmap: mmap call failed: errno: 12 errmsg: Cannot allocate memory

Yep, that's "fixed" the problem alright!  No way this puppy is
dirtying 2Gb of memory any more.

http://linux.brong.fastmail.fm/2007-11-22/bmtest.pl

That said, pushing the size down to 1700 rather than 2000 in that
file makes it run, and the behaviour matches the 2000 Mb case on
2.6.16.55 rather than 2.6.20.20 or 2.6.23.1 (my other test case
kernels that happened to be pre-built on that machine)

[EMAIL PROTECTED] ~]$ free
 total   used   free sharedbuffers cached
Mem:   414983620730562076780  0  220361846096
-/+ buffers/cache: 2049243944912
Swap:  2096472  02096472

That's after running the 1700Mb version.  You can see this machine is our
one remaining 4Gb machine (it's not running any production services unlike
the 6Gb machine, so it's better for testing)

Anyway - looks like this may be a "good enough" solution for out1 if
it can manage an ~2Gb file with 6Gb of memory available.  I'll test
that later today - but I should drag myself into the office now...

Bron.

(patch left attached below for reference)

> Only *after* testing this change is it probably a good idea to test the 
> real hack of then removing the highmem_dirtyable_memory() thing. 
> 
> Peter? Andrew?
> 
>   Linus
> 
> ---
>  mm/page-writeback.c |8 
>  1 files changed, 0 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 81a91e6..d55cfca 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -297,20 +297,12 @@ get_dirty_limits(long *pbackground, long *pdirty, long 
> *pbdi_dirty,
>  {
>   int background_ratio;   /* Percentages */
>   int dirty_ratio;
> - int unmapped_ratio;
>   long background;
>   long dirty;
>   unsigned long available_memory = determine_dirtyable_memory();
>   struct task_struct *tsk;
>  
> - unmapped_ratio = 100 - ((global_page_state(NR_FILE_MAPPED) +
> - global_page_state(NR_ANON_PAGES)) * 100) /
> - available_memory;
> -
>   dirty_ratio = vm_dirty_ratio;
> - if (dirty_ratio > unmapped_ratio / 2)
> - dirty_ratio = unmapped_ratio / 2;
> -
>   if (dirty_ratio < 5)
>   dirty_ratio = 5;
>  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ 

Re: [PATCH] [2.6.24-rc3-mm1] loop cleanup in fs/namespace.c - repost

2007-11-21 Thread Dmitri Vorobiev
Andrew Morton пишет:
> On Thu, 22 Nov 2007 01:49:19 +0300
> Dmitri Vorobiev <[EMAIL PROTECTED]> wrote:
> 
>> Zach Brown пишет:
> This doesn't look fine.  Did you test this?
 Oops, my fault. Of course, I tested the patch, but kernel modules are
 disabled in my test setup, so I missed the error.
>>> :)
>>>
 Enclosed to this message is a new patch, which replaces the goto-loop by
 the while-based one, but leaves the EXPORT_SYMBOL macro intact.
>>> It certainly looks OK to me now, for whatever that's worth. 
>> Zach, thank you for the code review and suggestions.
>>
>>> You probably want to wait 'till the next merge window to get it in,
>>> though.  It's just a cleanup and so shouldn't go in this late in the -rc
>>> line.
>>>
>>> Maybe Andrew will be willing to queue it until that time in -mm.
>> I am enclosing the patch against current -mm tree and adding Andrew to the 
>> Cc: list.
>>
>> Thanks,
>>
>> Dmitri
>>
>>> - z
>>>
>>
> [loop-cleanup-fs-namespace-mm.diff  text/x-patch (742B)]
> Signed-off-by: Dmitri Vorobiev <[EMAIL PROTECTED]>
> ---
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 79883fe..b098b63 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -606,19 +606,17 @@ static inline void __mntput(struct vfsmo
>  
>  void mntput_no_expire(struct vfsmount *mnt)
>  {
> -repeat:
> - if (atomic_dec_and_lock(>mnt_count, _lock)) {
> + while (atomic_dec_and_lock(>mnt_count, _lock)) {
>   if (likely(!mnt->mnt_pinned)) {
>   spin_unlock(_lock);
>   __mntput(mnt);
> - return;
> + break;
>   }
>   atomic_add(mnt->mnt_pinned + 1, >mnt_count);
>   mnt->mnt_pinned = 0;
>   spin_unlock(_lock);
>   acct_auto_close_mnt(mnt);
>   security_sb_umount_close(mnt);
> - goto repeat;
>   }
>  }
>  
> This patch has no changelog which I can use.
> 
> 

Andrew, thanks for the quick reply. I believe that a couple of sentences is 
enough for the changelog entry, so here it goes...

From: Dmitri Vorobiev <[EMAIL PROTECTED]>

The mntput_no_expire() routine implements a simple loop using the goto-based 
construct. Replace this with an equivalent while-based loop, which looks much 
cleaner in C code.


Signed-off-by: Dmitri Vorobiev <[EMAIL PROTECTED]>
---
diff --git a/fs/namespace.c b/fs/namespace.c
index 79883fe..b098b63 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -606,19 +606,17 @@ static inline void __mntput(struct vfsmo
 
 void mntput_no_expire(struct vfsmount *mnt)
 {
-repeat:
-	if (atomic_dec_and_lock(>mnt_count, _lock)) {
+	while (atomic_dec_and_lock(>mnt_count, _lock)) {
 		if (likely(!mnt->mnt_pinned)) {
 			spin_unlock(_lock);
 			__mntput(mnt);
-			return;
+			break;
 		}
 		atomic_add(mnt->mnt_pinned + 1, >mnt_count);
 		mnt->mnt_pinned = 0;
 		spin_unlock(_lock);
 		acct_auto_close_mnt(mnt);
 		security_sb_umount_close(mnt);
-		goto repeat;
 	}
 }
 


Re: [PATCH] [2.6.24-rc3-mm1] loop cleanup in fs/namespace.c - repost

2007-11-21 Thread Andrew Morton
On Thu, 22 Nov 2007 01:49:19 +0300
Dmitri Vorobiev <[EMAIL PROTECTED]> wrote:

> Zach Brown пишет:
> >>> This doesn't look fine.  Did you test this?
> >> Oops, my fault. Of course, I tested the patch, but kernel modules are
> >> disabled in my test setup, so I missed the error.
> > 
> > :)
> > 
> >> Enclosed to this message is a new patch, which replaces the goto-loop by
> >> the while-based one, but leaves the EXPORT_SYMBOL macro intact.
> > 
> > It certainly looks OK to me now, for whatever that's worth. 
> 
> Zach, thank you for the code review and suggestions.
> 
> > 
> > You probably want to wait 'till the next merge window to get it in,
> > though.  It's just a cleanup and so shouldn't go in this late in the -rc
> > line.
> > 
> > Maybe Andrew will be willing to queue it until that time in -mm.
> 
> I am enclosing the patch against current -mm tree and adding Andrew to the 
> Cc: list.
> 
> Thanks,
> 
> Dmitri
> 
> > 
> > - z
> > 
> 
> 
[loop-cleanup-fs-namespace-mm.diff  text/x-patch (742B)]
Signed-off-by: Dmitri Vorobiev <[EMAIL PROTECTED]>
---
diff --git a/fs/namespace.c b/fs/namespace.c
index 79883fe..b098b63 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -606,19 +606,17 @@ static inline void __mntput(struct vfsmo
 
 void mntput_no_expire(struct vfsmount *mnt)
 {
-repeat:
-   if (atomic_dec_and_lock(>mnt_count, _lock)) {
+   while (atomic_dec_and_lock(>mnt_count, _lock)) {
if (likely(!mnt->mnt_pinned)) {
spin_unlock(_lock);
__mntput(mnt);
-   return;
+   break;
}
atomic_add(mnt->mnt_pinned + 1, >mnt_count);
mnt->mnt_pinned = 0;
spin_unlock(_lock);
acct_auto_close_mnt(mnt);
security_sb_umount_close(mnt);
-   goto repeat;
}
 }
 
This patch has no changelog which I can use.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] PNP cleanups - Version 2 - Pass struct pnp_dev to pnp_clean_resource_table for cleanup reasons

2007-11-21 Thread Thomas Renninger
On Wed, 2007-11-21 at 18:13 +, Alan Cox wrote:
> > > in the pnp_dev. That is, the resources are tied to the device, with 
> > > struct 
> > > pnp_resource_table being no more than a handy container to group them 
> > > under 
> > > a single name.
> > Putting the count into struct resource does not make sense.
> 
> Can you explain that claim ?
The additional variable would only make sense for the pnp layer, or only
for the pnp resource table in the pnp layer, but struct resource is used
at much more places...
It is meant for System Memory and IO port resources in general, why
waste bytes and an additional name at all places it is used, just for
the pnp resource table?

> > The idea is to not rely on the exact pnp resource table structure and
> > abstracting this to macros. If krealloc approach works,
> > dev->res.port_resource[i].start would even still work, if not, it's
> > easier to alter the pnp resource table and the macros internally.
> 
> Externally in drivers yes. Internally in code no - it makes the code
> harder to work with.
> 
> > > Yes, I dont know how he intends to deal with this (nor, in fact, just how 
> > > dynamic things are supposed to end up to begin with) so over to Thomas.
> > Krealloc should only get used at early pnp init time, when the BIOS
> > structures are parsed. The devices shouldn't be active then...
> > A bit of a problem, as said, could be the sysfs interfaces, there it
> > must be insured krealloc is not used anymore.
> 
> I don't think its that simple but that can be dealt with one the changes
> are in place if the objects are sensibly laid out.

I hope it is, stay tuned there will come something soon...
If it's not that easy, another structure would be needed and every
dev->res.port_resource[i].start and friends need to be touched (I don't
see how this could still be resolved in a simple array then...).

Thomas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1- powerpc link failure

2007-11-21 Thread Stephen Rothwell
On Wed, 21 Nov 2007 13:36:30 +0530 Kamalesh Babulal <[EMAIL PROTECTED]> wrote:
>
> The kernel build fails on powerpc while linking,

Only for allyesconfig (or maybe some other config that builds a lot of
stuff in.

>   AS  .tmp_kallsyms3.o
>   LD  vmlinux.o
> ld: TOC section size exceeds 64k
> make: *** [vmlinux.o] Error 1
> 
> The patch posted at http://lkml.org/lkml/2007/11/13/414, solves this 
> failure.

However, that patch needs more testing especially to figure out what
performance effects it has.  i.e. not for merging, yet.

-- 
Cheers,
Stephen Rothwell[EMAIL PROTECTED]
http://www.canb.auug.org.au/~sfr/


pgpQkAgSmSIms.pgp
Description: PGP signature


Re: [PATCH] [2.6.24-rc3-mm1] loop cleanup in fs/namespace.c - repost

2007-11-21 Thread Dmitri Vorobiev
Zach Brown пишет:
>>> This doesn't look fine.  Did you test this?
>> Oops, my fault. Of course, I tested the patch, but kernel modules are
>> disabled in my test setup, so I missed the error.
> 
> :)
> 
>> Enclosed to this message is a new patch, which replaces the goto-loop by
>> the while-based one, but leaves the EXPORT_SYMBOL macro intact.
> 
> It certainly looks OK to me now, for whatever that's worth. 

Zach, thank you for the code review and suggestions.

> 
> You probably want to wait 'till the next merge window to get it in,
> though.  It's just a cleanup and so shouldn't go in this late in the -rc
> line.
> 
> Maybe Andrew will be willing to queue it until that time in -mm.

I am enclosing the patch against current -mm tree and adding Andrew to the Cc: 
list.

Thanks,

Dmitri

> 
> - z
> 

Signed-off-by: Dmitri Vorobiev <[EMAIL PROTECTED]>
---
diff --git a/fs/namespace.c b/fs/namespace.c
index 79883fe..b098b63 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -606,19 +606,17 @@ static inline void __mntput(struct vfsmo
 
 void mntput_no_expire(struct vfsmount *mnt)
 {
-repeat:
-	if (atomic_dec_and_lock(>mnt_count, _lock)) {
+	while (atomic_dec_and_lock(>mnt_count, _lock)) {
 		if (likely(!mnt->mnt_pinned)) {
 			spin_unlock(_lock);
 			__mntput(mnt);
-			return;
+			break;
 		}
 		atomic_add(mnt->mnt_pinned + 1, >mnt_count);
 		mnt->mnt_pinned = 0;
 		spin_unlock(_lock);
 		acct_auto_close_mnt(mnt);
 		security_sb_umount_close(mnt);
-		goto repeat;
 	}
 }
 


Error returns not handled correctly by sysfs.c:subsys_attr_store()

2007-11-21 Thread Andrew Patterson
The buf in fs/sysfs.c:subsys_attr_store() does not seem to be updated
correctly when returning a negative value (indicating that an error
condition has occurred) is returned.  If a negative value is returned,
the next subsequent call to subsys_attr_store will have the contents of
buf appended to the previous call.  Example: I have modified the
sd.c:sd_store_allow_restart to always print out the contents of the buf
and return an error using the following patch:

--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -183,6 +183,9 @@ static ssize_t sd_store_allow_restart(struct class_device *c
struct scsi_disk *sdkp = to_scsi_disk(cdev);
struct scsi_device *sdp = sdkp->device;
 
+   printk(KERN_ERR "buf_ptr = 0x%p, buf = %s, count = %u\n", buf, buf, coun
+   return -EINVAL;
+
if (!capable(CAP_SYS_ADMIN))
return -EACCES;

I get the following output when writing invalid values to the
allow_restart sysfs file:

# echo x > /sys/class/scsi_disk/4:0:0:0/allow_restart
bash: echo: write error: Invalid argument
# echo y > /sys/class/scsi_disk/4:0:0:0/allow_restart
bash: echo: write error: Invalid argument
# echo z > /sys/class/scsi_disk/4:0:0:0/allow_restart
bash: echo: write error: Invalid argument

And the console output shows:

buf_ptr = 0xe1004bdb, buf = x
, count = 2
buf_ptr = 0xe1004bdb, buf = x
, count = 2
buf_ptr = 0xe1004bdb, buf = x
y
, count = 4
buf_ptr = 0xe1004bdb, buf = x
y
, count = 4
buf_ptr = 0xe1004caf, buf = x
y
z
, count = 6
buf_ptr = 0xe1004caf, buf = x
y
z
, count = 6

and the same append problem occurs when using another sysfs file:

# echo xyzzy > /sys/class/scsi_disk/4:0:1:0/allow_restart
bash: echo: write error: Invalid argument

buf_ptr = 0xe1004caf, buf = x
y
z
xyzzy
, count = 12

I found this problem in 2.6.24-rc3 and and an earlier version of 2.6.23.

This seems to work correctly on 2.6.18 (at least with the RHEL5 kernel I
did some testing with), i.e. the contents of buf from the previous
failed called are thrown away/overwritten. I looked through sysfs.c to
see if I could find anything obvious but could not see anything.
Perhaps this is handled at a higher level. 

-- 
Andrew Patterson
Hewlett-Packard

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-21 Thread Andrew Morton
On Wed, 21 Nov 2007 22:45:22 +0100
Laurent Riffard <[EMAIL PROTECTED]> wrote:

> Le 21.11.2007 05:45, Andrew Morton a écrit :
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
> 
> Hello, 
> 
> My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
> that a bunch of task are blocked in "D" state, they seem to wait for
> some I/O completion. I can try to hand-copy some data if requested.
> 
> I found these messages in dmesg:
> 
> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
> EXT3-fs: mounted filesystem with ordered data mode.
> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
> driverbyte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sda, sector 16460
> ReiserFS: sda7: found reiserfs format "3.6" with standard journal
> ReiserFS: sda7: using ordered data mode
> --
> ReiserFS: sda7: Using r5 hash to sort names
> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> driverbyte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sdb, sector 19632
> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> driverbyte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sdb, sector 40037363
> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 extents:1 
> across:1048568k
> lp0: using parport0 (interrupt-driven).
> 
> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible.
> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
> 
> Maybe something is broken in pata_via driver ?
> 

Could be - 
libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
touch pata_via.c.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[replacement PATCH 5/6] x86: tls32 moved

2007-11-21 Thread Roland McGrath

This renames arch/x86/ia32/tls32.c to arch/x86/kernel/tls.c, which does
nothing now but paves the way to consolidate this code for 32-bit too.

Signed-off-by: Roland McGrath <[EMAIL PROTECTED]>
---
 arch/x86/ia32/Makefile  |2 +-
 arch/x86/ia32/tls32.c   |  158 ---
 arch/x86/kernel/Makefile_64 |2 +
 arch/x86/kernel/tls.c   |  158 +++
 4 files changed, 161 insertions(+), 159 deletions(-)

diff --git a/arch/x86/ia32/Makefile b/arch/x86/ia32/Makefile
index e2edda2..3c8b746 100644
--- a/arch/x86/ia32/Makefile
+++ b/arch/x86/ia32/Makefile
@@ -2,7 +2,7 @@
 # Makefile for the ia32 kernel emulation subsystem.
 #
 
-obj-$(CONFIG_IA32_EMULATION) := ia32entry.o sys_ia32.o ia32_signal.o tls32.o \
+obj-$(CONFIG_IA32_EMULATION) := ia32entry.o sys_ia32.o ia32_signal.o \
ia32_binfmt.o fpu32.o ptrace32.o syscall32.o syscall32_syscall.o \
mmap32.o
 
diff --git a/arch/x86/ia32/tls32.c b/arch/x86/ia32/tls32.c
deleted file mode 100644
index 5291596..000
--- a/arch/x86/ia32/tls32.c
+++ /dev/null
@@ -1,158 +0,0 @@
-#include 
-#include 
-#include 
-#include 
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-/*
- * sys_alloc_thread_area: get a yet unused TLS descriptor index.
- */
-static int get_free_idx(void)
-{
-   struct thread_struct *t = >thread;
-   int idx;
-
-   for (idx = 0; idx < GDT_ENTRY_TLS_ENTRIES; idx++)
-   if (desc_empty((struct n_desc_struct *)(t->tls_array) + idx))
-   return idx + GDT_ENTRY_TLS_MIN;
-   return -ESRCH;
-}
-
-/*
- * Set a given TLS descriptor:
- * When you want addresses > 32bit use arch_prctl()
- */
-int do_set_thread_area(struct thread_struct *t, struct user_desc __user 
*u_info)
-{
-   struct user_desc info;
-   struct n_desc_struct *desc;
-   int cpu, idx;
-
-   if (copy_from_user(, u_info, sizeof(info)))
-   return -EFAULT;
-
-   idx = info.entry_number;
-
-   /*
-* index -1 means the kernel should try to find and
-* allocate an empty descriptor:
-*/
-   if (idx == -1) {
-   idx = get_free_idx();
-   if (idx < 0)
-   return idx;
-   if (put_user(idx, _info->entry_number))
-   return -EFAULT;
-   }
-
-   if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX)
-   return -EINVAL;
-
-   desc = ((struct n_desc_struct *)t->tls_array) + idx - GDT_ENTRY_TLS_MIN;
-
-   /*
-* We must not get preempted while modifying the TLS.
-*/
-   cpu = get_cpu();
-
-   if (LDT_empty()) {
-   desc->a = 0;
-   desc->b = 0;
-   } else {
-   desc->a = LDT_entry_a();
-   desc->b = LDT_entry_b();
-   }
-   if (t == >thread)
-   load_TLS(t, cpu);
-
-   put_cpu();
-   return 0;
-}
-
-asmlinkage long sys32_set_thread_area(struct user_desc __user *u_info)
-{
-   return do_set_thread_area(>thread, u_info);
-}
-
-
-/*
- * Get the current Thread-Local Storage area:
- */
-
-#define GET_LIMIT(desc) ( \
-   ((desc)->a & 0x0) | \
-((desc)->b & 0xf) )
-
-#define GET_32BIT(desc)(((desc)->b >> 22) & 1)
-#define GET_CONTENTS(desc) (((desc)->b >> 10) & 3)
-#define GET_WRITABLE(desc) (((desc)->b >>  9) & 1)
-#define GET_LIMIT_PAGES(desc)  (((desc)->b >> 23) & 1)
-#define GET_PRESENT(desc)  (((desc)->b >> 15) & 1)
-#define GET_USEABLE(desc)  (((desc)->b >> 20) & 1)
-#define GET_LONGMODE(desc) (((desc)->b >> 21) & 1)
-
-int do_get_thread_area(struct thread_struct *t, struct user_desc __user 
*u_info)
-{
-   struct user_desc info;
-   struct n_desc_struct *desc;
-   int idx;
-
-   if (get_user(idx, _info->entry_number))
-   return -EFAULT;
-   if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX)
-   return -EINVAL;
-
-   desc = ((struct n_desc_struct *)t->tls_array) + idx - GDT_ENTRY_TLS_MIN;
-
-   memset(, 0, sizeof(struct user_desc));
-   info.entry_number = idx;
-   info.base_addr = get_desc_base(desc);
-   info.limit = GET_LIMIT(desc);
-   info.seg_32bit = GET_32BIT(desc);
-   info.contents = GET_CONTENTS(desc);
-   info.read_exec_only = !GET_WRITABLE(desc);
-   info.limit_in_pages = GET_LIMIT_PAGES(desc);
-   info.seg_not_present = !GET_PRESENT(desc);
-   info.useable = GET_USEABLE(desc);
-   info.lm = GET_LONGMODE(desc);
-
-   if (copy_to_user(u_info, , sizeof(info)))
-   return -EFAULT;
-   return 0;
-}
-
-asmlinkage long sys32_get_thread_area(struct user_desc __user *u_info)
-{
-   return do_get_thread_area(>thread, u_info);
-}
-
-
-int ia32_child_tls(struct task_struct *p, struct pt_regs *childregs)
-{
-   struct n_desc_struct *desc;
-   struct user_desc info;
-   struct user_desc __user 

[replacement PATCH 6/6] x86: TLS cleanup

2007-11-21 Thread Roland McGrath

This consolidates the four different places that implemented the same
encoding magic for the GDT-slot 32-bit TLS support.  The old tls32.c was
renamed and is now only slightly modified to be the shared implementation.

Signed-off-by: Roland McGrath <[EMAIL PROTECTED]>
---
 arch/x86/ia32/ia32entry.S|4 +-
 arch/x86/kernel/Makefile_32  |1 +
 arch/x86/kernel/process_32.c |  143 ++
 arch/x86/kernel/process_64.c |3 +-
 arch/x86/kernel/ptrace_32.c  |   91 +++
 arch/x86/kernel/ptrace_64.c  |   26 +++-
 arch/x86/kernel/tls.c|   97 +++-
 include/asm-x86/ia32.h   |6 --
 include/asm-x86/ptrace.h |   11 +++
 9 files changed, 80 insertions(+), 302 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index df588f0..0a71fa9 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -644,8 +644,8 @@ ia32_sys_call_table:
.quad compat_sys_futex  /* 240 */
.quad compat_sys_sched_setaffinity
.quad compat_sys_sched_getaffinity
-   .quad sys32_set_thread_area
-   .quad sys32_get_thread_area
+   .quad sys_set_thread_area
+   .quad sys_get_thread_area
.quad compat_sys_io_setup   /* 245 */
.quad sys_io_destroy
.quad compat_sys_io_getevents
diff --git a/arch/x86/kernel/Makefile_32 b/arch/x86/kernel/Makefile_32
index a7bc93c..e660584 100644
--- a/arch/x86/kernel/Makefile_32
+++ b/arch/x86/kernel/Makefile_32
@@ -10,6 +10,7 @@ obj-y := process_32.o signal_32.o entry_32.o traps_32.o 
irq_32.o \
pci-dma_32.o i386_ksyms_32.o i387_32.o bootflag.o e820_32.o\
quirks.o i8237.o topology.o alternative.o i8253.o tsc_32.o
 
+obj-y  += tls.o
 obj-$(CONFIG_STACKTRACE)   += stacktrace.o
 obj-y  += cpu/
 obj-y  += acpi/
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 7b89958..ebbbfc5 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -480,33 +480,16 @@ int copy_thread(int nr, unsigned long clone_flags, 
unsigned long esp,
set_tsk_thread_flag(p, TIF_IO_BITMAP);
}
 
+   err = 0;
+
/*
 * Set a new TLS for the child thread?
 */
-   if (clone_flags & CLONE_SETTLS) {
-   struct desc_struct *desc;
-   struct user_desc info;
-   int idx;
-
-   err = -EFAULT;
-   if (copy_from_user(, (void __user *)childregs->esi, 
sizeof(info)))
-   goto out;
-   err = -EINVAL;
-   if (LDT_empty())
-   goto out;
-
-   idx = info.entry_number;
-   if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX)
-   goto out;
-
-   desc = p->thread.tls_array + idx - GDT_ENTRY_TLS_MIN;
-   desc->a = LDT_entry_a();
-   desc->b = LDT_entry_b();
-   }
+   if (clone_flags & CLONE_SETTLS)
+   err = do_set_thread_area(p, -1,
+   (struct user_desc __user *)childregs->esi, 0);
 
-   err = 0;
- out:
-   if (err && p->thread.io_bitmap_ptr) {
+   if (err && p->thread.io_bitmap_ptr) {
kfree(p->thread.io_bitmap_ptr);
p->thread.io_bitmap_max = 0;
}
@@ -851,120 +834,6 @@ unsigned long get_wchan(struct task_struct *p)
return 0;
 }
 
-/*
- * sys_alloc_thread_area: get a yet unused TLS descriptor index.
- */
-static int get_free_idx(void)
-{
-   struct thread_struct *t = >thread;
-   int idx;
-
-   for (idx = 0; idx < GDT_ENTRY_TLS_ENTRIES; idx++)
-   if (desc_empty(t->tls_array + idx))
-   return idx + GDT_ENTRY_TLS_MIN;
-   return -ESRCH;
-}
-
-/*
- * Set a given TLS descriptor:
- */
-asmlinkage int sys_set_thread_area(struct user_desc __user *u_info)
-{
-   struct thread_struct *t = >thread;
-   struct user_desc info;
-   struct desc_struct *desc;
-   int cpu, idx;
-
-   if (copy_from_user(, u_info, sizeof(info)))
-   return -EFAULT;
-   idx = info.entry_number;
-
-   /*
-* index -1 means the kernel should try to find and
-* allocate an empty descriptor:
-*/
-   if (idx == -1) {
-   idx = get_free_idx();
-   if (idx < 0)
-   return idx;
-   if (put_user(idx, _info->entry_number))
-   return -EFAULT;
-   }
-
-   if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX)
-   return -EINVAL;
-
-   desc = t->tls_array + idx - GDT_ENTRY_TLS_MIN;
-
-   /*
-* We must not get preempted while modifying the TLS.
-*/
-   cpu = get_cpu();
-
-   if (LDT_empty()) {
-   desc->a = 0;
-   desc->b 

Re: [PATCH 5/5] x86: TLS cleanup

2007-11-21 Thread Roland McGrath
> I had a bit of trouble verifying correctness here because of much
> brownian motion.  Any possibility of a pure movement / fixup separation
> to make it easier on the eyes?

Yeah, sorry about that.  It was late and the whole TLS thing was a sudden
afterthought while I was in the middle of doing something else, so I didn't
feel like slicing up the patch any more.  And in my tree, GIT didn't even
do so great a job with noticing this rename.  I'll send a replacement.

Thanks,
Roland
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-11-21 Thread Andrew Morton
On Wed, 21 Nov 2007 20:35:13 +0200
"Kirill A. Shutemov" <[EMAIL PROTECTED]> wrote:

> Symbol init_level4_pgt is needed by nvidia module. Is it really need to 
> unexport it?

It's our clever way of reducing the tester base so we don't get so many
bug reports.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: usb mouse doesn't work

2007-11-21 Thread Andrew Morton
On Wed, 21 Nov 2007 20:23:46 +0200
"Kirill A. Shutemov" <[EMAIL PROTECTED]> wrote:

> USB mouse(Logitech M-BT58) doesn't work. TouchPad works.
> dmesg after rmmod usbcore && modprobe uhci_hcd:
> 
> usbcore: registered new interface driver usbfs
> usbcore: registered new interface driver hub
> usbcore: registered new device driver usb
> USB Universal Host Controller Interface driver v3.0
> ACPI: PCI Interrupt :00:1d.0[A] -> Link [LNKE] -> GSI 10 (level, low)
> -> IRQ 10
> PCI: Setting latency timer of device :00:1d.0 to 64
> uhci_hcd :00:1d.0: UHCI Host Controller
> uhci_hcd :00:1d.0: new USB bus registered, assigned bus number 1
> uhci_hcd :00:1d.0: irq 10, io base 0xbf80
> usb usb1: configuration #1 chosen from 1 choice
> hub 1-0:1.0: USB hub found
> hub 1-0:1.0: 2 ports detected
> usb usb1: new device found, idVendor=, idProduct=
> usb usb1: new device strings: Mfr=3, Product=2, SerialNumber=1
> usb usb1: Product: UHCI Host Controller
> usb usb1: Manufacturer: Linux 2.6.24-kas-alt1 uhci_hcd
> usb usb1: SerialNumber: :00:1d.0
> ACPI: PCI Interrupt :00:1d.1[B] -> Link [LNKF] -> GSI 11 (level, low)
> -> IRQ 11
> PCI: Setting latency timer of device :00:1d.1 to 64
> uhci_hcd :00:1d.1: UHCI Host Controller
> uhci_hcd :00:1d.1: new USB bus registered, assigned bus number 2
> uhci_hcd :00:1d.1: irq 11, io base 0xbf60
> usb usb2: configuration #1 chosen from 1 choice
> hub 2-0:1.0: USB hub found
> hub 2-0:1.0: 2 ports detected
> usb usb2: new device found, idVendor=, idProduct=
> usb usb2: new device strings: Mfr=3, Product=2, SerialNumber=1
> usb usb2: Product: UHCI Host Controller
> usb usb2: Manufacturer: Linux 2.6.24-kas-alt1 uhci_hcd
> usb usb2: SerialNumber: :00:1d.1
> ACPI: PCI Interrupt :00:1d.2[C] -> Link [LNKG] -> GSI 9 (level, low)
> -> IRQ 9
> PCI: Setting latency timer of device :00:1d.2 to 64
> uhci_hcd :00:1d.2: UHCI Host Controller
> uhci_hcd :00:1d.2: new USB bus registered, assigned bus number 3
> uhci_hcd :00:1d.2: irq 9, io base 0xbf40
> usb usb3: configuration #1 chosen from 1 choice
> hub 3-0:1.0: USB hub found
> hub 3-0:1.0: 2 ports detected
> usb usb3: new device found, idVendor=, idProduct=
> usb usb3: new device strings: Mfr=3, Product=2, SerialNumber=1
> usb usb3: Product: UHCI Host Controller
> usb usb3: Manufacturer: Linux 2.6.24-kas-alt1 uhci_hcd
> usb usb3: SerialNumber: :00:1d.2
> ACPI: PCI Interrupt :00:1d.3[D] -> Link [LNKH] -> GSI 7 (level, low)
> -> IRQ 7
> PCI: Setting latency timer of device :00:1d.3 to 64
> uhci_hcd :00:1d.3: UHCI Host Controller
> uhci_hcd :00:1d.3: new USB bus registered, assigned bus number 4
> uhci_hcd :00:1d.3: irq 7, io base 0xbf20
> usb usb4: configuration #1 chosen from 1 choice
> hub 4-0:1.0: USB hub found
> hub 4-0:1.0: 2 ports detected
> usb usb4: new device found, idVendor=, idProduct=
> usb usb4: new device strings: Mfr=3, Product=2, SerialNumber=1
> usb usb4: Product: UHCI Host Controller
> usb usb4: Manufacturer: Linux 2.6.24-kas-alt1 uhci_hcd
> usb usb4: SerialNumber: :00:1d.3
> uhci_hcd :00:1d.3: FGR not stopped yet!
> 

I've had some strangenesses with USB lately.  Sometimes running `lsusb'
makes the USB system notice a newly attached device.

Is that "FGR not stopped yet!" messgae new behaviour?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-usb-devel] USB deadlock after resume

2007-11-21 Thread Markus Rechberger
On 11/21/07, Laurent Pinchart <[EMAIL PROTECTED]> wrote:
> On Wednesday 21 November 2007, Markus Rechberger wrote:
> > On 11/21/07, Alan Stern <[EMAIL PROTECTED]> wrote:
> > > On Wed, 21 Nov 2007, Markus Rechberger wrote:
> > > > > > it's not just usb_set_interface that hangs actually.
> > > > > > It seems to hang at
> > > > > >
> > > > > > wait_event(usb_kill_urb_queue, atomic_read(>use_count) == 0);
> > > > > >
> > > > > > in drivers/usb/core/urb.c after resuming. I disabled access to the
> > > > > > usb subsystem in the uvc driver, although connecting any other usb
> > > > > > storage fails too, just at the same point.
> > > > >
> > > > > Which URB is usb_kill_urb() called for?
> > > >
> > > > it's the usb_control_message which calls usb_kill_urb if I haven't got
> > > > it wrong. (if you're looking for some other information please let me
> > > > know)
> > > > Although, I got a bit further with it. The error seems to happen
> > > > earlier already.
> > > > If I load the driver, and do not access the device after suspending
> > > > all usb_control commands fail with -71 eproto.
> > >
> > > That's very strange.  Getting -71 errors is understandable; it
> > > indicates that the device can't handle being suspended.  But the
> > > wait_event() line still shouldn't hang.  If it does, it indicates that
> > > there's something wrong with the USB host controller, not just the
> > > device.
> > >
> > > Can you try testing this on a different sort of computer?
> >
> > Not really, suspending doesn't work at all on my other notebook it
> > just freezes..
> > I'm basically trying to get that driver work on my eee PC [1], it's
> > cheap and tiny so I don't expect anything special in there..
> > The system is preloaded with Xandros (it's debian etch with a few
> > custom applications) and linux 2.6.21.4.
>
> If I'm not mistaken, the EeePC ACPI bios plays tricks with the USB ports
> during suspend/resume. You should really test suspend/resume with the same
> camera chipset on a proper computer. If the camera still crashes, we have a
> buggy chipset which needs a reset quirk. If it doesn't, the EeePC ACPI bios
> is probably at fault. Adding quirks and hacks to the Linux kernel (either in
> the USB stack or the uvcvideo driver) is pretty pointless if the bios tries
> to make the system crash. The ACPI code should be fixed in that case.
>

With ACPI it seems to be possible to disconnect the uvc device.
I tested the suspend/resume functions by adding a proc interface to
it, and it worked properly.
Although the eee PC also suspends the underlying bus where the usb
controller is connected to (which is PCI or PCIe)

> > The system still locks up, although only if I leave the video
> > application running during suspending. I don't have to reload the
> > driver anymore after resuming if the video node doesn't get accessed
> > (I'm looking for races in the uvc driver at the moment).
>

The current state I revealed is that after suspend if the video node
isn't used it's not necessary to reconnect the device nor to reload
the driver again if that reset is implemented.
That eee PC comes with 2.6.21.3 which has no such reset quirk feature
in the usbcore (that's what I initially meant actually).
If a videoapplication accesses the nodes during suspend the notebook
won't come back again.
I also think it's faulty hardware in that case but I'm moreover
looking for a solution for it. My other intel notebook doesn't even
awake from suspend to ram, and for some reason suspend to disk just
didn't work as expected either (Acer Travelmate 660).

thanks for the feedback,
Markus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: network driver usage count

2007-11-21 Thread Wagner Ferenc
Stephen Hemminger <[EMAIL PROTECTED]> writes:

> On Wed, 21 Nov 2007 20:45:11 +0100
> Ferenc Wagner <[EMAIL PROTECTED]> wrote:
>
>> Under 2.6.23.1, my lsmod output shows this:
>> 
>> $ lsmod | grep tg3
>> tg3   100580  0 
>> 
>> The usage count is zero, even though it drives my two physical
>> interfaces:
>> 
>> $ ls -l /sys/class/net/eth-gb?/device/driver
>> lrwxrwxrwx 1 root root 0 2007-11-21 19:58 
>> /sys/class/net/eth-gb1/device/driver -> ../../../bus/pci/drivers/tg3
>> lrwxrwxrwx 1 root root 0 2007-11-21 19:58 
>> /sys/class/net/eth-gb2/device/driver -> ../../../bus/pci/drivers/tg3
>> 
>> These interfaces are up and bonded together, but that doesn't seem to
>> matter at all.  I also checked other machines, the network driver
>> (tg3, e1000) usage counts are always zero under various recent 2.6
>> kernels, but nonzero under 2.4.21 for example.
>> 
>> And really, the module could be removed, cutting my ssh session. :)
>> 
>> Was this made possible intentionally?  If yes, why?
>
> Yes, so devices can be removed at anytime.

Hmm, that would warrant nuking all the reference counts on every
driver.  I must be missing something, since I really feel it goes
against common sense.  Can you point me to some discussion of this
change?  I mean, I couldn't remove the driver of a mounted filesystem.
So why can I remove a driver serving live network traffic?
-- 
Thanks,
Feri.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible bug from kernel 2.6.22 and above

2007-11-21 Thread Eric Dumazet

Jie Chen a écrit :

Hi, there:

We have a simple pthread program that measures the synchronization 
overheads for various synchronization mechanisms such as spin locks, 
barriers (the barrier is implemented using queue-based barrier 
algorithm) and so on. We have dual quad-core AMD opterons (barcelona) 
clusters running 2.6.23.8 kernel at this moment using Fedora Core 7 
distribution. Before we moved to this kernel, we had kernel 2.6.21. 
These two kernels are configured identical and compiled with the same 
gcc 4.1.2 compiler. Under the old kernel, we observed that the 
performance of these overheads increases as the number of threads 
increases from 2 to 8. The following are the values of total time and 
overhead for all threads acquiring a pthread spin lock and all threads 
executing a barrier synchronization call.


Could you post the source of your test program ?

spinlock are ... spining and should not call linux scheduler, so I have no 
idea why a kernel change could modify your results.


Also I suspect you'll have better results with Fedora Core 8 (since glibc was 
updated to use private futexes in v 2.7), at least for the barrier ops.





Kernel 2.6.21
Number of Threads  2  4   6 8
SpinLock (Time micro second)   10.561810.5853810.5915   10.643
  (Overhead)   0.073  0.05746 0.102805 0.154563
Barrier (Time micro second)11.020410  11.678125   11.9889   12.38002
 (Overhead)0.531660   1.1502  1.500112 1.891617

Each thread is bound to a particular core using pthread_setaffinity_np.

Kernel 2.6.23.8
Number of Threads  2  4   6 8
SpinLock (Time micro second)   14.849915  17.117603   14.4496   10.5990
 (Overhead)4.345417   6.6172073.949435  0.110985
Barrier (Time micro second)19.462255  20.285117   16.19395  12.37662
 (Overhead)8.957755   9.7847225.699590  1.869518

It is clearly that the synchronization overhead increases as the number 
of threads increases in the kernel 2.6.21. But the synchronization 
overhead actually decreases as the number of threads increases in the 
kernel 2.6.23.8 (We observed the same behavior on kernel 2.6.22 as 
well). This certainly is not a correct behavior. The kernels are 
configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, 
CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel 
configuration file is in the attachment of this e-mail.


 From what we have read, there was a new scheduler (CFS) appeared from 
2.6.22. We are not sure whether the above behavior is caused by the new 
scheduler.


Finally, our machine cpu information is listed in the following:

processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 16
model   : 2
model name  : Quad-Core AMD Opteron(tm) Processor 2347
stepping: 10
cpu MHz : 1909.801
cache size  : 512 KB
physical id : 0
siblings: 4
core id : 0
cpu cores   : 4
fpu : yes
fpu_exception   : yes
cpuid level : 5
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp
 lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm 
cmp_legacy svm

extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips: 3822.95
TLB size: 1024 4K pages
clflush size: 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

In addition, we have schedstat and sched_debug files in the /proc 
directory.


Thank you for all your help to solve this puzzle. If you need more 
information, please let us know.



P.S. I like to be cc'ed on the discussions related to this problem.


###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-usb-devel] USB deadlock after resume

2007-11-21 Thread Alan Stern
On Wed, 21 Nov 2007, Laurent Pinchart wrote:

> > > When you suspend, you cut off vbus (afaik, correct me if I'm wrong),
> > > which means your device will get disconnected. One way to avoid this is
> > > enabling CONFIG_USB_PERSIST and trying with that on.
> >
> > Suspend may or may not cut off power.
> 
> I've always been confused by this.
> 
> If I'm not mistaken, there are three kind of suspend modes: autosuspend, 

You mean runtime (AKA dynamic) suspend -- autosuspend is merely one
type of runtime suspend.

> suspend to RAM and suspend to disk.

The nomenclature du jour is just plain "suspend" for suspend-to-RAM and
"hibernation" for suspend-to-disk.

> In the first case I expect the USB hub 
> (either root hub or external hub) to make the bus idle but not power it down. 

Correct.

> In the last case I suspect the USB bus to be powered down.

Usually, not but always!  Some Macs have been known to keep USB suspend 
current available during hibernation.

> What controls the USB bus power on suspended ports ? Is it handled by the 
> system (BIOS, ...) ? Is it allowed to power down the ports or keep them 
> powered as it chooses ? What are the rules set in stone ?

There are no rules set in stone.  :-)

Systems are _supposed_ to keep the ports powered during suspend, but
some may fail to do so.  It depends on the firmware (i.e., BIOS for
PCs) and the motherboard design.

> > If it does cut off power, resume() will never be called, instead either
> > disconnect() or reset_resume(). 
> 
> What is reset_resume() for ? Which one will be called on resume after a bus 
> power down ?

This is explained in Documentation/usb/power-management.txt.  If the
USB Persist facility has been enabled for a device then reset_resume
will be called, to indicate that the device had to be reset as part of
the resume procedure.  If USB Persist isn't enabled then the disconnect
method will be called and the device will be re-enumerated, exactly as
though it had been unplugged and then plugged back in.

Alan Stern

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/1] selinux: do not clear f_op when removing entries

2007-11-21 Thread James Morris
On Wed, 21 Nov 2007, Stephen Smalley wrote:

> Do not clear f_op when removing entries since it isn't safe to do.
> 
> Signed-off-by:  Stephen Smalley <[EMAIL PROTECTED]>

Applied to
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6.git#for-akpm


-- 
James Morris
<[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Possible bug from kernel 2.6.22 and above

2007-11-21 Thread Jie Chen

Hi, there:

We have a simple pthread program that measures the synchronization 
overheads for various synchronization mechanisms such as spin locks, 
barriers (the barrier is implemented using queue-based barrier 
algorithm) and so on. We have dual quad-core AMD opterons (barcelona) 
clusters running 2.6.23.8 kernel at this moment using Fedora Core 7 
distribution. Before we moved to this kernel, we had kernel 2.6.21. 
These two kernels are configured identical and compiled with the same 
gcc 4.1.2 compiler. Under the old kernel, we observed that the 
performance of these overheads increases as the number of threads 
increases from 2 to 8. The following are the values of total time and 
overhead for all threads acquiring a pthread spin lock and all threads 
executing a barrier synchronization call.


Kernel 2.6.21
Number of Threads  2  4   6 8
SpinLock (Time micro second)   10.561810.5853810.5915   10.643
  (Overhead)   0.073  0.05746 0.102805 
0.154563

Barrier (Time micro second)11.020410  11.678125   11.9889   12.38002
 (Overhead)0.531660   1.1502  1.500112 1.891617

Each thread is bound to a particular core using pthread_setaffinity_np.

Kernel 2.6.23.8
Number of Threads  2  4   6 8
SpinLock (Time micro second)   14.849915  17.117603   14.4496   10.5990
 (Overhead)4.345417   6.6172073.949435  0.110985
Barrier (Time micro second)19.462255  20.285117   16.19395  12.37662
 (Overhead)8.957755   9.7847225.699590  1.869518

It is clearly that the synchronization overhead increases as the number 
of threads increases in the kernel 2.6.21. But the synchronization 
overhead actually decreases as the number of threads increases in the 
kernel 2.6.23.8 (We observed the same behavior on kernel 2.6.22 as 
well). This certainly is not a correct behavior. The kernels are 
configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, 
CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel 
configuration file is in the attachment of this e-mail.


From what we have read, there was a new scheduler (CFS) appeared from 
2.6.22. We are not sure whether the above behavior is caused by the new 
scheduler.


Finally, our machine cpu information is listed in the following:

processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 16
model   : 2
model name  : Quad-Core AMD Opteron(tm) Processor 2347
stepping: 10
cpu MHz : 1909.801
cache size  : 512 KB
physical id : 0
siblings: 4
core id : 0
cpu cores   : 4
fpu : yes
fpu_exception   : yes
cpuid level : 5
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp
 lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm 
cmp_legacy svm

extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips: 3822.95
TLB size: 1024 4K pages
clflush size: 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

In addition, we have schedstat and sched_debug files in the /proc 
directory.


Thank you for all your help to solve this puzzle. If you need more 
information, please let us know.



P.S. I like to be cc'ed on the discussions related to this problem.


###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_ZONE_DMA32=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_EARLY_PRINTK=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_DMI=y
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_IKCONFIG=m
CONFIG_IKCONFIG_PROC=y
CONFIG_CPUSETS=y
CONFIG_SYSFS_DEPRECATED=y
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_SYSCTL=y
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y

Modules: Fold percpu_modcopy into module.c and get rid of the macro from hell

2007-11-21 Thread Christoph Lameter
percpu_modcopy is defined multiple times in arch files. However, the only
use is in module.c. Put a static definition into module.c and remove
the definitions from the arch files.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 arch/ia64/kernel/module.c|   10 --
 include/asm-generic/percpu.h |8 
 include/asm-ia64/percpu.h|5 -
 include/asm-powerpc/percpu.h |9 -
 include/asm-s390/percpu.h|9 -
 include/asm-sparc64/percpu.h |8 
 include/asm-x86/percpu_32.h  |9 -
 include/asm-x86/percpu_64.h  |9 -
 kernel/module.c  |8 
 9 files changed, 8 insertions(+), 67 deletions(-)

Index: linux-2.6/include/asm-generic/percpu.h
===
--- linux-2.6.orig/include/asm-generic/percpu.h 2007-11-21 13:11:18.430858642 
-0800
+++ linux-2.6/include/asm-generic/percpu.h  2007-11-21 13:11:42.871108294 
-0800
@@ -26,14 +26,6 @@ extern unsigned long __per_cpu_offset[NR
 #define __get_cpu_var(var) per_cpu(var, smp_processor_id())
 #define __raw_get_cpu_var(var) per_cpu(var, raw_smp_processor_id())
 
-/* A macro to avoid #include hell... */
-#define percpu_modcopy(pcpudst, src, size) \
-do {   \
-   unsigned int __i;   \
-   for_each_possible_cpu(__i)  \
-   memcpy((pcpudst)+__per_cpu_offset[__i], \
-  (src), (size));  \
-} while (0)
 #else /* ! SMP */
 
 #define DEFINE_PER_CPU(type, name) \
Index: linux-2.6/arch/ia64/kernel/module.c
===
--- linux-2.6.orig/arch/ia64/kernel/module.c2007-11-21 13:13:06.587858751 
-0800
+++ linux-2.6/arch/ia64/kernel/module.c 2007-11-21 13:13:19.527309025 -0800
@@ -941,13 +941,3 @@ module_arch_cleanup (struct module *mod)
unw_remove_unwind_table(mod->arch.core_unw_table);
 }
 
-#ifdef CONFIG_SMP
-void
-percpu_modcopy (void *pcpudst, const void *src, unsigned long size)
-{
-   unsigned int i;
-   for_each_possible_cpu(i) {
-   memcpy(pcpudst + __per_cpu_offset[i], src, size);
-   }
-}
-#endif /* CONFIG_SMP */
Index: linux-2.6/include/asm-ia64/percpu.h
===
--- linux-2.6.orig/include/asm-ia64/percpu.h2007-11-21 13:12:37.140358213 
-0800
+++ linux-2.6/include/asm-ia64/percpu.h 2007-11-21 13:12:55.271731039 -0800
@@ -39,10 +39,6 @@
DEFINE_PER_CPU(type, name)
 #endif
 
-/*
- * Pretty much a literal copy of asm-generic/percpu.h, except that 
percpu_modcopy() is an
- * external routine, to avoid include-hell.
- */
 #ifdef CONFIG_SMP
 
 extern unsigned long __per_cpu_offset[NR_CPUS];
@@ -55,7 +51,6 @@ DECLARE_PER_CPU(unsigned long, local_per
 #define __get_cpu_var(var) (*RELOC_HIDE(_cpu__##var, 
__ia64_per_cpu_var(local_per_cpu_offset)))
 #define __raw_get_cpu_var(var) (*RELOC_HIDE(_cpu__##var, 
__ia64_per_cpu_var(local_per_cpu_offset)))
 
-extern void percpu_modcopy(void *pcpudst, const void *src, unsigned long size);
 extern void setup_per_cpu_areas (void);
 extern void *per_cpu_init(void);
 
Index: linux-2.6/include/asm-powerpc/percpu.h
===
--- linux-2.6.orig/include/asm-powerpc/percpu.h 2007-11-21 13:14:21.754859049 
-0800
+++ linux-2.6/include/asm-powerpc/percpu.h  2007-11-21 13:14:33.651108379 
-0800
@@ -30,15 +30,6 @@
 #define __get_cpu_var(var) (*RELOC_HIDE(_cpu__##var, __my_cpu_offset()))
 #define __raw_get_cpu_var(var) (*RELOC_HIDE(_cpu__##var, 
local_paca->data_offset))
 
-/* A macro to avoid #include hell... */
-#define percpu_modcopy(pcpudst, src, size) \
-do {   \
-   unsigned int __i;   \
-   for_each_possible_cpu(__i)  \
-   memcpy((pcpudst)+__per_cpu_offset(__i), \
-  (src), (size));  \
-} while (0)
-
 extern void setup_per_cpu_areas(void);
 
 #else /* ! SMP */
Index: linux-2.6/include/asm-s390/percpu.h
===
--- linux-2.6.orig/include/asm-s390/percpu.h2007-11-21 13:14:39.835108493 
-0800
+++ linux-2.6/include/asm-s390/percpu.h 2007-11-21 13:14:48.590858137 -0800
@@ -51,15 +51,6 @@ extern unsigned long __per_cpu_offset[NR
 #define per_cpu(var,cpu) __reloc_hide(var,__per_cpu_offset[cpu])
 #define per_cpu_offset(x) (__per_cpu_offset[x])
 
-/* A macro to avoid #include hell... */
-#define percpu_modcopy(pcpudst, src, size) \
-do {   \
-   unsigned int __i;   \
-   

Re: mmap dirty limits on 32 bit kernels (Was: [BUG] New Kernel Bugs)

2007-11-21 Thread Jan Engelhardt

On Thu, 15 Nov 2007 13:47:54 -0800 (PST) Linus Torvalds wrote:
>
>But quite frankly, I refuse to even care about anything past that. If 
>you have 12G (or heaven forbid, even more) in your machine, and you 
>can't be bothered to just upgrade to a 64-bit CPU, then quite frankly, 
>*I* personally can't be bothered to care.
>
>If they have that much RAM (and bought it a few years ago when a 64-bit 
>CPU wasn't an option), they can't be poor.
>
>So the _only_ explanation today for 12GB on a 32-bit machine is
> (a) insanity
>or 
> (b) being so lazy as to not bother to upgrade
>

Just around the corner...

$ ftp ftp
Connected to ftp.gwdg.de.
220-
220-Gesellschaft fuer wissenschaftliche Datenverarbeitung mbH Goettingen
220-
220-This is a Linux PC (Dell PE-2650, 2 CPUs P4/2800, 12 GB RAM)
220-running SuSE-Linux-8.2 with SuSE kernel 2.4.20-64GB-SMP.

There is no reason to upgrade the hardware - if it works, hey good then.
And I am pretty sure that a few 2 GB sticks are cheaper than a big 
opteron (if you only go by that). It sure is now - and probably even 
back then.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: __rcu_process_callbacks() in Linux 2.6

2007-11-21 Thread Paul E. McKenney
On Wed, Nov 21, 2007 at 11:57:29AM -0800, James Huang wrote:
> Paul,
> 
>I am not sure I understand your answer about using test_and_set_bit() 
> in tasklet_schedule() as a 
> memory barrier in this case.
> 
>Yes, tasklet_schedule() includes a 
> test_and_set_bit(TASKLET_STATE_SCHED, >state) on a tasklet, but 
> in this case the tasklet is a per CPU tasklet.

Memory barriers are memory barriers, regardless of what type of data
is being processed.

>According to documentation/atomic_ops.txt, atomic op that returns a 
> value has the semantics of 
> "explicit memory barriers performed before and after the operation".  

And test_and_set_bit() does return a value, namely the value of the
affected bit before the operation.  Therefore, any correct implementation
for a CONFIG_SMP build must include memory barriers before and after.
Extracting the relevant passage from Documentation/atomic_ops.txt
between the pair of dashed lines:



int test_and_set_bit(unsigned long nr, volatile unsigned long *addr);
int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr);
int test_and_change_bit(unsigned long nr, volatile unsigned long *addr);

Like the above, except that these routines return a boolean which
indicates whether the changed bit was set _BEFORE_ the atomic bit
operation.

WARNING! It is incredibly important that the value be a boolean,
ie. "0" or "1".  Do not try to be fancy and save a few instructions by
declaring the above to return "long" and just returning something like
"old_val & mask" because that will not work.

For one thing, this return value gets truncated to int in many code
paths using these interfaces, so on 64-bit if the bit is set in the
upper 32-bits then testers will never see that.

One great example of where this problem crops up are the thread_info
flag operations.  Routines such as test_and_set_ti_thread_flag() chop
the return value into an int.  There are other places where things
like this occur as well.

These routines, like the atomic_t counter operations returning values,
require explicit memory barrier semantics around their execution.  All
memory operations before the atomic bit operation call must be made
visible globally before the atomic bit operation is made visible.
Likewise, the atomic bit operation must be visible globally before any
subsequent memory operation is made visible.  For example:

obj->dead = 1;
if (test_and_set_bit(0, >flags))
/* ... */;
obj->killed = 1;

The implementation of test_and_set_bit() must guarantee that
"obj->dead = 1;" is visible to cpus before the atomic memory operation
done by test_and_set_bit() becomes visible.  Likewise, the atomic
memory operation done by test_and_set_bit() must become visible before
"obj->killed = 1;" is visible.



> If I understand it correctly, this means that, for exmaple,
> 
>atomic_t aa = ATOMIC_INIT(0);
>int X = 1;
>int Y = 2;
>  CPU 0:
>   update X=100;
>   atomc_inc_return();
>   update Y=200;

But, yes, atomic_inc_return() does indeed force ordering.

> Then CPU 1 will always see X=100 before it sees the new value of aa (1), and 
> CPU 1 wil always 
> see the new value of aa (1) before it sees Y=200.

Yep.  And CPU 1 will also see any preceding unrelated assignment
prior to the new value of aa as well.  And it is not just preceding
stores.  See the sentence from Documentation/atomic_ops.txt:

All memory operations before the atomic bit operation call must
be made visible globally before the atomic bit operation is
made visible.

Both stores -and- loads.

> This ordering semantics does not apply to the scenario in our discussion.
> For one thing, the rcu tasklet is a per CPU tasklet.  So obviously no other 
> CPU's will even read its t->state.
> 
> Am I still missing something?

Yep -- the test_and_set_bit() operation has no clue who else might or
might not be reading t->state.  Besides, tasklets need not be allocated
on a per-CPU basis, and therefore tasklet_schedule() must be prepared
to deal with other CPUs concurrently manipulating t->state, for example,
via the tasklet_disable() interface.

Another thing that might help is to fill in the RCU read-side critical
section that CPU 2 would have to execute (after all the stuff you currently
have it executing), along with the RCU update that would need to precede
CPU 2's call_rcu() call.  I have done this in your example code below.

Note that in order for a failure to occur, CPU 1 must reach /* A */
before CPU 2 reaches /* B */.

One key point is that tasklet_schedule()'s memory ordering affects this
preceding code for CPU 2.  The second key point is that acquiring and
releasing a lock acts as a barrier as well (though a limited one).

The 

Re: Identifying a specific affected file on Ext3 on a top of raid0 of raid1s

2007-11-21 Thread Lennart Sorensen
On Wed, Nov 21, 2007 at 03:57:53PM -0500, [EMAIL PROTECTED] wrote:
> I have a rather nasty situation developing on one of the big 24x7 
> production database servers. It seems that a batch of drives in one of the
> servers started to fail. 
> 
> The file servers are ext3fs on a top of raid-0 over a pair of raid-1 mirrors, 
> with each of raid-1 mirrors having two drives. The issue is that all four 
> drives are developing errors. 
> 
> Is there a way to determine what files are affected if I know the LBA# of the
> errors on individual drives?
> 
> It is running under 2.6.20.x series of kernels.

Most likely debugfs can tell you what filesystem block is used by a file
and which file is using a given filesystem block, although you would
still have to translate the disk block number through the raid layers
and partitions to find where it is relative to the begining of the
partition the filesystem is on.

--
Len Sorensen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Modules: Include sections.h to avoid defining linker variables explicitly

2007-11-21 Thread Christoph Lameter
Module.c should not define linker variables on its own. We have an include
file for that.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 kernel/module.c |4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

Index: linux-2.6/kernel/module.c
===
--- linux-2.6.orig/kernel/module.c  2007-11-21 13:02:39.415358527 -0800
+++ linux-2.6/kernel/module.c   2007-11-21 13:03:16.534858271 -0800
@@ -46,6 +46,7 @@
 #include 
 #include 
 #include 
+#include 
 
 extern int module_sysfs_initialized;
 
@@ -338,9 +339,6 @@ static inline unsigned int block_size(in
return val;
 }
 
-/* Created by linker magic */
-extern char __per_cpu_start[], __per_cpu_end[];
-
 static void *percpu_modalloc(unsigned long size, unsigned long align,
 const char *name)
 {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: network driver usage count

2007-11-21 Thread Stephen Hemminger
On Wed, 21 Nov 2007 20:45:11 +0100
Ferenc Wagner <[EMAIL PROTECTED]> wrote:

> Hi!
> 
> Under 2.6.23.1, my lsmod output shows this:
> 
> $ lsmod | grep tg3
> tg3   100580  0 
> 
> The usage count is zero, even though it drives my two physical
> interfaces:
> 
> $ ls -l /sys/class/net/eth-gb?/device/driver
> lrwxrwxrwx 1 root root 0 2007-11-21 19:58 
> /sys/class/net/eth-gb1/device/driver -> ../../../bus/pci/drivers/tg3
> lrwxrwxrwx 1 root root 0 2007-11-21 19:58 
> /sys/class/net/eth-gb2/device/driver -> ../../../bus/pci/drivers/tg3
> 
> These interfaces are up and bonded together, but that doesn't seem to
> matter at all.  I also checked other machines, the network driver
> (tg3, e1000) usage counts are always zero under various recent 2.6
> kernels, but nonzero under 2.4.21 for example.
> 
> And really, the module could be removed, cutting my ssh session. :)
> 
> Was this made possible intentionally?  If yes, why?

Yes, so devices can be removed at anytime.

-- 
Stephen Hemminger <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Modules: Handle symbols that have a zero value

2007-11-21 Thread Christoph Lameter
On Wed, 21 Nov 2007, Mathieu Desnoyers wrote:

> return -ENOENT;
> 
> directly ?
> 
> (ERR_PTR() in linux/err.h is a simple cast from long to void*).

Right and there is also IS_ERR_VALUE. Thanks for the feedback. New 
version:


Modules: Handle symbols that have a zero value

The module subsystem cannot handle symbols that are zero. If symbols are
present that have a zero value then the module resolver prints out
a message that these symbols are unresolved.

Use ERR_PTR to return an error code instead of 0. This is a bit awkward
since the addresses are handled as unsigned longs. So we need to convert
them everywhere.

Cc: Mathieu Desnoyers <[EMAIL PROTECTED]>
Cc: Kay Sievers <[EMAIL PROTECTED]
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 kernel/module.c |   17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

Index: linux-2.6/kernel/module.c
===
--- linux-2.6.orig/kernel/module.c  2007-11-21 12:58:33.095608448 -0800
+++ linux-2.6/kernel/module.c   2007-11-21 13:00:30.199108674 -0800
@@ -285,7 +285,7 @@ static unsigned long __find_symbol(const
}
}
DEBUGP("Failed to find symbol %s\n", name);
-   return 0;
+   return -ENOENT;
 }
 
 /* Search for module by name: must hold module_mutex. */
@@ -756,7 +756,7 @@ void __symbol_put(const char *symbol)
const unsigned long *crc;
 
preempt_disable();
-   if (!__find_symbol(symbol, , , 1))
+   if (IS_ERR_VALUE(__find_symbol(symbol, , , 1)))
BUG();
module_put(owner);
preempt_enable();
@@ -902,7 +902,8 @@ static inline int check_modstruct_versio
const unsigned long *crc;
struct module *owner;
 
-   if (!__find_symbol("struct_module", , , 1))
+   if (IS_ERR_VALUE(__find_symbol("struct_module",
+   , , 1)))
BUG();
return check_version(sechdrs, versindex, "struct_module", mod,
 crc);
@@ -955,7 +956,7 @@ static unsigned long resolve_symbol(Elf_
/* use_module can fail due to OOM, or module unloading */
if (!check_version(sechdrs, versindex, name, mod, crc) ||
!use_module(mod, owner))
-   ret = 0;
+   ret = -EINVAL;
}
return ret;
 }
@@ -1348,14 +1349,16 @@ static int verify_export_symbols(struct 
const unsigned long *crc;
 
for (i = 0; i < mod->num_syms; i++)
-   if (__find_symbol(mod->syms[i].name, , , 1)) {
+   if (!IS_ERR_VALUE(__find_symbol(mod->syms[i].name,
+   , , 1))) {
name = mod->syms[i].name;
ret = -ENOEXEC;
goto dup;
}
 
for (i = 0; i < mod->num_gpl_syms; i++)
-   if (__find_symbol(mod->gpl_syms[i].name, , , 1)) {
+   if (!IS_ERR_VALUE(__find_symbol(mod->gpl_syms[i].name,
+   , , 1))) {
name = mod->gpl_syms[i].name;
ret = -ENOEXEC;
goto dup;
@@ -1405,7 +1408,7 @@ static int simplify_symbols(Elf_Shdr *se
   strtab + sym[i].st_name, mod);
 
/* Ok if resolved.  */
-   if (sym[i].st_value != 0)
+   if (!IS_ERR_VALUE(sym[i].st_value))
break;
/* Ok if weak.  */
if (ELF_ST_BIND(sym[i].st_info) == STB_WEAK)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Identifying a specific affected file on Ext3 on a top of raid0 of raid1s

2007-11-21 Thread alex-lists-linux-kernel
Hi,

I have a rather nasty situation developing on one of the big 24x7 
production database servers. It seems that a batch of drives in one of the
servers started to fail. 

The file servers are ext3fs on a top of raid-0 over a pair of raid-1 mirrors, 
with each of raid-1 mirrors having two drives. The issue is that all four 
drives are developing errors. 

Is there a way to determine what files are affected if I know the LBA# of the
errors on individual drives?

It is running under 2.6.20.x series of kernels.

Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   >