Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Linus Torvalds


On Mon, 19 Mar 2007, Linus Torvalds wrote:
 
 So *please* don't believe that you can make it as cheap to have some 
 automatic fixup of two sequences, one inlined and one as a call.  It may 
 look so when you look at the single instruction generated, but you're 
 ignoring all the instructions *around* the site.

Side note, you can certainly fix things like this at least in theory, but 
it requires:

 - the call instruction that is used instead of the inlining should 
   basically have no callee-clobbers. Any paravirt_ops called this way
   should have a special calling sequence where they simple save and 
   restore all the registers they use.

   This is usually not that bad. Just create a per-architecture wrapper 
   function that saves/restores anything that the C calling convention on 
   that architecture says is clobbered by calls.

 - if the function has arguments, and the inlined sequence can take the 
   arguments in arbitrary registers, you are going to penalize the inlined 
   sequence anyway (by forcing some fixed arbitrary register allocation 
   policy).This thing is largely unfixable without some really extreme 
   compiler games (like post-processing the assembler output and having 
   different entry-points depending on where the arguments are..)

.. it will obviously depend on how thngs are done whether these things are 
useful or not. But it does mean that it's always a good idea to just have 
a config option of turn off all the paravirt crap, because it *does* add 
overhead, and replacing instructions on the fly doesn't make that 
overhead go away.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Jeremy Fitzhardinge
Eric W. Biederman wrote:
 Rusty Russell [EMAIL PROTECTED] writes:

   
 On Sun, 2007-03-18 at 13:08 +0100, Andi Kleen wrote:
 
 The idea is _NOT_ that you go look for references to the paravirt_ops
 members structure, that would be stupid and you wouldn't be able to
 use the most efficient addressing mode on a given cpu, you'd be
 patching up indirect calls and crap like that.  Just say no...
 
 That wouldn't handle inlines though. At least some of the current
 paravirtops like cli/sti are critical enough to require inlining.
   
 Well, we'd patch the inline over the call if we have room.

 Magic patching would be neat, but the downsides are that (1) we can't
 expand the patching room and (2) there's no way of attaching clobber
 info to the call site (doing register liveness analysis is not
 appealing).
 

 True.  You can use all of the call clobbered registers.
   

The trouble is that there are places where we want to intercept things
like sti/cli in entry.S in a context where all the registers are in use,
so we can't generally make the assumption that caller-saved regs are
clobberable.  For any call-site in compiler-generated code we can make
that assumption, of course.

 Now, this may not be fatal.  5 bytes is enough for all the native ops to
 be patched inline.   For lguest this covers popf and pushf, but not cli
 and sti (10 bytes): they'd have to be calls.

 As for clobber info, it turns out that almost all of the calls can
 clobber %eax, which is probably enough.  We just need to mark the
 handful of asm ones where this isn't true.
 

 I guess if the code is larger than a function call I'm failing to see
 the disadvantage in making it a direct function call.  Any modern
 processor ought to be able to predict it perfectly, and processors
 like the P4 may even optimize the call out of their L1 instruction
 cache.
   

For current processors, I think i-cache pollution is really the only
downside to a call.  But maybe you could put multiple pv-op functions
into a cacheline if they're used in close proximity
(cli/sti/save_fl/restore_fl would all fit into a single cacheline).

 If what David is suggesting works, making all of these direct calls
 looks easy and very maintainable.   At which point patching
 instructions inline is quite possibly overkill.
   

I spent some time looking at what it would take to get this going.  It
would probably look something like:

   1. make CONFIG_PARAVIRT select CONFIG_RELOCATABLE
   2. generate vmlinux
   3. extract relocs, and process the references to paravirt symbols
  into a table
   4. compile and link that table into the kernel again (like the ksyms
  dance)
   5. at boot time, use that table to redirect calls/references into the
  paravirt backend to the appropriate one

I think if we store the reloc references as section-relative, and the
reloc table itself is linked into its own section, then there's no need
for multipass linking to combine the reloc table into the kernel image,
since adding that table won't affect any existing reloc.

All this is doable; I'd probably end up hacking boot/compressed/relocs.c
to generate the appropriate reloc table.  My main concern is hacking the
kernel build process itself; I'm unsure of what it would actually take
to implement all this.

 Is it truly critical to inline any of these instructions?
   

Possibly not, but I'd like to be able to say with confidence that
running a PARAVIRT kernel on bare hardware has no performance loss
compared to running a !PARAVIRT kernel.  There's the case of small
instruction sequences which have been replaced with calls (such as
sti/cli/push;popf/etc), and also the case of hooks which are noops on
native (like make_pte/pte_val, etc).  In those cases it would be nice to
patch in a replacement, either a real instruction or just nops.


J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] [RFC] AF_RXRPC socket family implementation [try #2]

2007-03-19 Thread David Miller
From: David Howells [EMAIL PROTECTED]
Date: Mon, 19 Mar 2007 15:41:38 +

 Alan Cox [EMAIL PROTECTED] wrote:
 
  So use SOCK_DGRAM, its clearly near enough.
 
 No, it's not.  SOCK_DGRAM is an unreliable, unidirectional datagram passing
 service.

David we're not looking for a precise match, so please stop
discussing this in those terms.  We're looking for something
close enough.

The more and more I read Alan's arguments and your resistence to
his logic, the more I side with Alan.  He's definitely right on
all the basic counts here.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Jeremy Fitzhardinge
Linus Torvalds wrote:
 On Mon, 19 Mar 2007, Eric W. Biederman wrote:
   
 True.  You can use all of the call clobbered registers.
 

 Quite often, the biggest single win of inlining is not so much the code 
 size (although if done right, that will be smaller too), but the fact that 
 inlining DOES NOT CLOBBER AS MANY REGISTERS!
   

Yes, that's something that we've been conscious of when designing the
pv_ops patching mechanism.  As it stands, each time you emit a call to a
pv-ops operation, it not only stores the start and end of the patchable
area, but also which registers are available for clobbering at that
callsite by the patched-in code.  For things like sti/cli, where the
surrounding code expects nothing to be clobbered, we make sure that the
regs are preserved on the pv-ops side, even though there's a call to a
normal C function in the middle.  That gives in the pvops backend the
flexibility to patch over that site with either some inline code or a
call out to some function which doesn't necessarily clobber the full
caller-save set (or even any registers).

 In short: people here seem to think that inlining is about avoiding the 
 call/ret sequence. Not so. The real advantages of inlining are elsewhere.
   

Yes.  Probably the biggest win of inlining is constant propagation into
the inline function, but register-use flexibility is good small-scale
win (vs the micro-scale call/ret elimination).

 Side note, you can certainly fix things like this at least in theory, but 
 it requires:

  - the call instruction that is used instead of the inlining should 
basically have no callee-clobbers. Any paravirt_ops called this way
should have a special calling sequence where they simple save and 
restore all the registers they use.

This is usually not that bad. Just create a per-architecture wrapper 
function that saves/restores anything that the C calling convention on 
that architecture says is clobbered by calls.
   

Most of the places we intercept are normal C calls anyway, so this isn't
a big issue.  Its mainly the places where we replace single instructions
that this will have a big effect.

The trouble with this is that we're back to having to have special
wrappers around each call to hide the normal C calling convention from
the compiler, so that it knows that it has more registers to play with. 
This was the main complaint about the original version of the patch,
because it all looks very ugly.

  - if the function has arguments, and the inlined sequence can take the 
arguments in arbitrary registers, you are going to penalize the inlined 
sequence anyway (by forcing some fixed arbitrary register allocation 
policy).This thing is largely unfixable without some really extreme 
compiler games (like post-processing the assembler output and having 
different entry-points depending on where the arguments are..)
   

Yeah, that doesn't sound like much fun.  I think using the normal
regparm calling convention will be OK.  Aside from some slightly longer
instruction encodings, all the registers are more or less
interchangeable anyway.

 .. it will obviously depend on how thngs are done whether these things are 
 useful or not. But it does mean that it's always a good idea to just have 
 a config option of turn off all the paravirt crap, because it *does* add 
 overhead, and replacing instructions on the fly doesn't make that 
 overhead go away.

Yes, there's a big switch to turn all this off.  It would be nice if we
could get things to the point that it isn't necessary to have (ie,
running on bare hardware really is indistinguishable either way), but
we're a fair way from being able to prove that.  In the meantime, the
goal is to try to keep the source-level changes a local as possible so
that maintaining the CONFIG_PARAVIRT vs non-PARAVIRT distinction is
straightforward.


J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread David Miller
From: Jeremy Fitzhardinge [EMAIL PROTECTED]
Date: Mon, 19 Mar 2007 12:10:08 -0700

 All this is doable; I'd probably end up hacking boot/compressed/relocs.c
 to generate the appropriate reloc table.  My main concern is hacking the
 kernel build process itself; I'm unsure of what it would actually take
 to implement all this.

32-bit Sparc's btfixup might be usable as a guide.

Another point worth making is that for function calls you
can fix things up lazily if you want.

So you link, build the reloc tables, then link in a *.o file that
does provide the functions in the form of stubs.  The stubs intercept
the call, and patch the callsite, then revector to the real handler.

I don't like this idea actually because it essentially means you
either:

1) Only allow one setting of the operations

OR

2) Need to have code which walks the whole reloc table anyways
   to handle settings after the first so you can revector
   everyone back to the stubs and lazy reloc properly again

In fact forget I mentioned this idea :)

As another note, I do agree with Linus about the register usage
arguments.  It is important.  I think it's been mentioned but what you
could do is save nothing (so that sti and cli are just that and
cost nothing), but the more complicated versions save and restore
enough registers to operate.

It all depends upon what you're trying to do.  For example, it's
easy to use patching to make different PTE layouts be supportable
in the same kernel image.  We do this on sparc64 since sun4v
has a different PTE layout than sun4u, you can see the code in
asm-sparc64/pgtable.h for details (search for sun4v_*_patch)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Jeremy Fitzhardinge
David Miller wrote:
 Another point worth making is that for function calls you
 can fix things up lazily if you want.
 [...]
 In fact forget I mentioned this idea :)
   

OK :)  I think we'll only ever want to bind to a hypervisor once, since
the underlying hypervisor can't change on the fly (well, in principle it
might if you migrate, but you'll have more problems than just dealing
with the hook points).  And lazy binding doesn't buy much; I think its
much better to get it all out of the way at once, and then throw the
reloc info away.

 As another note, I do agree with Linus about the register usage
 arguments.  It is important.  I think it's been mentioned but what you
 could do is save nothing (so that sti and cli are just that and
 cost nothing), but the more complicated versions save and restore
 enough registers to operate.
   

Right, that's pretty much what we do now.

 It all depends upon what you're trying to do.  For example, it's
 easy to use patching to make different PTE layouts be supportable
 in the same kernel image.  We do this on sparc64 since sun4v
 has a different PTE layout than sun4u, you can see the code in
 asm-sparc64/pgtable.h for details (search for sun4v_*_patch)
   

I see.  We want to do something conceptually like this, but we need to
handle more than just adjusting which of two constants to use as a
mask.  For example, Xen needs to run pfns through a pfn-mfn (machine
frame number) conversion and back when making/unpacking pagetable entries.

J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 24/31] IrDA: irttp_dup spin_lock initialisation

2007-03-19 Thread Greg KH
-stable review patch.  If anyone has any objections, please let us know.

--

From: Samuel Ortiz [EMAIL PROTECTED]

Without this initialization one gets

kernel BUG at kernel/rtmutex_common.h:80!

This patch should also be included in the -stable kernel.

Signed-off-by: G. Liakhovetski [EMAIL PROTECTED]
Signed-off-by: Samuel Ortiz [EMAIL PROTECTED]
Cc: David Miller [EMAIL PROTECTED]
Signed-off-by: Greg Kroah-Hartman [EMAIL PROTECTED]

---
 net/irda/irttp.c |1 +
 1 file changed, 1 insertion(+)

--- a/net/irda/irttp.c
+++ b/net/irda/irttp.c
@@ -1455,6 +1455,7 @@ struct tsap_cb *irttp_dup(struct tsap_cb
 
/* Not everything should be copied */
new-notify.instance = instance;
+   spin_lock_init(new-lock);
init_timer(new-todo_timer);
 
skb_queue_head_init(new-rx_queue);

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND] NET: Add packet sock option to return orig_dev to userspace when bonded

2007-03-19 Thread David Miller

I got it the first time, it's just very low priority and very
deep in my backlog and I'm also about to be away for 3 days.

I have it so you don't need to keep resending it.

Thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Andi Kleen
 Possibly not, but I'd like to be able to say with confidence that
 running a PARAVIRT kernel on bare hardware has no performance loss
 compared to running a !PARAVIRT kernel.  There's the case of small
 instruction sequences which have been replaced with calls (such as
 sti/cli/push;popf/etc), 

My guess is that most critical pushf/popf are in spin_lock_irqsave(). It would 
be possible to special case that one -- inline it -- and use out of line
versions for all the others.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Rusty Russell
On Mon, 2007-03-19 at 11:38 -0700, Linus Torvalds wrote:
 
 On Mon, 19 Mar 2007, Eric W. Biederman wrote:
  
  True.  You can use all of the call clobbered registers.
 
 Quite often, the biggest single win of inlining is not so much the code 
 size (although if done right, that will be smaller too), but the fact that 
 inlining DOES NOT CLOBBER AS MANY REGISTERS!

Thanks Linus.

*This* was the reason that the current hand-coded calls only clobber %
eax.  It was a compromise between native (no clobbers) and others (might
need a reg).

Now, since we decided to allow paravirt_ops operations to be normal C
(ie. the patching is optional and done late), we actually push and pop %
ecx and %edx.  This makes the call site 10 bytes long, which is a nice
size for patching anyway (enough for a movl $0, addr, a-la lguest's
cli, or movw $0, %gs:addr if we supported SMP).

The current 6 paravirt ops which are patched cover the vast majority of
calls (until the Xen patches, then we need ~4 more?).  Jeremy chose to
expand patching to cover *all* paravirt ops, rather than just the new
hot ones, and that's where we tipped over the ugliness threshold.

Cheers,
Rusty.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6 patch] x25_forward_call(): fix NULL dereferences

2007-03-19 Thread David Miller
From: Adrian Bunk [EMAIL PROTECTED]
Date: Mon, 19 Mar 2007 10:24:03 +0100

 This patch fixes two NULL dereferences spotted by the Coverity checker.
 
 For a better understanding, the diff -uwp output (that ignores the 
 indentation changes) is:

I'll apply this, thanks Adrian.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Zachary Amsden

Jeremy Fitzhardinge wrote:

For example, say we wanted to put a general call for sti into entry.S,
where its expected it won't touch any registers.  In that case, we'd
have a sequence like:

push %eax
push %ecx
push %edx
call paravirt_cli
pop %edx
pop %ecx
pop %eax
  

If we parse the relocs, then we'd find the reference to paravirt_cli. 
If we look at the byte before and see 0xe8, then we can see if its a

call.  If we then work out in each direction and see matched push/pops,
then we know what registers can be trashed in the call.  This also
allows us to determine the callsite size, and therefore how much space
we need for inlining.
  


No, that is a very dangerous suggestion.  You absolutely *cannot* do 
this safely without explicitly marking the start EIP of this code.  You 
*must* use metadata to do that.  It is never safe to disassemble 
backwards or rewind EIP for x86 code.


Zach
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Jeremy Fitzhardinge
Zachary Amsden wrote:
 Jeremy Fitzhardinge wrote:
  If we then work out in each direction and see matched push/pops,
 then we know what registers can be trashed in the call.  This also
 allows us to determine the callsite size, and therefore how much space
 we need for inlining.
   

 No, that is a very dangerous suggestion.  You absolutely *cannot* do
 this safely without explicitly marking the start EIP of this code. 
 You *must* use metadata to do that.  It is never safe to disassemble
 backwards or rewind EIP for x86 code. 

What do you mean the instruction before is mov $0x52515000,%eax?

Yeah, you're right.  Oh well.

J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Zachary Amsden

Rusty Russell wrote:

On Mon, 2007-03-19 at 11:38 -0700, Linus Torvalds wrote:
  

On Mon, 19 Mar 2007, Eric W. Biederman wrote:


True.  You can use all of the call clobbered registers.
  
Quite often, the biggest single win of inlining is not so much the code 
size (although if done right, that will be smaller too), but the fact that 
inlining DOES NOT CLOBBER AS MANY REGISTERS!



For VMI, the default clobber was cc, and you need a way to allow at 
least that, because saving and restoring flags is too expensive on x86.



Thanks Linus.

*This* was the reason that the current hand-coded calls only clobber %
eax.  It was a compromise between native (no clobbers) and others (might
need a reg).
  


I still don't think this was a good trade.  The primary motivation for 
clobbering %eax was that Xen wanted a free register to use for computing 
the offset into the shared data in the case of SMP preemptible kernels.  
Xen no longer needs such a register, they can use the PDA offset 
instead.  And it does hurt native performance by unconditionally 
stealing a register in the four most commonly invoked paravirt-ops code 
sequences.



Now, since we decided to allow paravirt_ops operations to be normal C
(ie. the patching is optional and done late), we actually push and pop %
ecx and %edx.  This makes the call site 10 bytes long, which is a nice
size for patching anyway (enough for a movl $0, addr, a-la lguest's
cli, or movw $0, %gs:addr if we supported SMP).
  


You can do it in 11 bytes with no clobbers and normal C semantics by 
linking to a direct address instead of calling to an indirect, but then 
you need some gross fixup technology in paravirt_patch:


if (call_addr == (void*)native_sti) {
 ...
}

I think we should probably try to do it in 12 bytes.  Freeing eax to the 
inline caller is likely to make up the 2 bytes of space more we have to nop.


One thing I always tried to get in VMI was to encapsulate the actual 
code which went through the business of computing arguments that were 
not even used in the native case.  Unfortunately, that seems impossible 
in the current design, but I don't think it is an issue because I don't 
think there is actually a way to express:


SWITCHABLE_CODE_BLOCK_BEGIN {
  /* arbitrary C code for native */
} SWITCHABLE_CODE_BLOCK_ALTERNATIVE {
  /* arbitrary C code for something else */
}

Dave's linker suggestion is probably the best for things like that.

Zach
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [6/6] 2.6.21-rc4: known regressions

2007-03-19 Thread David Miller
From: Adrian Bunk [EMAIL PROTECTED]
Date: Sun, 18 Mar 2007 19:49:38 +0100

 Subject: ipv6 crash
 References : http://lkml.org/lkml/2007/3/10/2
 Submitter  : Len Brown [EMAIL PROTECTED]
 Status : unknown

This is caused by some problem in the router round-robin code in
net/ipv6/route.c:rt6_select()

Somehow it NULLs out fn-leaf, and then fib6_add_1() crashes
dererencing that NULL pointer as is seen in the report.

Deleting the router round-robin list mangling code in rt6_select()
makes the crash go away, but such a change causes regressions in the
ipv6 conformance tests.

Thomas Graf discovered this bug some time ago, but we still
haven't come up with a fix suitable for upstream :-/

This bug has been there for a very long time and is not a regression
of 2.6.21

I'll see if I can come up with something to fix this properly.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.22 3/3] Add LED trigger to libata core

2007-03-19 Thread Tejun Heo

Tony Vroon wrote:

The first user of ata_ac_issue_prot_with_ledtrigger, the ServerWorks Frodo/
Apple K2 driver. Used by the IDE LED trigger on G5 towers.
Respin of an earlier patch, based on comments by Tejun Heo  Alan Cox.


Just two comments.

1. IMHO, ata_qc_issue_prot_ledtrigger() without 'with' is good enough. 
This is just my personal preference.  Feel free to ignore it.


2. Patch #1 and #2 should be merged.  They're one logical change of 
adding ata_qc_issue_prot_with_ledtrigger().  Patch #3 is a logically 
separate change of using it, but unless it's a wide conversion, 
implementing something and using something can be merged.  So, please 
merge #1 and #2 and possibly #3.


Thanks.

--
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PATCH] USB fixes for 2.6.21-rc4

2007-03-19 Thread Greg KH
Here are some USB fixes and new device ids against 2.6.21-rc4.

These patches contain a fix for a regression that is on Adrian's list
with regards to usb-serial drivers oopsing, and they also add a number
of different device ids and other minor updates.

All of these have been in the -mm releases.

Please pull from:
master.kernel.org:/pub/scm/linux/kernel/git/gregkh/usb-2.6.git/

The full patches will be sent to the linux-usb-devel mailing list, if
anyone wants to see them.

thanks,

greg k-h

 drivers/usb/class/usblp.c  |6 +-
 drivers/usb/misc/berry_charge.c|2 +-
 drivers/usb/net/dm9601.c   |8 
 drivers/usb/serial/airprime.c  |4 
 drivers/usb/serial/mos7720.c   |1 +
 drivers/usb/serial/option.c|   15 +--
 drivers/usb/serial/usb-serial.c|   11 +--
 drivers/usb/storage/unusual_devs.h |   10 ++
 8 files changed, 43 insertions(+), 14 deletions(-)

---

Alan Stern (1):
  usblp: quirk flag and device entry for Seiko Epson M129C printer

Greg Kroah-Hartman (1):
  USB: new Novatel device ids for option driver

Jim Radford (1):
  USB: fix usb-serial regression

Jon Dowland (1):
  USB: two more device ids for dm9601 usbnet driver

Ken L Johnson (1):
  USB: berry_charge: correct dbg string for second magic command

Mark Glines (1):
  airprime: USB ID for Novatel EV620 mini PCI-E card

Oliver Neukum (1):
  USB: necessary update for mos7720 driver

Pete Zaitcev (1):
  USB: RAZR v3i unusual_devs

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 03/31] Fix user copy length in ipv6_sockglue.c

2007-03-19 Thread Greg KH
On Mon, Mar 19, 2007 at 03:01:25PM -0700, Chris Wright wrote:
 * Greg KH ([EMAIL PROTECTED]) wrote:
  From: Chris Wright [EMAIL PROTECTED]
  
  [IPV6] fix ipv6_getsockopt_sticky copy_to_user leak
  
  User supplied len  0 can cause leak of kernel memory.
  Use unsigned compare instead.
 
 You can drop this one.  It's dependent on a patch
 that is not in 2.6.20.

Ok, thanks for letting me know, it is now dropped.

greg k-h
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [stable] [patch 29/31] Input: i8042 - fix AUX IRQ delivery check

2007-03-19 Thread Greg KH
On Mon, Mar 19, 2007 at 05:48:55PM -0400, Dmitry Torokhov wrote:
 On 3/19/07, Greg KH [EMAIL PROTECTED] wrote:
  -stable review patch.  If anyone has any objections, please let us know.
 
  --
 
  From: Dmitry Torokhov [EMAIL PROTECTED]
 
  Input: i8042 - fix AUX IRQ delivery check
 
  On boxes that do not implement AUX LOOP command we can not
  verify AUX IRQ delivery and must assume that it is wired
  properly.
 
 
 Greg,
 
 There is another piece missing in AUX delivery test, commit
 
 3ca5de6dd4ec5a139b2b8f00dce3e4726ca91af1
 
 Unfortunately I can't send you a patch at the moment but if you could
 get it from the mainline that would be great.

Thanks for letting me know, I've added it to the queue now.

greg k-h
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sysctl: vfs_cache_divisor

2007-03-19 Thread H. Peter Anvin

Randy Dunlap wrote:


The we duplicate all the relevant /proc knobs:

cat /proc/sys/vm/dirty_ratio
30
cat /proc/sys/vm/hires-dirty_ratio/
30

Or we do something else ;)


Sounds better.  I wasn't very keen on the userspace interface that this
exposed.  Will look at those.



Okay... may be I could throw a spanner in the machinery, and suggest 
another option: perhaps we should add a way to do sysctl which can 
handle fractional (fixed-point) values... more coherent/detailed message 
tomorrow.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Greg KH
On Sat, Mar 17, 2007 at 02:26:57PM +0100, Andi Kleen wrote:
 Arjan van de Ven [EMAIL PROTECTED] writes:
  
  well we can do the handshake to take ownership like we do much later in
  boot, but that requires PCI to be there and fully discovered, which we
  don't have this early.
 
 That's not true - we do early pci discovery. Doing USB handsoff
 there would be quite possible.

What, we don't do USB handoff early enough in the boot process?  It's
happening at PCI quirk time now, which I think should be early enough
for everyone (and too early for some who rely on USB keyboards and
initramfs shells...)

thanks,

greg k-h
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 1/5] Quicklists for page table pages V3

2007-03-19 Thread Paul Mackerras
Christoph Lameter writes:

 +static inline void *quicklist_alloc(int nr, gfp_t flags, void (*ctor)(void 
 *))
 +{

...

 + p = (void *)__get_free_page(flags | __GFP_ZERO);

This will cause problems on 64-bit powerpc, at least with 4k pages,
since the pmd and pgd levels only use 1/4 of a page.

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.21-rc4-mm1

2007-03-19 Thread Andrew Morton

Temporarily at

  http://userweb.kernel.org/~akpm/2.6.21-rc4-mm1/

Will appear later at

  
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc4/2.6.21-rc4-mm1/



- Restored the RSDL CPU scheduler (a new version thereof)



Boilerplate:

- See the `hot-fixes' directory for any important updates to this patchset.

- To fetch an -mm tree using git, use (for example)

  git-fetch git://git.kernel.org/pub/scm/linux/kernel/git/smurf/linux-trees.git 
tag v2.6.16-rc2-mm1
  git-checkout -b local-v2.6.16-rc2-mm1 v2.6.16-rc2-mm1

- -mm kernel commit activity can be reviewed by subscribing to the
  mm-commits mailing list.

echo subscribe mm-commits | mail [EMAIL PROTECTED]

- If you hit a bug in -mm and it is not obvious which patch caused it, it is
  most valuable if you can perform a bisection search to identify which patch
  introduced the bug.  Instructions for this process are at

http://www.zip.com.au/~akpm/linux/patches/stuff/bisecting-mm-trees.txt

  But beware that this process takes some time (around ten rebuilds and
  reboots), so consider reporting the bug first and if we cannot immediately
  identify the faulty patch, then perform the bisection search.

- When reporting bugs, please try to Cc: the relevant maintainer and mailing
  list on any email.

- When reporting bugs in this kernel via email, please also rewrite the
  email Subject: in some manner to reflect the nature of the bug.  Some
  developers filter by Subject: when looking for messages to read.

- Occasional snapshots of the -mm lineup are uploaded to
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/ and are announced on
  the mm-commits list.


Changes since 2.6.21-rc3-mm1:


 origin.patch
 git-acpi.patch
 git-alsa.patch
 git-arm-master.patch
 git-arm.patch
 git-avr32.patch
 git-cifs.patch
 git-cpufreq.patch
 git-powerpc.patch
 git-drm.patch
 git-dvb.patch
 git-gfs2-nmw.patch
 git-hid.patch
 git-ia64.patch
 git-ieee1394.patch
 git-infiniband.patch
 git-input.patch
 git-kbuild.patch
 git-kvm.patch
 git-leds.patch
 git-libata-all.patch
 git-md-accel.patch
 git-mmc.patch
 git-mtd.patch
 git-ubi.patch
 git-netdev-all.patch
 git-ioat.patch
 git-ocfs2.patch
 git-parisc.patch
 git-selinux.patch
 git-pciseg.patch
 git-s390.patch
 git-sh.patch
 git-scsi-misc.patch
 git-scsi-rc-fixes.patch
 git-unionfs.patch
 git-wireless.patch
 git-ipwireless_cs.patch
 git-gccbug.patch

 git trees

-uml-hostfs-fix-double-free.patch
-uml-hostfs-make-hostfs=-option-work-as-a-jail-as-intended.patch
-uml-fix-a-memory-leak-in-the-multicast-driver.patch
-uml-remove-dead-code-about-os_usr1_signal-and-os_usr1_process.patch
-uml-mark-both-consoles-as-con_anytime.patch
-uml-fix-confusion-irq-early-reenabling.patch
-uml-activate_fd-return-enomem-only-when-appropriate.patch
-uml-fix-errno-usage.patch
-x86_64-fix-2618-regression-ptrace_oldsetoptions-should-be-accepted.patch
-bluetooth-fix-socket-locking-in-hci_sock_dev_event.patch
-add-epoll-compat_-code-to-fs-compatc.patch
-check_partition-fix-error-check.patch
-uml-arch_prctl-should-set-thread-fs.patch
-connector-bugfix-for-cn_call_callback.patch
-26-altix-console-fix-for-config_debug_shirq-usage.patch
-ecryptfs-nested-locking-annotation.patch
-swsusp-disable-nonboot-cpus-before-entering-platform-suspend.patch
-paravirt-build-fixes.patch
-acpi-disabled-due-to-dmi-failure-or-blacklisted-year-should-be-noted-as-is-done-with-other-acpi-blacklisting.patch
-git-alsa-oops-fix.patch
-avr32-dma-mappingh.patch
-gregkh-driver-device-symlink.patch
-gregkh-driver-platform-reorder-platform_device_del.patch
-gregkh-driver-remove-devfs-from-maintainers.patch
-gregkh-driver-driver-core-export-device_rename.patch
-gregkh-driver-uio-irq.patch
-scheduled-removal-of-sa_xxx-interrupt-flags-fixups-4.patch
-make-drivers-char-drm-drm_vmcdrm_io_prot-static.patch
-fix-saa7146_clipping_mem-size.patch
-drivers-media-video-cpia_ppc-dont-use-_work_nar.patch
-dvb-core-fix-several-locking-related-problems.patch
-saa7134-fix-modules=n-compilation.patch
-ivtv-warning-fix.patch
-jdelvare-i2c-i2c-03-use-i2c_adapterdevparent-for-messages.patch
-jdelvare-i2c-i2c-i801-restore-initial-state.patch
-jdelvare-i2c-ds1374-check-for-workqueue-creation.patch
-crash-on-evdev-disconnect.patch
-expose-set_mode-method-so-it-can-be-wrapped.patch
-ata_piix-remove-ugly-layering-violation.patch
-pata_cmd640-multiple-updates.patch
-ide-cmd64x-fix-recovery-time-calculation-take2.patch
-mtd-maps-ck804xromc-pci_module_init-to-pci_register_driver.patch
-mtd-chips-oops-in-cfi_amdstd_sync.patch
-mtd-esb2-check-for-closed-rom-window.patch
-dilnetpc-fix-warning.patch
-mtd-correct-misspelled-preprocessor-variable.patch
-git-netdev-all-ipw2200-fix.patch
-mv643xx-ethernet-driver-irq-registration-fix.patch
-via-rhine-set-avoid_d3-for-broken-bioses.patch
-netxen-fix-warnings.patch
-e1000-fix-be-ready-for-incoming-irq-at-pci_request_irq.patch
-e1000-fix-firmware-handover-bits.patch
-e1000-fix-stop-raw-interrupts-disabled-nag-from-rt.patch

Re: mm snapshot broken-out-2007-03-18-02-44.tar.gz uploaded

2007-03-19 Thread Andrew Morton
On Tue, 20 Mar 2007 13:47:53 +1100 Nick Piggin [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  On Mon, 19 Mar 2007 17:58:52 -0800 Andrew Morton [EMAIL PROTECTED] wrote:
  
  
 The kernel without Nick's patchset but with the assert runs OK too.  Under
 the principle of mm-has-been-too-flakey-lately, I'll drop the patches:
 
 mm-debug-check-for-the-fault-vs-invalidate-race.patch
 mm-simplify-filemap_nopage.patch
 mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
 mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
 mm-merge-populate-and-nopage-into-fault-fixes-nonlinear-tidy.patch
 mm-merge-nopfn-into-fault.patch
 mm-merge-nopfn-into-fault-fix.patch
 mm-remove-legacy-cruft.patch
  
  
  ug, too many rejects.  I'll leave them in, minus
  mm-debug-check-for-the-fault-vs-invalidate-race.patch
  
 
 Hang on a sec... I'll try fixing the thing before you next make a
 release.
 

Too late.  hot-fixes/ awaits thee.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Lee Revell

On 3/16/07, Thomas Gleixner [EMAIL PROTECTED] wrote:

Yes, this is probably caused by SMM code trying to emulate a PS/2
keyboard from a (maybe connected or not) USB keyboard. Unfortunately we
have no way to disable this BIOS misfeature in the early boot process.


https://mail.rtai.org/pipermail/rtai/2003-March/002949.html

http://www.embeddedrelated.com/usenet/embedded/show/50333-1.php

I think CONFIG_TRY_TO_DISABLE_SMI would be excellent for debugging,
not to mention people trying to spec out hardware for RT
applications...

Lee
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/31] 2.6.20-stable review

2007-03-19 Thread Gene Heskett
On Monday 19 March 2007, Greg KH wrote:
This is the start of the stable review cycle for the 2.6.20.4 release.
There are 31 patches in this series, all will be posted as a response
to this one.  If anyone has any issues with these being applied, please
let us know.  If anyone is a maintainer of the proper subsystem, and
wants to add a Signed-off-by: line to the patch, please respond with it.

These patches are sent out with a number of different people on the
Cc: line.  If you wish to be a reviewer, please email [EMAIL PROTECTED]
to add your name to the list.  If you want to be off the reviewer list,
also email us.

Responses should be made by Thursday March, 22, 15:00:00 UTC.
Anything received after that time might be too late.

BINGO!  One of these 31 patches may be the guilty party that's playing
tricks with tar's mind.  I'm running 2.6.20.4-rc1 on an older athlon 
xp2800 with a gig of ram.

Amanda has gotten through the estimate phase and is now doing the backup.  
It will fail, out of tape.  Here is an amstatus output as its running 
right now.

coyote:/GenesAmandaHelper-0.5 3 planner: [dumps way too big, 350850 KB, must 
skip incremental dumps]
coyote:/GenesAmandaHelper-0.6 1 planner: [dumps way too big, 184977 KB, must 
skip incremental dumps]
coyote:/bin   1 planner: [dumps way too big, 1110 KB, must skip 
incremental dumps]
coyote:/boot  13m wait for dumping
coyote:/dev   1 planner: [dumps way too big, 290 KB, must skip 
incremental dumps]
coyote:/etc   1 planner: [dumps way too big, 18291 KB, must 
skip incremental dumps]
coyote:/home  0 1018m wait for dumping
coyote:/lib   3 planner: [dumps way too big, 11705 KB, must 
skip incremental dumps]
coyote:/opt   15m wait for dumping
coyote:/root  3 planner: [dumps way too big, 785963 KB, must 
skip incremental dumps]
coyote:/sbin  1 planner: [dumps way too big, 10 KB, must skip 
incremental dumps]
coyote:/tmp   4   32m wait for dumping
coyote:/usr/X11R6 12m wait for dumping
coyote:/usr/bin   1 planner: [dumps way too big, 339170 KB, must 
skip incremental dumps]
coyote:/usr/dlds  1 planner: [dumps way too big, 2140 KB, must skip 
incremental dumps]
coyote:/usr/dlds-misc 30m wait for dumping
coyote:/usr/dlds-rpms 1 planner: [dumps way too big, 3130 KB, must skip 
incremental dumps]
coyote:/usr/dlds-tgzs 1 planner: [dumps way too big, 10 KB, must skip 
incremental dumps]
coyote:/usr/games 00m wait for dumping
coyote:/usr/include   1 planner: [dumps way too big, 10557 KB, must 
skip incremental dumps]
coyote:/usr/kerberos  10m wait for dumping
coyote:/usr/lib   1 planner: [dumps way too big, 474409 KB, must 
skip incremental dumps]
coyote:/usr/libexec   2 planner: [dumps way too big, 11285 KB, must 
skip incremental dumps]
coyote:/usr/local 2  279m wait for dumping
coyote:/usr/man   10m wait for dumping
coyote:/usr/movies2 7271m dumping 5485m ( 75.44%) (0:12:47)
coyote:/usr/music 1 planner: [dumps way too big, 2448290 KB, must 
skip incremental dumps]
coyote:/usr/pix   2   17m wait for dumping
coyote:/usr/sbin  1 planner: [dumps way too big, 3254 KB, must skip 
incremental dumps]
coyote:/usr/share 3 planner: [dumps way too big, 40514 KB, must 
skip incremental dumps]
coyote:/usr/src   3 6822m wait for dumping
coyote:/var   1  366m wait for dumping

SUMMARY  part  real  estimated
   size   size
partition   :  32
estimated   :  3231973m
flush   :   0 0m
failed  :  1816155m   ( 50.53%)
wait for dumping:  13 8547m   ( 26.73%)
dumping to tape :   00m   (  0.00%)
dumping :   1  5485m  7271m ( 75.44%) ( 17.16%)
dumped  :   0 0m 0m (  0.00%) (  0.00%)
wait for writing:   0 0m 0m (  0.00%) (  0.00%)
wait to flush   :   0 0m 0m (100.00%) (  0.00%)
writing to tape :   0 0m 0m (  0.00%) (  0.00%)
failed to tape  :   0 0m 0m (  0.00%) (  0.00%)
taped   :   0 0m 0m (  0.00%) (  0.00%)
  tape 1:   0 0m 0m (  0.00%) Dailys-19
8 dumpers idle  : not-idle
taper idle
network free kps:  6800
holding space   : 71118m (100.00%)
 dumper0 busy   :  0:00:00  (  0.00%)
 0 dumpers busy :  0:00:00  (  0.00%)
 1 dumper busy  :  0:00:00  (  0.00%)

The directory shown on line one of this report actually has:
[EMAIL PROTECTED] /]# du -h /GenesAmandaHelper-0.5/
1.6G

Re: mm snapshot broken-out-2007-03-18-02-44.tar.gz uploaded

2007-03-19 Thread Nick Piggin

Andrew Morton wrote:

On Tue, 20 Mar 2007 13:47:53 +1100 Nick Piggin [EMAIL PROTECTED] wrote:



Andrew Morton wrote:




Hang on a sec... I'll try fixing the thing before you next make a
release.




Too late.  hot-fixes/ awaits thee.


Awww... well thanks very much Michal for reporting the bug, I reproduced
it easily and it turns out to be a typo.

In my testing I never had a lot of writeout going on, so most of the pages
will have been truncated in the first loop...

--
SUSE Labs, Novell Inc.
Fix typo in do_no_page vs invalidate race fix patch.

Index: linux-2.6/mm/truncate.c
===
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -235,7 +235,7 @@ void truncate_inode_pages_range(struct a
wait_on_page_writeback(page);
if (page_mapped(page)) {
unmap_mapping_range(mapping,
- (loff_t)page_indexPAGE_CACHE_SHIFT,
+ (loff_t)page-indexPAGE_CACHE_SHIFT,
  PAGE_CACHE_SIZE, 0);
}
if (page-index  next)


Re: mm snapshot broken-out-2007-03-18-02-44.tar.gz uploaded

2007-03-19 Thread Nick Piggin

Nick Piggin wrote:

Andrew Morton wrote:

On Tue, 20 Mar 2007 13:47:53 +1100 Nick Piggin 
[EMAIL PROTECTED] wrote:




Andrew Morton wrote:




Hang on a sec... I'll try fixing the thing before you next make a
release.




Too late.  hot-fixes/ awaits thee.



Awww... well thanks very much Michal for reporting the bug, I reproduced
it easily and it turns out to be a typo.

In my testing I never had a lot of writeout going on, so most of the pages
will have been truncated in the first loop...


Also, noticed another problem in the same general area. Andrew you were
indeed right to question the removal of that unmap_mapping_range call,
but I think even it alone it wasn't enough...

--
SUSE Labs, Novell Inc.
The nopage vs invalidate race fix patch did not take care of truncating
private COW pages. Mind you, I'm pretty sure this was previously racy
even for regular truncate, not to mention vmtruncate_range.

Anyway, fix that omission.

Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1905,7 +1905,18 @@ int vmtruncate(struct inode * inode, lof
if (IS_SWAPFILE(inode))
goto out_busy;
i_size_write(inode, offset);
+
+   /*
+* unmap_mapping_range is called twice, first simply for efficiency
+* so that truncate_inode_pages does fewer single-page unmaps. However
+* after this first call, and before truncate_inode_pages finishes,
+* it is possible for private pages to be COWed, which remain after
+* truncate_inode_pages finishes, hence the second unmap_mapping_range
+* call must be made for correctness.
+*/
+   unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
truncate_inode_pages(mapping, offset);
+   unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
goto out_truncate;
 
 do_expand:
@@ -1943,7 +1954,9 @@ int vmtruncate_range(struct inode *inode
 
mutex_lock(inode-i_mutex);
down_write(inode-i_alloc_sem);
+   unmap_mapping_range(mapping, offset, (end - offset), 1);
truncate_inode_pages_range(mapping, offset, end);
+   unmap_mapping_range(mapping, offset, (end - offset), 1);
inode-i_op-truncate_range(inode, offset, end);
up_write(inode-i_alloc_sem);
mutex_unlock(inode-i_mutex);


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Eric St-Laurent
On Tue, 2007-20-03 at 01:04 -0400, Lee Revell wrote:

 I think CONFIG_TRY_TO_DISABLE_SMI would be excellent for debugging,
 not to mention people trying to spec out hardware for RT
 applications...

There is a SMI disabling module in RTAI, check the smi-module.c in this:

https://www.rtai.org/RTAI/rtai-3.5.tar.bz2

More infos:

http://www.captain.at/rtai-smi-high-latency.php
http://www.captain.at/xenomai-smi-high-latency.php

It might make sense to merge this code, at least in the -rt tree.


- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)

2007-03-19 Thread Nick Piggin
On Mon, Mar 19, 2007 at 09:44:28PM +0100, Blaisorblade wrote:
 On Sunday 18 March 2007 03:50, Nick Piggin wrote:
   
Yes, I believe that is the case, however I wonder if that is going to
be a problem for you to distinguish between write faults for clean
writable ptes, and write faults for readonly ptes?
  
   I wouldn't be able to distinguish them, but am I going to get write
   faults for clean ptes when vma_wants_writenotify() is false (as seems to
   be for tmpfs)? I guess not.
  
   For tmpfs pages, clean writable PTEs are mapped as writable so they won't
   give any problem, since vma_wants_writenotify() is false for tmpfs.
   Correct?
 
  Yes, that should be the case. So would this mean that nonlinear protections
  don't work on regular files?
 
 They still work in most cases (including for UML), but if the initial mmap() 
 specified PROT_WRITE, that is ignored, for pages which are not remapped via 
 remap_file_pages(). UML uses PROT_NONE for the initial mmap, so that's no 
 problem.

But how are you going to distinguish a write fault on a readonly pte for
dirty page accounting vs a read-only nonlinear protection?

You can't store any more data in a present pte AFAIK, so you'd have to
have some out of band data. At which point, you may as well just forget
about vma_wants_writenotify vmas, considering that everybody is using
shmem/ramfs.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] split file and anonymous page queues #2

2007-03-19 Thread Nick Piggin

Rik van Riel wrote:

Split the anonymous and file backed pages out onto their own pageout
queues.  This we do not unnecessarily churn through lots of anonymous
pages when we do not want to swap them out anyway.

This should (with additional tuning) be a great step forward in
scalability, allowing Linux to run well on very large systems where
scanning through the anonymous memory (on our way to the page cache
memory we do want to evict) is slowing systems down significantly.

This patch has been stress tested and seems to work, but has not
been fine tuned or benchmarked yet.  For now the swappiness parameter
can be used to tweak swap aggressiveness up and down as desired, but
in the long run we may want to simply measure IO cost of page cache
and anonymous memory and auto-adjust.

We apply pressure to each of sets of the pageout queues based on:
- the size of each queue
- the fraction of recently referenced pages in each queue,
   not counting used-once file pages
- swappiness (file IO is more efficient than swap IO)

Please take this patch for a spin and let me know what goes well
and what goes wrong.


This ignores whether a file page is mapped, doesn't it?

Even so, it could be a good approach anyway.

There are a couple of little nice improvements you have there, such as
treating shmem pages in the same class as anon pages. We found that we
needed something similar, so some of those things should go upstream
on their own.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RSDL v0.31

2007-03-19 Thread Willy Tarreau
On Mon, Mar 19, 2007 at 08:11:55PM -0700, Linus Torvalds wrote:
 Quite frankly, I was *planning* on merging RSDL very early after 2.6.21, 
 but there is one thing that has turned me completely off the whole thing:
 
  - the people involved seem to be totally unwilling to even admit there 
might be a problem.
 
 This is like alcoholism. If you cannot admit that you might have a 
 problem, you'll never get anywhere. And quite frankly, the RSDL proponents 
 seem to be in denial (we're always better, it's your problem if the old 
 scheduler works better, just one report of old scheduler being better).
 
 And the thing is, if people aren't even _willing_ to admit that there may 
 be issues, there's *no*way*in*hell* I will merge it even for testing. 
 Because the whole and only point of merging RSDL was to see if it could 
 replace the old scheduler, and the most important feature in that case is 
 not whether it is perfect, BUT WHETHER ANYBODY IS INTERESTED IN TRYING TO 
 FIX THE INEVITABLE PROBLEMS!

Linus, you're unfair with Con. He initially was on this position, and lately
worked with Mike by proposing changes to try to improve his X responsiveness.
But he's ill right now and cannot touch the keyboard, so only his supporters
speak for him, and as you know, speech is not code and does not fix problems.

Leave him a week or so to relieve and let's see what he can propose. Hopefully
a week away from the keyboard will help him think with a more general approach.
Also, Mike has already modified the code a bit to get better experience.

Also, while I don't agree with starting to renice X to get something usable,
it seems real that there's something funny on Mike's system which makes it
behave particularly strangely when combined with RSDL, because other people
in comparable tests (including me) have found X perfectly smooth even with
loads in the tens or even hundreds. I really suspect that we will find a bug
in RSDL which triggers the problem and that this fix will help discover
another problem on Mike's hardware which was not triggered by mainline.

Regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 20:05 -0500, Matt Mackall wrote:
 On Tue, Mar 20, 2007 at 01:42:46AM +0100, Thomas Gleixner wrote:
  On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote:
   This is exactly the same problem as booting on a desktop PC. But
   somehow LILO manages. My first Linux box had a hell of a lot less disk
   than the platform I bootstrapped (and wrote NAND drivers for) last
   month had in NAND.
  
  No, it is not. You get the absolute sector address of your second stage
  and this is a complete nobrainer. The translation is done in the DISK
  device.
 
 LILO and friends manage to boot systems that use software RAID and
 LVM. There are multiple methods. Some use block lists, some use tiny
 boot partitions, etc. All of them are applicable to controllerless NAND.

Yes, by using fixed addresses, which is not what I want.

  You simply ignore the fact, that inside each disk, USB Stick, CF-CARD,
  whatever - there is a more or less intellegent controller device, which
  does the mapping to the physical storage location. There is _NO_ such
  thing on a bare FLASH chip.
 
 How many times do I have to tell you that I wrote a driver for
 controllerless NAND just last month?

Wow. I'm impressed because I'm pulling my opinion out of thin air.

  How exactly does device mapper:
  
  A) across device wear levelling ?
 
 The same way UBI does, but encapsulated in a device mapper layer.

Does the device mapper do that ?

  B) dynamic partitioning for FLASH aware file systems ?

 See above.

Does the device mapper do that ?

  C) across device wear levelling for FLASH aware file systems ?
 
 See above.

Look at your own drawing. 

  D) background bit-flip corrections (copying affected blocks and recylce
  the old one) ?
 
 See above.

Repeating patterns do not impress me. Your drawing tells otherwise

  E) allow position independent placement of the second stage bootloader ?
 
 See way above to my LILO response.

Neither LILO nor GRUB have search capabilities for randomly located
second stage loaders.

You need to implement a clever journalling block device
emulator in order to keep the data alive and the FLASH not weared out
within no time. You need the wear levelling, otherwise you can throw
away your FLASH in no time.
   
   And that's why it's in my picture.
  
  Yes, it is in your picture, but:
  
  1) it excludes FLASH aware file systems and UBI does not.
  2) your picture does still not explain how it does achive the above A),
  B), C), D) and E)
  
  Your extra path for partitioning(4) and JFFS2 is just a weird hack,
  which makes your proposal completely absurd.
 
 No, it's just there to show the flexibility of device mapper. But I have
 the sneaking suspicion you have no idea how device mapper works.

Sigh. Layering violation == flexibility.

 In brief: device mapper takes one or more devices, applies a mapping
 to them, and returns a new device. For example, take various spans of
 /dev/hda1 and /dev/sda3 and present them as new-device1. Take
 new-device1 and transform it with dm-crypt to get new-device2. The
 kernel doesn't decide how to do this, any more than it decides where
 to mount your filesystems. Userspace does.

I know how it works. But your blurb does not answer any of my questions.

 5. We don't reimplement higher pieces of the stack (dm-crypt,
snapshot, etc.).

Why should we reimplement that ?
   
   So that you can get encryption and snapshot, etc.?
  
  1. On top of a clever block device.
  
  2. UBI can do snapshots by design.
 
 Oh, so you HAVE reimplemented it.

No, it already works

  3. Encryption should be done on the VFS layer and not below the
  filesystem layer. Doing it inside the block layer or the device mapper
  is broken by design.
 
 That's highly debatable and not a topic for this thread.

I see, you define, what has to be discussed.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 21:27 -0700, Greg KH wrote:
 On Sat, Mar 17, 2007 at 02:26:57PM +0100, Andi Kleen wrote:
  Arjan van de Ven [EMAIL PROTECTED] writes:
   
   well we can do the handshake to take ownership like we do much later in
   boot, but that requires PCI to be there and fully discovered, which we
   don't have this early.
  
  That's not true - we do early pci discovery. Doing USB handsoff
  there would be quite possible.
 
 What, we don't do USB handoff early enough in the boot process?  It's
 happening at PCI quirk time now, which I think should be early enough
 for everyone (and too early for some who rely on USB keyboards and
 initramfs shells...)

It happens way after the CPUs are brought up. At this point both the
delay loop calibration and the local APIC calibration are already done.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Sanitize filesystem NLS handling

2007-03-19 Thread Alexander E. Patrakov

OGAWA Hirofumi wrote:

Alexander E. Patrakov [EMAIL PROTECTED] writes:

But, anyway, this is a separate issue that my patch doesn't attempt to 
correct. The conclusion so far is that we disagree, and that there are 
situations where using utf8 iocharset is the least of all evils, so the 
warning is not justified enough. Reproducible testcase:


Again, I don't care about read at all. And why don't you use utf8
option, instead of iocharset=utf8. iocharset=utf8 is warned until
it is fixed. The utf8 also doesn't work correctly in some case though.


Would it be OK for you if I add the mount-time check for iocharset=utf8 to 
the fat filesystem and silently replace this with the utf8 option, instead 
of overly actively warning the users? This way, the sysfs option and the 
nls_base.iocharset module parameter will still work as I want.



I'm talking about two filesystems on a system here, not two encoding
on one filesystem.
I am also talking about this. Mounting two filesystems with different 
iocharsets is insane, because this will result in one of the following outcomes:


1) ls will show wrong characters in filenames on one of the filesystems
2) one of the two filesystems will contain wrong on-disk data for filenames, 
that, when misinterpreted by mounting with wrong iocharset, results in 
seemingly-correct output, but is misunderstood by the properly set up 
reference implementation (that's what is likely to happen with jfs in your 
example).


Because you didn't change the locale. And it is your policy, right?


Yes. This is because I have some files with non-ASCII names in my home 
directory. Changing the locale would make these filenames look wrong until I 
change it back.


--
Alexander E. Patrakov
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vt: fix a potential race in the VT_WAITACTIVE handler

2007-03-19 Thread Andrew Morton
On Thu, 15 Mar 2007 15:10:23 +0100 Michal Januszewski [EMAIL PROTECTED] wrote:

 On a multiprocessor machine the VT_WAITACTIVE ioctl call may return 0
 if fg_console has already been updated in redraw_screen(), but the 
 console switch itself hasn't been completed. Fix this by checking
 fg_console in vt_waitactive() with the console sem held.
 
 Signed-off-by: Michal Januszewski [EMAIL PROTECTED]
 
 ---
 diff --git a/drivers/char/vt_ioctl.c b/drivers/char/vt_ioctl.c
 index 3a5d301..00b5b34 100644
 --- a/drivers/char/vt_ioctl.c
 +++ b/drivers/char/vt_ioctl.c
 @@ -1041,8 +1041,12 @@ int vt_waitactive(int vt)
   for (;;) {
   set_current_state(TASK_INTERRUPTIBLE);
   retval = 0;
 - if (vt == fg_console)
 + acquire_console_sem();
 + if (vt == fg_console) {
 + release_console_sem();
   break;
 + }
 + release_console_sem();
   retval = -EINTR;
   if (signal_pending(current))
   break;
 

OK.  I think.  It's hard to tell.  I assume that the acquire_console_sem()
in here is to synchronise against some other function which also takes
acquire_console_sem(), but it is not clear which.

So could you please redo this with a comment which tells the reader exactly
what's being protected against what, and why?

Also, I always feel a bit worried by:

set_current_state(TASK_INTERRUPTIBLE);
down(...);

because if it hits contention, the down() will undo the
set_curremt_state().  Now that's normally OK because we loop, and because
the semaphore won't normally be 100% contended all the time.  Unless
someone reimplements down() so it happens to return in state TASK_RUNNING
all the time, which they could legitimately do (although this would
probably break stuff such as the above).


But still, it is nicer to do

down(...);
set_current_state(TASK_INTERRUPTIBLE);

if possible, and I think it is possible here.

Thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS internal error xfs_da_do_buf(2) at line 2087 of file fs/xfs/xfs_da_btree.c. Caller 0xc01b00bd

2007-03-19 Thread David Chinner
On Mon, Mar 19, 2007 at 11:32:27AM +0100, Marco Berizzi wrote:
 Marco Berizzi wrote:
  David Chinner wrote:
 
  Ok, so an ipsec change. And I see from the history below it
  really has nothing to do with this problem. it seems the problem
  has something to do with changes between 2.6.19.1 and 2.6.19.2.
 
  indeed. Yesterday at 13:00 I have switched from 2.6.19.1 to 2.6.19.2
  (without the ipsec fix) and at about 17:30 linux has crashed again.
  I have recompiled 2.6.19.2 with all kernel debugging options enabled
  and rebooted. Now I'm waiting for the crash...
 
 Linux has not been crashed. However here is dmesg output
 with all debugging option enabled: (search for 'INFO:
 possible recursive locking detected'). Is that normal?

.
 =
 [ INFO: possible recursive locking detected ]
 2.6.19.2 #1
 -
 rm/470 is trying to acquire lock:
  ((ip-i_lock)-mr_lock){}, at: [c01cd64a] xfs_ilock+0x5b/0xa1
 
 but task is already holding lock:
  ((ip-i_lock)-mr_lock){}, at: [c01cd64a] xfs_ilock+0x5b/0xa1
 
 other info that might help us debug this:
 3 locks held by rm/470:
  #0:  (inode-i_mutex/1){--..}, at: [c016e5a7] do_unlinkat+0x70/0x115
  #1:  (inode-i_mutex){--..}, at: [c030be35] mutex_lock+0x1c/0x1f
  #2:  ((ip-i_lock)-mr_lock){}, at: [c01cd64a]
 xfs_ilock+0x5b/0xa1
 
 stack backtrace:
  [c0103bc0] dump_trace+0x215/0x21a
  [c0103c68] show_trace_log_lvl+0x1a/0x30
  [c0103c90] show_trace+0x12/0x14
  [c0103d8d] dump_stack+0x19/0x1b
  [c01357e7] print_deadlock_bug+0xc0/0xcf
  [c0135860] check_deadlock+0x6a/0x79
  [c01372e1] __lock_acquire+0x350/0x970
  [c0137fd1] lock_acquire+0x75/0x97
  [c01331ab] down_write+0x3a/0x54
  [c01cd64a] xfs_ilock+0x5b/0xa1
  [c01eda0e] xfs_lock_dir_and_entry+0x105/0x11b
  [c01edcc5] xfs_remove+0x180/0x47f
  [c01f8a9e] xfs_vn_unlink+0x22/0x4f
  [c016e533] vfs_unlink+0x9e/0xa2
  [c016e5df] do_unlinkat+0xa8/0x115
  [c016e68b] sys_unlink+0x10/0x12
  [c0102cdb] syscall_call+0x7/0xb
  [b7efaa7d] 0xb7efaa7d
  ===

That's no problem - lockdep just doesn't know that we can nest i_lock
(we've got to get the annotations for this sorted out).

 Here is the relevant results:
 
 Phase 2 - found root inode chunk
 Phase 3 - ...
 agno = 0
 ...
 agno = 12
 LEAFN node level is 1 inode 1610612918 bno = 8388608

Hmmm - single bit error in the bno - that reminds of this:

http://oss.sgi.com/projects/xfs/faq.html#dir2

So I'd definitely make sure that is repaired

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


PNPACPI probes serial twice, messes up serial console

2007-03-19 Thread Keith Owens
Dell SC1425 x86_64 running in i386 mode (the problem also occurs in
x86_64 mode).  Kernel 2.6.21-rc4, gcc 4.1.0.  Config extract at end.

Booting with 'console=tty console=ttyS0,9600'.  The serial console on
ttyS0 (0x3f8, irq 4) is probed twice, once from serial8250_init() and
again from serial_pnp_probe().  The serial console output is correct
until the second probe (from PNP) gets to these lines in
serial8250_config_port()

if (flags  UART_CONFIG_TYPE)
autoconfig(up, probeflags);

After the call to autoconfig(), the serial console starts printing NUL
characters instead of the console output.  The number of NUL characters
corresponds closely with the number of characters written to the VT
console, IOW it outputs each serial character as NUL instead of the
correct character.  When the kernel boots /sbin/init, the console
resets to printing normal characters.

AFAICT, the second probe of the UART is doing something nasty to the
hardware.  This is not a recent problem, I can reproduce the problem on
2.6.16.  Booting with pnpacpi=off removes the problem, but that
supresses all the PNPACPI code, not just the second probe of the serial
devices.

Should pnpacpi probe and setup the serial devices even when thay have
already been setup?  Or this is something strange about the UART in
this particular box?

FWIW, the serial console is plugged into a serial to USB converter
(pl2303), my laptop has no serial ports.  That should not make a
difference, but just in case it does ...

Config extract:

X86_32=y
GENERIC_TIME=y
CLOCKSOURCE_WATCHDOG=y
GENERIC_CLOCKEVENTS=y
GENERIC_CLOCKEVENTS_BROADCAST=y
LOCKDEP_SUPPORT=y
STACKTRACE_SUPPORT=y
SEMAPHORE_SLEEPERS=y
X86=y
MMU=y
ZONE_DMA=y
GENERIC_ISA_DMA=y
GENERIC_IOMAP=y
GENERIC_BUG=y
GENERIC_HWEIGHT=y
ARCH_MAY_HAVE_PC_FDC=y
DMI=y
DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config
EXPERIMENTAL=y
LOCK_KERNEL=y
INIT_ENV_ARG_LIMIT=32
LOCALVERSION=-i386-kaos
LOCALVERSION_AUTO=y
SWAP=y
SYSVIPC=y
SYSVIPC_SYSCTL=y
POSIX_MQUEUE=y
IKCONFIG=y
IKCONFIG_PROC=y
SYSFS_DEPRECATED=y
CC_OPTIMIZE_FOR_SIZE=y
SYSCTL=y
EMBEDDED=y
SYSCTL_SYSCALL=y
KALLSYMS=y
KALLSYMS_ALL=y
HOTPLUG=y
PRINTK=y
BUG=y
ELF_CORE=y
BASE_FULL=y
FUTEX=y
EPOLL=y
SHMEM=y
SLAB=y
VM_EVENT_COUNTERS=y
RT_MUTEXES=y
BASE_SMALL=0
MODULES=y
MODULE_UNLOAD=y
KMOD=y
STOP_MACHINE=y
BLOCK=y
LBD=y
LSF=y
IOSCHED_NOOP=y
IOSCHED_AS=y
IOSCHED_DEADLINE=y
IOSCHED_CFQ=y
DEFAULT_DEADLINE=y
DEFAULT_IOSCHED=deadline
TICK_ONESHOT=y
HIGH_RES_TIMERS=y
SMP=y
X86_PC=y
MPENTIUM4=y
X86_CMPXCHG=y
X86_L1_CACHE_SHIFT=7
RWSEM_XCHGADD_ALGORITHM=y
GENERIC_CALIBRATE_DELAY=y
X86_WP_WORKS_OK=y
X86_INVLPG=y
X86_BSWAP=y
X86_POPAD_OK=y
X86_CMPXCHG64=y
X86_GOOD_APIC=y
X86_INTEL_USERCOPY=y
X86_USE_PPRO_CHECKSUM=y
X86_TSC=y
HPET_TIMER=y
HPET_EMULATE_RTC=y
NR_CPUS=8
SCHED_SMT=y
PREEMPT_NONE=y
X86_LOCAL_APIC=y
X86_IO_APIC=y
X86_MCE=y
X86_MCE_NONFATAL=y
X86_MCE_P4THERMAL=y
MICROCODE=m
MICROCODE_OLD_INTERFACE=y
X86_MSR=m
X86_CPUID=m
HIGHMEM4G=y
VMSPLIT_3G=y
PAGE_OFFSET=0xC000
HIGHMEM=y
ARCH_FLATMEM_ENABLE=y
ARCH_SPARSEMEM_ENABLE=y
ARCH_SELECT_MEMORY_MODEL=y
ARCH_POPULATES_NODE_MAP=y
SELECT_MEMORY_MODEL=y
FLATMEM_MANUAL=y
FLATMEM=y
FLAT_NODE_MEM_MAP=y
SPARSEMEM_STATIC=y
SPLIT_PTLOCK_CPUS=4
ZONE_DMA_FLAG=1
MTRR=y
IRQBALANCE=y
HZ_250=y
HZ=250
PHYSICAL_START=0x10
PHYSICAL_ALIGN=0x10
COMPAT_VDSO=y
ARCH_ENABLE_MEMORY_HOTPLUG=y
PM=y
ACPI=y
ACPI_PROCFS=y
ACPI_BUTTON=m
ACPI_FAN=m
ACPI_PROCESSOR=m
ACPI_BLACKLIST_YEAR=0
ACPI_EC=y
ACPI_POWER=y
ACPI_SYSTEM=y
PCI=y
PCI_GOANY=y
PCI_BIOS=y
PCI_DIRECT=y
PCI_MMCONFIG=y
PCIEPORTBUS=y
PCIEAER=y
PCI_MSI=y
HT_IRQ=y
ISA_DMA_API=y
BINFMT_ELF=y
BINFMT_MISC=m
NET=y
PACKET=y
PACKET_MMAP=y
UNIX=y
XFRM=y
INET=y
IP_MULTICAST=y
IP_ADVANCED_ROUTER=y
ASK_IP_FIB_HASH=y
IP_FIB_HASH=y
IP_ROUTE_MULTIPATH=y
IP_ROUTE_VERBOSE=y
SYN_COOKIES=y
INET_XFRM_MODE_BEET=y
INET_DIAG=y
INET_TCP_DIAG=y
TCP_CONG_CUBIC=y
DEFAULT_TCP_CONG=cubic
NETFILTER=y
NETFILTER_NETLINK=m
NETFILTER_NETLINK_LOG=m
NETFILTER_XTABLES=y
NETFILTER_XT_TARGET_CLASSIFY=m
NETFILTER_XT_TARGET_MARK=m
NETFILTER_XT_MATCH_COMMENT=m
NETFILTER_XT_MATCH_DCCP=m
NETFILTER_XT_MATCH_ESP=m
NETFILTER_XT_MATCH_LENGTH=m
NETFILTER_XT_MATCH_LIMIT=m
NETFILTER_XT_MATCH_MAC=m
NETFILTER_XT_MATCH_MARK=m
NETFILTER_XT_MATCH_MULTIPORT=m
NETFILTER_XT_MATCH_PKTTYPE=m
NETFILTER_XT_MATCH_QUOTA=m
NETFILTER_XT_MATCH_REALM=m
NETFILTER_XT_MATCH_SCTP=m
NETFILTER_XT_MATCH_STATISTIC=m
NETFILTER_XT_MATCH_TCPMSS=m
IP_NF_IPTABLES=y
IP_NF_MATCH_IPRANGE=m
IP_NF_MATCH_TOS=m
IP_NF_MATCH_RECENT=m
IP_NF_MATCH_ECN=m
IP_NF_MATCH_AH=m
IP_NF_MATCH_TTL=m
IP_NF_MATCH_OWNER=m
IP_NF_MATCH_ADDRTYPE=m
IP_NF_FILTER=y
IP_NF_TARGET_REJECT=y
IP_NF_TARGET_ULOG=y
VLAN_8021Q=y
NET_CLS_ROUTE=y
STANDALONE=y
PREVENT_FIRMWARE_BUILD=y
FW_LOADER=m
CONNECTOR=m
PNP=y
PNP_DEBUG=y
PNPACPI=y
BLK_DEV_FD=m
BLK_DEV_LOOP=m
IDE=m
IDE_MAX_HWIFS=4
BLK_DEV_IDE=m
BLK_DEV_IDEDISK=m
IDEDISK_MULTI_MODE=y
BLK_DEV_IDECD=m
IDE_TASK_IOCTL=y
BLK_DEV_IDEPCI=y
IDEPCI_SHARE_IRQ=y
BLK_DEV_IDEDMA_PCI=y
IDEDMA_PCI_AUTO=y
BLK_DEV_PIIX=m
BLK_DEV_IDEDMA=y

Re: [PATCH] powerpc minor pagefault optimization with kprobes enabled

2007-03-19 Thread Anton Blanchard

 I've attached a patch below the optimizes this code path for powerpc,
 but the scheme applies to all architectures aswell.  It just rips out all
 the callachin madness, and does as good as it gets in the pagefault
 handler:

NAK, patch on the way to get rid of all the debugger() crap by using
this very hook.

Anton
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

2007-03-19 Thread Nick Piggin

Christoph Hellwig wrote:

On Mon, Mar 19, 2007 at 09:11:31PM +1100, Nick Piggin wrote:


I've got the patches in -mm now. I hope they will get merged when the
the next window opens.

I didn't submit the -page_mkwrite conversion yet, because I didn't
have any callers to look at. It is is slightly less trivial than for
nopage and nopfn, so having David's block_page_mkwrite is helpful.



Yes.  I was just wondering whether it makes more sense to do this
functionality directly ontop of -fault instead of converting i over
real soon.


I would personally prefer that, but I don't want to block David's
patch from being merged if the -fault patches do not get in next
cycle. If the fault patches do make it in first, then yes we should
do the page_mkwrite conversion before merging David's patch.

I'll keep an eye on it, and try to do the right thing.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Rusty Russell
On Mon, 2007-03-19 at 18:00 -0800, Zachary Amsden wrote:
 Rusty Russell wrote:
  *This* was the reason that the current hand-coded calls only clobber %
  eax.  It was a compromise between native (no clobbers) and others (might
  need a reg).
 
 I still don't think this was a good trade.
...
 Xen no longer needs such a register

Hmm, well, if VMI is happy, Xen is happy, and lguest is happy, then
perhaps we're better off with a cc-only clobber rule?  Certainly makes
life simpler.

  Now, since we decided to allow paravirt_ops operations to be normal C
  (ie. the patching is optional and done late), we actually push and pop %
  ecx and %edx.  This makes the call site 10 bytes long, which is a nice
  size for patching anyway (enough for a movl $0, addr, a-la lguest's
  cli, or movw $0, %gs:addr if we supported SMP).
 
 You can do it in 11 bytes with no clobbers and normal C semantics by 
 linking to a direct address instead of calling to an indirect, but then 
 you need some gross fixup technology in paravirt_patch:
 
 if (call_addr == (void*)native_sti) {
   ...
 }

Well, I don't think we need such hacks: since we have to use handcoded
asm and mark the callsites anyway, marking what they're calling is
trivial.

The other idea from btfixup is that we can do the patching *much*
earlier, so we don't need the initial code to be valid at all if we
wanted to: we just need room to patch in a call insn.  We could then
generate trampolines which do the necessary pushes  pops automatically
for backends which want to use C calling conventions.

Perhaps it's time for code and benchmarks?

Rusty.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Jeremy Fitzhardinge
Zachary Amsden wrote:
 For VMI, the default clobber was cc, and you need a way to allow at
 least that, because saving and restoring flags is too expensive on x86.

According to lore (Andi, I think), asm() always clobbers cc. 

 I still don't think this was a good trade.  The primary motivation for
 clobbering %eax was that Xen wanted a free register to use for
 computing the offset into the shared data in the case of SMP
 preemptible kernels.  Xen no longer needs such a register, they can
 use the PDA offset instead.  And it does hurt native performance by
 unconditionally stealing a register in the four most commonly invoked
 paravirt-ops code sequences.

Actually, it still does need a temp register.  The sequence for cli is:

mov %fs:xen_vcpu, %eax
movb $1,1(%eax)

At some point I hope to move the vcpu structure directly into the
pda/percpu variables, at which point it will need no temps.

J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Andi Kleen
On Monday 19 March 2007 00:46, Jeremy Fitzhardinge wrote:
 Andi Kleen wrote:
  Yes. All inline assembly tells gcc what registers are clobbered
  and it fills in the tables. Hand clobbering in inline assembly cannot
  be expressed with the current toolchain, so we moved all those
  out of line.
 
  But again I'm not sure it will work anyways. For once you would
  need large padding around the calls anyways for inline replacement --
  how would you generate that? I expect you would need to put the calls
  into asm() again and with that a custom annotiation format looks
  reasonable.

 Inlining is most important for very small code: sti, cli, pushf;pop eax,
 etc (in many cases, no-ops).  We'd have at least 5 bytes to work in, and
 maybe more if there are surrounding push/pops to be consumed.

 For example, say we wanted to put a general call for sti into entry.S,
 where its expected it won't touch any registers.  In that case, we'd
 have a sequence like:

 push %eax
 push %ecx
 push %edx
 call paravirt_cli
 pop %edx
 pop %ecx
 pop %eax

This cannot right now be expressed as inline assembly in the unwinder at all 
because there is no way to inject the push/pops into the compiler generated
ehframe tables.

[BTW I plan to resubmit the unwinder with some changes]



 If we parse the relocs, then we'd find the reference to paravirt_cli.
 If we look at the byte before and see 0xe8, then we can see if its a
 call.  If we then work out in each direction and see matched push/pops,
 then we know what registers can be trashed in the call.  This also
 allows us to determine the callsite size, and therefore how much space
 we need for inlining.

gcc normally doesn't generate push/pops around directly around the
call site, but somewhere else due to the way its register allocator works.
It can be anywhere in the function or even not there at all if the register
didn't contain anything useful. And they're not necessarily push/pops of 
course.

So you would need to write it as inline assembly. I'm not sure it would
be significantly cleaner than just having tables then.

 So in this case, we see that there are 5 bytes for the call and a
 further 6 bytes of push/pops available for inlining.

 Of course this is hand-written code anyway, so there's no particular
 burden to having some extra metadata stashed away in another section.
 For compiler-generated code, we know that it's already expecting
 standard C ABI calling conventions.  The downside, of course, is that
 only the 5 byte call space is available for inline patching.

It's unlikely you can do much useful in 5 bytes I guess.

Regarding cli/sti: i've been actually thinking about changing it in the
non paravirt kernel. IIRC most save_flags/restore_flags are inside
spin_lock_irqsave/restore() and that is a separate function anyways
so a little larger special case code is ok as long as it is not slower. 
There is some evidence that at least on P4 a software cli/sti flag without 
pushf/popf would be faster.

-Andi

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Jeremy Fitzhardinge
Andi Kleen wrote:

 push %eax
 push %ecx
 push %edx
 call paravirt_cli
 pop %edx
 pop %ecx
 pop %eax
 

 This cannot right now be expressed as inline assembly in the unwinder at all 
 because there is no way to inject the push/pops into the compiler generated
 ehframe tables.
   

I was thinking more of entry.S rather than inline asms.  The only ones I
can think of which would be relevant are the interrupt operations in
spinlocks, and even then we get a bit more leeway.

 gcc normally doesn't generate push/pops around directly around the
 call site, but somewhere else due to the way its register allocator works.
 It can be anywhere in the function or even not there at all if the register
 didn't contain anything useful. And they're not necessarily push/pops of 
 course.
   

Right, I'm not making any assumptions about what gcc generates for
calls, other than assuming it uses a standard call X direct call
instruction.

 So you would need to write it as inline assembly. I'm not sure it would
 be significantly cleaner than just having tables then.
   

No, my intention is that the vast majority of pv_ops uses, which are
calls from C code, would simply be normal C calls, syntatically and
semantically.

 It's unlikely you can do much useful in 5 bytes I guess.
   

I think the main value is retaining non-PARAVIRT native performance
levels.  In the native case, the inlined instructions amount to 1 or 2
bytes, and having small patch sites is an advantage because there's less
space wasted on nops.  My understanding is that a direct call has almost
zero overhead on modern processors, because both the call and the return
can be completely predicted and prefetched, so the threshhold at which
you make inline vs call tradeoff is pretty small.

 Regarding cli/sti: i've been actually thinking about changing it in the
 non paravirt kernel. IIRC most save_flags/restore_flags are inside
 spin_lock_irqsave/restore() and that is a separate function anyways
 so a little larger special case code is ok as long as it is not slower. 
 There is some evidence that at least on P4 a software cli/sti flag without 
 pushf/popf would be faster.
   

There are also the ones in entry.S.  I suppose spinlocks do get used
more than syscalls and interrupts.


J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread David Miller
From: Andi Kleen [EMAIL PROTECTED]
Date: Mon, 19 Mar 2007 11:57:28 +0100

 On Monday 19 March 2007 00:46, Jeremy Fitzhardinge wrote:
  Andi Kleen wrote:
  For example, say we wanted to put a general call for sti into entry.S,
  where its expected it won't touch any registers.  In that case, we'd
  have a sequence like:
 
  push %eax
  push %ecx
  push %edx
  call paravirt_cli
  pop %edx
  pop %ecx
  pop %eax
 
 This cannot right now be expressed as inline assembly in the unwinder at all 
 because there is no way to inject the push/pops into the compiler generated
 ehframe tables.
 
 [BTW I plan to resubmit the unwinder with some changes]

It's inability to handle sequences like the above sounds to me like
a very good argument to _not_ merge the unwinder back into the tree.

To me, that unwinder is nothing but trouble, it severly limits what
cases you can use special calling conventions via inline asm (and we
have done that on several occaisions) and even ignoring that the
unwinder only works half the time.

Please don't subject us to another couple months of hair-pulling only
to have Linus yank the thing out again, there are certainly more
useful things to spend time on :-)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Andi Kleen

 It's inability to handle sequences like the above sounds to me like
 a very good argument to _not_ merge the unwinder back into the tree.

The unwinder can handle it fine, it is just that gcc right now cannot
be taught to generate correct unwind tables for it. If paravirt ops
is widely used i guess it would be possible to find some workaround
for that though.
 
There is one case it truly cannot handle (it would be possible by switching 
to dwarf3), but that was relatively easily eliminated and wasn't a problem
imho.

 To me, that unwinder is nothing but trouble, it severly limits what
 cases you can use special calling conventions via inline asm (and we
 have done that on several occaisions) and even ignoring that the
 unwinder only works half the time.

Initially we had some bugs that accounted for near all failures, but they 
were all fixed in the latest version.
 
 Please don't subject us to another couple months of hair-pulling only
 to have Linus yank the thing out again, there are certainly more
 useful things to spend time on :-)

I think better backtraces are a good use of my time. Sorry if you
disagree on that.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread David Miller
From: Linus Torvalds [EMAIL PROTECTED]
Date: Mon, 19 Mar 2007 20:18:14 -0700 (PDT)

   Please don't subject us to another couple months of hair-pulling only
   to have Linus yank the thing out again, there are certainly more
   useful things to spend time on :-)
 
 Good call. Dwarf2 unwinding simply isn't worth doing. But I won't yank it 
 out, I simply won't merge it. It was more than just totally buggy code, it 
 was an inability of the people to understand that even bugfree code 
 isn't enough - you have to be able to also handle buggy data.

Thank you.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Eric W. Biederman
David Miller [EMAIL PROTECTED] writes:

 From: Linus Torvalds [EMAIL PROTECTED]
 Date: Mon, 19 Mar 2007 20:18:14 -0700 (PDT)

   Please don't subject us to another couple months of hair-pulling only
   to have Linus yank the thing out again, there are certainly more
   useful things to spend time on :-)
 
 Good call. Dwarf2 unwinding simply isn't worth doing. But I won't yank it 
 out, I simply won't merge it. It was more than just totally buggy code, it 
 was an inability of the people to understand that even bugfree code 
 isn't enough - you have to be able to also handle buggy data.

 Thank you.

Hmm..

I know the feeling I have had a similar rant about the kexec on panic
code path.   The code is still no where near as paranoid about normal
kernel things not working as it could be, but by ranting about it
periodically the people doing the work are gradually making it better.

I'm conflicted about the dwarf unwinder.  I was off doing other things
at the time so I missed the pain, but I do have a distinct recollection of
the back traces on x86_64 being distinctly worse the on i386.  Lately
I haven't seen that so it may be I was misinterpreting what I was
seeing, and the compiler optimizations were what gave me such weird
back traces.  

But if the quality of our backtraces has gone down and dwarf unwinder
could give us better back traces it is likely worth pursuing.  Of
course it would need to start with the assumption that it's tables
may be borked (the kernel is busted after all) and be much more
careful than Andi's last attempt.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


new warning in ata_sg_clean

2007-03-19 Thread Meelis Roos
In todays 2.6.21-rc4+git the following news warning has appeared on my 
ppc computer:

  CC [M]  drivers/ata/libata-core.o
drivers/ata/libata-core.c: In function 'ata_sg_clean':
drivers/ata/libata-core.c:3558: warning: unused variable 'dir'

-- 
Meelis Roos ([EMAIL PROTECTED])
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.21-rc4: known regressions with patches available

2007-03-19 Thread Adrian Bunk
Let's not bore people running -rc kernels with regressions that have 
patches available - let's get this patches into the tree for giving 
them the pure exciting experience of the many unfixed regressions.


This email lists some known regressions in Linus' tree compared to 2.6.20
with patches available.

If you find your name in the Cc header, you are either submitter of one
of the bugs, maintainer of an affectected subsystem or driver, a patch
of you caused a breakage or I'm considering you in any other way
possibly involved with one or more of these issues.

Due to the huge amount of recipients, please trim the Cc when answering.


Subject: acpi_serialize locks system during boot
References : http://bugzilla.kernel.org/show_bug.cgi?id=8171
Submitter  : Colchao [EMAIL PROTECTED]
Handled-By : Len Brown [EMAIL PROTECTED]
Patch  : http://bugzilla.kernel.org/show_bug.cgi?id=8171
Status : patch available


Subject: laptop immediately resumes after suspend
References : http://lkml.org/lkml/2007/3/8/469
Submitter  : Ray Lee [EMAIL PROTECTED]
Caused-By  : Alexey Starikovskiy [EMAIL PROTECTED]
 commit ed41dab90eb40ac4911e60406bc653661f0e4ce1
Handled-By : Len Brown [EMAIL PROTECTED]
Patch  : http://lkml.org/lkml/2007/3/12/228
Status : patch available


Subject: suspend/resume hangs (until keypress)  (CONFIG_NO_HZ)
References : http://bugzilla.kernel.org/show_bug.cgi?id=8181
 http://lkml.org/lkml/2007/3/11/112
Submitter  : Tomas Janousek [EMAIL PROTECTED]
 Thomas Meyer [EMAIL PROTECTED]
 Milan Broz [EMAIL PROTECTED]
Handled-By : Thomas Gleixner [EMAIL PROTECTED]
Patch  : http://lkml.org/lkml/2007/3/16/406
Status : patch available


Subject: snd-intel8x0: no 3d surround sound
References : http://lkml.org/lkml/2007/3/5/164
Submitter  : Michal Piotrowski [EMAIL PROTECTED]
Caused-By  : Randy Cushman [EMAIL PROTECTED]
 commit 831466f4ad2b5fe23dff77edbe6a7c244435e973
Handled-By : Takashi Iwai [EMAIL PROTECTED]
Status : patch available


Subject: USB: Oops when connecting USB 1.1 docks
References : http://lkml.org/lkml/2007/3/4/266
Submitter  : Mark Lord [EMAIL PROTECTED]
Caused-By  : Jim Radford [EMAIL PROTECTED]
 commit d9a7ecacac5f8274d2afce09aadcf37bdb42b93a
Handled-By : Oliver Neukum [EMAIL PROTECTED]
 Jim Radford [EMAIL PROTECTED]
Patch  : http://lkml.org/lkml/2007/3/13/217
Status : patch available

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-pm] [2/6] 2.6.21-rc2: known regressions

2007-03-19 Thread Bill Davidsen

Jim Gettys wrote:

On Sun, 2007-03-18 at 17:07 +0100, Ingo Molnar wrote:

* Pavel Machek [EMAIL PROTECTED] wrote:

Some day we may have modesetting support in the kernel for some 
graphics hw, right now it's pretty damn spotty.

Yep, that's the way to go.

hey, i wildly supported this approach ever since 1996, when GGI came up
:-/



So wildly you wrote tons of code ;-).

More seriously, at the time, XFree86 would have spat in your face for
any such thing.  Thankfully, times are changing.

Also more seriously, a somewhat hybrid approach is in order for mode
setting: simple mode setting isn't much code and is required for sane
behavior on crash (it is nice to get oopses onto a screen); but the full
blown mode setting/configuration problem is so large that on some
hardware, it is likely left best left to a helper process (not the X
server).

Also key to get sane behavior out of the scheduler is to get the X
server to yield (sleep in the kernel) rather than busy waiting when the
GPU is busy; a standardized interface for this for both fbdev and dri is
in order.  Right now, X is a misbehaving compute bound process rather
than the properly interactive process it can/should/will be, releasing
the CPU whenever the hardware is busy. Needless to say, this wastes
cycles and hurts interactivity with just about any scheduler you can
devise. It isn't as if this is hard; on UNIX systems we did it in 1984
or thereabouts.


What you say sounds good, assuming that the cost of a sleep is less than 
the cost of the busy wait. But this may be hardware, the waits may be 
very small and frequent, and if it's hitting a small hardware window 
like retrace, delays in response will cause the time period to be missed 
completely. This probably less critical with very smart cards, many of 
us don't run them.


Of course, in 1996, XFree86 would have ignored any such interfaces, in
its insane quest for operating system independent user space drivers
requiring no standard kernel interfaces (it is the second part of
this where the true insanity lay).
  - Jim




--
Bill Davidsen [EMAIL PROTECTED]
  We have more to fear from the bungling of the incompetent than from
the machinations of the wicked.  - from Slashdot

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-pm] [2/6] 2.6.21-rc2: known regressions

2007-03-19 Thread Jim Gettys
On Mon, 2007-03-19 at 16:33 -0400, Bill Davidsen wrote:

 
 What you say sounds good, assuming that the cost of a sleep is less than 
 the cost of the busy wait. But this may be hardware, the waits may be 
 very small and frequent, and if it's hitting a small hardware window 
 like retrace, delays in response will cause the time period to be missed 
 completely. This probably less critical with very smart cards, many of 
 us don't run them.
  

Actually, various strategies involving short busy waiting, or looking at
DMA address registers before sleeping were commonplace.  But a
syscall/sleep/wakeup is/was pretty fast.  If you have an operation
blitting the screen (e.g. scrolling), it takes a bit of time for the GPU
to execute the command.  I see this right now on OLPC, where a wonderful
music application needs to scroll (most of) the screen left),
periodically, and we're losing samples sometimes at those operation.

Remember also, that being nice to everyone else by sleeping, there are
more cycles to go around, and the scheduler can nicely boost the X
server's priority as it will for interactive processes that are being
cooperative.
- Jim

-- 
Jim Gettys
One Laptop Per Child


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ofa-general] Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello!

 Well I don't think the loopback device is currently but as soon
 as we get network namespace support we will have multiple loopback
 devices and they will get unregistered when we remove the network
 namespace.

There is no logical difference. At the moment when namespace is gone
there is nobody who can hold unrevokable references to this loopback.

Alexey
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ofa-general] Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello!

 Does this look sane (untested)?

It does not, unfortunately.

Instead of regular crash in infiniband you will get numerous
random NULL pointer dereferences both due to dst-neighbour
and due to dst-dev.

Alexey
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ofa-general] Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Michael S. Tsirkin
 Quoting Alexey Kuznetsov [EMAIL PROTECTED]:
 Subject: Re: [ofa-general] Re: dst_ifdown breaks infiniband?
 
  Does this look sane (untested)?
 
 It does not, unfortunately.
 
 Instead of regular crash in infiniband you will get numerous
 random NULL pointer dereferences both due to dst-neighbour
 and due to dst-dev.

Right, I saw this clearly in the morning. Thanks.

-- 
MST
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ofa-general] Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello!

 I think the thing to do is to just leave the loopback references
 in place, try to unregister the per-namespace loopback device,
 and that will safely wait for all the references to go away.

Yes, it is exactly how it works in openvz. All the sockets are killed,
queues are cleared, nobody holds references and virtual loopback
can be unregistered just like another device.

Alexey
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Michael S. Tsirkin
  Any simpler ideas?
 
 Well, if inifiniband destructor really needs to take that lock... no.
 Right now I do not see.

OK, this is actually not hard to fix - for infiniband, we can just look at
neighbour-dev-type or compare neighbour-dev and
neighbour-parms-dev - if they are different, device is being unregistered,
so we do not need to do anything in the destructor.

I'll send a patch to openfabrics, shortly.

However, after implementing this fix, I hit what could be use after
free at module unloading. Dave, Alexey, Roland, could you take a look at
the following please?

Works fine for me (survived a couple of hours of crazy device
loading/unloading/up/down/hotplug + link data and state activity)
and seems to fix the issue.

-

If a device driver sets neigh_destructor in neigh_params, this could
get called after the device has been unregistered and the driver module
removed.

This is an old bug, but apparently, started to get triggered more infiniband
after recent multicast and connected mode changes.

Fix this by delaying dev_put until the neigh_params object is removed.

Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED]

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 3183142..cb34f1a 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1313,8 +1313,6 @@ void neigh_parms_release(struct neigh_table *tbl, struct 
neigh_parms *parms)
*p = parms-next;
parms-dead = 1;
write_unlock_bh(tbl-lock);
-   if (parms-dev)
-   dev_put(parms-dev);
call_rcu(parms-rcu_head, neigh_rcu_free_parms);
return;
}
@@ -1325,6 +1323,8 @@ void neigh_parms_release(struct neigh_table *tbl, struct 
neigh_parms *parms)
 
 void neigh_parms_destroy(struct neigh_parms *parms)
 {
+   if (parms-dev)
+   dev_put(parms-dev);
kfree(parms);
 }
 

-- 
MST
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Michael S. Tsirkin
 Quoting Michael S. Tsirkin [EMAIL PROTECTED]:
 Subject: Re: dst_ifdown breaks infiniband?
 
   Any simpler ideas?
  
  Well, if inifiniband destructor really needs to take that lock... no.
  Right now I do not see.
 
 OK, this is actually not hard to fix - for infiniband, we can just look at
 neighbour-dev-type or compare neighbour-dev and
 neighbour-parms-dev - if they are different, device is being unregistered,
 so we do not need to do anything in the destructor.
 
 I'll send a patch to openfabrics, shortly.
 
 However, after implementing this fix, I hit what could be use after
 free at module unloading. Dave, Alexey, Roland, could you take a look at
 the following please?
 
 Works fine for me (survived a couple of hours of crazy device
 loading/unloading/up/down/hotplug + link data and state activity)
 and seems to fix the issue.
 
 -
 
 If a device driver sets neigh_destructor in neigh_params, this could
 get called after the device has been unregistered and the driver module
 removed.
 
 This is an old bug, but apparently, started to get triggered more infiniband
 after recent multicast and connected mode changes.
 
 Fix this by delaying dev_put until the neigh_params object is removed.
 
 Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED]

The problem seems real enough but the fix seems no good -
device unregister gets blocked with
unregister_netdevice: waiting for ib0 to become free. Usage count = 1

It seems the parms object can survive indefinitely after device is removed.

How about creating a new parms object in dst_ifdown, and pointing neighbour
to this?

Would that work?
The advantage of this approach is that neigh-parms is already protected
by RCU.

-- 
MST
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello!

 If a device driver sets neigh_destructor in neigh_params, this could
 get called after the device has been unregistered and the driver module
 removed.

It is the same problem: if dst-neighbour holds neighbour, it should
not hold device. parms-dev is not supposed to be used after
neigh_parms_release(). F.e. set parms-dev to NULL to catch bad references.


Do you search for a way to find real inifiniband device in
ipoib_neigh_destructor()? I guess you will not be able.
The problem is logical: if destructor needs device, neighbour entry
_somehow_ have to hold reference to the device (via neigh-dev, neigh-parms,
whatever). Hence, if we hold neighbour entry, unregister cannot be completed.
Therefore, destructor cannot refer to device. Q.E.D. :-)

Seems, releasing dst-neighbour is inevitable.

Alexey
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Michael S. Tsirkin
 Quoting Alexey Kuznetsov [EMAIL PROTECTED]:
 Subject: Re: dst_ifdown breaks infiniband?
 
 Hello!
 
  If a device driver sets neigh_destructor in neigh_params, this could
  get called after the device has been unregistered and the driver module
  removed.
 
 It is the same problem: if dst-neighbour holds neighbour, it should
 not hold device. parms-dev is not supposed to be used after
 neigh_parms_release(). F.e. set parms-dev to NULL to catch bad references.

Yes. I fixed that - simply checking that neighbour-dev is a
loopback device is sufficient to detect the fact that
the device is being unregistered.

 Do you search for a way to find real inifiniband device in
 ipoib_neigh_destructor()?

No, not anymore.

 I guess you will not be able.

I agree it's not possible.

 The problem is logical: if destructor needs device, neighbour entry
 _somehow_ have to hold reference to the device (via neigh-dev, neigh-parms,
 whatever). Hence, if we hold neighbour entry, unregister cannot be completed.
 Therefore, destructor cannot refer to device. Q.E.D. :-)
 
 Seems, releasing dst-neighbour is inevitable.

infiniband sets parm-neigh_destructor, and I search for a way to prevent
this destructor from being called after the module has been unloaded.
Ideas?

-- 
MST
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Michael S. Tsirkin
 Quoting Alexey Kuznetsov [EMAIL PROTECTED]:
 Subject: Re: dst_ifdown breaks infiniband?
 
 Hello!
 
  If a device driver sets neigh_destructor in neigh_params, this could
  get called after the device has been unregistered and the driver module
  removed.
 
 It is the same problem: if dst-neighbour holds neighbour, it should
 not hold device. parms-dev is not supposed to be used after
 neigh_parms_release(). F.e. set parms-dev to NULL to catch bad references.

Yes. I fixed that - simply checking that neighbour-dev is a
loopback device is sufficient to detect the fact that
the device is being unregistered.

 Do you search for a way to find real inifiniband device in
 ipoib_neigh_destructor()?

No, not anymore.

 I guess you will not be able.

I agree it's not possible.

 The problem is logical: if destructor needs device, neighbour entry
 _somehow_ have to hold reference to the device (via neigh-dev, neigh-parms,
 whatever). Hence, if we hold neighbour entry, unregister cannot be completed.
 Therefore, destructor cannot refer to device. Q.E.D. :-)
 
 Seems, releasing dst-neighbour is inevitable.

infiniband sets parm-neigh_destructor, and I search for a way to prevent
this destructor from being called after the module has been unloaded.
Ideas?

-- 
MST
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello!

 infiniband sets parm-neigh_destructor, and I search for a way to prevent
 this destructor from being called after the module has been unloaded.
 Ideas?

It must be called in any case to update/release internal ipoib structures.

The idea is to move call of parm-neigh_destructor from neighbour destructor
to the moment when it is unhashed, right after n-dead is set.

infiniband is the only user (atm clip uses it too, but that use is obviously
dummy), so that nobody will be harmed.

But ipoib will have to check for validity of skb-dst-neighbour before
attempt to reinitialize private data on dead (n-dead != 0) neighbour.

Alexey
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ofa-general] Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Eric W. Biederman
David Miller [EMAIL PROTECTED] writes:


 I think the thing to do is to just leave the loopback references
 in place, try to unregister the per-namespace loopback device,
 and that will safely wait for all the references to go away.

Right.  The only thing I have found that needs to be changed so
far in this area is specifying which loopback device I want
to replace it with.

 If you do it that way, you should need absolutely no changes to
 the other code in this area.

 As per Herbert, I think he works on Xen rather than vserver :-)
 Perhaps you're thinking of Alexey Kuznetsov or another one of the
 vserver guys.

I think you are thinking of a different Herbert.  I was thinking of
Herbert Poetzl the vserver maintainer.  Alexey works on OpenVZ.

Until we get the basic architecture merged they are rival projects.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Michael S. Tsirkin
 Quoting Alexey Kuznetsov [EMAIL PROTECTED]:
 Subject: Re: dst_ifdown breaks infiniband?
 
 Hello!
 
  infiniband sets parm-neigh_destructor, and I search for a way to prevent
  this destructor from being called after the module has been unloaded.
  Ideas?
 
 It must be called in any case to update/release internal ipoib structures.

I don't think there's a problem.
All we do in destructor is release the ipoib_neigh resource.
And on device unregister we release all resources anyway.

 The idea is to move call of parm-neigh_destructor from neighbour destructor
 to the moment when it is unhashed, right after n-dead is set.

 infiniband is the only user (atm clip uses it too, but that use is obviously
 dummy), so that nobody will be harmed.

This might work. Could you post a patch to better show what you mean to do?

 But ipoib will have to check for validity of skb-dst-neighbour before
 attempt to reinitialize private data on dead (n-dead != 0) neighbour.

We set a flag before unregister_netdev and test it in start_xmit so
that's covered I think.

-- 
MST
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello!

 This might work. Could you post a patch to better show what you mean to do?

Here it is.

-neigh_destructor() is killed (not used), replaced with -neigh_cleanup(),
which is called when neighbor entry goes to dead state. At this point
everything is still valid: neigh-dev, neigh-parms etc.

The device should guarantee that dead neighbor entries (neigh-dead != 0)
do not get private part initialized, otherwise nobody will cleanup it.

I think this is enough for ipoib which is the only user of this thing.
Initialization private part of neighbor entries happens in ipib
start_xmit routine, which is not reached when device is down.
But it would be better to add explicit test for neigh-dead
in any case.


diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index f9dbc6f..2b5c297 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -814,7 +814,7 @@ static void ipoib_set_mcast_list(struct 
queue_work(ipoib_workqueue, priv-restart_task);
 }
 
-static void ipoib_neigh_destructor(struct neighbour *n)
+static void ipoib_neigh_cleanup(struct neighbour *n)
 {
struct ipoib_neigh *neigh;
struct ipoib_dev_priv *priv = netdev_priv(n-dev);
@@ -822,7 +822,7 @@ static void ipoib_neigh_destructor(struc
struct ipoib_ah *ah = NULL;
 
ipoib_dbg(priv,
- neigh_destructor for %06x  IPOIB_GID_FMT \n,
+ neigh_cleanup for %06x  IPOIB_GID_FMT \n,
  IPOIB_QPN(n-ha),
  IPOIB_GID_RAW_ARG(n-ha + 4));
 
@@ -874,7 +874,7 @@ void ipoib_neigh_free(struct net_device 
 
 static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms 
*parms)
 {
-   parms-neigh_destructor = ipoib_neigh_destructor;
+   parms-neigh_cleanup = ipoib_neigh_cleanup;
 
return 0;
 }
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 3725b93..ad7fe11 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -36,7 +36,7 @@ struct neigh_parms
struct net_device *dev;
struct neigh_parms *next;
int (*neigh_setup)(struct neighbour *);
-   void(*neigh_destructor)(struct neighbour *);
+   void(*neigh_cleanup)(struct neighbour *);
struct neigh_table *tbl;
 
void*sysctl_table;
diff --git a/net/atm/clip.c b/net/atm/clip.c
index ebb5d0c..8c38258 100644
--- a/net/atm/clip.c
+++ b/net/atm/clip.c
@@ -261,14 +261,6 @@ static void clip_pop(struct atm_vcc *vcc
spin_unlock_irqrestore(PRIV(dev)-xoff_lock, flags);
 }
 
-static void clip_neigh_destroy(struct neighbour *neigh)
-{
-   DPRINTK(clip_neigh_destroy (neigh %p)\n, neigh);
-   if (NEIGH2ENTRY(neigh)-vccs)
-   printk(KERN_CRIT clip_neigh_destroy: vccs != NULL !!!\n);
-   NEIGH2ENTRY(neigh)-vccs = (void *) NEIGHBOR_DEAD;
-}
-
 static void clip_neigh_solicit(struct neighbour *neigh, struct sk_buff *skb)
 {
DPRINTK(clip_neigh_solicit (neigh %p, skb %p)\n, neigh, skb);
@@ -342,7 +334,6 @@ static struct neigh_table clip_tbl = {
/* parameters are copied from ARP ... */
.parms = {
.tbl= clip_tbl,
-   .neigh_destructor   = clip_neigh_destroy,
.base_reachable_time= 30 * HZ,
.retrans_time   = 1 * HZ,
.gc_staletime   = 60 * HZ,
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 3183142..cfc6001 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -140,6 +140,8 @@ static int neigh_forced_gc(struct neigh_
n-dead = 1;
shrunk  = 1;
write_unlock(n-lock);
+   if (n-parms-neigh_cleanup)
+   n-parms-neigh_cleanup(n);
neigh_release(n);
continue;
}
@@ -211,6 +213,8 @@ static void neigh_flush_dev(struct neigh
NEIGH_PRINTK2(neigh %p is stray.\n, n);
}
write_unlock(n-lock);
+   if (n-parms-neigh_cleanup)
+   n-parms-neigh_cleanup(n);
neigh_release(n);
}
}
@@ -582,9 +586,6 @@ void neigh_destroy(struct neighbour *nei
kfree(hh);
}
 
-   if (neigh-parms-neigh_destructor)
-   (neigh-parms-neigh_destructor)(neigh);
-
skb_queue_purge(neigh-arp_queue);
 
dev_put(neigh-dev);
@@ -675,6 +676,8 @@ static void neigh_periodic_timer(unsigne
*np = n-next;
n-dead = 1;
write_unlock(n-lock);
+   if (n-parms-neigh_cleanup)
+   n-parms-neigh_cleanup(n);
  

<    5   6   7   8   9   10