Re: wibbling over the cpuset shed domain connnection

2007-10-02 Thread Nick Piggin
On Wednesday 03 October 2007 15:21, Paul Jackson wrote:
> > In the meantime, that patch should be merged though, shouldn't it?
>
> Which patch do you refer to:
>  1) the year old patch to disconnect cpusets and sched domains:
>   cpuset-remove-sched-domain-hooks-from-cpusets.patch
>  2) my patch of a few days ago to add a 'sched_load_balance' flag:
>   cpuset and sched domains: sched_load_balance flag

The one quoted, of course.


> I can't push one without the other, because some real time folks are
> depending on the sched domain hooks that (1) would remove, so need some
> alternative, such as in (2).  Even though (1) is rather broken, as you
> note, it still provides a way that the real time folks can disable load
> balancing at runtime on selected CPUs, so is essential to their work.

OK.


> I can't delay any more resolving this, because the cgroup (aka
> container) code is tangled up with (1), and Andrew needs a clear path
> to send cgroups to Linus real soon now.

If code isn't ready to go, it doesn't need to rush, it can just be untangled
or fixed properly etc.


> In my last message to you, a couple of days ago, I asked what I thought
> were a couple of key and simple questions -- can sched domains overlap,
> and what does it mean for user space if they overlap?  A further
> question comes to mind now -- if sched domains can overlap, does this
> provide some capability to user space that is important to provide?
>
> Could you take a minute, Nick, to consider these questions?  Thanks.

Yeah, it arrived after I had a 24 hour flight. I just see it now.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-02 Thread Crispin Cowan
Linus Torvalds wrote:
> Security, on the other hand, very much does depend on the circumstances 
> and the wishes of the users (or policy-makers). And if we had one module 
> that everybody would be happy with, I'd not make it pluggable either. But 
> as it is, we _know_ that's not the case. 
>   
And you claim you are not a security expert :-)

Crispin

-- 
Crispin Cowan, Ph.D.   http://crispincowan.com/~crispin/
   Itanium. Vista. GPLv3. Complexity at work

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset and sched domains: sched_load_balance flag

2007-10-02 Thread Nick Piggin
On Monday 01 October 2007 13:42, Paul Jackson wrote:
> Nick wrote:
> > Moreover, sched_load_balance doesn't really sound like a good name
> > for asking for a partition.
>
> Yup - it's not a good name for asking for a partition.
>
> That's because it isn't asking for a partition.
>
> It's asking for load balancing over the CPUs in the cpuset so marked.

Yeah yeah OK, you turn it off in the parent cpuset of the child cpusets
which you want the partitioning to occur in, and ensure there are no
other overlapping cpusets with that flag turned on in order to create a
hard partition. I don't think this makes the API anynicer.


> > It's more like you're just asking to have better
> > load balancing over that set,
>
> Yup - it's asking for load balancing over that set.  That is why it is
> called that.  There's no idea here of better or worse load balancing,
> that's an internal kernel scheduler subtlety -- it's just a request that
> load balancing be done.

OK, if it prohibits balancing when sched_load_balance is 0, then it is
slightly more useful.


> That is what is visible to user space: whether or not tasks get moved
> from overloaded CPUs to underloaded, though still allowed, CPUs.
>
> This is visible to user space in two ways:
>   1) as task movemement, which may or may not be what is desired, and
>   2) as kernel CPU cycles spent, because load balancing costs CPU cycles
>  that increase more than linearly with the number of CPUs being
>  balanced.
>
> The user doesn't give a hoot what a 'sched domain' is.  They care to
> manage (1) whether their tasks might move under a load imbalance, and
> (2) how many CPU cycles the kernel spends providing this service.

Yeah, but the interface is not very nice. As an interface for hard
partitioning, it doesn't work nicely because it is hierarchical.


> > You would do this by creating partitioning cpusets which carve up the
> > root cpuset (basically -- have multiple roots).
>
> You would do this with the current, single rooted cpuset (and now
> cgroup) mechanism by having multiple immediate child cpusets of the
> root cpuset, which partition the system CPUs.  There is no need to
> invent some bastardized multiple root structure.

What do you mean by bastardized? What's wrong with having a real
(and sane) representation of the requested hard-partitions in the system?


> > You can't (easily) do this now because you have so many tasks in the
> > root cpuset that it is impossible to know whether or not you
> > actually want to load balance them.
>
> I don't know what proposal you are reacting to here.  Clearly not this
> patch that I have proposed, as it is trivially easy to indicate whether
> you want to load balance the root cpuset - by setting or clearing the
> 'sched_load_balance' flag in the root cpuset.

Not your proposal, just the idea to have enough information to be able
to work out a more optimal set of sched-domains automatically. Actually
we can do most of it already automatically, but not hard partitioning.

[snip]

As I said, neither is really semantically more powerful than the other. So
yeah those things are possible to do with your API, but I don't like the API.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


More ext3 panics on 2.6.22 [Fwd: ext3_ordered_writepage panic on shiny new 2.6.22]

2007-10-02 Thread Mitch
Yes i know my kernel is tainted with vmblock and nvidia, but i'm not 
convinced it is related. I'm putting the panics here if anyone interested.


Oct  2 20:08:39 home kernel: Assertion failure in journal_unmap_buffer() 
at fs/jbd/transaction.c:1886: "!buffer_jbddirty(bh)"

Oct  2 20:08:39 home kernel: [ cut here ]
Oct  2 20:08:39 home kernel: kernel BUG at fs/jbd/transaction.c:1886!
Oct  2 20:08:39 home kernel: invalid opcode:  [#1]
Oct  2 20:08:39 home kernel: PREEMPT SMP
Oct  2 20:08:39 home kernel: Modules linked in: iuu_phoenix cdc_acm 
nvidia(P) vmnet(P) parport_pc parport vmblock(P) vmmon(P) udf isofs 
nls_iso8859_1 nls_cp43
7 vfat fat appletalk psnap llc nfsd exportfs lockd sunrpc ftdi_sio 
usbserial uhci_hcd ohci_hcd i2c_nforce2 forcedeth usblp snd_hda_intel 
snd_seq_oss snd_seq_m
idi_event snd_seq snd_seq_device snd_pcm_oss snd_pcm snd_timer 
snd_page_alloc snd_mixer_oss snd usb_storage it87 hwmon_vid i2c_isa 
i2c_dev i2c_core

Oct  2 20:08:39 home kernel: CPU:0
Oct  2 20:08:39 home kernel: EIP: 
0060:[journal_invalidatepage+505/1088]Tainted: P   VLI

Oct  2 20:08:39 home kernel: EFLAGS: 00210296   (2.6.22 #3)
Oct  2 20:08:39 home kernel: EIP is at journal_invalidatepage+0x1f9/0x440
Oct  2 20:08:39 home kernel: eax: 0064   ebx: d53fc770   ecx: 
c03e8a5c   edx: 0022
Oct  2 20:08:39 home kernel: esi: 6f72   edi: f6e14a00   ebp: 
d53fc770   esp: dfe63e4c
Oct  2 20:08:39 home kernel: ds: 007b   es: 007b   fs: 00d8  gs: 0033 
ss: 0068
Oct  2 20:08:39 home kernel: Process rm (pid: 3813, ti=dfe62000 
task=c672da90 task.ti=dfe62000)
Oct  2 20:08:39 home kernel: Stack: c03a44c4 c035850a c03a2ef0 075e 
c03a2ff3 f6e14adc f6e14a14 
Oct  2 20:08:39 home kernel:c2092360  0001  
0001 000a f1ac0964 c01a8b80
Oct  2 20:08:39 home kernel:6f72 000b  c014e116 
c2092360 c014e425 c2092360 c014e514

Oct  2 20:08:39 home kernel: Call Trace:
Oct  2 20:08:39 home kernel:  [ext3_invalidatepage+0/48] 
ext3_invalidatepage+0x0/0x30
Oct  2 20:08:39 home kernel:  [do_invalidatepage+22/32] 
do_invalidatepage+0x16/0x20
Oct  2 20:08:39 home kernel:  [truncate_complete_page+69/80] 
truncate_complete_page+0x45/0x50
Oct  2 20:08:39 home kernel:  [truncate_inode_pages_range+228/720] 
truncate_inode_pages_range+0xe4/0x2d0
Oct  2 20:08:39 home kernel:  [truncate_inode_pages+23/32] 
truncate_inode_pages+0x17/0x20
Oct  2 20:08:39 home kernel:  [ext3_delete_inode+19/208] 
ext3_delete_inode+0x13/0xd0
Oct  2 20:08:39 home kernel:  [ext3_delete_inode+0/208] 
ext3_delete_inode+0x0/0xd0
Oct  2 20:08:39 home kernel:  [generic_delete_inode+94/208] 
generic_delete_inode+0x5e/0xd0

Oct  2 20:08:39 home kernel:  [iput+92/112] iput+0x5c/0x70
Oct  2 20:08:39 home kernel:  [do_unlinkat+239/336] do_unlinkat+0xef/0x150
Oct  2 20:08:39 home kernel:  [irq_exit+91/144] irq_exit+0x5b/0x90
Oct  2 20:08:39 home kernel:  [smp_apic_timer_interrupt+87/144] 
smp_apic_timer_interrupt+0x57/0x90
Oct  2 20:08:39 home kernel:  [sysenter_past_esp+95/133] 
sysenter_past_esp+0x5f/0x85

Oct  2 20:08:39 home kernel:  ===
Oct  2 20:08:39 home kernel: Code: 44 24 10 f3 2f 3a c0 c7 44 24 0c 5e 
07 00 00 c7 44 24 08 f0 2e 3a c0 c7 44 24 04 0a 85 35 c0 c7 04 24 c4 44 
3a c0 e8 b7 a9

f6 ff <0f> 0b eb fe 8b 74 24 1c 85 f6 0f 85 22 fe ff ff 8b 5c 24 28 85
Oct  2 20:08:39 home kernel: EIP: [journal_invalidatepage+505/1088] 
journal_invalidatepage+0x1f9/0x440 SS:ESP 0068:dfe63e4c


And later...


Oct  3 03:01:51 home kernel: BUG: unable to handle kernel NULL pointer 
dereference at virtual address 

Oct  3 03:01:51 home kernel:  printing eip:
Oct  3 03:01:51 home kernel: c01bbbf0
Oct  3 03:01:51 home kernel: *pdpt = 345fd001
Oct  3 03:01:51 home kernel: *pde = 
Oct  3 03:01:51 home kernel: Oops: 0002 [#2]
Oct  3 03:01:51 home kernel: PREEMPT SMP
Oct  3 03:01:51 home kernel: Modules linked in: iuu_phoenix cdc_acm 
nvidia(P) vmnet(P) parport_pc parport vmblock(P) vmmon(P) udf isofs 
nls_iso8859_1 nls_cp43
7 vfat fat appletalk psnap llc nfsd exportfs lockd sunrpc ftdi_sio 
usbserial uhci_hcd ohci_hcd i2c_nforce2 forcedeth usblp snd_hda_intel 
snd_seq_oss snd_seq_m
idi_event snd_seq snd_seq_device snd_pcm_oss snd_pcm snd_timer 
snd_page_alloc snd_mixer_oss snd usb_storage it87 hwmon_vid i2c_isa 
i2c_dev i2c_core

Oct  3 03:01:51 home kernel: CPU:0
Oct  3 03:01:51 home kernel: EIP: 
0060:[journal_grab_journal_head+16/144]Tainted: P   VLI

Oct  3 03:01:51 home kernel: EFLAGS: 00010202   (2.6.22 #3)
Oct  3 03:01:51 home kernel: EIP is at journal_grab_journal_head+0x10/0x90
Oct  3 03:01:51 home kernel: eax: f7c7e000   ebx:    ecx: 
f6f6c8dc   edx: c3492360
Oct  3 03:01:51 home kernel: esi:    edi:    ebp: 
d862a534   esp: f7c7fe48
Oct  3 03:01:51 home kernel: ds: 007b   es: 007b   fs: 00d8  gs:  
ss: 0068
Oct  3 03:01:51 home kernel: Process kswapd0 (pid: 202, ti=f7c7e000 

Re: wibbling over the cpuset shed domain connnection

2007-10-02 Thread Paul Jackson
> In the meantime, that patch should be merged though, shouldn't it?

Which patch do you refer to:
 1) the year old patch to disconnect cpusets and sched domains:
cpuset-remove-sched-domain-hooks-from-cpusets.patch
 2) my patch of a few days ago to add a 'sched_load_balance' flag:
cpuset and sched domains: sched_load_balance flag

I can't push one without the other, because some real time folks are
depending on the sched domain hooks that (1) would remove, so need some
alternative, such as in (2).  Even though (1) is rather broken, as you
note, it still provides a way that the real time folks can disable load
balancing at runtime on selected CPUs, so is essential to their work.

I can't delay any more resolving this, because the cgroup (aka
container) code is tangled up with (1), and Andrew needs a clear path
to send cgroups to Linus real soon now.

In my last message to you, a couple of days ago, I asked what I thought
were a couple of key and simple questions -- can sched domains overlap,
and what does it mean for user space if they overlap?  A further
question comes to mind now -- if sched domains can overlap, does this
provide some capability to user space that is important to provide?

Could you take a minute, Nick, to consider these questions?  Thanks.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: File corruption when using kernels 2.6.18+

2007-10-02 Thread Pekka Enberg
Hi Neil,

On 10/3/07, Neil Romig <[EMAIL PROTECTED]> wrote:
> Thanks for your help on this. I have narrowed it down to commit
> "c22ce143d15eb288543fe9873e1c5ac1c01b69a1 x86: cache pollution aware
> __copy_from_user_ll()". This fits with the errors I'm getting, so now I need
> to find out if I can safely ignore this patch, or does it have to be modified?
> This is my first Linux bug in many years of simply using it, so I'm a little
> nervous!

Just to make sure, if you disable CONFIG_X86_INTEL_USERCOPY, the
corruption goes away?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 4 (2.6.23-rc8-mm2) Smack: Simplified Mandatory Access Control Kernel

2007-10-02 Thread Al Viro
On Tue, Oct 02, 2007 at 09:45:42PM -0700, Casey Schaufler wrote:
> 
> From: Casey Schaufler <[EMAIL PROTECTED]>
> 
> Smack is the Simplified Mandatory Access Control Kernel.
> 
> Smack implements mandatory access control (MAC) using labels
> attached to tasks and data containers, including files, SVIPC,
> and other tasks. Smack is a kernel based scheme that requires
> an absolute minimum of application support and a very small
> amount of configuration data.

I _really_ don't like what you are doing with these symlinks.
For one thing, you have no exclusion between reading the list
entries and modifying them.  For another...  WTF is filesystem
making assumptions about the locations where the things are
mounted?  Hell, even if you override your tmp symlink, what
happens if we want it in two chroot jails with different layouts?

I really don't get it; why not simply have something like
/smack/tmp.link resolve to tmp/ and have userland bind or mount
whatever you bloody like on /smack/tmp?  No problems with absolute
paths, can be used in chroot jails with whatever layouts, ditto for
namespaces, etc. and both symlink and directory get created at
the same time (by one name).  Hell, if you keep a reference
to dentry of directory in the data associated with symlink,
you can simply switch nd->dentry to that, drop the old one
and grab the reference to page containing label and return
it via nd_set_link().  No need to play with allocations, strcat,
yadda, yadda.  readlink() can stuff the ->d_name of the same
dentry plus / plus label directly into user buffer; again, no
allocations needed and works fine anywhere.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: wibbling over the cpuset shed domain connnection

2007-10-02 Thread Nick Piggin
On Tuesday 02 October 2007 07:34, Paul Jackson wrote:
> In -mm merge plans for 2.6.24, Andrew wrote:
> > cpuset-remove-sched-domain-hooks-from-cpusets.patch
> >
> >   Paul continues to wibble over this.  Hold, I guess.
>
> Oh dear ... after looking at the following to figure out what
> a wibble is, I wonder which one Andrew had in mind:
>
>   http://www.urbandictionary.com/define.php?term=wibble
>
> The insanity, the rubbish, being overwhelmed, ... ?
>
> 
>
> If one of Nick or I can knock some sense into the others head,
> then this saga should come to a close soon.

In the meantime, that patch should be merged though, shouldn't it?
cpusets is currently telling the scheduler to do the wrong thing WRT
the user interface definition of cpusets, right?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sky2: jumbo frame regression fix

2007-10-02 Thread Stephen Hemminger
On Wed, 03 Oct 2007 03:34:34 +0200
Ian Kumlien <[EMAIL PROTECTED]> wrote:

> On tis, 2007-10-02 at 18:02 -0700, Stephen Hemminger wrote:
> > Remove unneeded check that caused problems with jumbo frame sizes.
> > The check was recently added and is wrong.
> > When using jumbo frames the sky2 driver does fragmentation, so
> > rx_data_size is less than mtu.
> 
> Confirmed working.
> 
> Now running with 9k mtu with no errors, =)
> 
> It also seems that the FIFO bug was the one that affected me before,
> damn odd race that one.

Does the workaround (forced reset work). Ian, you are the first person to
report triggering it.  I haven't found a way to make it happen.
What combination of flow control and speeds are you using?


-- 
Stephen Hemminger <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sky2: jumbo frame regression fix

2007-10-02 Thread Jeff Garzik

Stephen Hemminger wrote:

On Tue, 02 Oct 2007 21:07:22 -0400
Jeff Garzik <[EMAIL PROTECTED]> wrote:


Stephen Hemminger wrote:

Remove unneeded check that caused problems with jumbo frame sizes.
The check was recently added and is wrong.
When using jumbo frames the sky2 driver does fragmentation, so
rx_data_size is less than mtu.

Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>

--- a/drivers/net/sky2.c2007-10-02 17:56:31.0 -0700
+++ b/drivers/net/sky2.c2007-10-02 17:58:56.0 -0700
@@ -2163,9 +2163,6 @@ static struct sk_buff *sky2_receive(stru
sky2->rx_next = (sky2->rx_next + 1) % sky2->rx_pending;
prefetch(sky2->rx_ring + sky2->rx_next);
 
-	if (length < ETH_ZLEN || length > sky2->rx_data_size)

-   goto len_error;
-

2.6.23?  2.6.24?  enquiring minds want to know...


2.6.23, since it is a regression


You can have regressions in behavior in net-2.6.24.git, too.  _Please_ 
be specific about where you want your patches to go.  Thanks.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Question] How to represent SYSTEM_RAM in kerenel/resouce.c

2007-10-02 Thread KAMEZAWA Hiroyuki
On Tue, 2 Oct 2007 19:52:42 -0600
Matthew Wilcox <[EMAIL PROTECTED]> wrote:

> On Wed, Oct 03, 2007 at 10:31:36AM +0900, KAMEZAWA Hiroyuki wrote:
> > i386 and x86_64 registers System RAM as IORESOUCE_MEM | IORESOUCE_BUSY.
> > ia64 registers System RAM as IORESOURCE_MEM.
> > 
> > Which is better ?
> 
> Should probably be BUSY.  Non-BUSY regions can have io resources
> requested underneath them, but you wouldn't want a PCI device to be
> assigned an address which overlaps with physical memory.

Thank you.
It seems that I'll have to try modifing ia64 and memory hotplug in
the next -mm. 

Regards,
-Kame

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sky2: jumbo frame regression fix

2007-10-02 Thread Stephen Hemminger
On Tue, 02 Oct 2007 21:07:22 -0400
Jeff Garzik <[EMAIL PROTECTED]> wrote:

> Stephen Hemminger wrote:
> > Remove unneeded check that caused problems with jumbo frame sizes.
> > The check was recently added and is wrong.
> > When using jumbo frames the sky2 driver does fragmentation, so
> > rx_data_size is less than mtu.
> > 
> > Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>
> > 
> > --- a/drivers/net/sky2.c2007-10-02 17:56:31.0 -0700
> > +++ b/drivers/net/sky2.c2007-10-02 17:58:56.0 -0700
> > @@ -2163,9 +2163,6 @@ static struct sk_buff *sky2_receive(stru
> > sky2->rx_next = (sky2->rx_next + 1) % sky2->rx_pending;
> > prefetch(sky2->rx_ring + sky2->rx_next);
> >  
> > -   if (length < ETH_ZLEN || length > sky2->rx_data_size)
> > -   goto len_error;
> > -
> 
> 2.6.23?  2.6.24?  enquiring minds want to know...

2.6.23, since it is a regression

-- 
Stephen Hemminger <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Document x86-64 iommu kernel parameters

2007-10-02 Thread Randy Dunlap

Jeff Garzik wrote:

Randy Dunlap wrote:

On Tue, 2 Oct 2007 21:34:13 -0400 Jeff Garzik wrote:


Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]>
---
After having to go figure out what some of these means, I figured I
would save others the trouble.

Some of these are "best guess" based on a quick scan of the code, so it
certainly needs a sanity review before going upstream.


"iommu" is listed in Documentation/x86_64/boot-options.txt
along with more x86_64-specific boot options.
A few other arches do something similar...


Ah!  Well, seeing as how we already have a provision for arch-specific 
options in kernel-parameters.txt, and some less-obscure arch-specific 
options can be found there, I think an argument can be made for my patch :)


Nonethless, if the maintainer disagrees, they can drop this patch I 
suppose.


or maybe during the x86 merge, we can merge the docs also...

--
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-02 Thread Linus Torvalds


On Tue, 2 Oct 2007, Bill Davidsen wrote:
> 
> Unfortunately not so, I've been looking at schedulers since MULTICS, and
> desktops since the 70s (MP/M), and networked servers since I was the ARPAnet
> technical administrator at GE's Corporate R Center. And on desktops response
> is (and should be king), while on a server, like nntp or mail, I will happily
> go from 1ms to 10sec for a message to pass through the system if only I can
> pass 30% more messages per hour, because in virtually all cases transit time
> in that range is not an issue. Same thing for DNS, LDAP, etc, only smaller
> time range. If my goal is <10ms, I will not sacrifice capacity to do it.

Bill, that's a *tuning* issue, not a scheduler logic issue.

You can do that today. The scheduler has always had (well, *almost* 
always: I think the really really original one didn't) had tuning knobs.

It in no way excuses any "pluggable scheduler", because IT DOES NOT CHANGE 
THE PROBLEM.

[ Side note: not only doesn't it change the problem, but a good scheduler 
  tunes itself rather naturally for most things. In particular, for things 
  that really are CPU-limited, the scheduler should be able to notice 
  that, and will not aim for latency to the same degree.

  In fact, what is really important is that the scheduler notice that some 
  programs are latency-critical AT THE SAME TIME as other programs sharing 
  that CPU are not, which very much implies that you absolutely MUST NOT 
  have a scheduler that done one or the other: it needs to know about 
  *both* behaviors at the same time.

  IOW, it is very much *not* about multiple different "pluggable modules", 
  because the scheduler must be able to work *across* these kinds of 
  barriers. ]

So for example, with the current scheduler, you can actually set things 
like scheduler latency. Exactly so you can tune things. However, I 
actually would argue that you generally shouldn't need to, and if you 
really do need to, and it's a huge deal for a real load (and not just a 
few percent for a benchmark), we should consider that a scheduler problem.

So your "argument" is nonsense. You're arguing for something else than 
what you _claim_ to be arguing for. What you state that you want actually 
has nothing what-so-ever to do with pluggable schedulers, quite the 
reverse!

It's also totally incorrect to state that this is somehow intrisicly a 
feature of a "server load". Many server loads have very real latency 
constraints. No, not the traditional UNIX loads of SMPT and NNTP, but in 
many loads the latency guarantees are a rather important part of it, and 
you'll have benchmarks that literally test how high the load can be until 
latency reaches some intolerable value - ie latency ends up being the 
critical part.

There's also a meta-development issue here: I can state with total 
conviction that historically, if we had had a "server scheduler" and a 
"desktop scheduler", we'd have been in much worse shape than we are now. 

Not only are a lot of the loads the same or at least similar (and aiming 
for _one_ scheduler - especially one that auto-tunes itself at least to to 
some degree - gets you lots of good testing), but the hardware situation 
changes.

For example, even just five years ago, there would have been people who 
thought that multiprocessing is a server load - and they'd have been 
largely right at the time. Would you have wanted a "server" (SMP, screw 
latency) scheduler, a "workstation" (SMP but low-latency) scheduler and a 
"desktop" (UP) scheduler for the different cases?

Because yes, SMP does impact the scheduler a lot... The locking, the 
migration between CPU's, the CPU affinity.. Things that gamers five years 
ago would have felt was just totally screwing them over and making the 
scheduler slower and more complex "for no gain".

See? Pluggable things are generally a *bad* thing. You should generally 
aim for *never* being pluggable if you can at all avoid it, because it not 
only fragments the developer base over totally different code bases, it 
generates unmaintainable decisions as the problem space evolves.

To get back to security: I didn't want pluggable security because I 
thought that was a technically good solution. No, the reason Linux has LSM 
(and yes, I was the one who pushed hard for the whole thing, even if I 
didn't actually write any of it) was because the problem wasn't technical 
to begin with.

It was social/political and administrative.

See? Another fundamental difference between schedulers and security 
modules. 

> > I don't know who came up with it, or why people continue to feed the insane
> > ideas. Why do people think that servers don't care about latency?   
> 
> Because people who run servers for a living, and have to live with limited
> hardware capacity realize that latency isn't the only issue to be addressed,
> and that the policy for degradation of latency vs. throughput may be very
> different on one server than another or a desktop.


Re: [PATCH] mark read_crX() asm code as volatile

2007-10-02 Thread Nick Piggin
On Wednesday 03 October 2007 04:27, Chuck Ebbert wrote:
> On 10/02/2007 11:28 AM, Arjan van de Ven wrote:
> > On Tue, 02 Oct 2007 18:08:32 +0400
> >
> > Kirill Korotaev <[EMAIL PROTECTED]> wrote:
> >> Some gcc versions (I checked at least 4.1.1 from RHEL5 & 4.1.2 from
> >> gentoo) can generate incorrect code with read_crX()/write_crX()
> >> functions mix up, due to cached results of read_crX().
> >
> > I'm not so sure volatile is the right answer, as compared to giving the
> > asm more strict contraints
> >
> > asm volatile tends to mean something else than "the result has
> > changed"
>
> It means "don't eliminate this code if it's reachable" which should be
> just enough for this case. But it could still be reordered in some cases
> that could break, I think.
>
> This should work because the result gets used before reading again:
>
> read_cr3(a);
> write_cr3(a | 1);
> read_cr3(a);
>
> But this might be reordered so that b gets read before the write:
>
> read_cr3(a);
> write_cr3(a | 1);
> read_cr3(b);
>
> ?

I don't see how, as write_cr3 clobbers memory.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-02 Thread Nick Piggin
On Tuesday 02 October 2007 21:40, Peter Zijlstra wrote:
> On Tue, 2007-10-02 at 13:21 +0200, Kay Sievers wrote:

> > How about adding this information to the tree then, instead of
> > creating a new top-level hack, just because something that you think
> > you need doesn't exist.
>
> So you suggest adding all the various network filesystems in there
> (where?), and adding the concept of a BDI, and ensuring all are properly
> linked together - somehow. Feel free to do so.

Would something fit better under /sys/fs/? At least filesystems are
already an existing concept to userspace.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] fix bogus reporting of signals by audit

2007-10-02 Thread Al Viro
Async signals should not be reported as sent by current in
audit log.  As it is, we call audit_signal_info() too early in
check_kill_permission().  Note that check_kill_permission() has that
test already - it needs to know if it should apply current-based
permission checks.  So the solution is to move the call of audit_signal_info()
between those.
Bogosity in question is easily reproduced - add a rule watching for
e.g. kill(2) from specific process (so that audit_signal_info() would not
short-circuit to nothing), say load_policy, watch the bogus OBJ_PID entry
in audit logs claiming that write(2) on selinuxfs file issued by load_policy(8)
had somehow managed to send a signal to syslogd...

Signed-off-by: Al Viro <[EMAIL PROTECTED]>
Acked-by: Steve Grubb <[EMAIL PROTECTED]>
Acked-by: Eric Paris <[EMAIL PROTECTED]>
---
diff --git a/kernel/signal.c b/kernel/signal.c
index 9fb91a3..7929523 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -531,18 +531,18 @@ static int check_kill_permission(int sig, struct siginfo 
*info,
if (!valid_signal(sig))
return error;
 
-   error = audit_signal_info(sig, t); /* Let audit system see the signal */
-   if (error)
-   return error;
-
-   error = -EPERM;
-   if ((info == SEND_SIG_NOINFO || (!is_si_special(info) && 
SI_FROMUSER(info)))
-   && ((sig != SIGCONT) ||
-   (process_session(current) != process_session(t)))
-   && (current->euid ^ t->suid) && (current->euid ^ t->uid)
-   && (current->uid ^ t->suid) && (current->uid ^ t->uid)
-   && !capable(CAP_KILL))
+   if (info == SEND_SIG_NOINFO || (!is_si_special(info) && 
SI_FROMUSER(info))) {
+   error = audit_signal_info(sig, t); /* Let audit system see the 
signal */
+   if (error)
+   return error;
+   error = -EPERM;
+   if (((sig != SIGCONT) ||
+   (process_session(current) != process_session(t)))
+   && (current->euid ^ t->suid) && (current->euid ^ t->uid)
+   && (current->uid ^ t->suid) && (current->uid ^ t->uid)
+   && !capable(CAP_KILL))
return error;
+   }
 
return security_task_kill(t, info, sig, 0);
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


My kernel fails with kexec

2007-10-02 Thread Jun Koi
Greetings,

I have a kernel (which is not Linux kernel), and want to have it
worked with kexec. That means I want to get kexec boot my kernel.
Fortunately, kexec crashes when booting it. (with kexec -e command)

My suspect is that my kernel is not written to "support" kexec. So
could anybody tell me what is the requirement of a kernel, so it works
with kexec?

"kexec -e" spits out the below message in QEMU when booting my kernel. Any
hint where the problem lies, and on how to debug the problem?

Thanks,
Jun


(qemu) qemu: fatal: triple fault
EAX=1500 EBX=001001f1 ECX= EDX=4081
ESI=002ae8c0 EDI=002ac8c0 EBP=e098 ESP=8ffe
EIP=0d05 EFL=0002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0018   00cf9300
CS =4020 00040200  008f9f00
SS =4000 0004  9300
DS =0018   00cf9300
FS =0018   00cf9300
GS =0018   00cf9300
LDT=   8000
TR =0080 c1fd7000 2073 c10089fd
GDT= 0040a938 0017
IDT= c000 
CR0=0010 CR2=b7ed6200 CR3= CR4=
CCS=8000 CCD=4000 CCO=SARL
FCW=037f FSW= [ST=0] FTW=00 MXCSR=1f80
FPR0=  FPR1= 
FPR2=  FPR3= 
FPR4=  FPR5= 
FPR6=f424 4012 FPR7= 
XMM00= XMM01=
XMM02= XMM03=
XMM04= XMM05=
XMM06= XMM07=
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kdump info request

2007-10-02 Thread Vivek Goyal
On Mon, Oct 01, 2007 at 09:31:45AM -0700, Randy Dunlap wrote:
> On Mon, 1 Oct 2007 09:35:04 -0600 Mukker, Atul wrote:
> 
> > Thanks for the information and the effort.
> > 
> > We need to support all currently shipping products with kdump support
> > available (read Red Hat and SuSE) so sooner it makes into to the
> > upstream kernel the better it is.
> > 
> > So, today there is no alternative way to find out if the driver is being
> > loaded under kdump restrictive environment?
> > 
> > Thanks
> > -Atul
> 
> I'm not the right person to answer that, but going forward, it would
> be nice to have that information and it would be good to do correctly.
> 
> I think that scanning  is not actually the good/right
> way to do this.  It should just be a flag somewhere, and the flag should
> be available in all (future) kernels (and likely easily backportable
> as well, if that matters), meaning those built without kdump support.
> 
> But it's still up to the kexec developers...
> 
> 

Hi Atul/Randy,

[CCing to LKML]

Thinking more about it, looks like scanning command line is not a very good
idea.

I think we should instead look for if total available RAM in the system and
then let driver make a decision about how much memory to allocate. Pavel
already suggested it on LKML and I like the idea.

This is more generic and can be applied to kdump, kexec based hibernation and
all the future users of kexec on panic infrstrcuture which will boot a 
second kernel in restricted amount of RAM.

I am not sure what's the best way to determine the total RAM in the system
in arch independent manner. Some VM guys can tell it better. One of the ways
could be to parse /proc/iomem, the way kexec-tools does. Balbir mentioned
that traverse through the nodes and sum up present_pages. 

Any suggestions, what's the best way to determine total amount of RAM
in the system in kernel?

Right now this seems to be one odd case so this code can be inside module.
If there are more users of it then we can probably create a flag inside
kernel and export it.

Thanks
Vivek
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Document x86-64 iommu kernel parameters

2007-10-02 Thread H. Peter Anvin

Randy Dunlap wrote:


Maybe we can/should merge the doc files along with the x86 arch merge.



Well, the x86 merge is pretty much mechanical.  It should be followed up 
with a lot of manual merging.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc7-mm1 -- powerpc rtas panic

2007-10-02 Thread Michael Ellerman
On Wed, 2007-10-03 at 11:19 +1000, Tony Breeds wrote:
> On Wed, Oct 03, 2007 at 10:30:16AM +1000, Michael Ellerman wrote:
>  
> > I realise it'll make the patch bigger, but this doesn't seem like a
> > particularly good name for the variable anymore.
> 
> Sure, what about?

Better .. but  ..   :D

> diff --git a/arch/powerpc/platforms/pseries/rtasd.c 
> b/arch/powerpc/platforms/pseries/rtasd.c
> index 30925d2..73401c8 100644
> --- a/arch/powerpc/platforms/pseries/rtasd.c
> +++ b/arch/powerpc/platforms/pseries/rtasd.c
> @@ -54,8 +54,9 @@ static unsigned int rtas_event_scan_rate;
>  static int full_rtas_msgs = 0;
>  
>  /* Stop logging to nvram after first fatal error */
> -static int no_more_logging;
> -
> +static int logging_enabled; /* Until we initialize everything,
> + * make sure we don't try logging
> + * anything */

Until we initialise what exactly?

>  static int error_log_cnt;
>  
>  /*
> @@ -217,7 +218,7 @@ void pSeries_log_error(char *buf, unsigned int err_type, 
> int fatal)
>   }
>  
>   /* Write error to NVRAM */
> - if (!no_more_logging && !(err_type & ERR_FLAG_BOOT))
> + if (logging_enabled && !(err_type & ERR_FLAG_BOOT))
>   nvram_write_error_log(buf, len, err_type, error_log_cnt);
>  
>   /*
> @@ -229,8 +230,8 @@ void pSeries_log_error(char *buf, unsigned int err_type, 
> int fatal)
>   printk_log_rtas(buf, len);
>  
>   /* Check to see if we need to or have stopped logging */
> - if (fatal || no_more_logging) {
> - no_more_logging = 1;
> + if (fatal || !logging_enabled) {
> + logging_enabled = 0;
>   spin_unlock_irqrestore(_log_lock, s);
>   return;
>   }

Hmmm, this routine has 4 separate lock-dropping exit paths ..

> @@ -302,7 +303,7 @@ static ssize_t rtas_log_read(struct file * file, char 
> __user * buf,
>  
>   spin_lock_irqsave(_log_lock, s);
>   /* if it's 0, then we know we got the last one (the one in NVRAM) */
> - if (rtas_log_size == 0 && !no_more_logging)
> + if (rtas_log_size == 0 && logging_enabled)
>   nvram_clear_error_log();
>   spin_unlock_irqrestore(_log_lock, s);
>  
> @@ -414,6 +415,8 @@ static int rtasd(void *unused)
>   memset(logdata, 0, rtas_error_log_max);
>   rc = nvram_read_error_log(logdata, rtas_error_log_max,
> _type, _log_cnt);
> + /* We can use rtas_log_buf now */
> + logging_enabled = 1;
>  
>   if (!rc) {
>   if (err_type != ERR_FLAG_ALREADY_LOGGED) {

What exactly happens that allows us to do logging? I don't see any
ordering between anything else and the setting of the flag, and AFAICT
we're not inside a spinlock or anything here.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


signature.asc
Description: This is a digitally signed message part


Re: [git] CFS-devel, latest code

2007-10-02 Thread Srivatsa Vaddagiri
On Tue, Oct 02, 2007 at 09:59:04PM +0200, Dmitry Adamushko wrote:
> The following patch (sched: disable sleeper_fairness on SCHED_BATCH)
> seems to break GROUP_SCHED. Although, it may be
> 'oops'-less due to the possibility of 'p' being always a valid
> address.

Thanks for catching it!  Patch below looks good to me. 

Acked-by : Srivatsa Vaddagiri <[EMAIL PROTECTED]>

> Signed-off-by: Dmitry Adamushko <[EMAIL PROTECTED]>
> 
> ---
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 8727d17..a379456 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -473,9 +473,8 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity 
> *se, int initial)
>   vruntime += sched_vslice_add(cfs_rq, se);
> 
>   if (!initial) {
> - struct task_struct *p = container_of(se, struct task_struct, 
> se);
> -
> - if (sched_feat(NEW_FAIR_SLEEPERS) && p->policy != SCHED_BATCH)
> + if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se) &&
> + task_of(se)->policy != SCHED_BATCH)
>   vruntime -= sysctl_sched_latency;
> 
>   vruntime = max_t(s64, vruntime, se->vruntime);
> 
> ---
> 

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Document x86-64 iommu kernel parameters

2007-10-02 Thread Randy Dunlap
On Tue, 02 Oct 2007 22:30:31 -0400 Jeff Garzik wrote:

> Randy Dunlap wrote:
> > On Tue, 2 Oct 2007 21:34:13 -0400 Jeff Garzik wrote:
> > 
> >> Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]>
> >> ---
> >> After having to go figure out what some of these means, I figured I
> >> would save others the trouble.
> >>
> >> Some of these are "best guess" based on a quick scan of the code, so it
> >> certainly needs a sanity review before going upstream.
> > 
> > "iommu" is listed in Documentation/x86_64/boot-options.txt
> > along with more x86_64-specific boot options.
> > A few other arches do something similar...
> 
> Ah!  Well, seeing as how we already have a provision for arch-specific 
> options in kernel-parameters.txt, and some less-obscure arch-specific 
> options can be found there, I think an argument can be made for my patch :)
> 
> Nonethless, if the maintainer disagrees, they can drop this patch I suppose.

[sorry if there be duplicates; I thought I sent this but can't find it
anywhere]


Maybe we can/should merge the doc files along with the x86 arch merge.

---
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCH] Add sysfs control to modify a user's cpu share

2007-10-02 Thread Srivatsa Vaddagiri
On Tue, Oct 02, 2007 at 06:12:39PM -0400, Eric St-Laurent wrote:
> While a sysfs interface is OK and somewhat orthogonal to the interface
> proposed the containers patches, I think maybe a new syscall should be
> considered.

We had discussed syscall vs filesystem based interface for resource
management [1] and there was a heavy bias favoring filesystem based interface,
based on which the container (now "cgroup") filesystem evolved.

Where we already have one interface defined, I would be against adding 
an equivalent syscall interface.

Note that this "fair-user" scheduling can in theory be accomplished
using the same cgroup based interface, but requires some extra setup in
userspace (either to run a daemon which moves tasks to appropriate
control groups/containers upon their uid change OR to modify initrd to mount 
cgroup filesystem at early bootup time). I expect most distros to enable
CONFIG_FAIR_CGROUP_SCHED (control group based fair group scheduler) and not 
CONFIG_FAIR_USER_SHCED (user id based fair group scheduler). The only
reason why we are providing CONFIG_FAIR_USER_SCHED and the associated
sysfs interface is to help test group scheduler w/o requiring knowledge
of cgroup filesystem.

Reference:

1. http://marc.info/?l=linux-kernel=116231242201300=2

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.6.23-rc9 and a heads-up for the 2.6.24 series..

2007-10-02 Thread Eric St-Laurent

On Tue, 2007-10-02 at 11:17 +0200, Thomas Gleixner wrote:

[...]

> I have uploaded an update of the arch/x86 tree based on -rc9 to
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-x86.git x86
> 

[...]

> If there is anything we can help with the transition, please do not
> hesitate to ask.
> 
> Thanks,
> 
>   Thomas, Ingo

Hi Thomas,

This latest x86 branch build and boot without problem with my usual
x86_64 config.

If you remember our conversation one month ago, I was unable to build
your tree.

I've upgraded my Ubuntu distribution from 7.04 to 7.10 beta this week,
maybe this fixed it.

But I still had to do some manual fixes to get the packaging steps
working:

mkdir arch/x86_64/boot/
ln -s ../../../arch/x86/boot/bzImage arch/x86_64/boot/bzImage


Best regards,

- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-02 Thread Bill Davidsen

Linus Torvalds wrote:

On Tue, 2 Oct 2007, Bill Davidsen wrote:
  

And yet you can make the exact same case for schedulers as security, you can
quantify the behavior, but if your only choice is A it doesn't help to know
that B is better.



You snipped a key part of the argument. Namely:

  Another difference is that when it comes to schedulers, I feel like I
  actually can make an informed decision. Which means that I'm perfectly
  happy to just make that decision, and take the flak that I get for it. And
  I do (both decide, and get flak). That's my job.

which you seem to not have read or understood (neither did apparently 
anybody on slashdot).
  


Actually I had quoted that, made a reply, and decided that my reply was 
too close to a flame and deleted the quote and the nasty reply, because 
I couldn't find a nice way to say what I wanted. Oh well, I tried to 
keep to a higher level, but... on this topic you seem to be off on an 
ego trip. You are not the decider, George Bush is the decider, and the 
only time he's not wrong he didn't understand the question. I checked 
the schedule, it's not you week to be God.


There are sensible people you respect on other topics, who have the 
opinion that there is room for behaviors other than CFS, and who have 
created a pluggable scheduler framework which they are trying to hand 
you on a platter. And you won't even consider that they might be right, 
because you believe there can be one scheduler which is close to optimal 
for all loads.

You say "performance" as if it had universal meaning.



Blah. Bogus and pointless argument removed.

When it comes to schedulers, "performance" *is* pretty damn well-defined, 
and has effectively universal meaning.


The arguments that "servers" have a different profile than "desktop" is 
pure and utter garbage, and is perpetuated by people who don't know what 
they are talking about. The whole notion of "server" and "desktop" 
scheduling being different is nothing but crap. 
  


Unfortunately not so, I've been looking at schedulers since MULTICS, and 
desktops since the 70s (MP/M), and networked servers since I was the 
ARPAnet technical administrator at GE's Corporate R Center. And on 
desktops response is (and should be king), while on a server, like nntp 
or mail, I will happily go from 1ms to 10sec for a message to pass 
through the system if only I can pass 30% more messages per hour, 
because in virtually all cases transit time in that range is not an 
issue. Same thing for DNS, LDAP, etc, only smaller time range. If my 
goal is <10ms, I will not sacrifice capacity to do it.
I don't know who came up with it, or why people continue to feed the 
insane ideas. Why do people think that servers don't care about latency? 
  


Because people who run servers for a living, and have to live with 
limited hardware capacity realize that latency isn't the only issue to 
be addressed, and that the policy for degradation of latency vs. 
throughput may be very different on one server than another or a desktop.
Why do people believe that desktop doesn't have multiple processors or 
through-put intensive loads? Why are people continuing this *idiotic* 
scheduler discussion?
  


Because people can't get you to understand that one size doesn't fit all 
(and I doubt I've broken through).
Really - not only is the whole "desktop scheduler" argument totally bogus 
to begin with (and only brought up by people who either don't know 
anything about it, or who just want to argue, regardless of whether the 
argumen is valid or not), quite frankly, when you say that it's the "same 
issue" as with security models, you're simply FULL OF SH*T.
  


The real issue is that you can't imagine that people who don't share 
your opinion are not only wrong but don't understand the problem. You 
may be right, but when you say anyone who disagrees is wrong by 
definition, then you have lost sight of productive technical 
differences. When your arguments drop to personal attacks and rants it's 
time to look at your technical values.
The issue with LSM is that security people simply cannot even agree on the 
model. It has nothing to do with performance. It's about management, and 
it's about totally different models. Have you even *looked* at the 
differences between AppArmor and SELinux? Did you look at SMACK? They are 
all done by people who are interested in security, but have totally 
different notions of what "security" even *IS*ALL*ABOUT.
  


Exactly, and I'm not the only one who doubts that more than one model 
would be useful. I'm sorry you can't see that about CPU schedulers as well.
In contrast, anybody who claims that the CPU scheduler doesn't know what 
it's all about is just tripping. And anybody who claims that desktop 
workloads are so radically different from server workloads (or that the 
hardware is so different) is just totally out to lunch.


So next time, think five minutes before you start your argument.
  


I don't 

Re: [PATCH 4/5] infiniband: add "dmabarrier" argument to ib_umem_get()

2007-10-02 Thread David Miller
From: [EMAIL PROTECTED]
Date: Tue, 2 Oct 2007 19:49:06 -0700

> 
> Pass a "dmabarrier" argument to ib_umem_get() and use the new 
> argument to control setting the DMA_BARRIER_ATTR attribute on 
> the memory that ib_umem_get() maps for DMA.
> 
> Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]>

Acked-by: David S. Miller <[EMAIL PROTECTED]>

However I'm a little unhappy with how IA64 achieves this.

The last argument for dma_map_foo() is an enum not an int,
every platform other than IA64 properly defines the last
argument as "enum dma_data_direction".  It can take one
of several distinct values, it is not a mask.

This hijacking of the DMA direction argument is hokey at
best, and at worst is type bypassing which is going to
explode subtly for someone in the future and result in
a long painful debugging session.

Adding another argument could be painful to do this cleanly, but at
least with inline functions and macros it could just evaluate to
nothing on platforms that don't need it.

Either that, or we should turn the thing into an integer "flags"
across the board and audit every DMA mapping implementation so that it
can handle multiple bits being set.  But that's really ugly and
invites mistakes as I detailed above.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] ibmebus: Move to of_device and of_platform_driver, match eHCA and eHEA drivers

2007-10-02 Thread Paul Mackerras
Joachim Fenkes writes:

> Replace struct ibmebus_dev and struct ibmebus_driver with struct of_device
> and struct of_platform_driver, respectively. Match the external ibmebus
> interface and drivers using it.
> 
> Signed-off-by: Joachim Fenkes <[EMAIL PROTECTED]>
> ---
>  drivers/infiniband/hw/ehca/ehca_classes.h |2 +-
>  drivers/net/ehea/ehea.h   |2 +-
>  include/asm-powerpc/ibmebus.h |   38 +++
>  arch/powerpc/kernel/ibmebus.c |   28 ++-
>  drivers/infiniband/hw/ehca/ehca_eq.c  |6 +-
>  drivers/infiniband/hw/ehca/ehca_main.c|   32 ++--
>  drivers/net/ehea/ehea_main.c  |   72 ++--

This is somewhat difficult as this patch touches files that are the
responsibility of three different maintainers.  Is it possible to
split the patch into three, one for each maintainer (possibly by
keeping both old and new interfaces around for a little while)?

If not, then you need to get an Acked-by and an agreement that this
change can go via the powerpc.git tree from Roland Dreier and Jeff
Garzik.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/5] dma: document dma_flags_set_attr()

2007-10-02 Thread David Miller
From: [EMAIL PROTECTED]
Date: Tue, 2 Oct 2007 19:47:52 -0700

> 
> Document dma_flags_set_attr().
> 
> Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]>

Acked-by: David S. Miller <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] dma: add dma_flags_set_attr() to dma interface

2007-10-02 Thread David Miller
From: [EMAIL PROTECTED]
Date: Tue, 2 Oct 2007 19:44:57 -0700

> 
> Introduce the dma_flags_set_attr() interface and give it a default 
> no-op implementation.
> 
> Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]>

Acked-by: David S. Miller <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] dma: redefine dma_flags_set_attr() for sn-ia64

2007-10-02 Thread David Miller
From: [EMAIL PROTECTED]
Date: Tue, 2 Oct 2007 19:46:41 -0700

> 
> define dma_flags_set_attr() for sn-ia64 - it "borrows" bits from 
> the direction argument (renamed "flags") to the dma_map_* routines 
> to pass an additional attributes.  Also define routines to retrieve 
> the original direction and attribute from "flags".
> 
> Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]>

Acked-by: David S. Miller <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bugme-new] [Bug 8378] New: Averatec 3156X laptop doesn't reboot with kernels > 2.6.13.5 (responsible commit found)

2007-10-02 Thread Truxton Fulton
Andrew Morton wrote (at Sat, 12 May 2007 18:02:40 -0700) :
> 
> OK, thanks.
> 
> So that are we doing here?  We try the pre-Truxton code and if that didn't
> work we try the post-Truxton code?  Hard to see how that could go wrong.
> 
> Truxton, can you please test it for us?

Hi,

Hiroto Shibuya wrote to tell me that he has a VIA EPIA-EK1
which suffers from the reboot problem when no keyboard is attached.
My first patch works for him :

  
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=59f4e7d572980a521b7bdba74ab71b21f5995538

But the latest patch does not work for him :

  
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8b93789808756bcc1e5c90c99f1b1ef52f839a51

We found that it was necessary to also set the "disable keyboard"
flag in the command byte, as the first patch was doing.  The
second patch tries to minimally modify the command byte, but
it is not enough.

Please consider this simple one-line patch to help people with
low end VIA motherboards reboot when no keyboard is attached.
Hiroto Shibuya has verified that this works for him (as I
no longer have an afflicted machine) :

This patch is against 
linux-2.6.23-rc9/include/asm-i386/mach-default/mach_reboot.h

--- mach_reboot.h   Mon Oct  1 20:24:52 2007
+++ mach_reboot.h.new   Tue Oct  2 19:22:13 2007
@@ -49,7 +49,7 @@
udelay(50);
kb_wait();
udelay(50);
-   outb(cmd | 0x04, 0x60); /* set "System flag" */
+   outb(cmd | 0x14, 0x60); /* set "System flag" and "Keyboard 
Disabled" */
udelay(50);
kb_wait();
udelay(50);


Thanks,

-Truxton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] mthca: allow setting "dmabarrier" on user-allocated memory

2007-10-02 Thread Heikki Orsila
On Tue, Oct 02, 2007 at 07:50:07PM -0700, [EMAIL PROTECTED] wrote:
> +struct mthca_reg_mr {
> + __u32 mr_attrs;
> +#define MTHCA_MR_DMAFLUSH 0x1/* flush in-flight DMA on a write to 
> +  * memory region */
> + __u32 reserved;
> +};

Seems like a very odd place to #define something new..

-- 
Heikki Orsila   Barbie's law:
[EMAIL PROTECTED]   "Math is hard, let's go shopping!"
http://www.iki.fi/shd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/5] mthca: allow setting "dmabarrier" on user-allocated memory

2007-10-02 Thread akepner

Allow setting a "dmabarrier" when the mthca driver registers user-
allocated memory.

Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]>

---

 mthca_provider.c |7 ++-
 mthca_user.h |   10 +-
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c 
b/drivers/infiniband/hw/mthca/mthca_provider.c
index 17486a4..c818708 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -1017,17 +1017,22 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd 
*pd, u64 start, u64 length,
struct mthca_dev *dev = to_mdev(pd->device);
struct ib_umem_chunk *chunk;
struct mthca_mr *mr;
+   struct mthca_reg_mr ucmd;
u64 *pages;
int shift, n, len;
int i, j, k;
int err = 0;
int write_mtt_size;
 
+   if (ib_copy_from_udata(, udata, sizeof ucmd)) 
+   return ERR_PTR(-EFAULT);
+
mr = kmalloc(sizeof *mr, GFP_KERNEL);
if (!mr)
return ERR_PTR(-ENOMEM);
 
-   mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0);
+   mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, 
+  ucmd.mr_attrs & MTHCA_MR_DMAFLUSH);
if (IS_ERR(mr->umem)) {
err = PTR_ERR(mr->umem);
goto err;
diff --git a/drivers/infiniband/hw/mthca/mthca_user.h 
b/drivers/infiniband/hw/mthca/mthca_user.h
index 02cc0a7..5662aea 100644
--- a/drivers/infiniband/hw/mthca/mthca_user.h
+++ b/drivers/infiniband/hw/mthca/mthca_user.h
@@ -41,7 +41,7 @@
  * Increment this value if any changes that break userspace ABI
  * compatibility are made.
  */
-#define MTHCA_UVERBS_ABI_VERSION   1
+#define MTHCA_UVERBS_ABI_VERSION   2
 
 /*
  * Make sure that all structs defined in this file remain laid out so
@@ -61,6 +61,14 @@ struct mthca_alloc_pd_resp {
__u32 reserved;
 };
 
+struct mthca_reg_mr {
+   __u32 mr_attrs;
+#define MTHCA_MR_DMAFLUSH 0x1  /* flush in-flight DMA on a write to 
+* memory region */
+   __u32 reserved;
+};
+
+
 struct mthca_create_cq {
__u32 lkey;
__u32 pdn;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/5] dma: document dma_flags_set_attr()

2007-10-02 Thread akepner

Document dma_flags_set_attr().

Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]>

---

 DMA-API.txt |   27 +++
 1 files changed, 27 insertions(+)

diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
index cc7a8c3..16e15c0 100644
--- a/Documentation/DMA-API.txt
+++ b/Documentation/DMA-API.txt
@@ -544,3 +544,30 @@ size is the size (and should be a page-sized multiple).
 The return value will be either a pointer to the processor virtual
 address of the memory, or an error (via PTR_ERR()) if any part of the
 region is occupied.
+
+int 
+dma_flags_set_attr(u32 attr, enum dma_data_direction dir)
+
+Amend dir with a platform-specific "dma attribute".
+
+The only attribute currently defined is DMA_BARRIER_ATTR, which causes 
+in-flight DMA to be flushed when the associated memory region is written 
+to (see example below).  Setting DMA_BARRIER_ATTR provides a mechanism 
+to enforce ordering of DMA on platforms that permit DMA to be reordered 
+between device and host memory (within a NUMA interconnect).  On other 
+platforms this is a nop.
+
+DMA_BARRIER_ATTR would be set when the memory region is mapped for DMA, 
+e.g.:
+
+   int count;
+   int flags = dma_flags_set_attr(DMA_BARRIER_ATTR, DMA_BIDIRECTIONAL);
+   
+   count = dma_map_sg(dev, sglist, nents, flags);
+
+As an example of a situation where this would be useful, suppose that 
+the device does a DMA write to indicate that data is ready and 
+available in memory.  The DMA of the "completion indication" could 
+race with data DMA.  Using DMA_BARRIER_ATTR on the memory used for 
+completion indications would prevent the race.
+
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/5] infiniband: add "dmabarrier" argument to ib_umem_get()

2007-10-02 Thread akepner

Pass a "dmabarrier" argument to ib_umem_get() and use the new 
argument to control setting the DMA_BARRIER_ATTR attribute on 
the memory that ib_umem_get() maps for DMA.

Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]>

---

 drivers/infiniband/core/umem.c   |8 ++--
 drivers/infiniband/hw/amso1100/c2_provider.c |2 +-
 drivers/infiniband/hw/cxgb3/iwch_provider.c  |2 +-
 drivers/infiniband/hw/ehca/ehca_mrmw.c   |2 +-
 drivers/infiniband/hw/ipath/ipath_mr.c   |2 +-
 drivers/infiniband/hw/mlx4/cq.c  |2 +-
 drivers/infiniband/hw/mlx4/doorbell.c|2 +-
 drivers/infiniband/hw/mlx4/mr.c  |3 ++-
 drivers/infiniband/hw/mlx4/qp.c  |2 +-
 drivers/infiniband/hw/mlx4/srq.c |2 +-
 drivers/infiniband/hw/mthca/mthca_provider.c |2 +-
 include/rdma/ib_umem.h   |4 ++--
 12 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 664d2fa..093b58d 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -69,9 +69,10 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
  * @addr: userspace virtual address to start at
  * @size: length of region to pin
  * @access: IB_ACCESS_xxx flags for memory being pinned
+ * @dmabarrier: set "dmabarrier" attribute on this memory
  */
 struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
-   size_t size, int access)
+   size_t size, int access, int dmabarrier)
 {
struct ib_umem *umem;
struct page **page_list;
@@ -83,6 +84,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
int ret;
int off;
int i;
+   int flags = dmabarrier ? dma_flags_set_attr(DMA_BARRIER_ATTR, 
+   DMA_BIDIRECTIONAL) : 
+DMA_BIDIRECTIONAL;
 
if (!can_do_mlock())
return ERR_PTR(-EPERM);
@@ -160,7 +164,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
chunk->nmap = ib_dma_map_sg(context->device,
>page_list[0],
chunk->nents,
-   DMA_BIDIRECTIONAL);
+   flags);
if (chunk->nmap <= 0) {
for (i = 0; i < chunk->nents; ++i)
put_page(chunk->page_list[i].page);
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c 
b/drivers/infiniband/hw/amso1100/c2_provider.c
index 997cf15..17243b7 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -449,7 +449,7 @@ static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,
return ERR_PTR(-ENOMEM);
c2mr->pd = c2pd;
 
-   c2mr->umem = ib_umem_get(pd->uobject->context, start, length, acc);
+   c2mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0);
if (IS_ERR(c2mr->umem)) {
err = PTR_ERR(c2mr->umem);
kfree(c2mr);
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c 
b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index f0c7775..d0a514c 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -601,7 +601,7 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,
if (!mhp)
return ERR_PTR(-ENOMEM);
 
-   mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc);
+   mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0);
if (IS_ERR(mhp->umem)) {
err = PTR_ERR(mhp->umem);
kfree(mhp);
diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c 
b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index d97eda3..c13c11c 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -329,7 +329,7 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, 
u64 length,
}
 
e_mr->umem = ib_umem_get(pd->uobject->context, start, length,
-mr_access_flags);
+mr_access_flags, 0);
if (IS_ERR(e_mr->umem)) {
ib_mr = (void *)e_mr->umem;
goto reg_user_mr_exit1;
diff --git a/drivers/infiniband/hw/ipath/ipath_mr.c 
b/drivers/infiniband/hw/ipath/ipath_mr.c
index e442470..e351222 100644
--- a/drivers/infiniband/hw/ipath/ipath_mr.c
+++ b/drivers/infiniband/hw/ipath/ipath_mr.c
@@ -195,7 +195,7 @@ struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,

Re: 2.6.23-rc9 boot failure (megaraid?)

2007-10-02 Thread FUJITA Tomonori
On Tue, 02 Oct 2007 15:38:13 -0500
James Bottomley <[EMAIL PROTECTED]> wrote:

> On Tue, 2007-10-02 at 20:15 +0200, Adrian Bunk wrote:
> > Cc's added, the complete bug report is at
> >   http://lkml.org/lkml/2007/10/2/243
> > 
> > On Tue, Oct 02, 2007 at 12:48:26PM -0400, Burton Windle wrote:
> > > 2.6.23-rc9 fails to boot for me; 2.6.22.9 works fine.
> > >
> > > System is a Dell Poweredge with PERC 2/DC with RAID1 volume.
> > >...
> > 
> > Thanks for your report.
> > 
> > Diff'ing the dmesg's shows:
> > 
> > <--  snip  -->
> > 
> >  scsi0: scanning scsi channel 4 [P0] for physical devices.
> >  scsi0: scanning scsi channel 5 [P1] for physical devices.
> >  st: Version 20070203, fixed bufsize 32768, s/g segs 256
> > -sd 0:0:0:0: [sda] 17547264 512-byte hardware sectors (8984 MB)
> > +sd 0:0:0:0: [sda] Sector size 0 reported, assuming 512.
> > +sd 0:0:0:0: [sda] 1 512-byte hardware sectors (0 MB)
> >  sd 0:0:0:0: [sda] Write Protect is off
> >  sd 0:0:0:0: [sda] Asking for cache data failed
> >  sd 0:0:0:0: [sda] Assuming drive cache: write through
> > -sd 0:0:0:0: [sda] 17547264 512-byte hardware sectors (8984 MB)
> > +sd 0:0:0:0: [sda] Sector size 0 reported, assuming 512.
> > +sd 0:0:0:0: [sda] 1 512-byte hardware sectors (0 MB)
> >  sd 0:0:0:0: [sda] Write Protect is off
> >  sd 0:0:0:0: [sda] Asking for cache data failed
> >  sd 0:0:0:0: [sda] Assuming drive cache: write through
> >   sda: sda1
> > + sda: p1 exceeds device capacity
> > 
> > <--  snip  -->
> > 
> > -   case MEGA_BULK_DATA:
> > -   if (scb->cmd->use_sg == 0)
> > -   length = scb->cmd->request_bufflen;
> > -   else {
> > -   struct scatterlist *sgl =
> > -   (struct scatterlist *)scb->cmd->request_buffer;
> > -   length = sgl->length;
> > -   }
> > -   pci_unmap_page(adapter->dev, scb->dma_h_bulkdata,
> > -  length, scb->dma_direction);
> > -   break;
> > -
> 
> This is the problem piece I think.  We've reintroduced a very old bug:
> 
> commit 51c928c34fa7cff38df584ad01de988805877dba
> Author: James Bottomley <[EMAIL PROTECTED]>
> Date:   Sat Oct 1 09:38:05 2005 -0500
> 
> [SCSI] Legacy MegaRAID: Fix READ CAPACITY
> 
> Some Legacy megaraid cards can't actually cope with the scatter/gather
> version of the READ CAPACITY command (which is what we now send them
> since altering all SCSI internal I/O to go via the block layer).  Fix
> this (and a few other broken megaraid driver assumptions) by sending
> the non-sg version of the command if the sg list only has a single
> element.
> 
> Signed-off-by: James Bottomley <[EMAIL PROTECTED]>
> 
> So what we have to do is put back the check for use_sg == 1 and send
> that as a bulk transfer command.

Sorry about this. Can this fix the problem?

Thanks,


diff --git a/drivers/scsi/megaraid.c b/drivers/scsi/megaraid.c
index 3907f67..da56163 100644
--- a/drivers/scsi/megaraid.c
+++ b/drivers/scsi/megaraid.c
@@ -1753,6 +1753,14 @@ mega_build_sglist(adapter_t *adapter, scb_t *scb, u32 
*buf, u32 *len)
 
*len = 0;
 
+   if (scsi_sg_count(cmd) == 1 && !adapter->has_64bit_addr) {
+   sg = scsi_sglist(cmd);
+   scb->dma_h_bulkdata = sg_dma_address(sg);
+   *buf = (u32)scb->dma_h_bulkdata;
+   *len = sg_dma_len(sg);
+   return 0;
+   }
+
scsi_for_each_sg(cmd, sg, sgcnt, idx) {
if (adapter->has_64bit_addr) {
scb->sgl64[idx].address = sg_dma_address(sg);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/5] dma: redefine dma_flags_set_attr() for sn-ia64

2007-10-02 Thread akepner

define dma_flags_set_attr() for sn-ia64 - it "borrows" bits from 
the direction argument (renamed "flags") to the dma_map_* routines 
to pass an additional attributes.  Also define routines to retrieve 
the original direction and attribute from "flags".

Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]>

---

 arch/ia64/sn/pci/pci_dma.c |   37 +++--
 include/asm-ia64/sn/io.h   |   23 +++
 2 files changed, 50 insertions(+), 10 deletions(-)

diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index d79ddac..84b6227 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -153,7 +153,7 @@ EXPORT_SYMBOL(sn_dma_free_coherent);
  * @dev: device to map for
  * @cpu_addr: kernel virtual address of the region to map
  * @size: size of the region
- * @direction: DMA direction
+ * @flags: DMA direction, and arch-specific attributes
  *
  * Map the region pointed to by @cpu_addr for DMA and return the
  * DMA address.
@@ -167,17 +167,23 @@ EXPORT_SYMBOL(sn_dma_free_coherent);
  *   figure out how to save dmamap handle so can use two step.
  */
 dma_addr_t sn_dma_map_single(struct device *dev, void *cpu_addr, size_t size,
-int direction)
+int flags)
 {
dma_addr_t dma_addr;
unsigned long phys_addr;
struct pci_dev *pdev = to_pci_dev(dev);
struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev);
+   int dmabarrier = dma_flags_get_attr(flags) & DMA_BARRIER_ATTR;
 
BUG_ON(dev->bus != _bus_type);
 
phys_addr = __pa(cpu_addr);
-   dma_addr = provider->dma_map(pdev, phys_addr, size, SN_DMA_ADDR_PHYS);
+   if (dmabarrier)
+   dma_addr = provider->dma_map_consistent(pdev, phys_addr, size, 
+   SN_DMA_ADDR_PHYS);
+   else
+   dma_addr = provider->dma_map(pdev, phys_addr, size, 
+SN_DMA_ADDR_PHYS);
if (!dma_addr) {
printk(KERN_ERR "%s: out of ATEs\n", __FUNCTION__);
return 0;
@@ -240,18 +246,20 @@ EXPORT_SYMBOL(sn_dma_unmap_sg);
  * @dev: device to map for
  * @sg: scatterlist to map
  * @nhwentries: number of entries
- * @direction: direction of the DMA transaction
+ * @flags: direction of the DMA transaction, and arch-specific attributes
  *
  * Maps each entry of @sg for DMA.
  */
 int sn_dma_map_sg(struct device *dev, struct scatterlist *sg, int nhwentries,
- int direction)
+ int flags)
 {
unsigned long phys_addr;
struct scatterlist *saved_sg = sg;
struct pci_dev *pdev = to_pci_dev(dev);
struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev);
int i;
+   int dmabarrier = dma_flags_get_attr(flags) & DMA_BARRIER_ATTR;
+   int dir = dma_flags_get_dir(flags);
 
BUG_ON(dev->bus != _bus_type);
 
@@ -259,19 +267,28 @@ int sn_dma_map_sg(struct device *dev, struct scatterlist 
*sg, int nhwentries,
 * Setup a DMA address for each entry in the scatterlist.
 */
for (i = 0; i < nhwentries; i++, sg++) {
+   dma_addr_t dma_addr;
phys_addr = SG_ENT_PHYS_ADDRESS(sg);
-   sg->dma_address = provider->dma_map(pdev,
-   phys_addr, sg->length,
-   SN_DMA_ADDR_PHYS);
 
-   if (!sg->dma_address) {
+   if (dmabarrier) {
+   dma_addr = provider->dma_map_consistent(pdev,
+   phys_addr,
+   sg->length,
+   
SN_DMA_ADDR_PHYS);
+   } else {
+   dma_addr = provider->dma_map(pdev,
+phys_addr, sg->length,
+SN_DMA_ADDR_PHYS);
+   }
+
+   if (!(sg->dma_address = dma_addr)) {
printk(KERN_ERR "%s: out of ATEs\n", __FUNCTION__);
 
/*
 * Free any successfully allocated entries.
 */
if (i > 0)
-   sn_dma_unmap_sg(dev, saved_sg, i, direction);
+   sn_dma_unmap_sg(dev, saved_sg, i, dir);
return 0;
}
 
diff --git a/include/asm-ia64/sn/io.h b/include/asm-ia64/sn/io.h
index 41c73a7..d2b94ce 100644
--- a/include/asm-ia64/sn/io.h
+++ b/include/asm-ia64/sn/io.h
@@ -271,4 +271,27 @@ sn_pci_set_vchan(struct pci_dev *pci_dev, unsigned long 
*addr, int vchan)
return 0;
 }
 
+#define ARCH_USES_DMA_ATTRS
+/* we pass additional dma attributes in the 

[PATCH 1/5] dma: add dma_flags_set_attr() to dma interface

2007-10-02 Thread akepner

Introduce the dma_flags_set_attr() interface and give it a default 
no-op implementation.

Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]>

---

 dma-mapping.h |8 
 1 files changed, 8 insertions(+)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 2dc21cb..4990aaf 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -99,4 +99,12 @@ static inline void dmam_release_declared_memory(struct 
device *dev)
 }
 #endif /* ARCH_HAS_DMA_DECLARE_COHERENT_MEMORY */
 
+#define DMA_BARRIER_ATTR   0x1
+#ifndef ARCH_USES_DMA_ATTRS
+static inline int dma_flags_set_attr(u32 attr, enum dma_data_direction dir) 
+{
+   return dir;
+}
+#endif /* ARCH_USES_DMA_ATTRS */
+
 #endif
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/5] allow drivers to flush in-flight DMA v3

2007-10-02 Thread akepner

On Altix, DMA may be reordered between a device and host memory. 
This reordering can happen in the NUMA interconnect, and it usually 
results in correct operation and improved performance. In some 
situations it may be necessary to explicitly synchronize DMA from 
the device.

This patchset allows a memory region to be mapped with a "dmabarrier". 
Writes to the memory region will cause in-flight DMA to be flushed, 
providing a mechanism to order DMA from a device.

There are 5 patches in this patchset:

  [1/5] dma: add dma_flags_set_attr() to dma interface
  [2/5] dma: redefine dma_flags_set_attr() for sn-ia64
  [3/5] dma: document dma_flags_set_attr()
  [4/5] infiniband: add "dmabarrier" argument to ib_umem_get()
  [5/5] mthca: allow setting "dmabarrier" on user-allocated memory

-- 
Arthur

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Traffic Controller Performance in Kernel 2.4 vs 2.6

2007-10-02 Thread Sonny
Hello
This is a repost, there seems to have a misunderstanding before.

I hope this is the right place to ask this. Does any know if there is a
substantial difference in the performance of the traffic controller
between kernel 2.4 and 2.6. We tested it using 1 iperf server and use
250 and 500 clients, altering the burst.

This is the set-up:
iperf client -  router (w/ traffic controller) - iperf server

We use the top command inside the router to check the idle time of our
router to see this. The results we got from the 2.4 kernel shows
around 65-70% idle time while the 2.6 shows
60-65% idle time. We tried to use MRTG and we're not getting any
results either. We want to know if we could improve the bandwidth by
upgrading the kernel, else we would have to get a new bandwidth
manager.  Have anyone performed a similar test or can suggest a better
way to do this. Thanks in advance.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-02 Thread David Chinner
On Wed, Oct 03, 2007 at 09:34:39AM +0800, Fengguang Wu wrote:
> On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote:
> > On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
> > >   wbc.pages_skipped = 0;
> > > @@ -560,8 +561,9 @@ static void background_writeout(unsigned
> > >   min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> > >   if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > >   /* Wrote less than expected */
> > > - congestion_wait(WRITE, HZ/10);
> > > - if (!wbc.encountered_congestion)
> > > + if (wbc.encountered_congestion || wbc.more_io)
> > > + congestion_wait(WRITE, HZ/10);
> > > + else
> > >   break;
> > >   }
> > 
> > Why do you call congestion_wait() if there is more I/O to issue?  If
> > we have a fast filesystem, this might cause the device queues to
> > fill, then drain on congestion_wait(), then fill again, etc. i.e. we
> > will have trouble keeping the queues full, right?
> 
> You mean slow writers and fast RAID? That would be exactly the case
> these patches try to improve.

I mean any writers and a fast block device (raid or otherwise).

> This patchset makes kupdate/background writeback more responsible,
> so that if (avg-write-speed < device-capabilities), the dirty data are
> synced timely, and we don't have to go for balance_dirty_pages().

Sure, but I'm asking about the effect of the patches on the
(avg-write-speed == device-capabilities) case. I agree that
they are necessary for timely syncing of data but I'm trying
to understand what effect they have on the normal write case
(i.e. keeping the disk at full write throughput).

> So for your question of queue depth, the answer is: the queue length
> will not build up in the first place. 

Which queue are you talking about here? The block deivce queue?

> Also the name of congestion_wait() could be misleading:
> - when not congested, congestion_wait() will wakeup on write
>   completions;
> - when congested, congestion_wait() could also wakeup on write
>   completions on other non-congested devices.
> So congestion_wait(100ms) normally only takes 0.1-10ms.

True, but if we know we are not congested and have more work
to do, why sleep at all?

> For the more_io case, congestion_wait() serves more like 'to take a
> breath'. Tests show that the system could go mad without it.

I'm interested to know what tests show that pushing more I/O when
you don't have block device congestion make the system go mad (and
what mad means).  It sounds to me like it's hiding (yet another)
bug in the writeback code..

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Document x86-64 iommu kernel parameters

2007-10-02 Thread Jeff Garzik

Randy Dunlap wrote:

On Tue, 2 Oct 2007 21:34:13 -0400 Jeff Garzik wrote:


Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]>
---
After having to go figure out what some of these means, I figured I
would save others the trouble.

Some of these are "best guess" based on a quick scan of the code, so it
certainly needs a sanity review before going upstream.


"iommu" is listed in Documentation/x86_64/boot-options.txt
along with more x86_64-specific boot options.
A few other arches do something similar...


Ah!  Well, seeing as how we already have a provision for arch-specific 
options in kernel-parameters.txt, and some less-obscure arch-specific 
options can be found there, I think an argument can be made for my patch :)


Nonethless, if the maintainer disagrees, they can drop this patch I suppose.

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Document x86-64 iommu kernel parameters

2007-10-02 Thread Randy Dunlap
On Tue, 2 Oct 2007 21:34:13 -0400 Jeff Garzik wrote:

> 
> Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]>
> ---
> After having to go figure out what some of these means, I figured I
> would save others the trouble.
> 
> Some of these are "best guess" based on a quick scan of the code, so it
> certainly needs a sanity review before going upstream.

"iommu" is listed in Documentation/x86_64/boot-options.txt
along with more x86_64-specific boot options.
A few other arches do something similar...


> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index 4d175c7..8afea9b 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -763,6 +763,30 @@ and is between 256 and 4096 characters. It is defined in 
> the file
>  
>   inttest=[IA64]
>  
> + iommu=option[,option..] [X86-64]
> + off Disable IOMMU.
> + force   Unconditionally enable IOMMU.
> + noforce Disable IOMMU and IOMMU merging, by default.
> + biomergeUnconditionally enable IOMMU, IOMMU merging,
> + and set BIO IOMMU vmerge boundary to 4096.
> + panic   Panic on IOMMU overflow.
> + nopanic Do not panic on IOMMU overflow.
> + merge   Unconditionally enable IOMMU, IOMMU merging.
> + nomerge Disable IOMMU merging.
> + forcesacForce single address cycle (SAC, 32-bit).
> + allowdacPermit dual address cycle (DAC, 64-bit).
> + nodac   Forbid dual address cycle (DAC, 64-bit).
> + softEnable swiotlb.
> + calgary Use Calgary IOMMU.
> +
> + (GART-only options follow...)
> +Specify size of remapping area.
> + fullflush   Disable optimizing flushing strategy.
> + nofullflush Enable optimizing flushing strategy.
> + noagp   Use entire aperture, AGP isn't using it.
> + noaperture  Disable aperture fixups / hole init.
> + memaper= malloc an aperture of order N.
> +
>   io7=[HW] IO7 for Marvel based alpha systems
>   See comment before marvel_specify_io7 in
>   arch/alpha/kernel/core_marvel.c.
> -


---
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] writeback: remove pages_skipped accounting in __block_write_full_page()

2007-10-02 Thread David Chinner
On Wed, Oct 03, 2007 at 09:43:33AM +0800, Fengguang Wu wrote:
> On Wed, Oct 03, 2007 at 07:55:18AM +1000, David Chinner wrote:
> > > 
> > > do not quite agree with each other. The page writeback should be skipped 
> > > for
> > > 'locked buffer', but here it is 'clean buffer'!
> > 
> > Ok, so that means we need an equivalent fix in xfs_start_page_writeback()
> > as it will skip pages with clean buffers just like this. Something like
> > this (untested)?
> 
> Sure OK - as long as it is 'no write because of clean buffer'.

Yes, that's the case here.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RTC wakealarm write-only, still has 644 permissions

2007-10-02 Thread David Brownell
> > > > [EMAIL PROTECTED]:/sys/class/rtc/rtc0# cat wakealarm 
> > > > [EMAIL PROTECTED]:/sys/class/rtc/rtc0# echo 132719 > wakealarm 
> > 
> > At which point I'd expect
> > 
> > # echo $?
> > 
> > would indicate the write failed.  That's a LONG time in the
> > past (January 2, 1970), so that setting would be rejected.
>
> echo $? says 0 here :-(.

I stand corrected.  What it should do -- and does -- in that case
involves disabling the alarm, then succeeding.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Question] How to represent SYSTEM_RAM in kerenel/resouce.c

2007-10-02 Thread Matthew Wilcox
On Wed, Oct 03, 2007 at 10:31:36AM +0900, KAMEZAWA Hiroyuki wrote:
> i386 and x86_64 registers System RAM as IORESOUCE_MEM | IORESOUCE_BUSY.
> ia64 registers System RAM as IORESOURCE_MEM.
> 
> Which is better ?

Should probably be BUSY.  Non-BUSY regions can have io resources
requested underneath them, but you wouldn't want a PCI device to be
assigned an address which overlaps with physical memory.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] writeback: remove pages_skipped accounting in __block_write_full_page()

2007-10-02 Thread Fengguang Wu
On Wed, Oct 03, 2007 at 07:55:18AM +1000, David Chinner wrote:
> > 
> > do not quite agree with each other. The page writeback should be skipped for
> > 'locked buffer', but here it is 'clean buffer'!
> 
> Ok, so that means we need an equivalent fix in xfs_start_page_writeback()
> as it will skip pages with clean buffers just like this. Something like
> this (untested)?

Sure OK - as long as it is 'no write because of clean buffer'.
The only user of pages_skipped is obviously using that semantics.

Andrew, here is the expanded patch:
---
writeback: remove pages_skipped accounting in __block_write_full_page()

Miklos Szeredi <[EMAIL PROTECTED]> and me identified a writeback bug:

> The following strange behavior can be observed:
>
> 1. large file is written
> 2. after 30 seconds, nr_dirty goes down by 1024
> 3. then for some time (< 30 sec) nothing happens (disk idle)
> 4. then nr_dirty again goes down by 1024
> 5. repeat from 3. until whole file is written
>
> So basically a 4Mbyte chunk of the file is written every 30 seconds.
> I'm quite sure this is not the intended behavior.

It can be produced by the following test scheme:

# cat bin/test-writeback.sh 
grep nr_dirty /proc/vmstat
echo 1 > /proc/sys/fs/inode_debug
dd if=/dev/zero of=/var/x bs=1K count=204800&
while true; do grep nr_dirty /proc/vmstat; sleep 1; done

# bin/test-writeback.sh
nr_dirty 19207
nr_dirty 19207
nr_dirty 30924
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s
nr_dirty 47150
nr_dirty 47141
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47205
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47215
nr_dirty 47216
nr_dirty 47216
nr_dirty 47216
nr_dirty 47154
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47134
nr_dirty 47134
nr_dirty 47135
nr_dirty 47135
nr_dirty 47135
nr_dirty 46097 <== -1038
nr_dirty 46098
nr_dirty 46098
nr_dirty 46098
[...]
nr_dirty 46091
nr_dirty 46092
nr_dirty 46092
nr_dirty 45069 <== -1023
nr_dirty 45056
nr_dirty 45056
nr_dirty 45056
[...]
nr_dirty 37822
nr_dirty 36799 <== -1023
[...]
nr_dirty 36781
nr_dirty 35758 <== -1023
[...]
nr_dirty 34708
nr_dirty 33672 <== -1024
[...]
nr_dirty 33692
nr_dirty 32669 <== -1023


% ls -li /var/x
847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x

% dmesg|grep 847824  # generated by a debug printk
[  529.263184] redirtied inode 847824 line 548
[  564.250872] redirtied inode 847824 line 548
[  594.272797] redirtied inode 847824 line 548
[  629.231330] redirtied inode 847824 line 548
[  659.224674] redirtied inode 847824 line 548
[  689.219890] redirtied inode 847824 line 548
[  724.226655] redirtied inode 847824 line 548
[  759.198568] redirtied inode 847824 line 548

# line 548 in fs/fs-writeback.c:
543 if (wbc->pages_skipped != pages_skipped) {
544 /*
545  * writeback is not making progress due to locked
546  * buffers.  Skip this inode for now.
547  */
548 redirty_tail(inode);
549 }

More debug efforts show that __block_write_full_page()
never has the chance to call submit_bh() for that big dirty file:
the buffer head is *clean*. So basicly no page io is issued by
__block_write_full_page(), hence pages_skipped goes up.

Also the comment in generic_sync_sb_inodes():

544 /*
545  * writeback is not making progress due to locked
546  * buffers.  Skip this inode for now.
547  */

and the comment in __block_write_full_page():

1713 /*
1714  * The page was marked dirty, but the buffers were
1715  * clean.  Someone wrote them back by hand with
1716  * ll_rw_block/submit_bh.  A rare case.
1717  */

do not quite agree with each other. The page writeback should be skipped for
'locked buffer', but here it is 'clean buffer'!

This patch fixes this bug. Though I'm not sure why __block_write_full_page()
is called only to do nothing and who actually issued the writeback for us.

This is the two possible new behaviors after the patch:

1) pretty nice: wait 30s and write ALL:)
2) not so good:
- during the dd: ~16M 
- after 30s:  ~4M
- after 5s:   ~4M
- after 5s: ~176M

The next patch will fix case (2).

Cc: Ken Chen <[EMAIL PROTECTED]>
Cc: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
Signed-off-by: David Chinner <[EMAIL PROTECTED]>
---
 fs/buffer.c |1 -
 fs/xfs/linux-2.6/xfs_aops.c |5 ++---
 2 files changed, 2 insertions(+), 4 deletions(-)

--- linux-2.6.23-rc8-mm2.orig/fs/buffer.c
+++ linux-2.6.23-rc8-mm2/fs/buffer.c
@@ -1737,7 +1737,6 @@ done:
 

Re: linux cache routines for Write-back cache policy on MIPS24KE

2007-10-02 Thread veerasena reddy
hi Geert,

here i mean 'flush' is 'write-back only'

Regards,
Veerasena.
--- Geert Uytterhoeven <[EMAIL PROTECTED]> wrote:

> On Mon, 1 Oct 2007, veerasena reddy wrote:
> > In linux-2.6.18 (for MIPS24KE processor):
> > suppose if i want to do flush only then which API
> i
> > should use?
> 
> `flush' is fuzzy terminology: some people mean
> invalidate, others mean
> write back, others mean both.
> 
> > Similarly, if i want to do invalidation only which
> API
> > i should use?
> 
> dma_cache_inv().
> 
> > --- Geert Uytterhoeven <[EMAIL PROTECTED]>
> wrote:
> > 
> > > On Mon, 1 Oct 2007, veerasena reddy wrote:
> > > > I have ported Linux-2.6.18 kernel on MIPS24KE
> > > > processor. I am using write back cache policy.
> > > > 
> > > > Could you please guide me under what cases the
> > > below
> > > > cache API's are being used:
> > > > - dma_cache_wback_inv() : Could you explain 
> what
> > > > exactly this function does
> > > 
> > > It does both write back and invalidate.
> > > 
> > > > - dma_cache_wback() : This function write back
> the
> > > > cache data to memory
> > > > - dma_cache_inv  : This function invalidate
> the
> > > cache
> > > > tags. so subsequent access will fetch from
> memory.
> > > 
> > > Note that 2.6.18 is old. The above functions are
> > > intended to be removed.
> 
> Gr{oetje,eeting}s,
> 
>   Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond
> ia32 -- [EMAIL PROTECTED]
> 
> In personal conversations with technical people, I
> call myself a hacker. But
> when I'm talking to journalists I just say
> "programmer" or something like that.
>   -- Linus Torvalds
> 
> 



  Forgot the famous last words? Access your message archive online at 
http://in.messenger.yahoo.com/webmessengerpromo.php

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd min order, slub max order [was Re: -mm merge plans for 2.6.24]

2007-10-02 Thread Nick Piggin
On Wednesday 03 October 2007 02:06, Hugh Dickins wrote:
> On Mon, 1 Oct 2007, Andrew Morton wrote:
> > #
> > # slub && antifrag
> > #
> > have-kswapd-keep-a-minimum-order-free-other-than-order-0.patch
> > only-check-absolute-watermarks-for-alloc_high-and-alloc_harder-allocation
> >s.patch slub-exploit-page-mobility-to-increase-allocation-order.patch
> > slub-reduce-antifrag-max-order.patch
> >
> >   I think this stuff is in the "mm stuff we don't want to merge"
> > category. If so, I really should have dropped it ages ago.
>
> I agree.  I spent a while last week bisecting down to see why my heavily
> swapping loads take 30%-60% longer with -mm than mainline, and it was
> here that they went bad.  Trying to keep higher orders free is costly.

Yeah, no there's no way we'd merge that.


> On the other hand, hasn't SLUB efficiency been built on the expectation
> that higher orders can be used?  And it would be a twisted shame for
> high performance to be held back by some idiot's swapping load.

IMO it's a bad idea to create all these dependencies like this.

If SLUB can get _more_ performance out of using higher order allocations,
then fine. If it is starting off at a disadvantage at the same order, then it
that should be fixed first, right?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] libata: fix for sata_mv >64KB DMA segments

2007-10-02 Thread Olof Johansson
Fix bug in sata_mv for cases where the IOMMU layer has merged SG entries
to larger than 64KB. They need to be split up before being sent to
the driver.

Just for simplicity's sake, split up at 64K boundary instead of 64K size,
since that's what the common code does anyway.

Signed-off-by: Olof Johansson <[EMAIL PROTECTED]>


---

On Tue, Oct 02, 2007 at 07:23:10PM -0400, Jeff Garzik wrote:
> Olof Johansson wrote:
>> On Mon, Oct 01, 2007 at 06:37:44PM -0400, Jeff Garzik wrote:
>> Looks like it's caused by enabling vmerge (which tends to be on for the
>> common PPC defconfigs). If I disable it, things look OK.
>> Perhaps the Marvell controller doesn't like requests larger than 64K,
>> or wrapping some boundary. Do you have access to erratas/docs?
>> I have verified it on a powermac now as well (had a quick scare that it
>> might have been some problem with the PA Semi IOMMU, but no).
>
> FWIW:  I just tried the 6042 on my AMD64 platform with iommu=force, and was 
> unable to reproduce any trouble.
>
> You could try changing MV_DMA_BOUNDARY to 0xU and see what happens.

As per discussion on IRC, it was really caused by mv_fill_sg() not
handing SG entries larger than 64K properly. Below patch fixes it to
behave like ata_fill_sg() instead.

Works OK here after some light testing (restoring my corrupted root
filesystem and running a few fscks on it, among other things).


Thanks,

Olof


diff --git a/drivers/ata/sata_mv.c b/drivers/ata/sata_mv.c
index 11bf6c7..1a82e22 100644
--- a/drivers/ata/sata_mv.c
+++ b/drivers/ata/sata_mv.c
@@ -1139,15 +1139,27 @@ static unsigned int mv_fill_sg(struct ata_queued_cmd 
*qc)
dma_addr_t addr = sg_dma_address(sg);
u32 sg_len = sg_dma_len(sg);
 
-   mv_sg->addr = cpu_to_le32(addr & 0x);
-   mv_sg->addr_hi = cpu_to_le32((addr >> 16) >> 16);
-   mv_sg->flags_size = cpu_to_le32(sg_len & 0x);
+   while (sg_len) {
+   u32 offset = addr & 0x;
+   u32 len = sg_len;
 
-   if (ata_sg_is_last(sg, qc))
-   mv_sg->flags_size |= cpu_to_le32(EPRD_FLAG_END_OF_TBL);
+   if ((offset + sg_len > 0x1))
+   len = 0x1 - offset;
+
+   mv_sg->addr = cpu_to_le32(addr & 0x);
+   mv_sg->addr_hi = cpu_to_le32((addr >> 16) >> 16);
+   mv_sg->flags_size = cpu_to_le32(len);
+
+   sg_len -= len;
+   addr += len;
+
+   if (!sg_len && ata_sg_is_last(sg, qc))
+   mv_sg->flags_size |= 
cpu_to_le32(EPRD_FLAG_END_OF_TBL);
+
+   mv_sg++;
+   n_sg++;
+   }
 
-   mv_sg++;
-   n_sg++;
}
 
return n_sg;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-02 Thread Fengguang Wu
On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote:
> On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
> > wbc.pages_skipped = 0;
> > @@ -560,8 +561,9 @@ static void background_writeout(unsigned
> > min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > /* Wrote less than expected */
> > -   congestion_wait(WRITE, HZ/10);
> > -   if (!wbc.encountered_congestion)
> > +   if (wbc.encountered_congestion || wbc.more_io)
> > +   congestion_wait(WRITE, HZ/10);
> > +   else
> > break;
> > }
> 
> Why do you call congestion_wait() if there is more I/O to issue?  If
> we have a fast filesystem, this might cause the device queues to
> fill, then drain on congestion_wait(), then fill again, etc. i.e. we
> will have trouble keeping the queues full, right?

You mean slow writers and fast RAID? That would be exactly the case
these patches try to improve.

The old writeback behaviors are sluggish when there is
- single big dirty file;
- single congested device
the queues may well build up slowly, hit background_limit, and
continue to build up, until hit dirty_limit. That means:
- kupdate writeback could leave behind many expired dirty data
- background writeback used to return prematurely
- eventually it relies on balance_dirty_pages() to do the job,
  which means
  - writers get throttled unnecessarily
  - dirty_limit pages are pinned unnecessarily

This patchset makes kupdate/background writeback more responsible,
so that if (avg-write-speed < device-capabilities), the dirty data are
synced timely, and we don't have to go for balance_dirty_pages().

So for your question of queue depth, the answer is: the queue length
will not build up in the first place. 

Also the name of congestion_wait() could be misleading:
- when not congested, congestion_wait() will wakeup on write
  completions;
- when congested, congestion_wait() could also wakeup on write
  completions on other non-congested devices.
So congestion_wait(100ms) normally only takes 0.1-10ms.

For the more_io case, congestion_wait() serves more like 'to take a
breath'. Tests show that the system could go mad without it.

Regards,
Fengguang

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sky2: jumbo frame regression fix

2007-10-02 Thread Ian Kumlien
On tis, 2007-10-02 at 18:02 -0700, Stephen Hemminger wrote:
> Remove unneeded check that caused problems with jumbo frame sizes.
> The check was recently added and is wrong.
> When using jumbo frames the sky2 driver does fragmentation, so
> rx_data_size is less than mtu.

Confirmed working.

Now running with 9k mtu with no errors, =)

It also seems that the FIFO bug was the one that affected me before,
damn odd race that one.

> Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>
Tested-by: Ian Kumlien <[EMAIL PROTECTED]>

(if that tag exists now)

Btw, Sorry but all mail directly to you will be blocked. I have yet to
fix the relaying properly with isp:s blocking port 25 etc so for some of
you this mail will only show up on the ML.

> --- a/drivers/net/sky2.c  2007-10-02 17:56:31.0 -0700
> +++ b/drivers/net/sky2.c  2007-10-02 17:58:56.0 -0700
> @@ -2163,9 +2163,6 @@ static struct sk_buff *sky2_receive(stru
>   sky2->rx_next = (sky2->rx_next + 1) % sky2->rx_pending;
>   prefetch(sky2->rx_ring + sky2->rx_next);
>  
> - if (length < ETH_ZLEN || length > sky2->rx_data_size)
> - goto len_error;
> -
>   /* This chip has hardware problems that generates bogus status.
>* So do only marginal checking and expect higher level protocols
>* to handle crap frames.
-- 
Ian Kumlien  -- http://pomac.netswarm.net


signature.asc
Description: This is a digitally signed message part


[PATCH] Document x86-64 iommu kernel parameters

2007-10-02 Thread Jeff Garzik

Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]>
---
After having to go figure out what some of these means, I figured I
would save others the trouble.

Some of these are "best guess" based on a quick scan of the code, so it
certainly needs a sanity review before going upstream.

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 4d175c7..8afea9b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -763,6 +763,30 @@ and is between 256 and 4096 characters. It is defined in 
the file
 
inttest=[IA64]
 
+   iommu=option[,option..] [X86-64]
+   off Disable IOMMU.
+   force   Unconditionally enable IOMMU.
+   noforce Disable IOMMU and IOMMU merging, by default.
+   biomergeUnconditionally enable IOMMU, IOMMU merging,
+   and set BIO IOMMU vmerge boundary to 4096.
+   panic   Panic on IOMMU overflow.
+   nopanic Do not panic on IOMMU overflow.
+   merge   Unconditionally enable IOMMU, IOMMU merging.
+   nomerge Disable IOMMU merging.
+   forcesacForce single address cycle (SAC, 32-bit).
+   allowdacPermit dual address cycle (DAC, 64-bit).
+   nodac   Forbid dual address cycle (DAC, 64-bit).
+   softEnable swiotlb.
+   calgary Use Calgary IOMMU.
+
+   (GART-only options follow...)
+  Specify size of remapping area.
+   fullflush   Disable optimizing flushing strategy.
+   nofullflush Enable optimizing flushing strategy.
+   noagp   Use entire aperture, AGP isn't using it.
+   noaperture  Disable aperture fixups / hole init.
+   memaper= malloc an aperture of order N.
+
io7=[HW] IO7 for Marvel based alpha systems
See comment before marvel_specify_io7 in
arch/alpha/kernel/core_marvel.c.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Question] How to represent SYSTEM_RAM in kerenel/resouce.c

2007-10-02 Thread KAMEZAWA Hiroyuki
Hi,

Now, SYSTEM_RAM is registerd to resouce list and a user can see memory map
from /proc/iomem, like following.
==
[EMAIL PROTECTED] linux-2.6.23-rc8-mm2]$ grep RAM /proc/iomem
-0009 : System RAM
0010-03ff : System RAM
0400-04f1bfff : System RAM
04f1c000-6b4b9fff : System RAM
6b4ba000-6b797fff : System RAM
6b798000-6b799fff : System RAM
6b79a000-6b79dfff : System RAM
6b79e000-6b79efff : System RAM
6b79f000-6b7fbfff : System RAM
6b7fc000-6c629fff : System RAM
6c62a000-6c800fff : System RAM
6c801000-6c843fff : System RAM
6c844000-6c847fff : System RAM
6c848000-6c849fff : System RAM
6c84a000-6c85dfff : System RAM
6c85e000-6c85efff : System RAM
6c85f000-6cbfbfff : System RAM
6cbfc000-6d349fff : System RAM
6d34a000-6d3fbfff : System RAM
6d3fc000-6d455fff : System RAM
6d4fc000-6d773fff : System RAM
1-7 : System RAM
408000-40 : System RAM
1400400-147 : System RAM
==

But there is a confusion.

i386 and x86_64 registers System RAM as IORESOUCE_MEM | IORESOUCE_BUSY.
ia64 registers System RAM as IORESOURCE_MEM.

Which is better ?

I ask this because current memory hotplug treat memory as IORESOUCE_MEM.
When memory hot-add occurs on x86_64, new memory is added as IORESOUCE_MEM.
memory hot-remove (now) can remove only IORESOUCE_MEM.

If ia64 should treat System RAM as IORESOUCE_MEM | IORESOUCE_BUSY, I'll write
a fix.

Thanks,
-Kame










-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc7-mm1 -- powerpc rtas panic

2007-10-02 Thread Tony Breeds
On Wed, Oct 03, 2007 at 10:30:16AM +1000, Michael Ellerman wrote:
 
> I realise it'll make the patch bigger, but this doesn't seem like a
> particularly good name for the variable anymore.

Sure, what about?

Clarify when RTAS logging is enabled.

Signed-off-by: Tony Breeds <[EMAIL PROTECTED]>

---
 arch/powerpc/platforms/pseries/rtasd.c |   15 +--
 1 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/rtasd.c 
b/arch/powerpc/platforms/pseries/rtasd.c
index 30925d2..73401c8 100644
--- a/arch/powerpc/platforms/pseries/rtasd.c
+++ b/arch/powerpc/platforms/pseries/rtasd.c
@@ -54,8 +54,9 @@ static unsigned int rtas_event_scan_rate;
 static int full_rtas_msgs = 0;
 
 /* Stop logging to nvram after first fatal error */
-static int no_more_logging;
-
+static int logging_enabled; /* Until we initialize everything,
+ * make sure we don't try logging
+ * anything */
 static int error_log_cnt;
 
 /*
@@ -217,7 +218,7 @@ void pSeries_log_error(char *buf, unsigned int err_type, 
int fatal)
}
 
/* Write error to NVRAM */
-   if (!no_more_logging && !(err_type & ERR_FLAG_BOOT))
+   if (logging_enabled && !(err_type & ERR_FLAG_BOOT))
nvram_write_error_log(buf, len, err_type, error_log_cnt);
 
/*
@@ -229,8 +230,8 @@ void pSeries_log_error(char *buf, unsigned int err_type, 
int fatal)
printk_log_rtas(buf, len);
 
/* Check to see if we need to or have stopped logging */
-   if (fatal || no_more_logging) {
-   no_more_logging = 1;
+   if (fatal || !logging_enabled) {
+   logging_enabled = 0;
spin_unlock_irqrestore(_log_lock, s);
return;
}
@@ -302,7 +303,7 @@ static ssize_t rtas_log_read(struct file * file, char 
__user * buf,
 
spin_lock_irqsave(_log_lock, s);
/* if it's 0, then we know we got the last one (the one in NVRAM) */
-   if (rtas_log_size == 0 && !no_more_logging)
+   if (rtas_log_size == 0 && logging_enabled)
nvram_clear_error_log();
spin_unlock_irqrestore(_log_lock, s);
 
@@ -414,6 +415,8 @@ static int rtasd(void *unused)
memset(logdata, 0, rtas_error_log_max);
rc = nvram_read_error_log(logdata, rtas_error_log_max,
  _type, _log_cnt);
+   /* We can use rtas_log_buf now */
+   logging_enabled = 1;
 
if (!rc) {
if (err_type != ERR_FLAG_ALREADY_LOGGED) {

Yours Tony

  linux.conf.auhttp://linux.conf.au/ || http://lca2008.linux.org.au/
  Jan 28 - Feb 02 2008 The Australian Linux Technical Conference!

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 05/12] mm: trylock_page

2007-10-02 Thread Nick Piggin
On Sunday 30 September 2007 01:01, Peter Zijlstra wrote:
> On Fri, 2007-09-28 at 13:11 +1000, Nick Piggin wrote:
> > On Friday 28 September 2007 17:42, Peter Zijlstra wrote:
> > > Replace raw TestSetPageLocked() usage with trylock_page()
> >
> > I have such a thing queued too, for the lock bitops patches for when
> > 2.6.24 opens, Andrew promises me :).
> >
> > I guess they should be identical, except I don't like doing trylock_page
> > in place of SetPageLocked, for memory ordering performance and aesthetic
> > reasons... I've got an init_page_locked (or set_page_locked... I can't
> > remember, the patch is at home).
>
> Sure, that might work, or we could just make it so that add_to_*_cache
> is never passed an unlocked page. But sure...

It does kind of make sense to have it there (because you want the page
to be locked iff it gets inserted into pagecache). And wherever you lock
the page, we'll still want an init_page_locked to be able to lock it while we
are the only owner of it, for the same performance reason.


> > Fine idea to lockdep the page lock, anyway. Does it show up any of the
> > buffered write deadlock possibilities? :)
>
> Not yet, it might just be that the concessions done to annotate this
> type of lock were too severe.
>
> What I basically did was treat all the page locks as a single recursive
> lock.

Hmm, OK. There are a couple of page lock deadlocks there that wouldn't
be picked up, but the page lock vs mmap_sem one probably should be.


> > buffer lock is another notable bit-mutex that might be converted (I have
> > the patch to do the similar nice !tas->trylock conversion for that too).
> > I think it is used widely enough by tricky code that it would be useful
> > to annotate as well.
>
> Not at all familiar with that lock, but yeah, we could have a look at
> doing that too.

Should be pretty well identical to the page lock. I'll cc you on the patch
to do this equivalent API conversion, if you'd like.


> > Unfortunately we can't convert bit_spinlock.h easily, I guess?
>
> Yeah, the space constraints make that rather hard. Each of these locks
> needs some form of external meta-data.

Yeah.

> For the page lock I used one lock instance per file system type.

Seems OK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add ability to dump mapped pages with /proc/sys/vm/drop_caches

2007-10-02 Thread Nick Piggin
On Monday 01 October 2007 04:03, Soeren Sandmann wrote:
> This patch adds the ability to drop mapped pages with
> /proc/sys/vm/drop_caches. This is useful to get repeatable
> measurements of startup time for applications.
>
> Without it, pages that are mapped in already-running applications will
> not get dropped, so the time measured will not be a true cold-cache
> time.

invalidate_inode_pages2_range is a nasty function... only to be used
if you really know you need it (and even in that case, it's probably
wrong!).

I have on my todo list a spring clean of mm/truncate.c to attempt to
figure out how to fix this thing properly but until then, it's a bad idea
to put in this kind of function.

Please just unmap_mapping_range before the existing invalidate call.
Also, I don't know if there is any use in making an extra mode for this --
presumably if someone wants to drop all pagecache, they really want
to drop all pagecache.

Aside from that, I think the idea is fine and indeed should make this
more useful (I never realised it didn't throw out mapped pages, which
we really should do to have reliable testing). So, thanks.



> Rik pointed out that "be_atomic" is a bit pointless since there is a
> race on SMP anyway where pages can be added. However, it is there in
> the existing code, so I added it for the new code as well. Does anyone
> know why it's there?
>
>
> Soren
>
> Signed-off-by: Soren Sandmann <[EMAIL PROTECTED]>
>
> ---
>  fs/drop_caches.c   |   13 -
>  include/linux/fs.h |3 +++
>  include/linux/mm.h |2 +-
>  mm/truncate.c  |   36 ++--
>  4 files changed, 34 insertions(+), 20 deletions(-)
>
> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> index 59375ef..9f3741d 100644
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -12,7 +12,7 @@
>  /* A global variable is a bit ugly, but it keeps the code simple */
>  int sysctl_drop_caches;
>
> -static void drop_pagecache_sb(struct super_block *sb)
> +static void drop_pagecache_sb(struct super_block *sb, int unmap)
>  {
>   struct inode *inode;
>
> @@ -20,12 +20,15 @@ static void drop_pagecache_sb(struct super_block *sb)
>   list_for_each_entry(inode, >s_inodes, i_sb_list) {
>   if (inode->i_state & (I_FREEING|I_WILL_FREE))
>   continue;
> - __invalidate_mapping_pages(inode->i_mapping, 0, -1, true);
> + if (unmap)
> + __invalidate_inode_pages2_range(inode->i_mapping, 0, 
> -1, true);
> + else
> + __invalidate_mapping_pages(inode->i_mapping, 0, -1, 
> true);
>   }
>   spin_unlock(_lock);
>  }
>
> -void drop_pagecache(void)
> +void drop_pagecache(int unmap)
>  {
>   struct super_block *sb;
>
> @@ -36,7 +39,7 @@ restart:
>   spin_unlock(_lock);
>   down_read(>s_umount);
>   if (sb->s_root)
> - drop_pagecache_sb(sb);
> + drop_pagecache_sb(sb, unmap);
>   up_read(>s_umount);
>   spin_lock(_lock);
>   if (__put_super_and_need_restart(sb))
> @@ -60,7 +63,7 @@ int drop_caches_sysctl_handler(ctl_table *table, int
> write, proc_dointvec_minmax(table, write, file, buffer, length, ppos);
>   if (write) {
>   if (sysctl_drop_caches & 1)
> - drop_pagecache();
> + drop_pagecache(sysctl_drop_caches & 4);
>   if (sysctl_drop_caches & 2)
>   drop_slab();
>   }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 16421f6..c112aa3 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1536,6 +1536,9 @@ static inline void invalidate_remote_inode(struct
> inode *inode) invalidate_mapping_pages(inode->i_mapping, 0, -1);
>  }
>  extern int invalidate_inode_pages2(struct address_space *mapping);
> +extern int __invalidate_inode_pages2_range(struct address_space *mapping,
> +pgoff_t start, pgoff_t end,
> +int be_atomic);
>  extern int invalidate_inode_pages2_range(struct address_space *mapping,
>pgoff_t start, pgoff_t end);
>  extern int write_inode_now(struct inode *, int);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1692dd6..6719c41 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1207,7 +1207,7 @@ int drop_caches_sysctl_handler(struct ctl_table *,
> int, struct file *, void __user *, size_t *, loff_t *);
>  unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
>   unsigned long lru_pages);
> -void drop_pagecache(void);
> +void drop_pagecache(int unmap);
>  void drop_slab(void);
>
>  #ifndef CONFIG_MMU
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 5cdfbc1..568ac77 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -371,19 +371,9 @@ static int do_launder_page(struct 

Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-10-02 Thread Nick Piggin
On Tuesday 02 October 2007 06:50, Christoph Lameter wrote:
> On Fri, 28 Sep 2007, Nick Piggin wrote:
> > I thought it was slower. Have you fixed the performance regression?
> > (OK, I read further down that you are still working on it but not
> > confirmed yet...)
>
> The problem is with the weird way of Intel testing and communication.
> Every 3-6 month or so they will tell you the system is X% up or down on
> arch Y (and they wont give you details because its somehow secret). And
> then there are conflicting statements by the two or so performance test
> departments. One of them repeatedly assured me that they do not see any
> regressions.

Just so long as there aren't known regressions that would require higher
order allocations to fix them.


> > OK, so long as it isn't going to depend on using higher order pages,
> > that's fine. (if they help even further as an optional thing, that's fine
> > too. You can turn them on your huge systems and not even bother about
> > adding this vmap fallback -- you won't have me to nag you about these
> > purely theoretical issues).
>
> Well the vmap fallback is generally useful AFAICT. Higher order
> allocations are common on some of our platforms. Order 1 failures even
> affect essential things like stacks that have nothing to do with SLUB and
> the LBS patchset.

I don't know if it is worth the trouble, though. The best thing to do is to
ensure that contiguous memory is not wasted on frivolous things... a few
order-1 or 2 allocations aren't too much of a problem.

The only high order allocation failure I've seen from fragmentation for a
long time IIRC are the order-3 failures coming from e1000. And obviously
they cannot use vmap.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-10-02 Thread Nick Piggin
On Tuesday 02 October 2007 07:01, Christoph Lameter wrote:
> On Sat, 29 Sep 2007, Peter Zijlstra wrote:
> > On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
> > > Really? That means we can no longer even allocate stacks for forking.
> >
> > I think I'm running with 4k stacks...
>
> 4k stacks will never fly on an SGI x86_64 NUMA configuration given the
> additional data that may be kept on the stack. We are currently
> considering to go from 8k to 16k (or even 32k) to make things work. So
> having the ability to put the stacks in vmalloc space may be something to
> look at.

i386 and x86-64 already used 8K stacks for years and they have never
really been much problem before.

They only started failing when contiguous memory is getting used up
by other things, _even with_ those anti-frag patches in there.

Bottom line is that you do not use higher order allocations when you do
not need them.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sky2: jumbo frame regression fix

2007-10-02 Thread Jeff Garzik

Stephen Hemminger wrote:

Remove unneeded check that caused problems with jumbo frame sizes.
The check was recently added and is wrong.
When using jumbo frames the sky2 driver does fragmentation, so
rx_data_size is less than mtu.

Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>

--- a/drivers/net/sky2.c2007-10-02 17:56:31.0 -0700
+++ b/drivers/net/sky2.c2007-10-02 17:58:56.0 -0700
@@ -2163,9 +2163,6 @@ static struct sk_buff *sky2_receive(stru
sky2->rx_next = (sky2->rx_next + 1) % sky2->rx_pending;
prefetch(sky2->rx_ring + sky2->rx_next);
 
-	if (length < ETH_ZLEN || length > sky2->rx_data_size)

-   goto len_error;
-


2.6.23?  2.6.24?  enquiring minds want to know...


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] allow drivers to flush in-flight DMA v2

2007-10-02 Thread akepner
On Fri, Sep 28, 2007 at 03:43:39PM -0700, David Miller wrote:
> 
> My only beef with this patch set is that it seems
> a bit much to create a totally new function name every
> time we want to set some kind of new attribute on some
> DMA object.  Why not add a "dma_set_flags()" or similar
> that can be used later on to set other kinds of aspects
> we'd like to change?
> 
> You can make the arguments "u32 flags" and "int dir".
> Actually you should probably use the dma direction
> enumaration instead of 'int'.

OK, this will be in the next version along with the 
coding style changes you mentioned.

-- 
Arthur

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4/4] mthca: allow setting "dmabarrier" on user-allocated memory

2007-10-02 Thread akepner
On Fri, Sep 28, 2007 at 12:50:00PM -0700, Roland Dreier wrote:

> Sorry for not mentioning this earlier, but this patch should really be
> two (or more) patches: one to add dmabarrier support to the core user
> memory stuff in drivers/infiniband, and a second one to add support to
> mthca (and more patches to add support to mlx4, cxgb3, etc, etc).

Makes sense. 

> 
>  > + * @dmabarrier: set "dmabarrier" attribute on this memory, if necessary 
> 
> Nit: just delete the "if necessary" since I don't think it makes
> things clearer (and actually doesn't make much sense in this context)
>

OK.
 
> Other than that this look fine to me, and I'm ready to merge it once
> the necessary core DMA stuff is settled.
> 

Great. A new version of the patchset is on the way.

-- 
Arthur

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] Boot protocol changes

2007-10-02 Thread H. Peter Anvin

Jeremy Fitzhardinge wrote:

H. Peter Anvin wrote:

I'm proposing that the existing bzImage format be retained, but that
the payload of the decompressor (already a gzip file) simply be
vmlinux.gz -- i.e. a gzip compressed ELF file, notes and all.  A
pointer in the header will point to the offset of the payload (this is
new, obviously.)

The decompression stub is adjusted to expect an ELF image, instead of
a raw binary.


It could, or just treat it as a raw binary at 1M+offset to skip the headers.


It would be cleaner to actually parse the ELF; it's only a handful of 
lines of code (we don't have to support arbitrary placement of sections, 
obviously, which makes the problem simpler.)



Existing bootloaders (16- or 32-bit) simply load the bzImage the way
they do now; new bootloaders have the option of accessing the
vmlinux.gz directly if they either want to load it themselves or want
to examine the notes. 


OK, but that has the same problem as making the payload an ELF file:
32-bit bootloaders which simply jump to 1M will be jumping into data
rather than code - and I got the impression from taking to Eric at KS
that there are such bootloaders.  


Uhm, no it doesn't.  Those bootloaders jump to the decompressor, not to 
the payload.  The decompressor interface hasn't changed.



If that's not an issue, then I still think the payload should be a plain
ELF file (possibly self-decompressing, or just a plain uncompressed
vmlinux, if that's what's desired).  Still think making a protected-mode
bootloader do the decompression is the wrong way to go about this; ELF
is enough.


It doesn't have to if it doesn't want to; it only needs to do so if it 
wants to access the kernel as an ELF.  Again, it has the advantage that 
the ELF is the real vmlinux, no funnies.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sky2: jumbo frame regression fix

2007-10-02 Thread Stephen Hemminger
Remove unneeded check that caused problems with jumbo frame sizes.
The check was recently added and is wrong.
When using jumbo frames the sky2 driver does fragmentation, so
rx_data_size is less than mtu.

Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>

--- a/drivers/net/sky2.c2007-10-02 17:56:31.0 -0700
+++ b/drivers/net/sky2.c2007-10-02 17:58:56.0 -0700
@@ -2163,9 +2163,6 @@ static struct sk_buff *sky2_receive(stru
sky2->rx_next = (sky2->rx_next + 1) % sky2->rx_pending;
prefetch(sky2->rx_ring + sky2->rx_next);
 
-   if (length < ETH_ZLEN || length > sky2->rx_data_size)
-   goto len_error;
-
/* This chip has hardware problems that generates bogus status.
 * So do only marginal checking and expect higher level protocols
 * to handle crap frames.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] Boot protocol changes

2007-10-02 Thread Jeremy Fitzhardinge
H. Peter Anvin wrote:
> I'm proposing that the existing bzImage format be retained, but that
> the payload of the decompressor (already a gzip file) simply be
> vmlinux.gz -- i.e. a gzip compressed ELF file, notes and all.  A
> pointer in the header will point to the offset of the payload (this is
> new, obviously.)
>
> The decompression stub is adjusted to expect an ELF image, instead of
> a raw binary.

It could, or just treat it as a raw binary at 1M+offset to skip the headers.

> Existing bootloaders (16- or 32-bit) simply load the bzImage the way
> they do now; new bootloaders have the option of accessing the
> vmlinux.gz directly if they either want to load it themselves or want
> to examine the notes. 

OK, but that has the same problem as making the payload an ELF file:
32-bit bootloaders which simply jump to 1M will be jumping into data
rather than code - and I got the impression from taking to Eric at KS
that there are such bootloaders.  

If that's not an issue, then I still think the payload should be a plain
ELF file (possibly self-decompressing, or just a plain uncompressed
vmlinux, if that's what's desired).  Still think making a protected-mode
bootloader do the decompression is the wrong way to go about this; ELF
is enough.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] Boot protocol changes

2007-10-02 Thread H. Peter Anvin

H. Peter Anvin wrote:
 
No, not at all.


I'm proposing that the existing bzImage format be retained, but that the 
payload of the decompressor (already a gzip file) simply be vmlinux.gz 
-- i.e. a gzip compressed ELF file, notes and all.  A pointer in the 
header will point to the offset of the payload (this is new, obviously.)


The decompression stub is adjusted to expect an ELF image, instead of a 
raw binary.


Existing bootloaders (16- or 32-bit) simply load the bzImage the way 
they do now; new bootloaders have the option of accessing the vmlinux.gz 
directly if they either want to load it themselves or want to examine 
the notes.




Slight correction: it does, of course, break loaders which root through 
the bzImage for a gzip header and decode that themselves and place in 
memory.  These loaders are pretty broken, though; they can't deal with 
the fact that the physical address of the kernel is configurable, for 
one thing.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG] sky2 errors in 2.6.23-rc9-git1

2007-10-02 Thread Ian Kumlien
Hi, 

Sorry about this but the latest sky2 seems damned odd.
I have been running with jumbo frames at home for quite some time but
with this kernel that doesn't work, i instead get loads of:
sky2 eth0: rx length error: status 0x5e60500 length 1510
sky2 eth0: rx length error: status 0x5e60500 length 1510
sky2 eth0: rx length error: status 0x5ea0500 length 1514
sky2 eth0: rx length error: status 0x5ea0500 length 1514

Where length can be just about anything from 800 -> MTU

That is not enough though, i also, for some reason, got several hangs:
sky2 eth0: hung mac 0:68 fifo 143 (133:76)
sky2 eth0: receiver hang detected
sky2 eth0: disabling interface
sky2 eth0: enabling interface
sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control rx
sky2 eth0: hung mac 0:125 fifo 195 (93:88)
sky2 eth0: receiver hang detected
sky2 eth0: disabling interface
sky2 eth0: enabling interface
sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control rx
sky2 eth0: hung mac 0:124 fifo 98 (10:108)
sky2 eth0: receiver hang detected
sky2 eth0: disabling interface
sky2 eth0: enabling interface
sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control rx
sky2 eth0: hung mac 0:41 fifo 30 (187:17)
sky2 eth0: receiver hang detected
sky2 eth0: disabling interface
sky2 eth0: enabling interface
sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control rx
...

All during about 2 minutes.

Could this be related to [sky2: sky2 FE+ receive status workaround]:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blobdiff;f=drivers/net/sky2.c;h=a3de0b6127ebb537b87a1849e207909fcc333ee4;hp=0792031a5cf959a1543f32f4e0f2ab4ccb7b0ec2;hb=3b12e0141f7a97c3b84731b5f935ed738bb6f960;hpb=ff0ce6845bc18292e80ea40d11c3d3a539a3fc5e

The chips being used are:
sky2 :02:00.0: v1.18 addr 0xdbffc000 irq 17 Yukon-EC (0xb6) rev 2
sky2 :02:00.0: v1.18 addr 0xfddfc000 irq 17 Yukon-EC (0xb6) rev 1

The receiver hang only happes on the REV 2 chip, which also reports:
sky2 :02:00.0: No interrupt generated using MSI, switching to INTx mode.

Ifconfig reports:
REV 2 chip:
  RX packets:30492 errors:0 dropped:646 overruns:0 frame:646
  TX packets:29229 errors:0 dropped:0 overruns:0 carrier:0

REV 1 chip:
  RX packets:19795 errors:0 dropped:131 overruns:0 frame:131
  TX packets:18588 errors:0 dropped:0 overruns:0 carrier:0

Let me know when jumbo frames work again, just mail me patches =)
(to tired to look in to it closer atm)

-- 
Ian Kumlien  -- http://pomac.netswarm.net


signature.asc
Description: This is a digitally signed message part


Re: [PATCH 0/5] Boot protocol changes

2007-10-02 Thread H. Peter Anvin

Jeremy Fitzhardinge wrote:

H. Peter Anvin wrote:

This series looks like a good start for Xen, but we still need to work
out where to stash the metadata which normally lives in ELF notes. 
Using ELF is convenient for Xen because it lets a large chunk of domain

builder code be reused; on the other hand, loading a plain bzImage is
pretty simple, so maybe it isn't such a big deal.

HPA, Eric: if we don't go the "embed ELF" path, where's a good
backwards-compatible place to stash the note data?  If we do go with
"embed ELF", how should we go about doing it?  Arrange to put the ELF
headers before the 1M mark?


This sounds like another good reason to do the ELF image as the
postcompression image.  The interface to the embedded compression
routine is then unchanged, and we get the "full vmlinux" with any
notes that belongs there.

I'll try to get an implementation of that done -- it really shouldn't
be very hard.


Please explain what you're proposing again, because my memory of your
plan from last time wouldn't help in this case.  Are you proposing that
the bzImage contains compressed data that its expecting the bootloader
to decompress?  Won't that completely break backwards compatibility?  If
we don't care about backwards compatibility with old bootloaders, then
it doesn't matter what we do one way or the other.



No, not at all.

I'm proposing that the existing bzImage format be retained, but that the 
payload of the decompressor (already a gzip file) simply be vmlinux.gz 
-- i.e. a gzip compressed ELF file, notes and all.  A pointer in the 
header will point to the offset of the payload (this is new, obviously.)


The decompression stub is adjusted to expect an ELF image, instead of a 
raw binary.


Existing bootloaders (16- or 32-bit) simply load the bzImage the way 
they do now; new bootloaders have the option of accessing the vmlinux.gz 
directly if they either want to load it themselves or want to examine 
the notes.


-hpa

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: build error

2007-10-02 Thread Stephen Hemminger
On Tue, 2 Oct 2007 17:28:27 -0700
Randy Dunlap <[EMAIL PROTECTED]> wrote:

> On Tue, 2 Oct 2007 17:19:42 -0700 Stephen Hemminger wrote:
> 
> > On Tue, 2 Oct 2007 22:12:13 +0200
> > [EMAIL PROTECTED] wrote:
> > 
> > > [please CC: me, my subscribe mail was greylisted]
> > > 
> > > Morning!
> > > 
> > > My make run for 2.6.23-rc9 ends like this:
> > > 
> > >   GEN .version
> > >   CHK include/linux/compile.h
> > >   UPD include/linux/compile.h
> > >   CC  init/version.o
> > >   LD  init/built-in.o
> > >   LD  .tmp_vmlinux1
> > > kernel/built-in.o: In function `getnstimeofday':
> > > (.text+0x1e141): undefined reference to `__udivdi3'
> > > kernel/built-in.o: In function `do_gettimeofday':
> > > (.text+0x1e263): undefined reference to `__udivdi3'
> > > kernel/built-in.o: In function `timekeeping_resume':
> > > timekeeping.c:(.text+0x1e427): undefined reference to `__udivdi3'
> > > kernel/built-in.o: In function `update_wall_time':
> > > (.text+0x1e829): undefined reference to `__udivdi3'
> > > kernel/built-in.o: In function `update_wall_time':
> > > (.text+0x1ec4c): undefined reference to `__udivdi3'
> > > make: *** [.tmp_vmlinux1] Error 1
> > > 
> > > .config attached.
> > > 
> > > I have already read the diff from -rc8 and found nothing that helped me.
> > > 
> > > Any ideas? Further questions?
> > > 
> > > Wilfried
> > > 
> > 
> > What Gcc version? and config/architecture?
> 
> .config file was attached.  It says X86_32.
> 
> I can't reproduce it on 2 different systems & toolchains.

There were earlier reports of gcc 4.3 bogus optimization:
  http://lkml.org/lkml/2007/5/18/355
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd min order, slub max order [was Re: -mm merge plans for 2.6.24]

2007-10-02 Thread Christoph Lameter
On Tue, 2 Oct 2007, Christoph Lameter wrote:

> The maximum order of allocation used by SLUB may have to depend on the 
> number of page structs in the system since small systems (128M was the 
> case that Peter found) can easier get into trouble. SLAB has similar 
> measures to avoid order 1 allocations for small systems below 32M.

A patch like this? This is based on the number of page structs on the 
system. Maybe it needs to be based on the number of MAX_ORDER blocks
for antifrag?


SLUB: Determine slub_max_order depending on the number of pages available

Determine the maximum order to be used for slabs and the mininum
desired number of objects in a slab from the amount of pages that
a system has available (like SLAB does for the order 1/0 distinction).

For systems with less than 128M only use order 0 allocations (SLAB does 
that for <32M only). The order 0 config is useful for small systems to 
minimize the memory used. Memory easily fragments since we have less than 
32k pages to play with. Order 0 insures that higher order allocations are 
minimized (Larger orders must still be used for objects that do not fit 
into order 0 pages).

Then step up to order 1 for systems < 256000 pages (1G)

Order 2 limit to systems < 100 page structs (4G)

Order 3 for systems larger than that.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/slub.c |   49 +
 1 file changed, 25 insertions(+), 24 deletions(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-10-02 09:26:16.0 -0700
+++ linux-2.6/mm/slub.c 2007-10-02 16:40:22.0 -0700
@@ -153,25 +153,6 @@ static inline void ClearSlabDebug(struct
 /* Enable to test recovery from slab corruption on boot */
 #undef SLUB_RESILIENCY_TEST
 
-#if PAGE_SHIFT <= 12
-
-/*
- * Small page size. Make sure that we do not fragment memory
- */
-#define DEFAULT_MAX_ORDER 1
-#define DEFAULT_MIN_OBJECTS 4
-
-#else
-
-/*
- * Large page machines are customarily able to handle larger
- * page orders.
- */
-#define DEFAULT_MAX_ORDER 2
-#define DEFAULT_MIN_OBJECTS 8
-
-#endif
-
 /*
  * Mininum number of partial slabs. These will be left on the partial
  * lists even if they are empty. kmem_cache_shrink may reclaim them.
@@ -1718,8 +1699,9 @@ static struct page *get_object_page(cons
  * take the list_lock.
  */
 static int slub_min_order;
-static int slub_max_order = DEFAULT_MAX_ORDER;
-static int slub_min_objects = DEFAULT_MIN_OBJECTS;
+static int slub_max_order;
+static int slub_min_objects = 4;
+static int manual;
 
 /*
  * Merge control. If this is set then no merging of slab caches will occur.
@@ -2237,7 +2219,7 @@ static struct kmem_cache *kmalloc_caches
 static int __init setup_slub_min_order(char *str)
 {
get_option (, _min_order);
-
+   manual = 1;
return 1;
 }
 
@@ -2246,7 +2228,7 @@ __setup("slub_min_order=", setup_slub_mi
 static int __init setup_slub_max_order(char *str)
 {
get_option (, _max_order);
-
+   manual = 1;
return 1;
 }
 
@@ -2255,7 +2237,7 @@ __setup("slub_max_order=", setup_slub_ma
 static int __init setup_slub_min_objects(char *str)
 {
get_option (, _min_objects);
-
+   manual = 1;
return 1;
 }
 
@@ -2566,6 +2548,16 @@ int kmem_cache_shrink(struct kmem_cache 
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
+/*
+ * Table to autotune the maximum slab order based on the number of pages
+ * that the system has available.
+ */
+static unsigned long __initdata phys_pages_for_order[PAGE_ALLOC_COSTLY_ORDER] 
= {
+   32768,  /* >128M if using 4K pages, >512M (16k), >2G (64k) */
+   256000, /* >1G if using 4k pages, >4G (16k), >16G (64k) */
+   100 /* >4G if using 4k pages, >16G (16k), >64G (64k) */
+};
+
 /
  * Basic setup of slabs
  ***/
@@ -2575,6 +2567,15 @@ void __init kmem_cache_init(void)
int i;
int caches = 0;
 
+   if (!manual) {
+   /* No manual parameters. Autotune for system */
+   for (i = 0; i < PAGE_ALLOC_COSTLY_ORDER; i++)
+   if (num_physpages > phys_pages_for_order[i]) {
+   slub_max_order++;
+   slub_min_objects <<= 1;
+   }
+   }
+
 #ifdef CONFIG_NUMA
/*
 * Must first have the slab cache available for the allocations of the
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc7-mm1 -- powerpc rtas panic

2007-10-02 Thread Michael Ellerman
On Wed, 2007-10-03 at 10:26 +1000, Tony Breeds wrote:
> On Tue, Oct 02, 2007 at 06:28:19PM -0500, Linas Vepstas wrote:
> > On Mon, Sep 24, 2007 at 01:35:31PM +0100, Andy Whitcroft wrote:
> > > Seeing the following from an older power LPAR, pretty sure we had
> > > this in the previous -mm also:
> > 
> > I haven't forgetten about this ... and am looking at it now.
> > Seems that whenever I go to reserve the machine pSeries-102,
> > someone else is using it :-)
> 
> This panic is caused by "[POWERPC] pseries: Fix jumbled no_logging flag."
> (79c0108d1b9db4864ab77b2a95dfa04f2dcf264c), in the powerpc/for-2.6.24
> branch.  It looks to me that we have logging enabled too early now.
> 
> I think the following is a reasonable fix?
> 
> ---
> Explicitly enable RTAS error logging, when it should be ready.
> 
> 
> Signed-off-by: Tony Breeds <[EMAIL PROTECTED]>
> 
> ---
> 
>  arch/powerpc/platforms/pseries/rtasd.c |7 ++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/rtasd.c 
> b/arch/powerpc/platforms/pseries/rtasd.c
> index 30925d2..0df5d0d 100644
> --- a/arch/powerpc/platforms/pseries/rtasd.c
> +++ b/arch/powerpc/platforms/pseries/rtasd.c
> @@ -54,7 +54,10 @@ static unsigned int rtas_event_scan_rate;
>  static int full_rtas_msgs = 0;
>  
>  /* Stop logging to nvram after first fatal error */
> -static int no_more_logging;
> +static int no_more_logging = 1; /* Until we initialize everything,
> + * make sure we don't try logging
> + * anything */
> +

I realise it'll make the patch bigger, but this doesn't seem like a
particularly good name for the variable anymore.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


signature.asc
Description: This is a digitally signed message part


Re: build error

2007-10-02 Thread Randy Dunlap
On Tue, 2 Oct 2007 17:19:42 -0700 Stephen Hemminger wrote:

> On Tue, 2 Oct 2007 22:12:13 +0200
> [EMAIL PROTECTED] wrote:
> 
> > [please CC: me, my subscribe mail was greylisted]
> > 
> > Morning!
> > 
> > My make run for 2.6.23-rc9 ends like this:
> > 
> >   GEN .version
> >   CHK include/linux/compile.h
> >   UPD include/linux/compile.h
> >   CC  init/version.o
> >   LD  init/built-in.o
> >   LD  .tmp_vmlinux1
> > kernel/built-in.o: In function `getnstimeofday':
> > (.text+0x1e141): undefined reference to `__udivdi3'
> > kernel/built-in.o: In function `do_gettimeofday':
> > (.text+0x1e263): undefined reference to `__udivdi3'
> > kernel/built-in.o: In function `timekeeping_resume':
> > timekeeping.c:(.text+0x1e427): undefined reference to `__udivdi3'
> > kernel/built-in.o: In function `update_wall_time':
> > (.text+0x1e829): undefined reference to `__udivdi3'
> > kernel/built-in.o: In function `update_wall_time':
> > (.text+0x1ec4c): undefined reference to `__udivdi3'
> > make: *** [.tmp_vmlinux1] Error 1
> > 
> > .config attached.
> > 
> > I have already read the diff from -rc8 and found nothing that helped me.
> > 
> > Any ideas? Further questions?
> > 
> > Wilfried
> > 
> 
> What Gcc version? and config/architecture?

.config file was attached.  It says X86_32.

I can't reproduce it on 2 different systems & toolchains.

---
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc7-mm1 -- powerpc rtas panic

2007-10-02 Thread Tony Breeds
On Tue, Oct 02, 2007 at 06:28:19PM -0500, Linas Vepstas wrote:
> On Mon, Sep 24, 2007 at 01:35:31PM +0100, Andy Whitcroft wrote:
> > Seeing the following from an older power LPAR, pretty sure we had
> > this in the previous -mm also:
> 
> I haven't forgetten about this ... and am looking at it now.
> Seems that whenever I go to reserve the machine pSeries-102,
> someone else is using it :-)

This panic is caused by "[POWERPC] pseries: Fix jumbled no_logging flag."
(79c0108d1b9db4864ab77b2a95dfa04f2dcf264c), in the powerpc/for-2.6.24
branch.  It looks to me that we have logging enabled too early now.

I think the following is a reasonable fix?

---
Explicitly enable RTAS error logging, when it should be ready.


Signed-off-by: Tony Breeds <[EMAIL PROTECTED]>

---

 arch/powerpc/platforms/pseries/rtasd.c |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/rtasd.c 
b/arch/powerpc/platforms/pseries/rtasd.c
index 30925d2..0df5d0d 100644
--- a/arch/powerpc/platforms/pseries/rtasd.c
+++ b/arch/powerpc/platforms/pseries/rtasd.c
@@ -54,7 +54,10 @@ static unsigned int rtas_event_scan_rate;
 static int full_rtas_msgs = 0;
 
 /* Stop logging to nvram after first fatal error */
-static int no_more_logging;
+static int no_more_logging = 1; /* Until we initialize everything,
+ * make sure we don't try logging
+ * anything */
+
 
 static int error_log_cnt;
 
@@ -414,6 +417,8 @@ static int rtasd(void *unused)
memset(logdata, 0, rtas_error_log_max);
rc = nvram_read_error_log(logdata, rtas_error_log_max,
  _type, _log_cnt);
+   /* We can use rtas_log_buf now */
+   no_more_logging = 0;
 
if (!rc) {
if (err_type != ERR_FLAG_ALREADY_LOGGED) {

Yours Tony

  linux.conf.auhttp://linux.conf.au/ || http://lca2008.linux.org.au/
  Jan 28 - Feb 02 2008 The Australian Linux Technical Conference!

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: build error

2007-10-02 Thread Stephen Hemminger
On Tue, 2 Oct 2007 22:12:13 +0200
[EMAIL PROTECTED] wrote:

> [please CC: me, my subscribe mail was greylisted]
> 
> Morning!
> 
> My make run for 2.6.23-rc9 ends like this:
> 
>   GEN .version
>   CHK include/linux/compile.h
>   UPD include/linux/compile.h
>   CC  init/version.o
>   LD  init/built-in.o
>   LD  .tmp_vmlinux1
> kernel/built-in.o: In function `getnstimeofday':
> (.text+0x1e141): undefined reference to `__udivdi3'
> kernel/built-in.o: In function `do_gettimeofday':
> (.text+0x1e263): undefined reference to `__udivdi3'
> kernel/built-in.o: In function `timekeeping_resume':
> timekeeping.c:(.text+0x1e427): undefined reference to `__udivdi3'
> kernel/built-in.o: In function `update_wall_time':
> (.text+0x1e829): undefined reference to `__udivdi3'
> kernel/built-in.o: In function `update_wall_time':
> (.text+0x1ec4c): undefined reference to `__udivdi3'
> make: *** [.tmp_vmlinux1] Error 1
> 
> .config attached.
> 
> I have already read the diff from -rc8 and found nothing that helped me.
> 
> Any ideas? Further questions?
> 
> Wilfried
> 

What Gcc version? and config/architecture?

-- 
Stephen Hemminger <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-02 Thread Linus Torvalds


On Wed, 3 Oct 2007, Alan Cox wrote:
> 
> Smack seems a perfectly good simple LSM module, its clean, its based upon
> credible security models and sound theory (unlike AppArmor).

The problem with SELinux isn't the theory. It's the practice. 

IOW, it's too hard to use.

Apparently Ubuntu is giving up on it too, for that reason.

And what some people seem to have trouble admitting is that theory counts 
for nothing, if the practice isn't there.

So quite frankly, the SELinux people would look at whole lot smarter if 
they didn't blather on about "theory".

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel 2.4 vs 2.6 Traffic Controller performance

2007-10-02 Thread Stephen Hemminger
On Wed, 3 Oct 2007 08:05:30 +0800
Sonny <[EMAIL PROTECTED]> wrote:

> Hello
> I hope this is the right place to ask this.Does any know if there is a
> substantial difference in the performance of the traffic controller
> between kernel 2.4 and 2.6. We tested it using 1 iperf server and use
> 250 and 500 clients, altering the burst. We use the top command to
> check the idle time of our router to see this. The results we got from
> the 2.4 kernel shows around 65-70% idle time while the 2.6 shows
> 60-65% idle time. We tried to use MRTG and we're not getting any
> results either. We want to know if we could improve the bandwidth by
> upgrading the kernel, else we would have to get a new bandwidth
> manager.  Could anyone have the similar test regarding this. Thanks in
> advance.

Some related thoughts:
1. Make sure you have the iperf yield fix in place. Otherwise iperf eats
   cpu.

2. Proper mailing lists are: [EMAIL PROTECTED] and [EMAIL PROTECTED]

3. The latest versions of 2.6 use different clock measurement that
should be better than older 2.4 (where there are three choices).
The new clock is finer resolution (at slightly higher overhead), which
should make accuracy higher but might increase cpu usage.

-- 
Stephen Hemminger <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 0/6] Patch series to add of_platform binding to xilinxfb

2007-10-02 Thread Antonino A. Daplas
On Mon, 2007-10-01 at 09:57 -0600, Grant Likely wrote:
> (resend due to mailer issues.  Apologies to anyone receiving this twice)
> 
> This patch series reworks the Xilinx framebuffer driver and then adds
> an of_platform bus binding.  The of_platform bus binding is needed to use
> the driver in arch/powerpc platforms.
> 
> Antonino,
> 
> Assuming there are no major issues, I'd like to get this patch series
> queued up for inclusion in 2.6.24.

Okay.

Tony


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-02 Thread Alan Cox
> situations. For example, I find SELinux to be so irrelevant to my usage 
> that I don't use it at all. I just don't have any other users on my 
> machine

That you know about...

The value of SELinux (or indeed any system compartmentalising access and
limiting damage) comes into play when you get breakage - eg via a web
browser exploit.

Yes SELinux is much more relevant to servers, and really comes into its
own when its used to write custom rulesets and enforce corporate policy
("No you can't run that screensaver that arrived by email").

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc9 boot failure (megaraid?)

2007-10-02 Thread FUJITA Tomonori
On Tue, 02 Oct 2007 15:38:13 -0500
James Bottomley <[EMAIL PROTECTED]> wrote:

> On Tue, 2007-10-02 at 20:15 +0200, Adrian Bunk wrote:
> > Cc's added, the complete bug report is at
> >   http://lkml.org/lkml/2007/10/2/243
> > 
> > On Tue, Oct 02, 2007 at 12:48:26PM -0400, Burton Windle wrote:
> > > 2.6.23-rc9 fails to boot for me; 2.6.22.9 works fine.
> > >
> > > System is a Dell Poweredge with PERC 2/DC with RAID1 volume.
> > >...
> > 
> > Thanks for your report.
> > 
> > Diff'ing the dmesg's shows:
> > 
> > <--  snip  -->
> > 
> >  scsi0: scanning scsi channel 4 [P0] for physical devices.
> >  scsi0: scanning scsi channel 5 [P1] for physical devices.
> >  st: Version 20070203, fixed bufsize 32768, s/g segs 256
> > -sd 0:0:0:0: [sda] 17547264 512-byte hardware sectors (8984 MB)
> > +sd 0:0:0:0: [sda] Sector size 0 reported, assuming 512.
> > +sd 0:0:0:0: [sda] 1 512-byte hardware sectors (0 MB)
> >  sd 0:0:0:0: [sda] Write Protect is off
> >  sd 0:0:0:0: [sda] Asking for cache data failed
> >  sd 0:0:0:0: [sda] Assuming drive cache: write through
> > -sd 0:0:0:0: [sda] 17547264 512-byte hardware sectors (8984 MB)
> > +sd 0:0:0:0: [sda] Sector size 0 reported, assuming 512.
> > +sd 0:0:0:0: [sda] 1 512-byte hardware sectors (0 MB)
> >  sd 0:0:0:0: [sda] Write Protect is off
> >  sd 0:0:0:0: [sda] Asking for cache data failed
> >  sd 0:0:0:0: [sda] Assuming drive cache: write through
> >   sda: sda1
> > + sda: p1 exceeds device capacity
> > 
> > <--  snip  -->
> > 
> > -   case MEGA_BULK_DATA:
> > -   if (scb->cmd->use_sg == 0)
> > -   length = scb->cmd->request_bufflen;
> > -   else {
> > -   struct scatterlist *sgl =
> > -   (struct scatterlist *)scb->cmd->request_buffer;
> > -   length = sgl->length;
> > -   }
> > -   pci_unmap_page(adapter->dev, scb->dma_h_bulkdata,
> > -  length, scb->dma_direction);
> > -   break;
> > -
> 
> This is the problem piece I think.  We've reintroduced a very old bug:
> 
> commit 51c928c34fa7cff38df584ad01de988805877dba
> Author: James Bottomley <[EMAIL PROTECTED]>
> Date:   Sat Oct 1 09:38:05 2005 -0500
> 
> [SCSI] Legacy MegaRAID: Fix READ CAPACITY
> 
> Some Legacy megaraid cards can't actually cope with the scatter/gather
> version of the READ CAPACITY command (which is what we now send them
> since altering all SCSI internal I/O to go via the block layer).  Fix
> this (and a few other broken megaraid driver assumptions) by sending
> the non-sg version of the command if the sg list only has a single
> element.
> 
> Signed-off-by: James Bottomley <[EMAIL PROTECTED]>
> 
> So what we have to do is put back the check for use_sg == 1 and send
> that as a bulk transfer command.

Sorry again. Needs to check sg count before dma mapping.


diff --git a/drivers/scsi/megaraid.c b/drivers/scsi/megaraid.c
index 3907f67..ae0b220 100644
--- a/drivers/scsi/megaraid.c
+++ b/drivers/scsi/megaraid.c
@@ -1737,9 +1737,12 @@ mega_build_sglist(adapter_t *adapter, scb_t *scb, u32 
*buf, u32 *len)
Scsi_Cmnd   *cmd;
int sgcnt;
int idx;
+   int bulkdata;
 
cmd = scb->cmd;
 
+   bulkdata = (scsi_sg_count(cmd) == 1) ? 1 : 0;
+
/*
 * Copy Scatter-Gather list info into controller structure.
 *
@@ -1753,6 +1756,14 @@ mega_build_sglist(adapter_t *adapter, scb_t *scb, u32 
*buf, u32 *len)
 
*len = 0;
 
+   if (bulkdata && !adapter->has_64bit_addr) {
+   sg = scsi_sglist(cmd);
+   scb->dma_h_bulkdata = sg_dma_address(sg);
+   *buf = (u32)scb->dma_h_bulkdata;
+   *len = sg_dma_len(sg);
+   return 0;
+   }
+
scsi_for_each_sg(cmd, sg, sgcnt, idx) {
if (adapter->has_64bit_addr) {
scb->sgl64[idx].address = sg_dma_address(sg);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-02 Thread Alan Cox
On Tue, 02 Oct 2007 17:02:13 -0400
Bill Davidsen <[EMAIL PROTECTED]> wrote:

> Linus Torvalds wrote:
> > 
> > On Mon, 1 Oct 2007, Stephen Smalley wrote:
> >> You argued against pluggable schedulers, right?  Why is security
> >> different?
> > 
> > Schedulers can be objectively tested. There's this thing called 
> > "performance", that can generally be quantified on a load basis.
> > 
> > Yes, you can have crazy ideas in both schedulers and security. Yes, you 
> > can simplify both for a particular load. Yes, you can make mistakes in 
> > both. But the *discussion* on security seems to never get down to real 
> > numbers. 
> > 
> And yet you can make the exact same case for schedulers as security, you 
> can quantify the behavior, but if your only choice is A it doesn't help 
> to know that B is better.

To be fair the discussion on security does get down to real set theory
but at that point most people's eyes (mine included) glaze over somewhat.

You can reasonably quantify the behaviour and correctness of a security
model based upon mathematical principles - if anything its *easier* that
schedulers which are so much based on "feeling right".

Smack seems a perfectly good simple LSM module, its clean, its based upon
credible security models and sound theory (unlike AppArmor). I don't see
why it shouldn't go in.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Kernel 2.4 vs 2.6 Traffic Controller performance

2007-10-02 Thread Sonny
Hello
I hope this is the right place to ask this.Does any know if there is a
substantial difference in the performance of the traffic controller
between kernel 2.4 and 2.6. We tested it using 1 iperf server and use
250 and 500 clients, altering the burst. We use the top command to
check the idle time of our router to see this. The results we got from
the 2.4 kernel shows around 65-70% idle time while the 2.6 shows
60-65% idle time. We tried to use MRTG and we're not getting any
results either. We want to know if we could improve the bandwidth by
upgrading the kernel, else we would have to get a new bandwidth
manager.  Could anyone have the similar test regarding this. Thanks in
advance.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] Boot protocol changes

2007-10-02 Thread Jeremy Fitzhardinge
H. Peter Anvin wrote:
>> This series looks like a good start for Xen, but we still need to work
>> out where to stash the metadata which normally lives in ELF notes. 
>> Using ELF is convenient for Xen because it lets a large chunk of domain
>> builder code be reused; on the other hand, loading a plain bzImage is
>> pretty simple, so maybe it isn't such a big deal.
>>
>> HPA, Eric: if we don't go the "embed ELF" path, where's a good
>> backwards-compatible place to stash the note data?  If we do go with
>> "embed ELF", how should we go about doing it?  Arrange to put the ELF
>> headers before the 1M mark?
>>
>
> This sounds like another good reason to do the ELF image as the
> postcompression image.  The interface to the embedded compression
> routine is then unchanged, and we get the "full vmlinux" with any
> notes that belongs there.
>
> I'll try to get an implementation of that done -- it really shouldn't
> be very hard.

Please explain what you're proposing again, because my memory of your
plan from last time wouldn't help in this case.  Are you proposing that
the bzImage contains compressed data that its expecting the bootloader
to decompress?  Won't that completely break backwards compatibility?  If
we don't care about backwards compatibility with old bootloaders, then
it doesn't matter what we do one way or the other.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] Boot protocol changes

2007-10-02 Thread H. Peter Anvin

Jeremy Fitzhardinge wrote:

Rusty Russell wrote:

Hi all,

Jeremy had some boot changes for bzImages, but buried in there was an
update to the boot protocol to support Xen and lguest (and kvm-lite).
I've copied those fairly simple patches, and if HPA is happy I'd like to
push them for 2.6.24 (after correcting for the Great Arch Merge of
course).


Ah, good.  I was thinking about reviving this work.  The main problem is
that sticking an ELF header at the 1 meg mark (the address of the
bzImage "payload") breaks 32-bit bootloaders which think they can just
jump to 32-bit code there.  I started a conversation with Eric at KS
about it, but we didn't reach any conclusions.

This series looks like a good start for Xen, but we still need to work
out where to stash the metadata which normally lives in ELF notes.  
Using ELF is convenient for Xen because it lets a large chunk of domain

builder code be reused; on the other hand, loading a plain bzImage is
pretty simple, so maybe it isn't such a big deal.

HPA, Eric: if we don't go the "embed ELF" path, where's a good
backwards-compatible place to stash the note data?  If we do go with
"embed ELF", how should we go about doing it?  Arrange to put the ELF
headers before the 1M mark?



This sounds like another good reason to do the ELF image as the 
postcompression image.  The interface to the embedded compression 
routine is then unchanged, and we get the "full vmlinux" with any notes 
that belongs there.


I'll try to get an implementation of that done -- it really shouldn't be 
very hard.


-hpa

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] Boot protocol changes

2007-10-02 Thread Jeremy Fitzhardinge
Rusty Russell wrote:
> Hi all,
>
>   Jeremy had some boot changes for bzImages, but buried in there was an
> update to the boot protocol to support Xen and lguest (and kvm-lite).
> I've copied those fairly simple patches, and if HPA is happy I'd like to
> push them for 2.6.24 (after correcting for the Great Arch Merge of
> course).

Ah, good.  I was thinking about reviving this work.  The main problem is
that sticking an ELF header at the 1 meg mark (the address of the
bzImage "payload") breaks 32-bit bootloaders which think they can just
jump to 32-bit code there.  I started a conversation with Eric at KS
about it, but we didn't reach any conclusions.

This series looks like a good start for Xen, but we still need to work
out where to stash the metadata which normally lives in ELF notes.  
Using ELF is convenient for Xen because it lets a large chunk of domain
builder code be reused; on the other hand, loading a plain bzImage is
pretty simple, so maybe it isn't such a big deal.

HPA, Eric: if we don't go the "embed ELF" path, where's a good
backwards-compatible place to stash the note data?  If we do go with
"embed ELF", how should we go about doing it?  Arrange to put the ELF
headers before the 1M mark?

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] Boot protocol changes

2007-10-02 Thread H. Peter Anvin

Rusty Russell wrote:

Hi all,

Jeremy had some boot changes for bzImages, but buried in there was an
update to the boot protocol to support Xen and lguest (and kvm-lite).
I've copied those fairly simple patches, and if HPA is happy I'd like to
push them for 2.6.24 (after correcting for the Great Arch Merge of
course).



Acked-by: H. Peter Anvin <[EMAIL PROTECTED]>

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/5] Revert lguest magic and use hook in head.S

2007-10-02 Thread Rusty Russell
Version 2.07 of the boot protocol uses 0x23C for the hardware_subarch
field, that for lguest is "1".  This allows us to use the standard
boot entry point rather than the "GenuineLguest" string hack.

This entry point also clears the BSS and copies the boot parameters
and commandline for us, saving more code.

Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>

---
 Documentation/lguest/lguest.c |   31 ---
 arch/i386/kernel/head.S   |8 
 drivers/lguest/lguest_asm.S   |9 +++--
 3 files changed, 15 insertions(+), 33 deletions(-)

diff -r 2fdc577cfe5c Documentation/lguest/lguest.c
--- a/Documentation/lguest/lguest.c Tue Oct 02 22:21:05 2007 +1000
+++ b/Documentation/lguest/lguest.c Tue Oct 02 23:00:09 2007 +1000
@@ -251,23 +251,6 @@ static void *get_pages(unsigned int num)
return addr;
 }
 
-/* To find out where to start we look for the magic Guest string, which marks
- * the code we see in lguest_asm.S.  This is a hack which we are currently
- * plotting to replace with the normal Linux entry point. */
-static unsigned long entry_point(const void *start, const void *end)
-{
-   const void *p;
-
-   /* The scan gives us the physical starting address.  We boot with
-* pagetables set up with virtual and physical the same, so that's
-* OK. */
-   for (p = start; p < end; p++)
-   if (memcmp(p, "GenuineLguest", strlen("GenuineLguest")) == 0)
-   return to_guest_phys(p + strlen("GenuineLguest"));
-
-   errx(1, "Is this image a genuine lguest?");
-}
-
 /* This routine is used to load the kernel or initrd.  It tries mmap, but if
  * that fails (Plan 9's kernel file isn't nicely aligned on page boundaries),
  * it falls back to reading the memory in. */
@@ -303,7 +286,6 @@ static void map_at(int fd, void *addr, u
  * We return the starting address. */
 static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr)
 {
-   void *start = (void *)-1, *end = NULL;
Elf32_Phdr phdr[ehdr->e_phnum];
unsigned int i;
 
@@ -335,19 +317,13 @@ static unsigned long map_elf(int elf_fd,
verbose("Section %i: size %i addr %p\n",
i, phdr[i].p_memsz, (void *)phdr[i].p_paddr);
 
-   /* We track the first and last address we mapped, so we can
-* tell entry_point() where to scan. */
-   if (from_guest_phys(phdr[i].p_paddr) < start)
-   start = from_guest_phys(phdr[i].p_paddr);
-   if (from_guest_phys(phdr[i].p_paddr) + phdr[i].p_filesz > end)
-   end=from_guest_phys(phdr[i].p_paddr)+phdr[i].p_filesz;
-
/* We map this section of the file at its physical address. */
map_at(elf_fd, from_guest_phys(phdr[i].p_paddr),
   phdr[i].p_offset, phdr[i].p_filesz);
}
 
-   return entry_point(start, end);
+   /* The entry point is given in the ELF header. */
+   return ehdr->e_entry;
 }
 
 /*L:160 Unfortunately the entire ELF image isn't compressed: the segments
@@ -374,7 +350,8 @@ static unsigned long unpack_bzimage(int 
 
verbose("Unpacked size %i addr %p\n", len, img);
 
-   return entry_point(img, img + len);
+   /* The entry point for a bzImage is always the first byte */
+   return (unsigned long)img;
 }
 
 /*L:150 A bzImage, unlike an ELF file, is not meant to be loaded.  You're
@@ -1684,8 +1661,15 @@ int main(int argc, char *argv[])
*(u32 *)(boot + 0x228) = 4096;
concat(boot + 4096, argv+optind+2);
 
-   /* The guest type value of "1" tells the Guest it's under lguest. */
-   *(int *)(boot + 0x23c) = 1;
+   /* Boot protocol version: 2.07 supports the fields for lguest. */
+   *(u16 *)(boot + 0x206) = 0x207;
+
+   /* The hardware_subarch value of "1" tells the Guest it's an lguest. */
+   *(u32 *)(boot + 0x23c) = 1;
+
+   /* Set bit 6 of the loadflags (aka. KEEP_SEGMENTS) so the entry path
+* does not ttry to reload segment registers. */
+   *(u8 *)(boot + 0x211) |= (1 << 6);
 
/* We tell the kernel to initialize the Guest: this returns the open
 * /dev/lguest file descriptor. */
diff -r 2fdc577cfe5c arch/i386/lguest/boot.c
--- a/arch/i386/lguest/boot.c   Tue Oct 02 22:21:05 2007 +1000
+++ b/arch/i386/lguest/boot.c   Tue Oct 02 23:27:22 2007 +1000
@@ -938,18 +938,8 @@ static unsigned lguest_patch(u8 type, u1
 /*G:030 Once we get to lguest_init(), we know we're a Guest.  The paravirt_ops
  * structure in the kernel provides a single point for (almost) every routine
  * we have to override to avoid privileged instructions. */
-__init void lguest_init(void *boot)
-{
-   /* Copy boot parameters first: the Launcher put the physical location
-* in %esi, and head.S converted that to a virtual address and handed
-* it to us.  We use "__memcpy" because "memcpy" sometimes tries to do
-* 

[PATCH 5/5] lguest: loading bzImage directly

2007-10-02 Thread Rusty Russell
Now arch/i386/boot/compressed/head.S understands the hardware_platform field,
we can directly execute bzImages.  No more horrific unpacking code.

Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
---
 Documentation/lguest/lguest.c|   97 --
 arch/i386/boot/compressed/head.S |6 ++
 drivers/lguest/lguest.c  |5 +
 3 files changed, 42 insertions(+), 66 deletions(-)

diff -r b0480fd71a72 Documentation/lguest/lguest.c
--- a/Documentation/lguest/lguest.c Tue Oct 02 22:28:13 2007 +1000
+++ b/Documentation/lguest/lguest.c Tue Oct 02 22:52:07 2007 +1000
@@ -326,74 +326,39 @@ static unsigned long map_elf(int elf_fd,
return ehdr->e_entry;
 }
 
-/*L:160 Unfortunately the entire ELF image isn't compressed: the segments
- * which need loading are extracted and compressed raw.  This denies us the
- * information we need to make a fully-general loader. */
-static unsigned long unpack_bzimage(int fd)
-{
-   gzFile f;
-   int ret, len = 0;
-   /* A bzImage always gets loaded at physical address 1M.  This is
-* actually configurable as CONFIG_PHYSICAL_START, but as the comment
-* there says, "Don't change this unless you know what you are doing".
-* Indeed. */
-   void *img = from_guest_phys(0x10);
-
-   /* gzdopen takes our file descriptor (carefully placed at the start of
-* the GZIP header we found) and returns a gzFile. */
-   f = gzdopen(fd, "rb");
-   /* We read it into memory in 64k chunks until we hit the end. */
-   while ((ret = gzread(f, img + len, 65536)) > 0)
-   len += ret;
-   if (ret < 0)
-   err(1, "reading image from bzImage");
-
-   verbose("Unpacked size %i addr %p\n", len, img);
-
-   /* The entry point for a bzImage is always the first byte */
-   return (unsigned long)img;
-}
-
 /*L:150 A bzImage, unlike an ELF file, is not meant to be loaded.  You're
- * supposed to jump into it and it will unpack itself.  We can't do that
- * because the Guest can't run the unpacking code, and adding features to
- * lguest kills puppies, so we don't want to.
- *
- * The bzImage is formed by putting the decompressing code in front of the
- * compressed kernel code.  So we can simple scan through it looking for the
- * first "gzip" header, and start decompressing from there. */
+ * supposed to jump into it and it will unpack itself.  We used to have to
+ * perform some hairy magic because the unpacking code scared me.
+ *
+ * Fortunately, Jeremy Fitzhardinge convinced me it wasn't that hard and wrote
+ * a small patch to jump over the tricky bits in the guest, so now we just read
+ * the funky header so we know where in the file to load, and away we go! */
 static unsigned long load_bzimage(int fd)
 {
-   unsigned char c;
-   int state = 0;
-
-   /* GZIP header is 0x1F 0x8B  ... . */
-   while (read(fd, , 1) == 1) {
-   switch (state) {
-   case 0:
-   if (c == 0x1F)
-   state++;
-   break;
-   case 1:
-   if (c == 0x8B)
-   state++;
-   else
-   state = 0;
-   break;
-   case 2 ... 8:
-   state++;
-   break;
-   case 9:
-   /* Seek back to the start of the gzip header. */
-   lseek(fd, -10, SEEK_CUR);
-   /* One final check: "compressed under UNIX". */
-   if (c != 0x03)
-   state = -1;
-   else
-   return unpack_bzimage(fd);
-   }
-   }
-   errx(1, "Could not find kernel in bzImage");
+   u8 hdr[1024];
+   int r;
+   /* Modern bzImages get loaded at 1M. */
+   void *p = from_guest_phys(0x10);
+
+   /* Go back to the start of the file and read the header.  It should be
+* a Linux boot header (see Documentation/i386/boot.txt) */
+   lseek(fd, 0, SEEK_SET);
+   read(fd, hdr, sizeof(hdr));
+
+   /* At offset 0x202, we expect the magic "HdrS" */
+   if (memcmp(hdr + 0x202, "HdrS", 4) != 0)
+   errx(1, "This doesn't look like a bzImage to me");
+
+   /* The byte at 0x1F1 tells us how many extra sectors of
+* header: skip over them all. */
+   lseek(fd, (unsigned long)(hdr[0x1F1]+1) * 512, SEEK_SET);
+
+   /* Now read everything into memory. in nice big chunks. */
+   while ((r = read(fd, p, 65536)) > 0)
+   p += r;
+
+   /* Finally, 0x214 tells us where to start the kernel. */
+   return *(unsigned long *)[0x214];
 }
 
 /*L:140 Loading the kernel is easy when it's a "vmlinux", but most kernels


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the 

Re: PROBLEM: high load average when idle

2007-10-02 Thread Mark Lord

Arjan van de Ven wrote:

On Tue, 02 Oct 2007 18:46:18 -0400

On a related note, {set/get}itimer() currently are buggy (since
2.6.11 or so), also due to this round_jiffies() thing I believe.


I very much believe that it is totally unrelated... most of all since
round_jiffies() wasn't in the kernel then an also isn't used anywhere
near these timers.


Ah, yes, you're correct.  The itimer routines do their *own* rounding.

-ml
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.6.23-rc9 and a heads-up for the 2.6.24 series..

2007-10-02 Thread Mel Gorman
On Tue, 2007-10-02 at 18:09 -0400, Bill Davidsen wrote:
> Mel Gorman wrote:
> > On (02/10/07 14:15), Ingo Molnar didst pronounce:
> >> * Mel Gorman <[EMAIL PROTECTED]> wrote:
> >>
> >>> Dirt. Booting with "profile=sleep,2" is broken in 2.6.23-rc9 and 
> >>> 2.6.23-rc8 but working in 2.6.22. I was checking it out as part of a 
> >>> discussion in another thread and noticed it broken in -mm as well 
> >>> (2.6.23-rc8-mm2). Bisect is in progress but suggestions as to the 
> >>> prime candidates are welcome or preferably, pointing out that I'm an 
> >>> idiot because I missed twiddling some config change.
> >> Mel, does the patch below fix this bug for you? (Note: you will need to 
> >> enable CONFIG_SCHEDSTATS=y too.)
> >>
> > 
> > Nice one Ingo - got it first try. The problem commit was
> > dd41f596cda0d7d6e4a8b139ffdfabcefdd46528 and it's clear that the code 
> > removed
> > in this commit is put back by this latest patch.  When applied, 
> > profile=sleep
> > works as long as CONFIG_SCHEDSTAT is set.
> > 
> And if it isn't set? I can easily see building a new kernel with stats 
> off and forgetting to change the boot options.
> 

If CONFIG_SCHEDSTAT is off and profile=sleep is set, you see with Ingo's
patch and readprofile;

 0 *unknown*
 0 total  0.

That is a tad confusing hence my follow-up patch which would say
"/proc/profile" doesn't exist when readprofile is used and the warning
in dmesg.

-- 
Mel Gorman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git patches] net driver updates

2007-10-02 Thread David Miller
From: Jeff Garzik <[EMAIL PROTECTED]>
Date: Tue, 2 Oct 2007 13:41:50 -0400

> Please pull from the 'upstream' branch of
> master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git upstream

Pulled and pushed back out to net-2.6.24, thanks Jeff!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/5] add WEAK() for creating weak asm labels

2007-10-02 Thread Rusty Russell
Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>

---
 include/linux/linkage.h |6 ++
 1 file changed, 6 insertions(+)

===
--- a/include/linux/linkage.h
+++ b/include/linux/linkage.h
@@ -34,6 +34,12 @@
   name:
 #endif
 
+#ifndef WEAK
+#define WEAK(name)\
+   .weak name;\
+   name:
+#endif
+
 #define KPROBE_ENTRY(name) \
   .pushsection .kprobes.text, "ax"; \
   ENTRY(name)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/5] i386: paravirt boot sequence

2007-10-02 Thread Rusty Russell
This patch uses the updated boot protocol to do paravirtualized boot.
If the boot version is >= 2.07, then it will do two things:

 1. Check the bootparams loadflags to see if we should reload the
segment registers and clear interrupts.  This is appropriate
for normal native boot and some paravirtualized environments, but
inapproprate for others.

 2. Check the hardware architecture, and dispatch to the appropriate
kernel entrypoint.  If the bootloader doesn't set this, then we
simply do the normal boot sequence.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
Cc: "Eric W. Biederman" <[EMAIL PROTECTED]>
Cc: H. Peter Anvin <[EMAIL PROTECTED]>
Cc: Vivek Goyal <[EMAIL PROTECTED]>
Cc: James Bottomley <[EMAIL PROTECTED]>
---
 arch/i386/boot/compressed/head.S |   14 +--
 arch/i386/boot/compressed/misc.c |4 +++
 arch/i386/boot/header.S  |7 -
 arch/i386/kernel/head.S  |   47 ++
 4 files changed, 65 insertions(+), 7 deletions(-)

diff -r 5d471e4c931d arch/i386/boot/compressed/head.S
--- a/arch/i386/boot/compressed/head.S  Tue Oct 02 22:13:34 2007 +1000
+++ b/arch/i386/boot/compressed/head.S  Tue Oct 02 22:20:25 2007 +1000
@@ -27,19 +27,30 @@
 #include 
 #include 
 #include 
+#include 
 
 .section ".text.head","ax",@progbits
.globl startup_32
 
 startup_32:
-   cld
-   cli
+   /* check to see if KEEP_SEGMENTS flag is meaningful */
+   cmpw $0x207, BP_version(%esi)
+   jb 1f
+
+   /* test KEEP_SEGMENTS flag to see if the bootloader is asking
+* us to not reload segments */
+   testb $(1<<6), BP_loadflags(%esi)
+   jnz 2f
+
+1: cli
movl $(__BOOT_DS),%eax
movl %eax,%ds
movl %eax,%es
movl %eax,%fs
movl %eax,%gs
movl %eax,%ss
+
+2: cld
 
 /* Calculate the delta between where we were compiled to run
  * at and where we were actually loaded at.  This can only be done
diff -r 5d471e4c931d arch/i386/boot/compressed/misc.c
--- a/arch/i386/boot/compressed/misc.c  Tue Oct 02 22:13:34 2007 +1000
+++ b/arch/i386/boot/compressed/misc.c  Tue Oct 02 22:13:34 2007 +1000
@@ -246,6 +246,9 @@ static void putstr(const char *s)
 {
int x,y,pos;
char c;
+
+   if (RM_SCREEN_INFO.orig_video_mode == 0 && lines == 0 && cols == 0)
+   return;
 
x = RM_SCREEN_INFO.orig_x;
y = RM_SCREEN_INFO.orig_y;
diff -r 5d471e4c931d arch/i386/boot/header.S
--- a/arch/i386/boot/header.S   Tue Oct 02 22:13:34 2007 +1000
+++ b/arch/i386/boot/header.S   Tue Oct 02 22:13:34 2007 +1000
@@ -119,7 +119,7 @@ 1:
# Part 2 of the header, from the old setup.S
 
.ascii  "HdrS"  # header signature
-   .word   0x0206  # header version number (>= 0x0105)
+   .word   0x0207  # header version number (>= 0x0105)
# or else old loadlin-1.5 will fail)
.globl realmode_swtch
 realmode_swtch:.word   0, 0# default_switch, SETUPSEG
@@ -214,6 +214,11 @@ cmdline_size:   .long   COMMAND_LINE_SIZ
 #added with boot protocol
 #version 2.06
 
+hardware_subarch:  .long 0 # subarchitecture, added with 
2.07
+   # default to 0 for normal x86 PC
+
+hardware_subarch_data: .quad 0
+
 # End of setup header #
 
.section ".inittext", "ax"
diff -r 5d471e4c931d arch/i386/kernel/head.S
--- a/arch/i386/kernel/head.S   Tue Oct 02 22:13:34 2007 +1000
+++ b/arch/i386/kernel/head.S   Tue Oct 02 22:21:01 2007 +1000
@@ -70,22 +70,30 @@ INIT_MAP_BEYOND_END = BOOTBITMAP_SIZE + 
  */
 .section .text.head,"ax",@progbits
 ENTRY(startup_32)
+   /* check to see if KEEP_SEGMENTS flag is meaningful */
+   cmpw $0x207, BP_version(%esi)
+   jb 1f
+
+   /* test KEEP_SEGMENTS flag to see if the bootloader is asking
+   us to not reload segments */
+   testb $(1<<6), BP_loadflags(%esi)
+   jnz 2f
 
 /*
  * Set segments to known values.
  */
-   cld
-   lgdt boot_gdt_descr - __PAGE_OFFSET
+1: lgdt boot_gdt_descr - __PAGE_OFFSET
movl $(__BOOT_DS),%eax
movl %eax,%ds
movl %eax,%es
movl %eax,%fs
movl %eax,%gs
+2:
 
 /*
  * Clear BSS first so that there are no surprises...
- * No need to cld as DF is already clear from cld above...
- */
+ */
+   cld
xorl %eax,%eax
movl $__bss_start - __PAGE_OFFSET,%edi
movl $__bss_stop - __PAGE_OFFSET,%ecx
@@ -119,6 +127,35 @@ 2:
movsl
 1:
 
+#ifdef CONFIG_PARAVIRT
+   cmpw $0x207, (boot_params + BP_version - __PAGE_OFFSET)
+   jb default_entry
+
+   /* Paravirt-compatible boot parameters.  Look to see what 

[PATCH 0/5] Boot protocol changes

2007-10-02 Thread Rusty Russell
Hi all,

Jeremy had some boot changes for bzImages, but buried in there was an
update to the boot protocol to support Xen and lguest (and kvm-lite).
I've copied those fairly simple patches, and if HPA is happy I'd like to
push them for 2.6.24 (after correcting for the Great Arch Merge of
course).

Thanks,
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/5] update boot spec to 2.07

2007-10-02 Thread Rusty Russell
Proposed updates for version 2.07 of the boot protocol.  This includes:

load_flags.KEEP_SEGMENTS- flag to request/inhibit segment reloads
hardware_subarch- what subarchitecture we're booting under
hardware_subarch_data   - per-architecture data

The intention of these changes is to make booting a paravirtualized
kernel work via the normal Linux boot protocol.  The intention is that
the bzImage payload can be a properly formed ELF file, so that the
bootloader can use its ELF notes and Phdrs to get more metadata about
the kernel and its requirements.

The ELF file could be the uncompressed kernel vmlinux itself; it would
only take small buildsystem changes to implement this.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
Cc: "Eric W. Biederman" <[EMAIL PROTECTED]>
Cc: H. Peter Anvin <[EMAIL PROTECTED]>
Cc: Vivek Goyal <[EMAIL PROTECTED]>

---
 Documentation/i386/boot.txt|   34 +-
 arch/i386/kernel/asm-offsets.c |7 +++
 include/asm-i386/bootparam.h   |9 +++--
 3 files changed, 47 insertions(+), 3 deletions(-)

diff -r cff7afab3bac Documentation/i386/boot.txt
--- a/Documentation/i386/boot.txt   Tue Oct 02 22:05:10 2007 +1000
+++ b/Documentation/i386/boot.txt   Tue Oct 02 22:05:10 2007 +1000
@@ -168,6 +168,8 @@ 0234/1  2.05+   relocatable_kernel Whether 
 0234/1 2.05+   relocatable_kernel Whether kernel is relocatable or not
 0235/3 N/A pad2Unused
 0238/4 2.06+   cmdline_sizeMaximum size of the kernel command line
+023C/4 2.07+   hardware_subarch Hardware subarchitecture
+0240/8 2.07+   hardware_subarch_data Subarchitecture-specific data
 
 (1) For backwards compatibility, if the setup_sects field contains 0, the
 real value is 4.
@@ -204,7 +206,7 @@ boot loaders can ignore those fields.
 
 The byte order of all fields is littleendian (this is x86, after all.)
 
-Field name:setup_secs
+Field name:setup_sects
 Type:  read
 Offset/size:   0x1f1/1
 Protocol:  ALL
@@ -356,6 +358,13 @@ Protocol:  2.00+
- If 0, the protected-mode code is loaded at 0x1.
- If 1, the protected-mode code is loaded at 0x10.
 
+  Bit 6 (write): KEEP_SEGMENTS
+   Protocol: 2.07+
+   - if 0, reload the segment registers in the 32bit entry point.
+   - if 1, do not reload the segment registers in the 32bit entry point.
+   Assume that %cs %ds %ss %es are all set to flat segments with
+   a base of 0 (or the equivalent for their environment).
+
   Bit 7 (write): CAN_USE_HEAP
Set this bit to 1 to indicate that the value entered in the
heap_end_ptr is valid.  If this field is clear, some setup code
@@ -479,6 +488,29 @@ Protocol:  2.06+
   zero. This means that the command line can contain at most
   cmdline_size characters. With protocol version 2.05 and earlier, the
   maximum size was 255.
+
+Field name:hardware_subarch
+Type:  write
+Offset/size:   0x23c/4
+Protocol:  2.07+
+
+  In a paravirtualized environment the hardware low level architectural
+  pieces such as interrupt handling, page table handling, and
+  accessing process control registers needs to be done differently.
+
+  This field allows the bootloader to inform the kernel we are in one
+  one of those environments.
+
+  0x   The default x86/PC environment
+  0x0001   lguest
+  0x0002   Xen
+
+Field name:hardware_subarch_data
+Type:  write
+Offset/size:   0x240/8
+Protocol:  2.07+
+
+  A pointer to data that is specific to hardware subarch
 
 
  THE KERNEL COMMAND LINE
diff -r cff7afab3bac arch/i386/kernel/asm-offsets.c
--- a/arch/i386/kernel/asm-offsets.cTue Oct 02 22:05:10 2007 +1000
+++ b/arch/i386/kernel/asm-offsets.cTue Oct 02 22:05:10 2007 +1000
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -145,4 +146,10 @@ void foo(void)
OFFSET(LGUEST_PAGES_regs_errcode, lguest_pages, regs.errcode);
OFFSET(LGUEST_PAGES_regs, lguest_pages, regs);
 #endif
+
+   BLANK();
+   OFFSET(BP_scratch, boot_params, scratch);
+   OFFSET(BP_loadflags, boot_params, hdr.loadflags);
+   OFFSET(BP_hardware_subarch, boot_params, hdr.hardware_subarch);
+   OFFSET(BP_version, boot_params, hdr.version);
 }
diff -r cff7afab3bac include/asm-i386/bootparam.h
--- a/include/asm-i386/bootparam.h  Tue Oct 02 22:05:10 2007 +1000
+++ b/include/asm-i386/bootparam.h  Tue Oct 02 22:05:10 2007 +1000
@@ -25,8 +25,9 @@ struct setup_header {
u16 kernel_version;
u8  type_of_loader;
u8  loadflags;
-#define LOADED_HIGH0x01
-#define CAN_USE_HEAP   0x80
+#define LOADED_HIGH(1<<0)
+#define KEEP_SEGMENTS  (1<<6)
+#define CAN_USE_HEAP   (1<<7)
u16 setup_move_size;
u32 code32_start;
u32 ramdisk_image;
@@ -38,6 +39,10 @@ struct setup_header {
u32 

Re: Linux 2.6.23-rc9 and a heads-up for the 2.6.24 series..

2007-10-02 Thread Linus Torvalds


On Wed, 3 Oct 2007, Diego Calleja wrote:
> 
> Also...if someone dislikes something in http://kernelnewbies.org/Linux_2_6_23 
> ,
> or wants to fix my english, do it soon :)

Heh. The "remove sk98lin driver" bullet is sadly wrong. We had to 
reinstate it because it supported some cards that the skge driver doesn't 
handle.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc7-mm1 -- powerpc rtas panic

2007-10-02 Thread Linas Vepstas
On Mon, Sep 24, 2007 at 01:35:31PM +0100, Andy Whitcroft wrote:
> Seeing the following from an older power LPAR, pretty sure we had
> this in the previous -mm also:

I haven't forgetten about this ... and am looking at it now.
Seems that whenever I go to reserve the machine pSeries-102,
someone else is using it :-)

--linas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: high load average when idle

2007-10-02 Thread Arjan van de Ven
On Tue, 02 Oct 2007 18:33:58 -0400
> Or, everybody wakes up at once right when we are taking a sample.  :)

nice try but we sample every timer tick; this code being timer driven
makes it what you say it is regardless of *which* timer tick it
happens at ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Point of gpl-only modules (flame)

2007-10-02 Thread Arjan van de Ven
On Tue, 02 Oct 2007 23:49:04 +0200
Jimmy <[EMAIL PROTECTED]> wrote:

> I know I'll be getting hell for this, I must be a masochist.
> 


DO NOT FEED THE TROLL.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-02 Thread Linus Torvalds


On Tue, 2 Oct 2007, Linus Torvalds wrote:
> 
> I don't know who came up with it, or why people continue to feed the 
> insane ideas. Why do people think that servers don't care about latency? 
> Why do people believe that desktop doesn't have multiple processors or 
> through-put intensive loads? Why are people continuing this *idiotic* 
> scheduler discussion?

Btw, one thing that is true: while both servers and desktop cares about 
latency, it's often easier to *see* the issues on the desktop (or hear 
them: audio skipping).

But that doesn't mean that the server people wouldn't care, and it doesn't 
mean that scheduling would be "fundamentally different" on servers or the
desktop.

In contrast, security really *is* fundamentally different in different 
situations. For example, I find SELinux to be so irrelevant to my usage 
that I don't use it at all. I just don't have any other users on my 
machine, so the security I care about is in firewalls etc. And that really 
*is* fundamentally different from a system that has shell access to its 
users. Which in turn is fundamentally different from one that has some 
legal reasons why it needs to have a particular kind of security. Which in 
turn is fundamentally different from 

You get the idea.

It boils down to: "scheduling is scheduling", and doesn't really change 
apart from the kind of decisions that are required by any scheduler (ie RT 
vs non-RT etc). Everybody wants the same thing in the end: low latency for 
loads where that matters, high bandwidth for loads where that matters. 
It's not a "one user has only one kind of load". Not at all.

Security, on the other hand, very much does depend on the circumstances 
and the wishes of the users (or policy-makers). And if we had one module 
that everybody would be happy with, I'd not make it pluggable either. But 
as it is, we _know_ that's not the case. 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   >