date:20070301

Re: Bug #7674 (shutdown hd noise) EDIT: wrong address, sorry!

2007-03-01 Thread Francesco Pretto


I'm sorry with the lkml users for the unwanted noise. I did a mistake
with my mail client.

Francesco

2007/3/2, Francesco Pretto <[EMAIL PROTECTED]>:


I'll send you a message of the thread. You only have to answer it
(with reply-to function of your browser) changing the TO: address with
linux-kernel@vger.kernel.org (you don't have to be subscribed, i'm not
for example) . Hopefully, it will maintaing headers and it will merge
with the rest of the thread.

Bye


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Fixes and cleanups for earlyprintk aka boot console.

2007-03-01 Thread Andrew Morton

On Tue, 20 Feb 2007 12:35:49 +0100 Gerd Hoffmann <[EMAIL PROTECTED]> wrote:

> The console subsystem already has an idea of a boot console, using the
> CON_BOOT flag.  The implementation has some flaws though.  The major
> problem is that presence of a boot console makes register_console()
> ignore any other console devices (unless explicitly specified on the
> kernel command line).
> 
> This patch fixes the console selection code to *not* consider a boot
> console a full-featured one, so the first non-boot console registering
> will become the default console instead.  This way the unregister call
> for the boot console in the register_console() function actually
> triggers and the handover from the boot console to the real console
> device works smoothly.  Added a printk for the handover, so you know
> which console device the output goes to when the boot console stops
> printing messages.
> 
> The disable_early_printk() call is obsolete with that patch, explicitly
> disabling the early console isn't needed any more as it works
> automagically with that patch.
> 
> I've walked through the tree, dropped all disable_early_printk()
> instances found below arch/ and tagged the consoles with CON_BOOT if
> needed.
> 
> The code is tested on x86 only so far.  It is probably a good idea to
> run it in -mm for a while to shake out any architecture issues which
> might show up.  Comments?

It blows up on powerpc:

drivers/built-in.o(.init.text+0x2080): In function `.console_init':
: undefined reference to `.disable_early_printk'

and the below patch might help.

But my confidence level isn't high so I'll drop it for now.  I have a feeling
this will need careful testing.

--- 
a/arch/x86_64/kernel/early_printk.c~fixes-and-cleanups-for-earlyprintk-aka-boot-console-fix
+++ a/arch/x86_64/kernel/early_printk.c
@@ -249,17 +249,3 @@ static int __init setup_early_printk(cha
 }
 
 early_param("earlyprintk", setup_early_printk);
-
-void __init disable_early_printk(void)
-{
-   if (!early_console_initialized || !early_console)
-   return;
-   if (!keep_early) {
-   printk("disabling early console\n");
-   unregister_console(early_console);
-   early_console_initialized = 0;
-   } else {
-   printk("keeping early console\n");
-   }
-}
-
diff -puN 
drivers/char/tty_io.c~fixes-and-cleanups-for-earlyprintk-aka-boot-console-fix 
drivers/char/tty_io.c
--- 
a/drivers/char/tty_io.c~fixes-and-cleanups-for-earlyprintk-aka-boot-console-fix
+++ a/drivers/char/tty_io.c
@@ -141,8 +141,6 @@ static DECLARE_MUTEX(allocated_ptys_lock
 static int ptmx_open(struct inode *, struct file *);
 #endif
 
-extern void disable_early_printk(void);
-
 static void initialize_tty_struct(struct tty_struct *tty);
 
 static ssize_t tty_read(struct file *, char __user *, size_t, loff_t *);
@@ -3889,13 +3887,6 @@ void __init console_init(void)
/* Setup the default TTY line discipline. */
(void) tty_register_ldisc(N_TTY, &tty_ldisc_N_TTY);
 
-   /*
-* set up the console device so that later boot sequences can 
-* inform about problems etc..
-*/
-#ifdef CONFIG_EARLY_PRINTK
-   disable_early_printk();
-#endif
call = __con_initcall_start;
while (call < __con_initcall_end) {
(*call)();
_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Bug #7674 (shutdown hd noise)

2007-03-01 Thread Francesco Pretto


2007/3/2, Dan Gilliam <[EMAIL PROTECTED]>:

Hi Francesco,

I just tried to submit a plea to that address, but it's not letting me
post to it (refused).  Help!
Dan



I'll send you a message of the thread. You only have to answer it
(with reply-to function of your browser) changing the TO: address with
linux-kernel@vger.kernel.org (you don't have to be subscribed, i'm not
for example) . Hopefully, it will maintaing headers and it will merge
with the rest of the thread.

Bye
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Christoph Lameter

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > Sure we will. And you believe that the the newer controllers will be able 
> > to magically shrink the the SG lists somehow? We will offload the 
> > coalescing of the page structs into bios in hardware or some such thing? 
> > And the vmscans etc too?
> 
> As far as pagecache page management goes, is that an issue for you?
> I don't want to know about how many billions of pages for some operation,
> just some profiles.

If there are billions of pages in the system and we are allocating and 
deallocating then pages need to be aged. If there are just few pages 
freeable then we run into issues.

> > > I understand you have controllers (or maybe it is a block layer limit)
> > > that doesn't work well with 4K pages, but works OK with 16K pages.
> > Really? This is the first that I have heard about it.
> Maybe that's the issue you're running into.

Oh, I am running into an issue on a system that does not yet exist? I am 
extrapolating from the problems that we commonly see now. Those will get 
worse the more memory increases.

> > > This is not something that we would introduce variable sized pagecache
> > > for, surely.
> > I am not sure where you get the idea that this is the sole reason why we 
> > need to be able to handle larger contiguous chunks of memory.
> I'm not saying that. You brought up this subject of variable sized pagecache.

You keep bringing up the 4k/16k issue into this for some reason. I want 
just the ability to handle large amounts of memory. Larger page sizes are 
a way to accomplish that.

> Eventually, increasing x86 page size a bit might be an idea. We could even
> do it in software if CPU manufacturers don't for us.

A bit? Are we back to the 4k/16k issue? We need to reach 2M at mininum. 
Some way to handle continuous memory segments of 1GB and larger 
effectively would be great.

> That doesn't buy us a great deal if you think there is this huge looming
> problem with struct page management though.

I am not the first one See Rik's posts regarding the reasons for his 
new page replacement algorithms.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm 3/7] Freezer: Remove PF_NOFREEZE from rcutorture thread

2007-03-01 Thread Gautham R Shenoy

> From: Paul E. McKenney <[EMAIL PROTECTED]>
> 
> Remove PF_NOFREEZE from the rcutorture thread, adding a try_to_freeze() call 
> as
> required.
> 
> Signed-off-by: Paul E. McKenney <[EMAIL PROTECTED]>
> Signed-off-by: Rafael J. Wysocki <[EMAIL PROTECTED]>
> Acked-by: Pavel Machek <[EMAIL PROTECTED]>
> ---
>  kernel/rcutorture.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6.20-mm2/kernel/rcutorture.c
> ===
> --- linux-2.6.20-mm2.orig/kernel/rcutorture.c 2007-02-25 12:07:15.0 
> +0100
> +++ linux-2.6.20-mm2/kernel/rcutorture.c  2007-02-25 12:49:23.0 
> +0100
> @@ -46,6 +46,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  MODULE_LICENSE("GPL");
>  MODULE_AUTHOR("Paul E. McKenney <[EMAIL PROTECTED]> and "
> @@ -585,7 +586,6 @@ rcu_torture_writer(void *arg)
> 
>   VERBOSE_PRINTK_STRING("rcu_torture_writer task started");
>   set_user_nice(current, 19);
> - current->flags |= PF_NOFREEZE;
> 
>   do {
>   schedule_timeout_uninterruptible(1);
> @@ -607,6 +607,7 @@ rcu_torture_writer(void *arg)
>   }
>   rcu_torture_current_version++;
>   oldbatch = cur_ops->completed();
> + try_to_freeze();
>   } while (!kthread_should_stop() && !fullstop);
>   VERBOSE_PRINTK_STRING("rcu_torture_writer task stopping");
>   while (!kthread_should_stop())

Paul, 
Any reasons for not try_to_freeze()'ing the fakewriter and the reader
threads?? (Ok, I admit, I haven't looked into the code for the reason
which might be obvious.)


thanks
gautham.
-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc1: known regressions (part 2)

2007-03-01 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> But most likely, 9f4bd5dd is actually already bad, and what you are 
> seeing is two *different* bugs that just have the same symptoms 
> ("suspend doesn't work").

the situation is simpler than that: there is a /known/ bug, and i marked 
the bugfix commit as 'good'. I never met such a multiple-bugs scenario 
before and forgot that git-bisect could easily pick a tree without this 
essential bugfix and would not be able to make a distinction between the 
two types of badness.

I'll try what i've described in the previous mail: mark all bisection 
points that do not include f3ccb06f as 'good' - thus 'merging' the 
known-bad area with the first known-good commit, and thus eliminating it 
from the bisection space.

(but it might also be useful to have a "git-bisect must-include" kind of 
command that would allow this to be automated: mark a particular tree as 
an essential component of the search space.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fastboot] [PATCH RFC 0/5] hard_smp_processor_id overhaul

2007-03-01 Thread Horms

On Thu, Mar 01, 2007 at 04:16:13PM +0900, Fernando Luis Vázquez Cao wrote:
> With the advent of kdump, the assumption that the boot CPU when running
> an UP kernel is always the CPU with a hardware ID of 0 (usually referred
> to as BSP on some architectures) does not hold true anymore. The reason
> being that the dump capture kernel boots on the crashed CPU (the CPU
> that invoked crash_kexec).
> 
> As a consequence, the hardcoding of hard_smp_processor_id() to 0 on UP
> systems (see "linux/smp.h") is not correct.
> 
> This patch-set does the following:
> 
> 1- Remove hardcoding of hard_smp_processor_id on UP systems.
> 
> 2- Ask the hardware when possible to obtain the hardware processor id on
> i386, x86_64, and ia64, independently of whether CONFIG_SMP is set or
> not.
> 
> 3- Move definition of hard_smp_processor_id for the UP case to asm/smp.h
> on alpha, m32r, powerpc, s390, sparc, sparc64, and um architectures. I
> guess that hardware features could be used to implement
> hard_smp_processor_id even in the UP case, but since I am not an expert
> in this architectures I just move the definition.
> 
> The patches have been tested on i386, x86_64, and ia64.

Hi Fernando,

These patches seem find to me. Tested on ia64 (Tiger2)

Acked: Simon Horman <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc1: known regressions (part 2)

2007-03-01 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> Btw, you seem to have re-ordered the commits - the above is not the 
> order you did the bisection in. The known-good commit (f3ccb06..) is 
> in the middle. [...]

no - i simply picked them by hand, based on looking at gittk output, 
because bisection did not appear to find anything useful:

  9f4bd5dde81b5cb94e4f52f2f05825aa0422f1ff is first bad commit

And via that method i found a couple of more 'good' points - which 
git-bisect never picked up by itself. (and i did 3-4 separate git-bisect 
sessions, one of them was a "git-bisect start drivers/acpi/" - which is 
the main area of suspicion). I looked at git-bisect visualize more than 
once, and i've attached one of the bisection logs below.

i also think i know what happens. Firstly, my testing is reliable, as i 
mentioned it in the other mail i frequently re-visited commits to make 
sure that none of my bad/good decisions is spurios - but no, the test 
results are extremely reproducable: either the laptop resumes properly 
after flashing its disk light or it does not.

the problem i think is that i simply took git-bisect's behavior for 
granted (i used it many times already) but forgot about a very basic 
precondition: git-bisect will find only a /single/ good->bad transition.

If there is a bad->good transition combined with a good->bad transition 
then git-bisect will think it's the same 'badness', while it's a 
/former/ badness that it is honing in on - totally sending the bisection 
off into la-la-land.

so as i mentioned it in the first mail: i /know/ that this commit is a 
bad->good transition point:

  f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38

/and i only want to test commits that include this commit/ - because i 
know that without this commit git-bisect confuses the /other/ breakage 
with the new breakage. In the bisection log below, this choice of 
git-bisect:

  ee404566f97f9254433399fbbcfa05390c7c55f7

is 'bad' according to testing, but that's 'another' badness - and i 
missed it.

Now, having slept on it, the solution is very simple: whenever 
git-bisect picks a commit for which the following command comes up 
empty:

  git-log | grep f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38

then i'll mark it "git-bisect good" - artificially marking the older 
badness as a 'good' area. That way git-bisect will find the right 
good->bad transition point.

btw., that's why i tried to pick up commits by hand, making sure that 
commit f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 is always included - but 
got lost in the maze of the commit graph, and didnt realize that there 
is a simple solution. Nevertheless i wanted to dump the information i 
already gathered. Those commits were totally out of order, etc. - they 
were picked by a poor human who is much worse at walking graphs than 
git-bisect ;-)

Ingo

git-bisect start
# bad: [01363220f5d23ef68276db8974e46a502e43d01d] [PARISC] clocksource: Move 
update_cr16_clocksource later in boot
git-bisect bad 01363220f5d23ef68276db8974e46a502e43d01d
# good: [f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38] ACPI: Disable wake GPEs only 
once.
git-bisect good f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38
# bad: [ee404566f97f9254433399fbbcfa05390c7c55f7] sysctl: mips/au1000: remove 
sys_sysctl support
git-bisect bad ee404566f97f9254433399fbbcfa05390c7c55f7
# bad: [c827ba4cb49a30ce581201fd0ba2be77cde412c7] Merge 
master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6
git-bisect bad c827ba4cb49a30ce581201fd0ba2be77cde412c7
# bad: [68a696a01f482859a9fe937249e8b3d44252b610] Merge branch 'upstream' of 
git://ftp.linux-mips.org/pub/scm/upstream-tc
git-bisect bad 68a696a01f482859a9fe937249e8b3d44252b610
# bad: [1c433fbda4896a6455d97b66a4f2646cbdd52a8c] [ALSA] soc - 0.13 ASoC headers
git-bisect bad 1c433fbda4896a6455d97b66a4f2646cbdd52a8c
# bad: [048b945077bdc7e8dff5d5810ff2a0ced3590ca9] [ALSA] echoaudio, add TLV 
support
git-bisect bad 048b945077bdc7e8dff5d5810ff2a0ced3590ca9
# bad: [c07584c83287ae5a13cc836f69a1d824ad068c66] [ALSA] hda-codec - Add 
support for Medion laptops
git-bisect bad c07584c83287ae5a13cc836f69a1d824ad068c66
# bad: [dbc6b6ad767c86907db373e85139b0e975ba7599] [ALSA] ASoC codecs: generic 
AC97 support
git-bisect bad dbc6b6ad767c86907db373e85139b0e975ba7599
# bad: [b66b3cfe6c2f6560f351278883a325b6ebc478f5] [ALSA] hda_intel: increase 
maximum DMA buffer size to 1024MB
git-bisect bad b66b3cfe6c2f6560f351278883a325b6ebc478f5
# bad: [12b131c4cf3eb1dc8a60082a434b7b100774c2e7] [ALSA] allow registering an 
alsa device with struct device pointer
git-bisect bad 12b131c4cf3eb1dc8a60082a434b7b100774c2e7
# bad: [e4f8e656d8c152c08cd44d0e3c21f009fab09952] [ALSA] usb-audio: allow 
pausing
git-bisect bad e4f8e656d8c152c08cd44d0e3c21f009fab09952
# bad: [1700f3080d98323e91864d67cb9f6d46f818ccf0] [ALSA] usb-audio: merge 
playback/capture hardware information structs
git-bisect bad 1700f3080d98323e91864d67cb9f6d46f818ccf0
# bad: [9f4bd5dde81b5cb94e4f52f2f05825aa0422f1ff] [ALSA] snd-emu

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Nick Piggin

On Thu, Mar 01, 2007 at 10:51:00PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > There was no talk about slightly. 1G page size would actually be quite 
> > > convenient for some applications.
> > 
> > But it is far from convenient for the kernel. So we have hugepages, so
> > we can stay out of the hair of those applications and they can stay out
> > of hours.
> 
> Huge pages cannot do I/O so we would get back to the gazillions of pages 
> to be handled for I/O. I'd love to have I/O support for huge pages. This 
> would address some of the issues.

Can't direct IO from a hugepage?

> > > Writing a terabyte of memory to disk with handling 256 billion page 
> > > structs? In case of a system with 1 petabyte of memory this may be rather 
> > > typical and necessary for the application to be able to save its state
> > > on disk.
> > 
> > But you will have newer IO controllers, faster CPUs...
> 
> Sure we will. And you believe that the the newer controllers will be able 
> to magically shrink the the SG lists somehow? We will offload the 
> coalescing of the page structs into bios in hardware or some such thing? 
> And the vmscans etc too?

As far as pagecache page management goes, is that an issue for you?
I don't want to know about how many billions of pages for some operation,
just some profiles.

> > Is it a problem or isn't it? Waving around the 256 billion number isn't
> > impressive because it doesn't really say anything.
> 
> It is the number of items that needs to be handled by the I/O layer and 
> likely by the SG engine.

The number is irrelevant, it is the rate that is important.

> > I understand you have controllers (or maybe it is a block layer limit)
> > that doesn't work well with 4K pages, but works OK with 16K pages.
> 
> Really? This is the first that I have heard about it.
>

Maybe that's the issue you're running into.

> > This is not something that we would introduce variable sized pagecache
> > for, surely.
> 
> I am not sure where you get the idea that this is the sole reason why we 
> need to be able to handle larger contiguous chunks of memory.

I'm not saying that. You brought up this subject of variable sized pagecache.

> How about coming up with a response to the issue at hand? How do I write 
> back 1 Terabyte effectively? Ok this may be an exotic configuration today 
> but in one year this may be much more common. Memory sizes keep on 
> increasing and so is the number of page structs to be handled for I/O. At 
> some point we need a solution here.

Considering you're just handwaving about the actual problems, I
don't know. I assume you're sitting in front of some workload that has
gone wrong, so can't you elaborate?

Eventually, increasing x86 page size a bit might be an idea. We could even
do it in software if CPU manufacturers don't for us.

That doesn't buy us a great deal if you think there is this huge looming
problem with struct page management though.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2.6.20-rc2] gpio_direction_output() needs an initial value

2007-03-01 Thread Milan Svoboda

> It's been pointed out that output GPIOs should have an initial value, to
> avoid signal glitching ... among other things, it can be some time 
before
> a driver is ready.  This patch corrects that oversight, fixing
> 
>  - documentation
>  - platforms supporting the GPIO interface
>  - users of that call (just one for now, others are pending)
> 
> Note that most platforms are clear about the hardware letting the output
> value be set before the pin direction is changed, but the s3c241x docs
> are vague on that topic ... so those chips might not avoid the glitches.
> 
> Signed-off-by: David Brownell <[EMAIL PROTECTED]>

Acked-by: Milan Svoboda <[EMAIL PROTECTED]>


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] libata: Cable detection fixes

2007-03-01 Thread Michal Jaegermann

On Thu, Mar 01, 2007 at 08:33:17PM -0500, Jeff Garzik wrote:
> 
> That little change, buried in the middle of Alan's patch, changes the 
> probing order for a /lot/ of devices, possibly millions, when you 
> consider that it changes behavior of ata_piix (Intel SATA) as well as 
> all the not-yet-default PATA controllers.

Hm, I got recently hands on a hardware where 2.6.21-rc1 based
kernels from Fedora rawhide simply do not boot as there is no
way to get to disks.  I would not mind some change in behavior
although so far I can boot at least some earlier kernels.

This looks like ATIIXP issue and details are here:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229621
Changelogs for kernels in question have this:

* Wed Feb 21 2007 Dave Jones <[EMAIL PROTECTED]>
- 2.6.21-rc1

   Michal
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: belkin bulldog ups monitor vs 2.6.21-rc2

2007-03-01 Thread Gene Heskett

On Friday 02 March 2007, Con Kolivas wrote:
>On 02/03/07, Gene Heskett <[EMAIL PROTECTED]> wrote:
>> Greetings;
>>
>> I just rebooted to 2.6.21-rc2 and noted that getting x up and running
>> was about 15 seconds longer than usual.  When it got a bash shell
>> going I went to it and ran htop which showed that the bulldog monitor
>> was taking 90% of the cpu.  Killed it, then restarted it, but when I
>> ran the gui which ran fine and then stopped the gui, the daemon once
>> again went hog wild and had to be killed,  and I'm losing my kmail
>> composer focus for 30 seconds at a time now that amanda is making her
>> nightly run.
>>
>> There is nothing in the log about it other than from xinetd as it ran
>> the amanda server stuff.
>>
>> Not quite ready for prime time methinks.  Using the ck scheduler, this
>> is terrible performance, virtually no multitasking.  Back to
>> 2.6.20-ck1 in the morning if it lives the rest of the night.
>
>HI Gene.
>
>I'm not sure if you're saying here that the performance is terrible on
>2.6.21-rc2 only with the -ck scheduler, or only 2.6.21-rc2, or that
>2.6.20-ck1 is terrible or that it fixes the problem.  Can you please
>clarify this?

I miss-spoke above now that I read it again, sorry Con.  I think I thought 
my fingers had put 'Comparing' in front of the 'Using' above. This time 
of the night, my mind has been known to be running a chapter or more 
ahead of (or in some cases behind) my fingers.

2.6.20-ck1 runs great, 2.6.21-rc2 was not only a dog, it fed amanda a 
bunch of lsd via bad data from tar, so tar when told to do a level 1 
while 21-rc2 (without your patch) was running, it actually did a level 0, 
and predictably ran out of vtape.  /usr/pix didn't change over 7GB of its 
contents overnight, in fact nothing changed there yesterday, but tar sure 
went on a rampage.

Sorry about the confusion.  I'm back in 2.6.20-ck1 and everythings cool.

>Regards,
>-ck

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] slab: remove colouroff from struct slab

2007-03-01 Thread Andrew Morton

On Thu, 22 Feb 2007 14:37:38 +0200 (EET) Pekka J Enberg <[EMAIL PROTECTED]> 
wrote:

> As the color offset is always within the first page of the slab,
> virt_to_page() works just fine without slabp->colouroff.

kernel BUG at mm/slab.c:1658!
invalid opcode:  [#1]
SMP 
last sysfs file: /block/hdc/range
Modules linked in:
CPU:1
EIP:0060:[]Not tainted VLI
EFLAGS: 00010246   (2.6.21-rc2-mm1 #7)
EIP is at kmem_freepages+0xc8/0xd0
eax: 4000   ebx: c106e730   ecx:    edx: 
esi: 0001   edi: c21fcbe0   ebp: c2231e9c   esp: c2231e8c
ds: 007b   es: 007b   fs: 00d8  gs:   ss: 0068
Process swapper (pid: 0, ti=c223 task=c222cac0 task.ti=c223)
Stack: c21fcbe0 c21fcbe0 f6252020 0002 c2231eac c017409c f62b7020 c1b74f80 
   c2231ec0 c013078a c1b74ffc  c05502c0 c2231ec8 c0130901 c2231ee0 
   c012495a c05519d0 0003 c04faf68 c0551a20 c2231efc c01242d7 000a 
Call Trace:
 [] show_trace_log_lvl+0x1a/0x30
 [] show_stack_log_lvl+0xa9/0xd0
 [] show_registers+0x1e9/0x2f0
 [] die+0x11a/0x250
 [] do_trap+0x91/0xc0
 [] do_invalid_op+0x97/0xb0
 [] error_code+0x7c/0x84
 [] kmem_rcu_free+0x1c/0x50
 [] __rcu_process_callbacks+0x6a/0x1c0
 [] rcu_process_callbacks+0x21/0x50
 [] tasklet_action+0x5a/0xe0
 [] __do_softirq+0x87/0x100
 [] do_softirq+0x57/0x60
 [] irq_exit+0x47/0x50
 [] smp_apic_timer_interrupt+0x55/0x90
 [] apic_timer_interrupt+0x33/0x38
 [] cpu_idle+0x7f/0xe0
 [] start_secondary+0x281/0x3c0
 [<>] 0x0
 ===
Code: fe ff 58 5b 5e 5f 5d c3 8b 03 89 f1 ba 09 00 00 00 f7 d9 c1 e8 1e 8d 04 
40 8d 04 c0 c1 e0 05 05 40 f1 4c c0 e8 9a dc fe ff eb 95 <0f> 0b eb fe 8d 74 26 
00 55 b9 6c dd 79 c0 89 e5 ba 52 fc 46 c0 
EIP: [] kmem_freepages+0xc8/0xd0 SS:ESP 0068:c2231e8c


#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.21-rc2-mm1
# Thu Mar  1 23:05:37 2007
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_SYSVIPC_SYSCTL=y
# CONFIG_POSIX_MQUEUE is not set
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_UTS_NS is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_IKCONFIG=y
# CONFIG_IKCONFIG_PROC is not set
# CONFIG_CPUSETS is not set
CONFIG_SYSFS_DEPRECATED=y
# CONFIG_RELAY is not set
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_EMBEDDED=y
CONFIG_UID16=y
# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_SLAB=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_SLOB is not set

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y

#
# Block layer
#
CONFIG_BLOCK=y
# CONFIG_LBD is not set
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_LSF is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_AS=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="anticipatory"

#
# Processor type and features
#
# CONFIG_TICK_ONESHOT is not set
# CONFIG_NO_HZ is not set
# CONFIG_HIGH_RES_TIMERS is not set
CONFIG_SMP=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_PARAVIRT is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
CONFIG_MPENTIUMIII=y
# CONFIG_MPENTIUMM is not set
# CONFIG_MCORE2 is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWI

Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Ulrich Drepper

Andrew Morton wrote:
> Perhaps Ulrich can comment.

I was out of town, hence the delay.

I think that if there is no support for the syscall the correct answer
is to return ENOSYS.  In this case the current userlevel code would be
used and ENOSYS is also used to trigger the use of the compat code in
glibc in case the syscall does not exist at all.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature

Re: belkin bulldog ups monitor vs 2.6.21-rc2

2007-03-01 Thread Gene Heskett

On Friday 02 March 2007, Gene Heskett wrote:
>Greetings;
>
>I just rebooted to 2.6.21-rc2 and noted that getting x up and running
> was about 15 seconds longer than usual.  When it got a bash shell going
> I went to it and ran htop which showed that the bulldog monitor was
> taking 90% of the cpu.  Killed it, then restarted it, but when I ran
> the gui which ran fine and then stopped the gui, the daemon once again
> went hog wild and had to be killed,  and I'm losing my kmail composer
> focus for 30 seconds at a time now that amanda is making her nightly
> run.
>
>There is nothing in the log about it other than from xinetd as it ran
> the amanda server stuff.
>
>Not quite ready for prime time methinks.  Using the ck scheduler, this
> is terrible performance, virtually no multitasking.  Back to 2.6.20-ck1
> in the morning if it lives the rest of the night.

Addendum, amanda finished early, it seems tar thought every level was a 
level 0, so it ran out of storage after only 3 dle's were processed and 
backed up.  There are about 25 dle's.  It tried to put 11GB on an 8GB 
vtape, which because it was a vtape, it could do.

So it appears something in the ext3 filesystem is sadly miss-informing tar 
when it does the estimate scan vs doing the real file reading.  Or the 
scan is updating the ctime?

I'm back on 2.6.20-ck1 & everything is copacetic again.  I'll find out if 
the filesystem is damaged tomorrow night cause if the ctimes are all 
screwed up, amanda will effectively be starting from scratch.  That is 
not exactly a Good Thing(TM).

I did find the ls -lt command, and the filesystem looks ok timewise when 
rebooted now.  I have no more ready clues without your able questions to 
guide me on this.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel Null pointer dereference in sysfs_readdir()

2007-03-01 Thread Greg KH

On Thu, Mar 01, 2007 at 05:54:01PM -0800, Kunal Trivedi wrote:
> 5) OOPS messages from console.
><1>Unable to handle kernel NULL pointer dereference at virtual
> address 0018
><1> printing eip:
><4>e01a40c9
><1>*pde = 
><1>Oops:  [#1]
><4>SMP
><4>Modules linked in: ipt_state ip_conntrack iptable_filter
> cls_u32 iptable_mangle lm85 i2c_i801 w83627hf_wdt w83627hf i2c_sensor
> i2c_isa i2c_core slcmi ip_tables e7xxx_edac edac_mc
><4>CPU:2
><4>EIP:0060:[]Tainted: PF VLI
><4>EFLAGS: 00010286   (2.6.9-34.EL-i386_SMP)
><4>EIP is at sysfs_readdir+0xd9/0x210
><4>eax:    ebx: f7d6b104   ecx: 0006   edx: 0020
><4>esi: f7d6b100   edi: f7f1cb87   ebp: f7f1cb80   esp: ef432f48
><4>ds: 007b   es: 007b   ss: 0068
><4>Process sensors (pid: 2933, threadinfo=ef432000 task=f562c030)
><4>Stack: 0002  016c32f7 000a f7d6cc8c 0006
> f7ddbbc4 e017a670
><4>   ef432fa0 ed6e7280 e0409ba0 ed6e7280 f6f180b0 f6f18120
> e017a33f ef432fa0
><4>   e017a670 09ce61b4 ed6e7280 fff7  e017a81e
> 09ce6204 09ce61e4
><4>Call Trace:
><4> [] filldir64+0x0/0x140
><4> [] vfs_readdir+0xaf/0xd0
><4> [] filldir64+0x0/0x140
><4> [] sys_getdents64+0x6e/0xb6
><4> [] syscall_call+0x7/0xb
><4>Code: 26 00 89 f0 e8 89 e8 ff ff 89 c5 b9 ff ff ff ff 31 c0 89
> ef f2 ae f7 d1 49 89 4c 24 14 8b 46 20 85 c0 0f 84 22 01 00 00 8b 40
> 10 <8b> 50 18 0f b7 46 1c 89 54 24 08 8b 4c 24 24 c1 e8 0c 89 44 24
> 
> Please advice.

I suggest contacting the vendor providing the support for this old
kernel version, they should be able to help you out (although they might
ask you to not run a closed source driver in your kernel, as that
probably voids any support contract you might have.)

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Andrew Morton

On Thu, 1 Mar 2007 22:51:00 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> 
wrote:

> I'd love to have I/O support for huge pages.

direct-IO works.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: belkin bulldog ups monitor vs 2.6.21-rc2

2007-03-01 Thread Con Kolivas


On 02/03/07, Gene Heskett <[EMAIL PROTECTED]> wrote:

Greetings;

I just rebooted to 2.6.21-rc2 and noted that getting x up and running was
about 15 seconds longer than usual.  When it got a bash shell going I
went to it and ran htop which showed that the bulldog monitor was taking
90% of the cpu.  Killed it, then restarted it, but when I ran the gui
which ran fine and then stopped the gui, the daemon once again went hog
wild and had to be killed,  and I'm losing my kmail composer focus for 30
seconds at a time now that amanda is making her nightly run.

There is nothing in the log about it other than from xinetd as it ran the
amanda server stuff.

Not quite ready for prime time methinks.  Using the ck scheduler, this is
terrible performance, virtually no multitasking.  Back to 2.6.20-ck1 in
the morning if it lives the rest of the night.


HI Gene.

I'm not sure if you're saying here that the performance is terrible on
2.6.21-rc2 only with the -ck scheduler, or only 2.6.21-rc2, or that
2.6.20-ck1 is terrible or that it fixes the problem.  Can you please
clarify this?

Regards,
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] scatterlist.h needs types.h

2007-03-01 Thread Jean Delvare

Hi Andrew,

On Thu, 1 Mar 2007 16:11:06 -0800, Andrew Morton wrote:
> On Thu, 1 Mar 2007 13:55:16 +0100
> Jean Delvare <[EMAIL PROTECTED]> wrote:
> 
> > Most architectures' scatterlist.h use the type dma_addr_t, but omit
> > to include  which defines it. This could lead to build
> > failures, so let's add the missing includes.
> 
> _does_ it actually lead to build errors?  If so, 2.6.21.  If not, 2.6.22.

No known build error at the moment, so 2.6.22 is fine with me.

I'm working on a patch cleaning up the inclusion of 
across the whole kernel, and this is how I've hit the problem. I'll
post that patch later today for comments.

Thanks,
-- 
Jean Delvare
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Christoph Lameter

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > There was no talk about slightly. 1G page size would actually be quite 
> > convenient for some applications.
> 
> But it is far from convenient for the kernel. So we have hugepages, so
> we can stay out of the hair of those applications and they can stay out
> of hours.

Huge pages cannot do I/O so we would get back to the gazillions of pages 
to be handled for I/O. I'd love to have I/O support for huge pages. This 
would address some of the issues.

> > Writing a terabyte of memory to disk with handling 256 billion page 
> > structs? In case of a system with 1 petabyte of memory this may be rather 
> > typical and necessary for the application to be able to save its state
> > on disk.
> 
> But you will have newer IO controllers, faster CPUs...

Sure we will. And you believe that the the newer controllers will be able 
to magically shrink the the SG lists somehow? We will offload the 
coalescing of the page structs into bios in hardware or some such thing? 
And the vmscans etc too?

> Is it a problem or isn't it? Waving around the 256 billion number isn't
> impressive because it doesn't really say anything.

It is the number of items that needs to be handled by the I/O layer and 
likely by the SG engine.

> I understand you have controllers (or maybe it is a block layer limit)
> that doesn't work well with 4K pages, but works OK with 16K pages.

Really? This is the first that I have heard about it.

> This is not something that we would introduce variable sized pagecache
> for, surely.

I am not sure where you get the idea that this is the sole reason why we 
need to be able to handle larger contiguous chunks of memory.

How about coming up with a response to the issue at hand? How do I write 
back 1 Terabyte effectively? Ok this may be an exotic configuration today 
but in one year this may be much more common. Memory sizes keep on 
increasing and so is the number of page structs to be handled for I/O. At 
some point we need a solution here.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/9] Vmi fix highpte

2007-03-01 Thread Jeremy Fitzhardinge

Zachary Amsden wrote:
> Yeah, actually that does work, since you pass the km_type, we can use
> that.  But I would rather not respin this for 2.6.21; getting this
> 100% right can be tricky, and we've already done a good deal of
> testing on this patch the way it is.

It seems fairly low risk to me; its basically the same structure with
the same calls happening in the same order, but just slightly rearranged
in the source.  Of course, if I'd seen this patch earlier I could have
given you earlier feedback...

>   Do you have any objection to me creating a patch for -mm tree that
> implements kmap_atomic_pte the way you have described above and
> attaching it to the Xen patch series, but leaving the current patch as
> is for now?

Not particularly, but it seems odd to put something in knowing its going
to be immediately replaced.  What's the urgency?

> Thanks, (and thanks for the suggestion - I was a little worried about
> how it would play with Xen when HIGHPTE support came around, but it
> looks like it will work for both of us with just one paravirt-op).

Yeah, the kpte_clear_flush change helped as well.  I have a patch to
make that into a pvop as well, since its useful to do the clear+flush in
a single call.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] md: Fix for raid6 reshape.

2007-03-01 Thread Neil Brown

On Thursday March 1, [EMAIL PROTECTED] wrote:
> On Fri, 2 Mar 2007 15:56:55 +1100 NeilBrown <[EMAIL PROTECTED]> wrote:
> 
> > -   conf->expand_progress = (sector_nr + i)*(conf->raid_disks-1);
> > +   conf->expand_progress = (sector_nr + i) * new_data_disks);
> 
> ahem.


It wasn't like that when I tested it, honest...
But the original got caught up with some other changes which were not
really related so I removed them all and just made this change
manually and totally messed it up (again).  Sorry.

Of course it should be

> > +   conf->expand_progress = (sector_nr + i) * new_data_disks;

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] md: Fix for raid6 reshape.

2007-03-01 Thread Andrew Morton

On Fri, 2 Mar 2007 15:56:55 +1100 NeilBrown <[EMAIL PROTECTED]> wrote:

> - conf->expand_progress = (sector_nr + i)*(conf->raid_disks-1);
> + conf->expand_progress = (sector_nr + i) * new_data_disks);

ahem.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/9] Vmi fix highpte

2007-03-01 Thread Zachary Amsden


Jeremy Fitzhardinge wrote:

Jeremy Fitzhardinge wrote:
  

Hm, I don't think this interface will work for Xen.  In Xen, whenever a
pagetable page gets mapped, it must be mapped RO.  map_pt_hook gets
called after the mapping has already been created, so its too late for Xen.

I was planning on adding kmap_atomic_pte() for use in pte_offset_map*(),
which would be wired through to paravirt_ops to allow Xen to make this a
RO mapping.  Would this be sufficient for you to do your vmi thing?
  



Something like this (compiled, untested).

J

diff -r 972e84c265cf arch/i386/kernel/paravirt.c
--- a/arch/i386/kernel/paravirt.c   Thu Mar 01 19:12:49 2007 -0800
+++ b/arch/i386/kernel/paravirt.c   Thu Mar 01 19:38:42 2007 -0800
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* nop stub */

 void _paravirt_nop(void)
@@ -605,6 +606,8 @@ struct paravirt_ops paravirt_ops = {
 
 	.kpte_clear_flush = native_kpte_clear_flush,
 
+	.kmap_atomic_pte = native_kmap_atomic_pte,

+
 #ifdef CONFIG_X86_PAE
.set_pte_atomic = native_set_pte_atomic,
.set_pte_present = native_set_pte_present,
diff -r 972e84c265cf arch/i386/mm/highmem.c
--- a/arch/i386/mm/highmem.cThu Mar 01 19:12:49 2007 -0800
+++ b/arch/i386/mm/highmem.cThu Mar 01 19:38:42 2007 -0800
@@ -26,7 +26,7 @@ void kunmap(struct page *page)
  * However when holding an atomic kmap is is not legal to sleep, so atomic
  * kmaps are appropriate for short, tight code paths only.
  */
-void *kmap_atomic(struct page *page, enum km_type type)
+void *_kmap_atomic(struct page *page, enum km_type type, pgprot_t prot)
 {
enum fixed_addresses idx;
unsigned long vaddr;
@@ -41,9 +41,14 @@ void *kmap_atomic(struct page *page, enu
return page_address(page);
 
 	vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);

-   set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
+   set_pte(kmap_pte-idx, mk_pte(page, prot));
 
 	return (void*) vaddr;

+}
+
+void *kmap_atomic(struct page *page, enum km_type type)
+{
+   return _kmap_atomic(page, type, kmap_prot);
 }
  


Yeah, actually that does work, since you pass the km_type, we can use 
that.  But I would rather not respin this for 2.6.21; getting this 100% 
right can be tricky, and we've already done a good deal of testing on 
this patch the way it is.  Do you have any objection to me creating a 
patch for -mm tree that implements kmap_atomic_pte the way you have 
described above and attaching it to the Xen patch series, but leaving 
the current patch as is for now?


Thanks, (and thanks for the suggestion - I was a little worried about 
how it would play with Xen when HIGHPTE support came around, but it 
looks like it will work for both of us with just one paravirt-op).


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Nick Piggin

On Thu, Mar 01, 2007 at 10:19:48PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > >From the I/O controller and from the application. 
> > 
> > Why doesn't the application need to deal with TLB entries?
> 
> Because it may only operate on a small section of the file and hopefully 
> splice the rest through? But yes support for mmapped I/O would be 
> necessary.

So you're talking about copying a file from one location to another?


> > > This would only be a temporary fix pushing the limits to the double or so?
> > 
> > And using slightly larger page sizes isn't?
> 
> There was no talk about slightly. 1G page size would actually be quite 
> convenient for some applications.

But it is far from convenient for the kernel. So we have hugepages, so
we can stay out of the hair of those applications and they can stay out
of hours.

> > > Amortized? The controller still would have to hunt down the 4kb page 
> > > pieces that we have to feed him right now. Result: Huge scatter gather 
> > > lists that may themselves create issues with higher page order.
> > 
> > What sort of numbers do you have for these controllers that aren't
> > very good at doing sg?
> 
> Writing a terabyte of memory to disk with handling 256 billion page 
> structs? In case of a system with 1 petabyte of memory this may be rather 
> typical and necessary for the application to be able to save its state
> on disk.

But you will have newer IO controllers, faster CPUs...

Is it a problem or isn't it? Waving around the 256 billion number isn't
impressive because it doesn't really say anything.

> > Isn't the issue was something like your IO controllers have only a
> > limited number of sg entries, which is fine with 16K pages, but with
> > 4K pages that doesn't give enough data to cover your RAID stripe?
> > 
> > We're never going to do a variable sized pagecache just because of that.
> 
> No, we need support for larger page sizes than 16k. 16k has not been fine 
> for a couple of years. We only agreed to 16k because that was the common 
> consensus. Best performance was always at 64k 4 years ago (but then we 
> have no numbers for higher page sizes yet). Now we would prefer much 
> larger sizes.

But you are in a tiny minority, so it is not so much a question of what
you prefer, but what you can make do with without being too intrusive.

I understand you have controllers (or maybe it is a block layer limit)
that doesn't work well with 4K pages, but works OK with 16K pages.
This is not something that we would introduce variable sized pagecache
for, surely.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/9] Vmi fix highpte

2007-03-01 Thread Jeremy Fitzhardinge

Zachary Amsden wrote:
> That doesn't quite work, since we need to know which of the two -
> KM_PTE0 or KM_PTE1 is being mapped.  But it could be moved to before
> the mapping, as you need, and take this as a parameter. 

Err, kmap_atomic_pte gets passed the type - KM_PTE0 or KM_PTE1.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/9] Vmi fix highpte

2007-03-01 Thread Zachary Amsden


Jeremy Fitzhardinge wrote:

Jeremy Fitzhardinge wrote:
  

Hm, I don't think this interface will work for Xen.  In Xen, whenever a
pagetable page gets mapped, it must be mapped RO.  map_pt_hook gets
called after the mapping has already been created, so its too late for Xen.

I was planning on adding kmap_atomic_pte() for use in pte_offset_map*(),
which would be wired through to paravirt_ops to allow Xen to make this a
RO mapping.  Would this be sufficient for you to do your vmi thing?
  



Something like this (compiled, untested).

J

diff -r 972e84c265cf arch/i386/kernel/paravirt.c
--- a/arch/i386/kernel/paravirt.c   Thu Mar 01 19:12:49 2007 -0800
+++ b/arch/i386/kernel/paravirt.c   Thu Mar 01 19:38:42 2007 -0800
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* nop stub */

 void _paravirt_nop(void)
@@ -605,6 +606,8 @@ struct paravirt_ops paravirt_ops = {
 
 	.kpte_clear_flush = native_kpte_clear_flush,
 
+	.kmap_atomic_pte = native_kmap_atomic_pte,

+
  


That doesn't quite work, since we need to know which of the two - 
KM_PTE0 or KM_PTE1 is being mapped.  But it could be moved to before the 
mapping, as you need, and take this as a parameter.


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Christoph Lameter

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > >From the I/O controller and from the application. 
> 
> Why doesn't the application need to deal with TLB entries?

Because it may only operate on a small section of the file and hopefully 
splice the rest through? But yes support for mmapped I/O would be 
necessary.

> > This would only be a temporary fix pushing the limits to the double or so?
> 
> And using slightly larger page sizes isn't?

There was no talk about slightly. 1G page size would actually be quite 
convenient for some applications.

> > Amortized? The controller still would have to hunt down the 4kb page 
> > pieces that we have to feed him right now. Result: Huge scatter gather 
> > lists that may themselves create issues with higher page order.
> 
> What sort of numbers do you have for these controllers that aren't
> very good at doing sg?

Writing a terabyte of memory to disk with handling 256 billion page 
structs? In case of a system with 1 petabyte of memory this may be rather 
typical and necessary for the application to be able to save its state
on disk.

> Isn't the issue was something like your IO controllers have only a
> limited number of sg entries, which is fine with 16K pages, but with
> 4K pages that doesn't give enough data to cover your RAID stripe?
> 
> We're never going to do a variable sized pagecache just because of that.

No, we need support for larger page sizes than 16k. 16k has not been fine 
for a couple of years. We only agreed to 16k because that was the common 
consensus. Best performance was always at 64k 4 years ago (but then we 
have no numbers for higher page sizes yet). Now we would prefer much 
larger sizes.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] needs to include

2007-03-01 Thread Andrew Morton

On Sat, 24 Feb 2007 12:22:11 + Ralf Baechle <[EMAIL PROTECTED]> wrote:

> sysdev.h uses THIS_MODULE so should include .
> 
> Signed-off-by: Ralf Baechle <[EMAIL PROTECTED]>
> 
> diff --git a/include/linux/sysdev.h b/include/linux/sysdev.h
> index 389ccf8..e699ab2 100644
> --- a/include/linux/sysdev.h
> +++ b/include/linux/sysdev.h
> @@ -22,6 +22,7 @@
>  #define _SYSDEV_H_
>  
>  #include 
> +#include 
>  #include 
>  


You can't just make changes like this without a lot of compile testing, I'm
afraid.

This causes a recursive inclusion and sched.h blows up:

In file included from include/linux/utsname.h:35,
 from include/asm/elf.h:12,
 from include/linux/elf.h:7,
 from include/linux/module.h:15,
 from include/linux/sysdev.h:25,
 from kernel/time/clocksource.c:28:
include/linux/sched.h:1648: warning: 'struct sysdev_class' declared inside 
parameter list
include/linux/sched.h:1648: warning: its scope is only this definition or 
declaration, which is probably not what you want


I think we can fix that by moving the declarations into cpu.h and getting
that unpleasant include out of sched.h.

Of course, this will probably make other things blow up and additional
sysdev.h includes will now be needed.  We'll see..





diff -puN 
include/linux/cpu.h~linux-sysdevh-needs-to-include-linux-moduleh-up-fix 
include/linux/cpu.h
--- a/include/linux/cpu.h~linux-sysdevh-needs-to-include-linux-moduleh-up-fix
+++ a/include/linux/cpu.h
@@ -41,6 +41,9 @@ extern void cpu_remove_sysdev_attr(struc
 extern int cpu_add_sysdev_attr_group(struct attribute_group *attrs);
 extern void cpu_remove_sysdev_attr_group(struct attribute_group *attrs);
 
+extern struct sysdev_attribute attr_sched_mc_power_savings;
+extern struct sysdev_attribute attr_sched_smt_power_savings;
+extern int sched_create_sysfs_power_savings_entries(struct sysdev_class *cls);
 
 #ifdef CONFIG_HOTPLUG_CPU
 extern void unregister_cpu(struct cpu *cpu);
diff -puN 
include/linux/sched.h~linux-sysdevh-needs-to-include-linux-moduleh-up-fix 
include/linux/sched.h
--- a/include/linux/sched.h~linux-sysdevh-needs-to-include-linux-moduleh-up-fix
+++ a/include/linux/sched.h
@@ -1642,10 +1642,7 @@ static inline void arch_pick_mmap_layout
 extern long sched_setaffinity(pid_t pid, cpumask_t new_mask);
 extern long sched_getaffinity(pid_t pid, cpumask_t *mask);
 
-#include 
 extern int sched_mc_power_savings, sched_smt_power_savings;
-extern struct sysdev_attribute attr_sched_mc_power_savings, 
attr_sched_smt_power_savings;
-extern int sched_create_sysfs_power_savings_entries(struct sysdev_class *cls);
 
 extern void normalize_rt_tasks(void);
 
_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Paul Mundt

On Fri, Mar 02, 2007 at 02:50:29PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 1 Mar 2007 21:11:58 -0800 (PST)
> Linus Torvalds <[EMAIL PROTECTED]> wrote:
> 
> > The whole DRAM power story is a bedtime story for gullible children. Don't 
> > fall for it. It's not realistic. The hardware support for it DOES NOT 
> > EXIST today, and probably won't for several years. And the real fix is 
> > elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which 
> > is against the whole point of FBDIMM in the first place, but that's what 
> > you get when you ignore power in the first version!).
> > 
> 
> Note:
> I heard embeded people often designs their own memory-power-off control on
> embeded Linux. (but it never seems to be posted to the list.) But I don't know
> they are interested in generic memory hotremove or not.
> 
Yes, this is not that uncommon of a thing. People tend to do this in a
couple of different ways, in some cases the system is too loaded to ever
make doing such a thing at run-time worthwhile, and in those cases these
sorts of things tend to be munged in with the suspend code. Unfortunately
it tends to be quite difficult in practice to keep pages in one place,
so people rely on lame chip-select hacks and limiting the amount of
memory that the kernel treats as RAM instead so it never ends up being an
issue. Having some sort of a balance would certainly be nice, though.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Andrew Morton

On Thu, 01 Mar 2007 22:03:55 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote:

> Just curious .. What does posix_fallocate() return ?

bookmark this:

http://www.opengroup.org/onlinepubs/009695399/nfindex.html

Upon successful completion, posix_fallocate() shall return zero;
otherwise, an error number shall be returned to indicate the error.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG 2.6.21-rc2] divide error: 0000

2007-03-01 Thread Willy Tarreau

On Thu, Mar 01, 2007 at 11:12:42PM +, Sean Young wrote:
> Apologies if this has already been reported.
> 
> If I call clock_gettime(CLOCK_THREAD_CPUTIME_ID, .. ) twice I get:
> 
> divide error:  [#1]
> Modules linked in: binfmt_misc rfcomm l2cap bluetooth sonypi speedstep_ich 
> speedstep_lib cpufreq_userspace cpufreq_stats cpufreq_powersave 
> cpufreq_ondemand freq_table cpufreq_conservative video thermal sbs processor 
> i2c_ec fan dock button battery ac af_packet ipv6 sbp2 lp usb_storage libusual 
> orinoco_cs orinoco hermes joydev tsdev usbhid pcmcia e100 mii psmouse 
> ohci1394 serio_raw yenta_socket rsrc_nonstatic pcmcia_core ieee1394 sr_mod 
> cdrom sg uhci_hcd parport_pc parport pcspkr evdev usbcore
> CPU:0
> EIP:0060:[]Not tainted VLI
> EFLAGS: 00010246   (2.6.21-rc2 #1)
> EIP is at sample_to_timespec+0x28/0x33
> eax: 63b5a669   ebx: fffa   ecx: 63b5a669   edx: fffa
> esi: d4a56fa4   edi: 3b9aca00   ebp: d4a56fa4   esp: d4a56f74
> ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
> Process x (pid: 3894, ti=d4a56000 task=dfe9aa50 task.ti=d4a56000)
> Stack:  fffe  c0127d49 d4a56fa4 63b5a669 fffa fffe
>0003  d4a56000 c0125bf3 b7f68ff4 b7f9fce0 fffe 0003
>c0103bfc fffe bfd6d5d8 b7f74ff4 0003  bfd6d5b8 0109
> Call Trace:
>  [] posix_cpu_clock_get+0x47/0xdc
>  [] sys_clock_gettime+0x80/0x82
>  [] syscall_call+0x7/0xb
>  [] svc_ioctl+0xc2/0x261
>  ===
> Code: 0b eb fe 57 56 53 89 cb 89 d1 8b 74 24 10 83 e0 03 83 f8 02 74 0c 89 f2 
> 89 c8 5b 5e 5f e9 ee 3f ff ff bf 00 ca 9a 3b 89 d0 89 da  f7 89 56 04 89 
> 06 5b 5e 5f c3 55 57 56 53 89 c7 89 d6 89 cb
> EIP: [] sample_to_timespec+0x28/0x33 SS:ESP 0068:d4a56f74
> 
> The instruction is:
> 
>   div %edi
> 
> And edi is 1e9 (0x3b9aca00). I don't understand why this results in an 
> divide error. 

It does this because 'div' does an unsigned divide of edx:eax by edi.
Here, edx=fffa and eax is 63b5a669. Clearly, such a number cannot
be divided by 1e9 to return a 32 bits value.

Given the values we see here, I suspect the code should have used an
integer divide (idiv). This means that something in the code implies
that the result is unsigned while it should be signed.

Regards,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Nick Piggin

On Thu, Mar 01, 2007 at 09:53:42PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > You do not have to deal with TLB entries if you do buffered I/O.
> > 
> > Where does the data come from?
> 
> >From the I/O controller and from the application. 

Why doesn't the application need to deal with TLB entries?


> > > We currently have problems with the kernel limits of 128 SG 
> > > entries but the fundamental issue is that we can only do 2 Meg of I/O in 
> > > one go given the default limits of the block layer. Typically the number 
> > > of hardware SG entrie is also limited. We never will be able to put a 
> > 
> > Seems like changing the default limits would be the easiest way to
> > fix it then?
> 
> This would only be a temporary fix pushing the limits to the double or so?

And using slightly larger page sizes isn't?

> > As far as hardware limits go, I don't think you need to scale that
> > number linearly with the amount of memory you have, or even with the
> > IO throughput. You should reach a point where your command overhead
> > is amortised sufficiently, and the controller will be pipelining the
> > commands.
> 
> Amortized? The controller still would have to hunt down the 4kb page 
> pieces that we have to feed him right now. Result: Huge scatter gather 
> lists that may themselves create issues with higher page order.

What sort of numbers do you have for these controllers that aren't
very good at doing sg?

Isn't the issue was something like your IO controllers have only a
limited number of sg entries, which is fine with 16K pages, but with
4K pages that doesn't give enough data to cover your RAID stripe?

We're never going to do a variable sized pagecache just because of that.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Badari Pulavarty



Amit K. Arora wrote:


This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation "fallocate", for persistent preallocation. The new
system call, as Andrew suggested, will look like:

 asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

I am wondering about return values from this syscall ? Is it supposed to 
return the
number of bytes allocated ? What about partial allocations ? What about 
if the

blocks already exists ? What would be return values in those cases ?

Just curious .. What does posix_fallocate() return ?

Thanks,
Badari







-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2.6.20-rc2] gpio_direction_output() needs an initial value

2007-03-01 Thread Andrew Victor

hi David,

> It's been pointed out that output GPIOs should have an initial value, to
> avoid signal glitching ... among other things, it can be some time before
> a driver is ready.  This patch corrects that oversight, fixing

For the AT91 changes:
  Acked-by: Andrew Victor <[EMAIL PROTECTED]>


> --- g26.orig/drivers/spi/atmel_spi.c  2007-02-28 12:47:43.0 -0800
> +++ g26/drivers/spi/atmel_spi.c   2007-03-01 15:29:30.0 -0800

> - gpio_direction_output(npcs_pin);
> + gpio_direction_output(npcs_pin, !(spi->mode & SPI_CS_HIGH));
>   }

As mentioned previously (by Walter Tuppa), wouldn't it be better to just
change this to:
 cs_deactivate(spi);


Regards,
  Andrew Victor


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Christoph Lameter

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > You do not have to deal with TLB entries if you do buffered I/O.
> 
> Where does the data come from?

>From the I/O controller and from the application. 

> > We currently have problems with the kernel limits of 128 SG 
> > entries but the fundamental issue is that we can only do 2 Meg of I/O in 
> > one go given the default limits of the block layer. Typically the number 
> > of hardware SG entrie is also limited. We never will be able to put a 
> 
> Seems like changing the default limits would be the easiest way to
> fix it then?

This would only be a temporary fix pushing the limits to the double or so?
 
> As far as hardware limits go, I don't think you need to scale that
> number linearly with the amount of memory you have, or even with the
> IO throughput. You should reach a point where your command overhead
> is amortised sufficiently, and the controller will be pipelining the
> commands.

Amortized? The controller still would have to hunt down the 4kb page 
pieces that we have to feed him right now. Result: Huge scatter gather 
lists that may themselves create issues with higher page order.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread KAMEZAWA Hiroyuki

On Thu, 1 Mar 2007 21:11:58 -0800 (PST)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> The whole DRAM power story is a bedtime story for gullible children. Don't 
> fall for it. It's not realistic. The hardware support for it DOES NOT 
> EXIST today, and probably won't for several years. And the real fix is 
> elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which 
> is against the whole point of FBDIMM in the first place, but that's what 
> you get when you ignore power in the first version!).
> 

At first, we have memory hot-add now. So I want to implement hot-removing 
hot-added memory, at least. (in this case, we don't have to write invasive
patches to memory-init-core.)

Our(Fujtisu's) product, ia64-NUMA server, has a feature to offline memory.
It supports dynamic reconfigraion of nodes, node-hoplug.

But there is no *shipped* firmware for hotplug yet. RHEL4 couldn't boot on
such hotplug-supported-firmware...so firmware-team were not in hurry.
It will be shipped after RHEL5 comes.
IMHO, a firmware which supports memory-hot-add are ready to support 
memory-hot-remove
if OS can handle it.

Note:
I heard embeded people often designs their own memory-power-off control on
embeded Linux. (but it never seems to be posted to the list.) But I don't know
they are interested in generic memory hotremove or not.

Thanks,
-Kame

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Christoph Lameter

On Fri, 2 Mar 2007, Nick Piggin wrote:

> So what do you mean by efficient? I guess you aren't talking about CPU
> efficiency, because even if you make the IO subsystem submit larger
> physical IOs, you still have to deal with 256 billion TLB entries, the
> pagecache has to deal with 256 billion struct pages, so does the
> filesystem code to build the bios.

Re the page cache: It needs also to be able to handle large page sizes of 
course. Scanning gazillions of page structs in vmscan.c will make the 
system slow as a dog. The number of page structs needs to be drastically 
reduced for large I/O. I think this can be done with allowing compound 
pages to be handled throughout the VM. The defrag issues then becomes very 
pressing indeed.

We have discussed the idea of going to kernel with 2M base page size on 
x86_64 but that step is a bit drastic and the overhead for small files 
would be tremendous.

Support for compound pages already exists in the page allocator and the 
slab allocator. Maybe we could extend that support to the I/O subsystem? 
We would also then have more contiguous writes which will further speed up 
I/O efficiency.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Nick Piggin

On Thu, Mar 01, 2007 at 09:40:45PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > So what do you mean by efficient? I guess you aren't talking about CPU
> > efficiency, because even if you make the IO subsystem submit larger
> > physical IOs, you still have to deal with 256 billion TLB entries, the
> > pagecache has to deal with 256 billion struct pages, so does the
> > filesystem code to build the bios.
> 
> You do not have to deal with TLB entries if you do buffered I/O.

Where does the data come from?

> For mmapped I/O you would want to transparently use 2M TLBs if the 
> page size is large.
> 
> > So you are having problems with your IO controller's handling of sg
> > lists?
> 
> We currently have problems with the kernel limits of 128 SG 
> entries but the fundamental issue is that we can only do 2 Meg of I/O in 
> one go given the default limits of the block layer. Typically the number 
> of hardware SG entrie is also limited. We never will be able to put a 

Seems like changing the default limits would be the easiest way to
fix it then?

As far as hardware limits go, I don't think you need to scale that
number linearly with the amount of memory you have, or even with the
IO throughput. You should reach a point where your command overhead
is amortised sufficiently, and the controller will be pipelining the
commands.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

belkin bulldog ups monitor vs 2.6.21-rc2

2007-03-01 Thread Gene Heskett

Greetings;

I just rebooted to 2.6.21-rc2 and noted that getting x up and running was 
about 15 seconds longer than usual.  When it got a bash shell going I 
went to it and ran htop which showed that the bulldog monitor was taking 
90% of the cpu.  Killed it, then restarted it, but when I ran the gui 
which ran fine and then stopped the gui, the daemon once again went hog 
wild and had to be killed,  and I'm losing my kmail composer focus for 30 
seconds at a time now that amanda is making her nightly run.

There is nothing in the log about it other than from xinetd as it ran the 
amanda server stuff.

Not quite ready for prime time methinks.  Using the ck scheduler, this is 
terrible performance, virtually no multitasking.  Back to 2.6.20-ck1 in 
the morning if it lives the rest of the night.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Christoph Lameter

On Fri, 2 Mar 2007, Nick Piggin wrote:

> So what do you mean by efficient? I guess you aren't talking about CPU
> efficiency, because even if you make the IO subsystem submit larger
> physical IOs, you still have to deal with 256 billion TLB entries, the
> pagecache has to deal with 256 billion struct pages, so does the
> filesystem code to build the bios.

You do not have to deal with TLB entries if you do buffered I/O.

For mmapped I/O you would want to transparently use 2M TLBs if the 
page size is large.

> So you are having problems with your IO controller's handling of sg
> lists?

We currently have problems with the kernel limits of 128 SG 
entries but the fundamental issue is that we can only do 2 Meg of I/O in 
one go given the default limits of the block layer. Typically the number 
of hardware SG entrie is also limited. We never will be able to put a 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] mv643xx ethernet driver

2007-03-01 Thread Giridhar Pemmasani

During initialization, mv643xx driver registers IRQ before setting up tx/rx
rings. This causes kernel oops because mv643xx_poll, which gets called
right after registering IRQ, calls netif_rx_complete, which accesses the rx
ring (I don't have the oops message anymore; I just remember this sequence
of calls). Attached (tested) patch first initializes the rx/tx rings and
then registers the IRQ.

Giri

Signed-off-by: Giridhar Pemmasani <[EMAIL PROTECTED]>

--- drivers/net/mv643xx_eth.c   2006-11-29 16:57:37.0 -0500
+++ ../linux-2.6.20.orig/drivers/net/mv643xx_eth.c  2007-02-23
09:38:21.0 -0500
@@ -778,14 +778,6 @@
unsigned int size;
int err;
 
-   err = request_irq(dev->irq, mv643xx_eth_int_handler,
-   IRQF_SHARED | IRQF_SAMPLE_RANDOM, dev->name, dev);
-   if (err) {
-   printk(KERN_ERR "Can not assign IRQ number to MV643XX_eth%d\n",
-   port_num);
-   return -EAGAIN;
-   }
-
eth_port_init(mp);
 
memset(&mp->timeout, 0, sizeof(struct timer_list));
@@ -797,8 +789,7 @@
GFP_KERNEL);
if (!mp->rx_skb) {
printk(KERN_ERR "%s: Cannot allocate Rx skb ring\n", dev->name);
-   err = -ENOMEM;
-   goto out_free_irq;
+   return -ENOMEM;
}
mp->tx_skb = kmalloc(sizeof(*mp->tx_skb) * mp->tx_ring_size,
GFP_KERNEL);
@@ -852,13 +843,8 @@
dev->name, size);
printk(KERN_ERR "%s: Freeing previously allocated TX queues...",
dev->name);
-   if (mp->rx_sram_size)
-   iounmap(mp->p_tx_desc_area);
-   else
-   dma_free_coherent(NULL, mp->tx_desc_area_size,
-   mp->p_tx_desc_area, mp->tx_desc_dma);
err = -ENOMEM;
-   goto out_free_tx_skb;
+   goto out_free_tx_ring;
}
memset((void *)mp->p_rx_desc_area, 0, size);
 
@@ -866,6 +852,14 @@
 
mv643xx_eth_rx_refill_descs(dev);   /* Fill RX ring with skb's */
 
+   err = request_irq(dev->irq, mv643xx_eth_int_handler,
+ IRQF_SHARED | IRQF_SAMPLE_RANDOM, dev->name, dev);
+   if (err) {
+   printk(KERN_ERR "Can not assign IRQ number to MV643XX_eth%d\n",
+  port_num);
+   goto out_free_rx_ring;
+   }
+
/* Clear any pending ethernet port interrupts */
mv_write(MV643XX_ETH_INTERRUPT_CAUSE_REG(port_num), 0);
mv_write(MV643XX_ETH_INTERRUPT_CAUSE_EXTEND_REG(port_num), 0);
@@ -891,12 +885,22 @@
 
return 0;
 
+out_free_rx_ring:
+   if (mp->rx_sram_size)
+   iounmap(mp->p_rx_desc_area);
+   else
+   dma_free_coherent(NULL, mp->rx_desc_area_size,
+ mp->p_rx_desc_area, mp->rx_desc_dma);
+out_free_tx_ring:
+   if (mp->tx_sram_size)
+   iounmap(mp->p_tx_desc_area);
+   else
+   dma_free_coherent(NULL, mp->tx_desc_area_size,
+ mp->p_tx_desc_area, mp->tx_desc_dma);
 out_free_tx_skb:
kfree(mp->tx_skb);
 out_free_rx_skb:
kfree(mp->rx_skb);
-out_free_irq:
-   free_irq(dev->irq, dev);
 
return err;
 }

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ckrm-tech] [PATCH 1/2] rcfs core patch

2007-03-01 Thread Balbir Singh


Srivatsa Vaddagiri wrote:

Heavily based on Paul Menage's (inturn cpuset) work. The big difference
is that the patch uses task->nsproxy to group tasks for resource control
purpose (instead of task->containers).

The patch retains the same user interface as Paul Menage's patches. In
particular, you can have multiple hierarchies, each hierarchy giving a 
different composition/view of task-groups.


(Ideally this patch should have been split into 2 or 3 sub-patches, but
will do that on a subsequent version post)



With this don't we end up with a lot of duplicate between cpusets and rcfs.



Signed-off-by : Srivatsa Vaddagiri <[EMAIL PROTECTED]>
Signed-off-by : Paul Menage <[EMAIL PROTECTED]>


---

 linux-2.6.20-vatsa/include/linux/init_task.h |4 
 linux-2.6.20-vatsa/include/linux/nsproxy.h   |5 
 linux-2.6.20-vatsa/init/Kconfig  |   22 
 linux-2.6.20-vatsa/init/main.c   |1 
 linux-2.6.20-vatsa/kernel/Makefile   |1 



---


The diffstat does not look quite right.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: patch 3 / 3: fix floppy mount bug in kernel 2.6.21-rc1

2007-03-01 Thread Stephane Eranian

Andrew,

On Thu, Mar 01, 2007 at 04:47:42PM -0800, Andrew Morton wrote:
> On Thu, 01 Mar 2007 15:32:22 +0100
> "Uwe Bugla" <[EMAIL PROTECTED]> wrote:
> 
> > Hi folks,
> > this patch fixes the floppy mount bug (i. e. regression) in kernel 
> > 2.6.21-rc1. It was inspired by Stephane Eranian. It was tested on an Intel 
> > P4 1800 MHz
> > (Intel ICH4 chipset) and on an AMD Athlon XP 1800 MHz (Silicon Integrated 
> > Systems chipset 740, 5513).
> > My deep thanks and respect go to:
> > Stephane Eranian, Linus Torvalds, Jiri Slaby. You are truthfully real men 
> > and reliable, accurate, fine chaps. It feels great to have you in this 
> > world-wide community!
> > Would you still call the whole i386 architecture "a small number of 
> > machines", Mister Andrew Morton? If yes, in how far please?
> > 
> > Signed-off-by: Uwe Bugla <[EMAIL PROTECTED]>
> > 
> > --- a/arch/i386/kernel/process.c
> > +++ b/arch/i386/kernel/process.c
> > @@ -154,6 +154,7 @@
> > current_thread_info()->status |= TS_POLLING;
> > } else {
> > /* loop is done by the caller */
> > +   local_irq_enable();
> > cpu_relax();
> > }
> >  }
> 
> Linus reverted the offending patch "[PATCH] i386: add idle notifier"
> on Feb 26, so this fix should no longer be needed, and 2.6.21-rc2 should
> be working again.
> 
> Hopefully Stephane will fold this fix into any future version of that patch,
> if appropriate.

Well, given that nobody really liked this idle notifier, I am trying to do
differently on all architectures which unfortunately is not an easy thing
to do.

What I did not really like in all of this is that people come up with
arguments without providing the data to prove it, e.g., increase interrupt
latency (by how much?).

-- 
-Stephane
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: PATCH 2.6.21-rc1 aoe: handle zero _count pages in bios

2007-03-01 Thread Christoph Hellwig

On Thu, Mar 01, 2007 at 09:09:42PM -0800, Andrew Morton wrote:
> > or document that drivers need to handle it specially and give them a
> > way to find out about them. (Or do the horrible slab refcounting hack
> > I wrote up above)
> 
> OK.  So you're proposing that XFS and ext3 simply stop sing slab for this
> memory?

Yes.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Jeremy Fitzhardinge

Linus Torvalds wrote:
> Virtualization in general. We don't know what it is - in IBM machines it's 
> a hypervisor. With Xen and VMware, it's usually a hypervisor too. With 
> KVM, it's obviously a host Linux kernel/user-process combination.
>
> The point being that in the guests, hotunplug is almost useless (for 
> bigger ranges), and we're much better off just telling the virtualization 
> hosts on a per-page level whether we care about a page or not, than to 
> worry about fragmentation.
>
> And in hosts, we usually don't care EITHER, since it's usually done in a 
> hypervisor.
>   

The paravirt_ops patches I just posted implement all the machinery
required to create a pseudo-physical to machine address mapping under
the kernel.  This is used under Xen because it directly exposes the
pagetables to its guests, but there's no reason why you couldn't use
this layer to implement the same mapping without an underlying
hypervisor.  This allows the kernel to see a normal linear "physical"
address space which is in fact its mapped over a discontigious set of
machine ("real physical") pages.

Andrew and I discussed using it for a kdump kernel, so that you could
load it into a random bunch of pages, and set things up so that it sees
itself as being contiguous.

The mapping is pretty simple.  It intercepts __pte (__pmd, etc) to map
the "physical" page to the real machine page, and pte_val does the
reverse mapping.

You could implement this today as a farily simple, thin paravirt_ops
backend.  The main tricky part is making sure all the device drivers are
correct in using bus addresses (which are mapped to real machine
addresses), and that they don't assume that adjacent kernel virtual
pages are physically adjacent.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Linus Torvalds

On Thu, 1 Mar 2007, Andrew Morton wrote:
>
> On Thu, 1 Mar 2007 19:44:27 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> 
> wrote:
> 
> > In other words, I really don't see a huge upside. I see *lots* of 
> > downsides, but upsides? Not so much. Almost everybody who wants unplug 
> > wants virtualization, and right now none of the "big virtualization" 
> > people would want to have kernel-level anti-fragmentation anyway sicne 
> > they'd need to do it on their own.
> 
> Agree with all that, but you're missing the other application: power
> saving.  FBDIMMs take eight watts a pop.

This is a hardware problem. Let's see how long it takes for Intel to 
realize that FBDIMM's were a hugely bad idea from a power perspective.

Yes, the same issues exist for other DRAM forms too, but to a *much* 
smaller degree.

Also, IN PRACTICE you're never ever going to see this anyway. Almost 
everybody wants bank interleaving, because it's a huge performance win on 
many loads. That, in turn, means that your memory will be spread out over 
multiple DIMM's even for a single page, much less any bigger area.

In other words - forget about DRAM power savings. It's not realistic. And 
if you want low-power, don't use FBDIMM's. It really *is* that simple.

(And yes, maybe FBDIMM controllers in a few years won't use 8 W per 
buffer. I kind of doubt that, since FBDIMM fairly fundamentally is highish 
voltage swings at high frequencies.)

Also, on a *truly* idle system, we'll see the power savings whatever we 
do, because the working set will fit in D$, and to get those DRAM power 
savings in reality you need to have the DRAM controller shut down on its 
own anyway (ie sw would only help a bit).

The whole DRAM power story is a bedtime story for gullible children. Don't 
fall for it. It's not realistic. The hardware support for it DOES NOT 
EXIST today, and probably won't for several years. And the real fix is 
elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which 
is against the whole point of FBDIMM in the first place, but that's what 
you get when you ignore power in the first version!).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: PATCH 2.6.21-rc1 aoe: handle zero _count pages in bios

2007-03-01 Thread Andrew Morton

On Fri, 2 Mar 2007 05:03:51 + Christoph Hellwig <[EMAIL PROTECTED]> wrote:

> On Thu, Mar 01, 2007 at 09:00:44PM -0800, Andrew Morton wrote:
> > I that case we're talking about different things.
> > 
> > I thought the proposal was to continue to use slab pages, but to take a ref
> > on them as they're added to the bio, drop that ref in bi_end_io()?
> 
> That would give you silent memory corruption in case the networking code
> hold a reference after the memory gets returned to slab and reused.

Well, given that bi_end_io() is called after the "io" has completed, I'm
assuming that networking has completely finished with the memory by the
time bi_end_io() gets called.

I guess one can envisage situations where that might not happen, but they'd
be terribly buggy ones, surely.

> We need to either stop allowing to pass slab memory to the block layer,
> or document that drivers need to handle it specially and give them a
> way to find out about them. (Or do the horrible slab refcounting hack
> I wrote up above)

OK.  So you're proposing that XFS and ext3 simply stop sing slab for this
memory?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Nick Piggin

On Thu, Mar 01, 2007 at 08:31:24PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > Yes, we (SGI) need exactly that: Use of higher order pages in the kernel 
> > > in order to reduce overhead of managing page structs for large I/O and 
> > > large memory applications. We need appropriate measures to deal with the 
> > > fragmentation problem.
> > 
> > I don't understand why, out of any architecture, ia64 would have to hack
> > around this in software :(
> 
> Ummm... We have x86_64 platforms with the 4k page problem. 4k pages are 
> very useful for the large number of small files that are around. But for 
> the large streams of data you would want other methods of handling these.
> 
> If I want to write 1 terabyte (2^50) to disk then the I/O subsystem has 
> to handle 2^(50-12) = 2^38 = 256 million page structs! This limits I/O 
> bandwiths and leads to huge scatter gather lists (and we are limited in 
> terms of the numbe of items on those lists in many drivers). Our future 
> platforms have up to serveral petabytes of memory. There needs to be some 
> way to handle these capacities in an efficient way. We cannot wait 
> an hour for the terabyte to reach the disk.

I guess you mean 256 billion page structs.

So what do you mean by efficient? I guess you aren't talking about CPU
efficiency, because even if you make the IO subsystem submit larger
physical IOs, you still have to deal with 256 billion TLB entries, the
pagecache has to deal with 256 billion struct pages, so does the
filesystem code to build the bios.

So you are having problems with your IO controller's handling of sg
lists?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: PATCH 2.6.21-rc1 aoe: handle zero _count pages in bios

2007-03-01 Thread Christoph Hellwig

On Thu, Mar 01, 2007 at 09:00:44PM -0800, Andrew Morton wrote:
> I that case we're talking about different things.
> 
> I thought the proposal was to continue to use slab pages, but to take a ref
> on them as they're added to the bio, drop that ref in bi_end_io()?

That would give you silent memory corruption in case the networking code
hold a reference after the memory gets returned to slab and reused.

We need to either stop allowing to pass slab memory to the block layer,
or document that drivers need to handle it specially and give them a
way to find out about them. (Or do the horrible slab refcounting hack
I wrote up above)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: PATCH 2.6.21-rc1 aoe: handle zero _count pages in bios

2007-03-01 Thread Andrew Morton

On Fri, 2 Mar 2007 04:49:10 + Christoph Hellwig <[EMAIL PROTECTED]> wrote:

> On Thu, Mar 01, 2007 at 08:48:06PM -0800, Andrew Morton wrote:
> > On Fri, 2 Mar 2007 04:30:39 + Christoph Hellwig <[EMAIL PROTECTED]> 
> > wrote:
> > 
> > > But in this case we'd really need to enforce this, and add a
> > > BUG_ON(PageSlab(page)) in bio_add_page to trip everyone submit
> > > this kind of pages.
> > 
> > That would be
> > 
> > BUG_ON(PageSlab(page) && page_count(page) == 0)?
> 
> No, all slab pages.  Currently they all have a reference count of
> zero, but we generally don't want people to pass in pages that
> come from a non-refcounted allocator.

I that case we're talking about different things.

I thought the proposal was to continue to use slab pages, but to take a ref
on them as they're added to the bio, drop that ref in bi_end_io()?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Andrew Morton

On Thu, 1 Mar 2007 20:33:04 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> 
wrote:

> On Thu, 1 Mar 2007, Andrew Morton wrote:
> 
> > Sorry, but this is crap.  zones and nodes are distinct, physical concepts
> > and you're kidding yourself if you think you can somehow fudge things to 
> > make
> > one of them just go away.
> > 
> > Think: ZONE_DMA32 on an Opteron machine.  I don't think there is a sane way
> > in which we can fudge away the distinction between
> > bus-addresses-which-have-the-32-upper-bits-zero and
> > memory-which-is-local-to-each-socket.
> 
> Of course you can. Add a virtual DMA and DMA32 zone/node and extract the 
> relevant memory from the base zone/node.

You're using terms which I've never seen described anywhere.

Please, just stop here.  Give us a complete design proposal which we can
understand and review.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] md: Fix for raid6 reshape.

2007-03-01 Thread NeilBrown

### Comments for Changeset

Recent patch for raid6 reshape had a change missing that showed up in
subsequent review.

Many places in the raid5 code used "conf->raid_disks-1" to mean
"number of data disks".  With raid6 that had to be changed to
"conf->raid_disk - conf->max_degraded" or similar.  One place was missed.

This bug means that if a raid6 reshape were aborted in the middle the
recorded position would be wrong.  On restart it would either fail (as
the position wasn't on an appropriate boundary) or would leave a section
of the array unreshaped, causing data corruption.


Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/raid5.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c2007-03-02 15:47:51.0 +1100
+++ ./drivers/md/raid5.c2007-03-02 15:48:35.0 +1100
@@ -3071,7 +3071,7 @@ static sector_t reshape_request(mddev_t 
release_stripe(sh);
}
spin_lock_irq(&conf->device_lock);
-   conf->expand_progress = (sector_nr + i)*(conf->raid_disks-1);
+   conf->expand_progress = (sector_nr + i) * new_data_disks);
spin_unlock_irq(&conf->device_lock);
/* Ok, those stripe are ready. We can start scheduling
 * reads on the source stripes.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/4] coredump: documentation for proc entry

2007-03-01 Thread Kawai, Hidehiro

This patch adds the documentation for
/proc//coredump_omit_anonymous_shared.

Signed-off-by: Hidehiro Kawai <[EMAIL PROTECTED]>
---
 Documentation/filesystems/proc.txt |   38 +++
 1 files changed, 38 insertions(+)

Index: linux-2.6.20-mm2/Documentation/filesystems/proc.txt
===
--- linux-2.6.20-mm2.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.20-mm2/Documentation/filesystems/proc.txt
@@ -41,6 +41,7 @@ Table of Contents
   2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem
   2.12 /proc//oom_adj - Adjust the oom-killer score
   2.13 /proc//oom_score - Display current oom-killer score
+  2.14 /proc//coredump_omit_anonymous_shared - Core dump coordinator
 
 --
 Preface
@@ -1982,6 +1983,43 @@ This file can be used to check the curre
 any given . Use it together with /proc//oom_adj to tune which
 process should be killed in an out-of-memory situation.
 
+2.14 /proc//coredump_omit_anonymous_shared - Core dump coordinator
+-
+When a process is dumped, all anonymous memory is written to a core file as
+long as the size of the core file isn't limited. But sometimes we don't want
+to dump some memory segments, for example, huge shared memory.
+
+The /proc//coredump_omit_anonymous_shared is a flag which enables you to
+omit anonymous shared memory segments from a core file when it is generated.
+When the  process is dumped, the core dump routine decides whether a
+given memory segment should be dumped into a core file or not based on the
+type of the memory segment and the flag.
+
+If you have written a non-zero value to this proc file, anonymous shared
+memory segments are not dumped. There are three types of anonymous shared
+memory:
+
+  - IPC shared memory
+  - the memory segments created by mmap(2) with MAP_ANONYMOUS and MAP_SHARED
+flags
+  - the memory segments created by mmap(2) with MAP_SHARED flag, and the
+mapped file has already been unlinked
+
+Because current core dump routine doesn't distinguish these segments, you can
+only choose either dumping all anonymous shared memory segments or not.
+
+If you don't want to dump all shared memory segments attached to pid 1234,
+write 0 to the process's proc file.
+
+  $ echo 1 > /proc/1234/coredump_omit_anonymous_shared
+
+When a new process is created, the process inherits the flag status from its
+parent. It is useful to set the flag before the program runs.
+For example:
+
+  $ echo 1 > /proc/self/coredump_omit_anonymous_shared
+  $ ./some_program
+
 --
 Summary
 --


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/4] coredump: ELF-FDPIC: enable to omit anonymous shared memory

2007-03-01 Thread Kawai, Hidehiro

This patch enables to omit anonymous shared memory from an ELF-FDPIC
formatted core file when it is generated.

The debug messages from maydump() in fs/binfmt_elf_fdpic.c are changed
appropriately so that we can know what kind of memory segments are
dumped or not.

Signed-off-by: Hidehiro Kawai <[EMAIL PROTECTED]>
---
 fs/binfmt_elf_fdpic.c |   25 -
 1 files changed, 16 insertions(+), 9 deletions(-)

Index: linux-2.6.20-mm2/fs/binfmt_elf_fdpic.c
===
--- linux-2.6.20-mm2.orig/fs/binfmt_elf_fdpic.c
+++ linux-2.6.20-mm2/fs/binfmt_elf_fdpic.c
@@ -1168,7 +1168,7 @@ static int dump_seek(struct file *file, 
  *
  * I think we should skip something. But I am not sure how. H.J.
  */
-static int maydump(struct vm_area_struct *vma)
+static int maydump(struct vm_area_struct *vma, struct mm_struct *mm)
 {
/* Do not dump I/O mapped devices or special mappings */
if (vma->vm_flags & (VM_IO | VM_RESERVED)) {
@@ -1184,15 +1184,22 @@ static int maydump(struct vm_area_struct
return 0;
}
 
-   /* Dump shared memory only if mapped from an anonymous file. */
+   /*
+* Dump shared memory only if mapped from an anonymous file and
+* /proc//coredump_omit_anonymous_shared flag is not set.
+*/
if (vma->vm_flags & VM_SHARED) {
-   if (vma->vm_file->f_path.dentry->d_inode->i_nlink == 0) {
+   if (vma->vm_file->f_path.dentry->d_inode->i_nlink) {
kdcore("%08lx: %08lx: no (share)", vma->vm_start, 
vma->vm_flags);
+   return 0;
+   }
+   if (mm->coredump_omit_anon_shared) {
+   kdcore("%08lx: %08lx: no (anon-share)", vma->vm_start, 
vma->vm_flags);
+   return 0;
+   } else {
+   kdcore("%08lx: %08lx: yes (anon-share)", vma->vm_start, 
vma->vm_flags);
return 1;
}
-
-   kdcore("%08lx: %08lx: no (share)", vma->vm_start, 
vma->vm_flags);
-   return 0;
}
 
 #ifdef CONFIG_MMU
@@ -1451,7 +1458,7 @@ static int elf_fdpic_dump_segments(struc
for (vma = current->mm->mmap; vma; vma = vma->vm_next) {
unsigned long addr;
 
-   if (!maydump(vma))
+   if (!maydump(vma, mm))
continue;
 
for (addr = vma->vm_start;
@@ -1506,7 +1513,7 @@ static int elf_fdpic_dump_segments(struc
for (vml = current->mm->context.vmlist; vml; vml = vml->next) {
struct vm_area_struct *vma = vml->vma;
 
-   if (!maydump(vma))
+   if (!maydump(vma, mm))
continue;
 
if ((*size += PAGE_SIZE) > *limit)
@@ -1715,7 +1722,7 @@ static int elf_fdpic_core_dump(long sign
phdr.p_offset = offset;
phdr.p_vaddr = vma->vm_start;
phdr.p_paddr = 0;
-   phdr.p_filesz = maydump(vma) ? sz : 0;
+   phdr.p_filesz = maydump(vma, current->mm) ? sz : 0;
phdr.p_memsz = sz;
offset += phdr.p_filesz;
phdr.p_flags = vma->vm_flags & VM_READ ? PF_R : 0;


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/4] coredump: ELF: enable to omit anonymous shared memory

2007-03-01 Thread Kawai, Hidehiro

This patch enables to omit anonymous shared memory from an ELF
formatted core file when it is generated.

Signed-off-by: Hidehiro Kawai <[EMAIL PROTECTED]>
---
 fs/binfmt_elf.c |   12 +---
 1 files changed, 9 insertions(+), 3 deletions(-)

Index: linux-2.6.20-mm2/fs/binfmt_elf.c
===
--- linux-2.6.20-mm2.orig/fs/binfmt_elf.c
+++ linux-2.6.20-mm2/fs/binfmt_elf.c
@@ -1191,9 +1191,15 @@ static int maydump(struct vm_area_struct
if (vma->vm_flags & (VM_IO | VM_RESERVED))
return 0;
 
-   /* Dump shared memory only if mapped from an anonymous file. */
-   if (vma->vm_flags & VM_SHARED)
-   return vma->vm_file->f_path.dentry->d_inode->i_nlink == 0;
+   /*
+* Dump shared memory only if mapped from an anonymous file and
+* /proc//coredump_omit_anonymous_shared flag is not set.
+*/
+   if (vma->vm_flags & VM_SHARED) {
+   if (vma->vm_file->f_path.dentry->d_inode->i_nlink)
+   return 0;
+   return vma->vm_mm->coredump_omit_anon_shared == 0;
+   }
 
/* If it hasn't been written to, don't write it out */
if (!vma->anon_vma)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: PATCH 2.6.21-rc1 aoe: handle zero _count pages in bios

2007-03-01 Thread Christoph Hellwig

On Thu, Mar 01, 2007 at 08:48:06PM -0800, Andrew Morton wrote:
> On Fri, 2 Mar 2007 04:30:39 + Christoph Hellwig <[EMAIL PROTECTED]> wrote:
> 
> > But in this case we'd really need to enforce this, and add a
> > BUG_ON(PageSlab(page)) in bio_add_page to trip everyone submit
> > this kind of pages.
> 
> That would be
> 
>   BUG_ON(PageSlab(page) && page_count(page) == 0)?

No, all slab pages.  Currently they all have a reference count of
zero, but we generally don't want people to pass in pages that
come from a non-refcounted allocator.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/4] coredump: add an interface to control the core dump routine

2007-03-01 Thread Kawai, Hidehiro

This patch adds an interface to set/reset a flag which determines
anonymous shared memory segments should be dumped or not when a core
file is generated.

/proc//coredump_omit_anonymous_shared file is provided to access
the flag. You can change the flag status for a particular process by
writing to or reading from the file.

The flag status is inherited to the child process when it is created.

The flag is stored into coredump_omit_anon_shared member of mm_struct,
which shares bytes with dumpable member because these two are adjacent
bit fields. In order to avoid write-write race between the two, we use
a global spin lock.

smp_wmb() at updating dumpable is removed because set_dumpable()
includes a pair of spin lock and unlock which has the effect of
memory barrier.

Signed-off-by: Hidehiro Kawai <[EMAIL PROTECTED]>
---
 fs/exec.c   |   12 ++--
 fs/proc/base.c  |  103 ++
 include/linux/binfmts.h |4 +
 include/linux/sched.h   |   33 
 kernel/fork.c   |3 +
 kernel/sys.c|   62 +++---
 security/commoncap.c|2 
 security/dummy.c|2 
 8 files changed, 174 insertions(+), 47 deletions(-)

Index: linux-2.6.20-mm2/fs/proc/base.c
===
--- linux-2.6.20-mm2.orig/fs/proc/base.c
+++ linux-2.6.20-mm2/fs/proc/base.c
@@ -74,6 +74,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 /* NOTE:
@@ -1753,6 +1754,104 @@ static const struct inode_operations pro
 
 #endif
 
+#if defined(USE_ELF_CORE_DUMP) && defined(CONFIG_ELF_CORE)
+static ssize_t proc_coredump_omit_anon_shared_read(struct file *file,
+  char __user *buf,
+  size_t count,
+  loff_t *ppos)
+{
+   struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
+   struct mm_struct *mm;
+   char buffer[PROC_NUMBUF];
+   size_t len;
+   loff_t __ppos = *ppos;
+   int ret;
+
+   ret = -ESRCH;
+   if (!task)
+   goto out_no_task;
+
+   ret = 0;
+   mm = get_task_mm(task);
+   if (!mm)
+   goto out_no_mm;
+
+   len = snprintf(buffer, sizeof(buffer), "%u\n",
+  mm->coredump_omit_anon_shared);
+   if (__ppos >= len)
+   goto out;
+   if (count > len - __ppos)
+   count = len - __ppos;
+
+   ret = -EFAULT;
+   if (copy_to_user(buf, buffer + __ppos, count))
+   goto out;
+
+   ret = count;
+   *ppos = __ppos + count;
+
+ out:
+   mmput(mm);
+ out_no_mm:
+   put_task_struct(task);
+ out_no_task:
+   return ret;
+}
+
+static ssize_t proc_coredump_omit_anon_shared_write(struct file *file,
+   const char __user *buf,
+   size_t count,
+   loff_t *ppos)
+{
+   struct task_struct *task;
+   struct mm_struct *mm;
+   char buffer[PROC_NUMBUF], *end;
+   unsigned int val;
+   int ret;
+
+   ret = -EFAULT;
+   memset(buffer, 0, sizeof(buffer));
+   if (count > sizeof(buffer) - 1)
+   count = sizeof(buffer) - 1;
+   if (copy_from_user(buffer, buf, count))
+   goto out_no_task;
+
+   ret = -EINVAL;
+   val = (unsigned int)simple_strtoul(buffer, &end, 0);
+   if (*end == '\n')
+   end++;
+   if (end - buffer == 0)
+   goto out_no_task;
+
+   ret = -ESRCH;
+   task = get_proc_task(file->f_dentry->d_inode);
+   if (!task)
+   goto out_no_task;
+
+   ret = end - buffer;
+   mm = get_task_mm(task);
+   if (!mm)
+   goto out_no_mm;
+
+   if (down_write_trylock(&coredump_settings_sem)) {
+   set_coredump_omit_anon_shared(mm, (val != 0));
+   up_write(&coredump_settings_sem);
+   } else
+   ret = -EBUSY;
+
+   mmput(mm);
+ out_no_mm:
+   put_task_struct(task);
+ out_no_task:
+   return ret;
+}
+
+static struct file_operations proc_coredump_omit_anon_shared_operations = {
+   .read   = proc_coredump_omit_anon_shared_read,
+   .write  = proc_coredump_omit_anon_shared_write,
+};
+#endif
+
 /*
  * /proc/self:
  */
@@ -1972,6 +2071,10 @@ static struct pid_entry tgid_base_stuff[
 #ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, fault_inject),
 #endif
+#if defined(USE_ELF_CORE_DUMP) && defined(CONFIG_ELF_CORE)
+   REG("coredump_omit_anonymous_shared", S_IRUGO|S_IWUSR,
+   coredump_omit_anon_shared),
+#endif
 #ifdef CONFIG_TASK_IO_ACCOUNTING
INF("io",   S_IRUGO, pid_io_accounting),
 #endif
Index: linux-2.6.20-mm2/include/linux/sched.h
==

Re: PATCH 2.6.21-rc1 aoe: handle zero _count pages in bios

2007-03-01 Thread Andrew Morton

On Fri, 2 Mar 2007 04:30:39 + Christoph Hellwig <[EMAIL PROTECTED]> wrote:

> But in this case we'd really need to enforce this, and add a
> BUG_ON(PageSlab(page)) in bio_add_page to trip everyone submit
> this kind of pages.

That would be

BUG_ON(PageSlab(page) && page_count(page) == 0)?


> > So we have a few options to look at:
> > 
> > a) kludge things in AOE.  Unpleasing, and might cause memory leaks
> >(although it won't, because the caller hasn't run bi_end_io yet).
> > 
> > b) Take a ref on slab pages in slab.  A bit costly, perhaps.
> > 
> > c) teach ext3 and XFS to take a ref on these pages as they are added to
> >the BIOs, undo that ref in bi_end_io.
> > 
> > I think c)?
> 
> Yes.  I'm perfectly fine with this as long as we document and enforce
> this.

And write the patch ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Bug in on_each_cpu?

2007-03-01 Thread Ernie Petrides

On Thursday, 1-Mar-2007 at 7:22 PST, Andrew Morton wrote:

> On Thu, 01 Mar 2007 03:47:39 -0800 Zachary Amsden <[EMAIL PROTECTED]> wrote:
> 
> > Rusty Russell wrote:
> > > On Thu, 2007-03-01 at 03:34 -0800, Zachary Amsden wrote:
> > >   
> > >> What would be really, really nice would be to statically check all 
> > >> callsites that issue irq disables actually keep irqs disabled.  
> > >> Presumably, there was a reason they disabled irqs, and re-enabling them 
> > >> underneath their noses, even if it is to avoid a race, breaks the logic 
> > >> behind that reason.
> > >> 
> > >
> > > For the moment, how about a BUG_ON() in on_each_cpu()?
> > >   
> > 
> > Sounds quite decent.  But why does on_each_cpu need to disable 
> > interrupts at all?  It just calls func(), then re-enables interrupts.  
> > So whatever was going to happen during func() that might not be 
> > interrupt safe could just be done in the callee, avoiding the rather 
> > expensive mess of disabling and re-enabling interrupts for those cases 
> > where it doesn't matter.
> 
> The handler for smp_call_function() is called with local interrupts
> disabled (from the IPI handler).
> 
> So to provide a consistent call environment for that handler, on_each_cpu()
> will also disable local interrupts when making the direct call on this CPU.

And further, this "consistent call environment" is *required* for correct
operation of certain callers, e.g. invalidate_bh_lrus(), whose callback
function is invalidate_bh_lru().  If invalidate_bh_lru() is called without
IRQs blocked, it might be interrupted by an IPI that causes nested execution
of that same function on behalf of another cpu's call to on_each_cpu(), and
this can lead to duplicate brelse() calls on a buf head (and ultimately to
ext3 journaling crashes due to invalid concurrent use of that buf head).

Cheers.  -ernie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/4] coredump: core dump masking support v4

2007-03-01 Thread Kawai, Hidehiro

Hi,

This patch series is version 4 of the core dump masking feature,
which provides a per-process flag not to dump anonymous shared
memory segments.

In the previous version, the flag value was passed around the core
dump functions as an argument to use the same setting while dumping.
In this version, instead of doing that, a r/w semaphore prevents the
setting from being changed while dumping.

This patch series can be applied against 2.6.20-mm2.
The supported core file formats are ELF and ELF-FDPIC. ELF has been
tested, but ELF-FDPIC has not been built and tested because I don't
have the test environment.


Background:
Some software programs share huge memory among hundreds of
processes. If a failure occurs on one of these processes, they can
be signaled by a monitoring process to generate core files and
restart the service. However, it can develop into a system-wide
failure such as system slow down for a long time and disk space
shortage because the total size of the core files is very huge!

To avoid the above situation we can limit the core file size by
setrlimit(2) or ulimit(1). But this method can lose important data
such as stack because core dumping is terminated halfway.
So I suggest keeping shared memory segments from being dumped for
particular processes. Because the shared memory attached to processes
is common in them, we don't need to dump the shared memory every time.


Usage:
Get all shared memory segments of pid 1234 not to dump:

  $ echo 1 > /proc/1234/coredump_omit_anonymous_shared

When a new process is created, the process inherits the flag status
from its parent. It is useful to set the core dump flags before the
program runs. For example:

  $ echo 1 > /proc/self/coredump_omit_anonymous_shared
  $ ./some_program


ChangeLog:
v4:
  - in maydump(), retrieve the core dump setting from mm_struct
directly, instead of its additional argument
  - writing to /proc//coredump_omit_anonymous_shared returns
EBUSY while core dumping.

v3:
http://groups.google.com/group/linux.kernel/browse_frm/thread/706d2ae41c1cb2de/
  - remove `/proc//core_flags' proc entry
  - add `/proc//coredump_anonymous_shared' as a named flag
  - remove kernel.core_flags_enable sysctl parameter

v2:
http://groups.google.com/group/linux.kernel/browse_frm/thread/cb254465971d4a42/
http://groups.google.com/group/linux.kernel/browse_frm/thread/da78f2702e06fa11/
  - rename `coremask' to `core_flags'
  - change `core_flags' member in mm_struct to a bit field
next to `dumpable'
  - introduce a global spin lock to protect adjacent two bit fields
(core_flags and dumpable) from race condition
  - fix a bug that the generated core file can be corrupted when
core dumping and updating core_flags occur concurrently
  - add kernel.core_flags_enable sysctl parameter to enable/disable
flags in /proc//core_flags
  - support ELF-FDPIC binary format, but not tested

v1:
http://groups.google.com/group/linux.kernel/browse_frm/thread/1381fc54d716e3e6/

-- 
Hidehiro Kawai
Hitachi, Ltd., Systems Development Laboratory
E-mail: [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Christoph Lameter

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > Yes, we (SGI) need exactly that: Use of higher order pages in the kernel 
> > in order to reduce overhead of managing page structs for large I/O and 
> > large memory applications. We need appropriate measures to deal with the 
> > fragmentation problem.
> 
> I don't understand why, out of any architecture, ia64 would have to hack
> around this in software :(

Ummm... We have x86_64 platforms with the 4k page problem. 4k pages are 
very useful for the large number of small files that are around. But for 
the large streams of data you would want other methods of handling these.

If I want to write 1 terabyte (2^50) to disk then the I/O subsystem has 
to handle 2^(50-12) = 2^38 = 256 million page structs! This limits I/O 
bandwiths and leads to huge scatter gather lists (and we are limited in 
terms of the numbe of items on those lists in many drivers). Our future 
platforms have up to serveral petabytes of memory. There needs to be some 
way to handle these capacities in an efficient way. We cannot wait 
an hour for the terabyte to reach the disk.

> > We need to reduce the real hardware zones as much as possible. Most high 
> > performance architectures have no need for additional DMA zones f.e. and
> > do not have to deal with the complexities that arise there.
> 
> And then you want to add something else on top of them?

zones are basically managing a number of MAX_ORDER chunks. The adding of 
something here is dealing with the categorization of these MAX_ORDER 
chunks in order to insure movability and thus defragmentability of
most of them. Or the upper layer may limit the number of those chunks
assigned to a certain container.

> > Yes that would mean merging nodes and zones. So "nones".
> 
> Yes, this is what Andrew just said. But you then wanted to add virtual zones
> or something on top. I just don't understand why. You agree that merging
> nodes and zones is a good idea. Did I miss the important post where some
> bright person discovered why merging zones and "virtual zones" is a bad
> idea?

Hmmm.. I usually talk about the "virtual zones" as virtual nodes. But we 
are basically at the same point there. Node level controls and APIs exist and 
can even be used from user space. A container could just be a special node 
and then the allocations to this container could be controlled via the 
existing APIs.

A virtual zone/node would be assigned a number of MAX_ORDER blocks from 
real zones/nodes. Then it may hopefully be managed like a real node. In 
the original zone/node these MAX_ORDER blocks would show up as 
unavailable. The "upper" layer therefore is the existing node/zone layer. 
The virtual zones/nodes just steal memory from the real ones.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Christoph Lameter

On Thu, 1 Mar 2007, Andrew Morton wrote:

> Sorry, but this is crap.  zones and nodes are distinct, physical concepts
> and you're kidding yourself if you think you can somehow fudge things to make
> one of them just go away.
> 
> Think: ZONE_DMA32 on an Opteron machine.  I don't think there is a sane way
> in which we can fudge away the distinction between
> bus-addresses-which-have-the-32-upper-bits-zero and
> memory-which-is-local-to-each-socket.

Of course you can. Add a virtual DMA and DMA32 zone/node and extract the 
relevant memory from the base zone/node.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 003 of 3] knfsd: Remove CONFIG_IPV6 ifdefs from sunrpc server code.

2007-03-01 Thread NeilBrown


They don't really save that much, and aren't worth the hassle.

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./include/linux/sunrpc/svc.h |2 --
 ./net/sunrpc/svcsock.c   |   13 +++--
 2 files changed, 3 insertions(+), 12 deletions(-)

diff .prev/include/linux/sunrpc/svc.h ./include/linux/sunrpc/svc.h
--- .prev/include/linux/sunrpc/svc.h2007-03-02 14:20:13.0 +1100
+++ ./include/linux/sunrpc/svc.h2007-03-02 15:14:11.0 +1100
@@ -194,9 +194,7 @@ static inline void svc_putu32(struct kve
 
 union svc_addr_u {
 struct in_addr addr;
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
 struct in6_addraddr6;
-#endif
 };
 
 /*

diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c
--- .prev/net/sunrpc/svcsock.c  2007-03-02 15:12:52.0 +1100
+++ ./net/sunrpc/svcsock.c  2007-03-02 15:14:11.0 +1100
@@ -131,13 +131,13 @@ static char *__svc_print_addr(struct soc
NIPQUAD(((struct sockaddr_in *) addr)->sin_addr),
htons(((struct sockaddr_in *) addr)->sin_port));
break;
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+
case AF_INET6:
snprintf(buf, len, "%x:%x:%x:%x:%x:%x:%x:%x, port=%u",
NIP6(((struct sockaddr_in6 *) addr)->sin6_addr),
htons(((struct sockaddr_in6 *) addr)->sin6_port));
break;
-#endif
+
default:
snprintf(buf, len, "unknown address type: %d", addr->sa_family);
break;
@@ -449,9 +449,7 @@ svc_wake_up(struct svc_serv *serv)
 
 union svc_pktinfo_u {
struct in_pktinfo pkti;
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
struct in6_pktinfo pkti6;
-#endif
 };
 
 static void svc_set_cmsg_data(struct svc_rqst *rqstp, struct cmsghdr *cmh)
@@ -467,7 +465,7 @@ static void svc_set_cmsg_data(struct svc
cmh->cmsg_len = CMSG_LEN(sizeof(*pki));
}
break;
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+
case AF_INET6: {
struct in6_pktinfo *pki = CMSG_DATA(cmh);
 
@@ -479,7 +477,6 @@ static void svc_set_cmsg_data(struct svc
cmh->cmsg_len = CMSG_LEN(sizeof(*pki));
}
break;
-#endif
}
return;
 }
@@ -730,13 +727,11 @@ static inline void svc_udp_get_dest_addr
rqstp->rq_daddr.addr.s_addr = pki->ipi_spec_dst.s_addr;
break;
}
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
case AF_INET6: {
struct in6_pktinfo *pki = CMSG_DATA(cmh);
ipv6_addr_copy(&rqstp->rq_daddr.addr6, &pki->ipi6_addr);
break;
}
-#endif
}
 }
 
@@ -976,11 +971,9 @@ static inline int svc_port_is_privileged
case AF_INET:
return ntohs(((struct sockaddr_in *)sin)->sin_port)
< PROT_SOCK;
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
case AF_INET6:
return ntohs(((struct sockaddr_in6 *)sin)->sin6_port)
< PROT_SOCK;
-#endif
default:
return 0;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: PATCH 2.6.21-rc1 aoe: handle zero _count pages in bios

2007-03-01 Thread Christoph Hellwig

On Thu, Mar 01, 2007 at 07:22:45PM -0800, Andrew Morton wrote:
> Well I spose slab _could_ take a ref on these pages.

What it would need to do is:

 - add a reference for every object touching this page
 - don't give the page back to the page allocator or reuse any
   single object inside it until there are no more reference to the page.

I don't think this is a very good idea, although the netowkring references
tend to be rather short-term once making this not a that bad burden.

> Networking internally maintains caller memory lifetimes, and it assumes
> that the caller allocated memory via __alloc_pages() - because it uses
> get_page() and put_page().
> 
> BIO, however, does not internally manage caller memory lifetime.  This is
> because the caller's ->bi_end_io is always called, so the caller can do it.
> 
> So where we've come unstuck is in a module which has gone and fed BIO
> memory into networking.  The differing design philosophies are clashing.
> 
> I'm surprised this doesn't happen in other places - aren't there any other
> drivers which take a BIO and stuff it down the network?
> 
> Anyway, where's the bug?
> 
> Really, I'd say it's XFS (and ext3).  Even though BIO doesn't presently
> manage page lifetimes, it _could_.  After all, the function is called
> bio_add_page(), not bio_add_virtual_address().  It's a bit hacky to kmalloc
> some memory, run virt_to_page() and to then present that page to BIO even
> though the caller (thanks to the slab optimisation) doesn't actually have
> control of that page's lifetime.

That was the conclusion I came to when this was brought up initially.
Fixing up XFS would be easyish and only waste a tiny amount of memory,
and the same is true for ext3 (I did in fact suggest just using get_free_page
for this case but got shot down for stupid reasons when the slab debug
alignment issues in that area came up)

But in this case we'd really need to enforce this, and add a
BUG_ON(PageSlab(page)) in bio_add_page to trip everyone submit
this kind of pages.

> So we have a few options to look at:
> 
> a) kludge things in AOE.  Unpleasing, and might cause memory leaks
>(although it won't, because the caller hasn't run bi_end_io yet).
> 
> b) Take a ref on slab pages in slab.  A bit costly, perhaps.
> 
> c) teach ext3 and XFS to take a ref on these pages as they are added to
>the BIOs, undo that ref in bi_end_io.
> 
> I think c)?

Yes.  I'm perfectly fine with this as long as we document and enforce
this.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 002 of 3] knfsd: Avoid checksum checks when collecting metadata for a UDP packet.

2007-03-01 Thread NeilBrown


When recv_msg is called with a size of 0 and MSG_PEEK (and
sunrpc/svcsock.c does), it is clear that we only interested in
metadata (from/to addresses) and not the data, so don't do any
checksum checking at this point.  Leave that until the data is
requested.

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./net/ipv4/udp.c |3 +++
 ./net/ipv6/udp.c |4 
 2 files changed, 7 insertions(+)

diff .prev/net/ipv4/udp.c ./net/ipv4/udp.c
--- .prev/net/ipv4/udp.c2007-03-02 14:20:13.0 +1100
+++ ./net/ipv4/udp.c2007-03-02 15:13:50.0 +1100
@@ -846,6 +846,9 @@ try_again:
goto csum_copy_err;
copy_only = 1;
}
+   if (len == 0 &&  (flags & MSG_PEEK))
+   /* avoid checksum concerns when just getting metadata */
+   copy_only = 1;
 
if (copy_only)
err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr),

diff .prev/net/ipv6/udp.c ./net/ipv6/udp.c
--- .prev/net/ipv6/udp.c2007-03-02 14:20:13.0 +1100
+++ ./net/ipv6/udp.c2007-03-02 15:13:50.0 +1100
@@ -151,6 +151,10 @@ try_again:
copy_only = 1;
}
 
+   if (len == 0 &&  (flags & MSG_PEEK))
+   /* avoid checksum concerns when just getting metadata */
+   copy_only = 1;
+
if (copy_only)
err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr),
  msg->msg_iov, copied   );
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Andrew Morton

On Thu, 1 Mar 2007 20:06:25 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> 
wrote:

> No merge them to one thing and handle them as one. No difference between 
> zones and nodes anymore.

Sorry, but this is crap.  zones and nodes are distinct, physical concepts
and you're kidding yourself if you think you can somehow fudge things to make
one of them just go away.

Think: ZONE_DMA32 on an Opteron machine.  I don't think there is a sane way
in which we can fudge away the distinction between
bus-addresses-which-have-the-32-upper-bits-zero and
memory-which-is-local-to-each-socket.

No matter how hard those hands are waving.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 001 of 3] knfsd: Use recv_msg to get peer address for NFSD instead of code-copying

2007-03-01 Thread NeilBrown


The sunrpc server code needs to know the source and destination address
for UDP packets so it can reply properly. 
It currently copies code out of the network stack to pick the pieces out
of the skb.
This is ugly and causes compile problems with the IPv6 stuff.

So, rip that out and use recv_msg instead.  This is a much cleaner
interface, but has a slight cost in that the checksum is now checked
before the copy, so we don't benefit from doing both at the same time.
This can probably be fixed.


Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./net/sunrpc/svcsock.c |   63 -
 1 file changed, 31 insertions(+), 32 deletions(-)

diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c
--- .prev/net/sunrpc/svcsock.c  2007-03-02 14:20:14.0 +1100
+++ ./net/sunrpc/svcsock.c  2007-03-02 15:12:52.0 +1100
@@ -721,45 +721,23 @@ svc_write_space(struct sock *sk)
}
 }
 
-static void svc_udp_get_sender_address(struct svc_rqst *rqstp,
-   struct sk_buff *skb)
+static inline void svc_udp_get_dest_address(struct svc_rqst *rqstp,
+   struct cmsghdr *cmh)
 {
switch (rqstp->rq_sock->sk_sk->sk_family) {
case AF_INET: {
-   /* this seems to come from net/ipv4/udp.c:udp_recvmsg */
-   struct sockaddr_in *sin = svc_addr_in(rqstp);
-
-   sin->sin_family = AF_INET;
-   sin->sin_port = skb->h.uh->source;
-   sin->sin_addr.s_addr = skb->nh.iph->saddr;
-   rqstp->rq_addrlen = sizeof(struct sockaddr_in);
-   /* Remember which interface received this request */
-   rqstp->rq_daddr.addr.s_addr = skb->nh.iph->daddr;
-   }
+   struct in_pktinfo *pki = CMSG_DATA(cmh);
+   rqstp->rq_daddr.addr.s_addr = pki->ipi_spec_dst.s_addr;
break;
+   }
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
case AF_INET6: {
-   /* this is derived from net/ipv6/udp.c:udpv6_recvmesg */
-   struct sockaddr_in6 *sin6 = svc_addr_in6(rqstp);
-
-   sin6->sin6_family = AF_INET6;
-   sin6->sin6_port = skb->h.uh->source;
-   sin6->sin6_flowinfo = 0;
-   sin6->sin6_scope_id = 0;
-   if (ipv6_addr_type(&sin6->sin6_addr) &
-   IPV6_ADDR_LINKLOCAL)
-   sin6->sin6_scope_id = IP6CB(skb)->iif;
-   ipv6_addr_copy(&sin6->sin6_addr,
-   &skb->nh.ipv6h->saddr);
-   rqstp->rq_addrlen = sizeof(struct sockaddr_in);
-   /* Remember which interface received this request */
-   ipv6_addr_copy(&rqstp->rq_daddr.addr6,
-   &skb->nh.ipv6h->saddr);
-   }
+   struct in6_pktinfo *pki = CMSG_DATA(cmh);
+   ipv6_addr_copy(&rqstp->rq_daddr.addr6, &pki->ipi6_addr);
break;
+   }
 #endif
}
-   return;
 }
 
 /*
@@ -771,7 +749,15 @@ svc_udp_recvfrom(struct svc_rqst *rqstp)
struct svc_sock *svsk = rqstp->rq_sock;
struct svc_serv *serv = svsk->sk_server;
struct sk_buff  *skb;
+   charbuffer[CMSG_SPACE(sizeof(union svc_pktinfo_u))];
+   struct cmsghdr *cmh = (struct cmsghdr *)buffer;
int err, len;
+   struct msghdr msg = {
+   .msg_name = svc_addr(rqstp),
+   .msg_control = cmh,
+   .msg_controllen = sizeof(buffer),
+   .msg_flags = MSG_DONTWAIT,
+   };
 
if (test_and_clear_bit(SK_CHNGBUF, &svsk->sk_flags))
/* udp sockets need large rcvbuf as all pending
@@ -797,7 +783,9 @@ svc_udp_recvfrom(struct svc_rqst *rqstp)
}
 
clear_bit(SK_DATA, &svsk->sk_flags);
-   while ((skb = skb_recv_datagram(svsk->sk_sk, 0, 1, &err)) == NULL) {
+   while ((err == kernel_recvmsg(svsk->sk_sock, &msg, NULL,
+ 0, 0, MSG_PEEK)) < 0 ||
+  (skb = skb_recv_datagram(svsk->sk_sk, 0, 1, &err)) == NULL) {
if (err == -EAGAIN) {
svc_sock_received(svsk);
return err;
@@ -805,6 +793,7 @@ svc_udp_recvfrom(struct svc_rqst *rqstp)
/* possibly an icmp error */
dprintk("svc: recvfrom returned error %d\n", -err);
}
+   rqstp->rq_addrlen = sizeof(rqstp->rq_addr);
if (skb->tstamp.off_sec == 0) {
struct timeval tv;
 
@@ -827,7 +816,7 @@ svc_udp_recvfrom(struct svc_rqst *rqstp)
 
rqstp->rq_prot = IPPROTO_UDP;
 
-   svc_udp_get_se

[PATCH 000 of 3] knfsd: Resolve IPv6 related link error

2007-03-01 Thread NeilBrown

Current mainline has a compile linkage problem if both
  CONFIG_IPV6=m
  CONFIG_SUNRPC=y

because net/sunrpc/svcsock.c conditionally used a function defined in the IPv6 
module.

These three patches resolve the issue.

The problem is caused because svcsock needs to get the source and
destination address for a udp packet, but doesn't want to just use
sock_recvmsg like userspace would as it wants to be able to use the
data directly out of the skbuff rather than copying it (when practical).

Currently it copies code from udp.c (both ipv4/ and ipv6/) and this
causes the problem.

This patch changes it to use kernel_recvmsg with a length of 0 and
flags of MSG_PEEK to get the addresses but leave the data untouched.

A small problem here is that kernel_recvmsg always checks the
checksum, so in the case of a large packet we will check the checksum
at a different time to when we copy it out into a buffer, which is not ideal.

So the second patch of this series avoids the check when recv_msg is
called with size==0 and flags==MSG_PEEK.  This change should be acked
by someone on netdev before going upsteam!!!  The rest of the series
is still appropriate without the patch, it is just a small
optimisation.

Finally the last patch removes all the
  #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
from sunrpc as it really isn't needed and just hides this sort of problem.

Patches 1 and 3 are suitable for 2.6.21.  Patch 2 needs confirmation.

Thanks,
NeilBrown

 [PATCH 001 of 3] knfsd: Use recv_msg to get peer address for NFSD instead of 
code-copying
 [PATCH 002 of 3] knfsd: Avoid checksum checks when collecting metadata for a 
UDP packet.
 [PATCH 003 of 3] knfsd: Remove CONFIG_IPV6 ifdefs from sunrpc server code.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Paul Mundt

On Fri, Mar 02, 2007 at 04:57:51AM +0100, Nick Piggin wrote:
> On Thu, Mar 01, 2007 at 07:05:48PM -0800, Christoph Lameter wrote:
> > On Thu, 1 Mar 2007, Andrew Morton wrote:
> > > For prioritisation purposes I'd judge that memory hot-unplug is of similar
> > > value to the antifrag work (because memory hot-unplug permits DIMM
> > > poweroff).
> > 
> > I would say that anti-frag / defrag enables memory unplug.
> 
> Well that really depends. If you want to have any sort of guaranteed
> amount of unplugging or shrinking (or hugepage allocating), then antifrag
> doesn't work because it is a heuristic.
> 
> One thing that worries me about anti-fragmentation is that people might
> actually start _using_ higher order pages in the kernel. Then fragmentation
> comes back, and it's worse because now it is not just the fringe hugepage or
> unplug users (who can anyway work around the fragmentation by allocating
> from reserve zones).
> 
There's two sides to that, the ability to use higher order pages in the
kernel also means that it's possible to use larger TLB entries while
keeping the base page size small, too. There are already many places in
the kernel that attempt to use the largest possible size when setting up
the entries, and this is something that those of us with tiny
software-managed TLBs are a huge fan of -- some platforms have even opted
to do perverse things such as scanning for contiguous PTEs and bumping to
the next order automatically at set_pte() time.

Unplug is also interesting from a power management point of view.
Powering off is still more attractive than self-refresh, for example, but
could also be used at run-time depending on the workload.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc1 and 2.6.21-rc2 kwin dies silently

2007-03-01 Thread Sid Boyce

Avi Kivity wrote:

Sid Boyce wrote:
> That's very much appreciated. The point is that all vanilla 
kernels up

> to 2.6.20+ have not had the problems now seen on 2.6.20-rc1 and
> 2.6.20-rc2 and like other problems reported, sic framebuffer, etc.,
> there is a distinct likelihood that it's related to those kernels and
> worth reporting here where it will also be seen by the openSUSE 
kernel

> developers.

Try running an strace on kwin and reporting the result.

Modified /opt/kde3/bin/startkde as below, but got no output, not even 
an empty file.

strace -s 256 -f kwin --lock -o /home/lancelot/KWIN.out &

Perhaps that line is never executed.

Try running kwin from your konsole after it dies, with the strace of 
course. Oh, and put the '-o ...' before the kwin command, not after.

Oops!, above text should read the same as the subject line,  problems 
seen on 2.6.21-rc1 and 2.6.21-rc2.
The strace is huge 2737627 2007-03-02 03:28 KWIN.out.  Further digging 
shows kwin, kicker and klauncher and perhaps other kdeinit stuff also 
die - no desktop icons after those 3 are started from the commandline. 
Moving kdesktop_lock out of /opt/kde3/bin, everything comes back after 
the video is blanked -- no password required.
I shall run like that (2.6.21-rc2-git1 currently) and wait for openSUSE 
to upgrade to 2.6.21. I can send the straces of kicker and kwin on if 
you think it's still worth it.

Thanks and Regards
Sid.

--
Sid Boyce ... Hamradio License G3VBV, Licensed Private Pilot
Emeritus IBM/Amdahl Mainframes and Sun/Fujitsu Servers Tech Support Specialist, 
Cricket Coach
Microsoft Windows Free Zone - Linux used for all Computing Tasks

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Nick Piggin

On Thu, Mar 01, 2007 at 08:06:25PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > I would say that anti-frag / defrag enables memory unplug.
> > 
> > Well that really depends. If you want to have any sort of guaranteed
> > amount of unplugging or shrinking (or hugepage allocating), then antifrag
> > doesn't work because it is a heuristic.
> 
> We would need additional measures such as real defrag and make more 
> structure movable.
> 
> > One thing that worries me about anti-fragmentation is that people might
> > actually start _using_ higher order pages in the kernel. Then fragmentation
> > comes back, and it's worse because now it is not just the fringe hugepage or
> > unplug users (who can anyway work around the fragmentation by allocating
> > from reserve zones).
> 
> Yes, we (SGI) need exactly that: Use of higher order pages in the kernel 
> in order to reduce overhead of managing page structs for large I/O and 
> large memory applications. We need appropriate measures to deal with the 
> fragmentation problem.

I don't understand why, out of any architecture, ia64 would have to hack
around this in software :(

> > > Thats a value judgement that I doubt. Zone based balancing is bad and has 
> > > been repeatedly patched up so that it works with the usual loads.
> > 
> > Shouldn't we fix it instead of deciding it is broken and add another layer
> > on top that supposedly does better balancing?
> 
> We need to reduce the real hardware zones as much as possible. Most high 
> performance architectures have no need for additional DMA zones f.e. and
> do not have to deal with the complexities that arise there.

And then you want to add something else on top of them?

> > But just because zones are hardware _now_ doesn't mean they have to stay
> > that way. The upshot is that a lot of work for zones is already there.
> 
> Well you cannot get there without the nodes. The control of memory 
> allocations with user space support etc only comes with the nodes.
> 
> > > A. moveable/unmovable
> > > B. DMA restrictions
> > > C. container assignment.
> > 
> > There are alternatives to adding a new layer of virtual zones. We could try
> > using zones, enven.
> 
> No merge them to one thing and handle them as one. No difference between 
> zones and nodes anymore.
>  
> > zones aren't perfect right now, but they are quite similar to what you
> > want (ie. blocks of memory). I think we should first try to generalise what
> > we have rather than adding another layer.
> 
> Yes that would mean merging nodes and zones. So "nones".

Yes, this is what Andrew just said. But you then wanted to add virtual zones
or something on top. I just don't understand why. You agree that merging
nodes and zones is a good idea. Did I miss the important post where some
bright person discovered why merging zones and "virtual zones" is a bad
idea?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/9] Vmi fix highpte

2007-03-01 Thread Jeremy Fitzhardinge

Jeremy Fitzhardinge wrote:
> Hm, I don't think this interface will work for Xen.  In Xen, whenever a
> pagetable page gets mapped, it must be mapped RO.  map_pt_hook gets
> called after the mapping has already been created, so its too late for Xen.
>
> I was planning on adding kmap_atomic_pte() for use in pte_offset_map*(),
> which would be wired through to paravirt_ops to allow Xen to make this a
> RO mapping.  Would this be sufficient for you to do your vmi thing?
>   

Something like this (compiled, untested).

J

diff -r 972e84c265cf arch/i386/kernel/paravirt.c
--- a/arch/i386/kernel/paravirt.c   Thu Mar 01 19:12:49 2007 -0800
+++ b/arch/i386/kernel/paravirt.c   Thu Mar 01 19:38:42 2007 -0800
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* nop stub */
 void _paravirt_nop(void)
@@ -605,6 +606,8 @@ struct paravirt_ops paravirt_ops = {
 
.kpte_clear_flush = native_kpte_clear_flush,
 
+   .kmap_atomic_pte = native_kmap_atomic_pte,
+
 #ifdef CONFIG_X86_PAE
.set_pte_atomic = native_set_pte_atomic,
.set_pte_present = native_set_pte_present,
diff -r 972e84c265cf arch/i386/mm/highmem.c
--- a/arch/i386/mm/highmem.cThu Mar 01 19:12:49 2007 -0800
+++ b/arch/i386/mm/highmem.cThu Mar 01 19:38:42 2007 -0800
@@ -26,7 +26,7 @@ void kunmap(struct page *page)
  * However when holding an atomic kmap is is not legal to sleep, so atomic
  * kmaps are appropriate for short, tight code paths only.
  */
-void *kmap_atomic(struct page *page, enum km_type type)
+void *_kmap_atomic(struct page *page, enum km_type type, pgprot_t prot)
 {
enum fixed_addresses idx;
unsigned long vaddr;
@@ -41,9 +41,14 @@ void *kmap_atomic(struct page *page, enu
return page_address(page);
 
vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-   set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
+   set_pte(kmap_pte-idx, mk_pte(page, prot));
 
return (void*) vaddr;
+}
+
+void *kmap_atomic(struct page *page, enum km_type type)
+{
+   return _kmap_atomic(page, type, kmap_prot);
 }
 
 void kunmap_atomic(void *kvaddr, enum km_type type)
diff -r 972e84c265cf arch/i386/xen/enlighten.c
--- a/arch/i386/xen/enlighten.c Thu Mar 01 19:12:49 2007 -0800
+++ b/arch/i386/xen/enlighten.c Thu Mar 01 19:38:42 2007 -0800
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "xen-ops.h"
 #include "mmu.h"
@@ -499,6 +500,11 @@ static void xen_release_pt(u32 pfn)
ClearPagePinned(page);
make_lowmem_page_readwrite(__va(PFN_PHYS(pfn)));
}
+}
+
+static void *xen_kmap_atomic_pte(struct page *page, enum km_type type)
+{
+   return _kmap_atomic(page, type, PAGE_KERNEL_RO);
 }
 
 static __init void xen_pagetable_setup_start(pgd_t *base)
@@ -688,6 +694,8 @@ static const struct paravirt_ops xen_par
 
.kpte_clear_flush = xen_kpte_clear_flush,
 
+   .kmap_atomic_pte = xen_kmap_atomic_pte,
+
.pte_val = xen_pte_val,
.pgd_val = xen_pgd_val,
 
diff -r 972e84c265cf include/asm-i386/highmem.h
--- a/include/asm-i386/highmem.hThu Mar 01 19:12:49 2007 -0800
+++ b/include/asm-i386/highmem.hThu Mar 01 19:38:42 2007 -0800
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* declarations for highmem.c */
 extern unsigned long highstart_pfn, highend_pfn;
@@ -67,10 +68,20 @@ extern void FASTCALL(kunmap_high(struct 
 
 void *kmap(struct page *page);
 void kunmap(struct page *page);
+void *_kmap_atomic(struct page *page, enum km_type type, pgprot_t prot);
 void *kmap_atomic(struct page *page, enum km_type type);
 void kunmap_atomic(void *kvaddr, enum km_type type);
 void *kmap_atomic_pfn(unsigned long pfn, enum km_type type);
 struct page *kmap_atomic_to_page(void *ptr);
+
+static inline void *native_kmap_atomic_pte(struct page *page, enum km_type 
type)
+{
+   return kmap_atomic(page, type);
+}
+
+#ifndef CONFIG_PARAVIRT
+#define kmap_atomic_pte(page, type)native_kmap_atomic_pte(page, type)
+#endif
 
 #define flush_cache_kmaps()do { } while (0)
 
diff -r 972e84c265cf include/asm-i386/paravirt.h
--- a/include/asm-i386/paravirt.h   Thu Mar 01 19:12:49 2007 -0800
+++ b/include/asm-i386/paravirt.h   Thu Mar 01 19:38:42 2007 -0800
@@ -15,6 +15,9 @@
 
 #ifndef __ASSEMBLY__
 #include 
+#include 
+
+struct page;
 
 #define paravirt_type(type)[paravirt_typenum] "i" (type)
 #define paravirt_clobber(clobber)  [paravirt_clobber] "i" (clobber)
@@ -372,6 +375,8 @@ struct paravirt_ops
 
pte_t (*ptep_get_and_clear)(pte_t *ptep);
 
+   void *(*kmap_atomic_pte)(struct page *page, enum km_type type);
+
 #ifdef CONFIG_X86_PAE
void (*set_pte_atomic)(pte_t *ptep, pte_t pteval);
void (*set_pte_present)(struct mm_struct *mm, unsigned long addr, pte_t 
*ptep, pte_t pte);
@@ -695,6 +700,13 @@ static inline void paravirt_init_pda(str
 #define paravirt_alloc_pd_clone(pfn, clonepfn, start, count) \
PVOP_VCALL4(alloc_pd_cl

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Balbir Singh


Linus Torvalds wrote:


On Fri, 2 Mar 2007, Balbir Singh wrote:

My personal opinion is that while I'm not a huge fan of virtualization,
these kinds of things really _can_ be handled more cleanly at that layer,
and not in the kernel at all. Afaik, it's what IBM already does, and has
been doing for a while. There's no shame in looking at what already works,
especially if it's simpler.

Could you please clarify as to what "that layer" means - is it the
firmware/hardware for virtualization? or does it refer to user space?


Virtualization in general. We don't know what it is - in IBM machines it's 
a hypervisor. With Xen and VMware, it's usually a hypervisor too. With 
KVM, it's obviously a host Linux kernel/user-process combination.




Thanks for clarifying.

The point being that in the guests, hotunplug is almost useless (for 
bigger ranges), and we're much better off just telling the virtualization 
hosts on a per-page level whether we care about a page or not, than to 
worry about fragmentation.


And in hosts, we usually don't care EITHER, since it's usually done in a 
hypervisor.



It would also be useful to have a resource controller like per-container
RSS control (container refers to a task grouping) within the kernel or
non-virtualized environments as well.


.. but this has again no impact on anti-fragmentation.



Yes, I agree that anti-fragmentation and resource management are independent
of each other. I must admit to being a bit selfish here, in that my main
interest is in resource management and we would love to see a well
written  and easy to understand resource management infrastructure and 
controllers to control CPU and memory usage. Since the issue of

per-container RSS control came up, I wanted to ensure that we do not mix
up resource control and anti-fragmentation.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fastboot] [PATCH RFC 0/5] hard_smp_processor_id overhaul

2007-03-01 Thread Vivek Goyal

On Thu, Mar 01, 2007 at 09:06:48AM -0500, Benjamin LaHaise wrote:
> On Thu, Mar 01, 2007 at 04:16:13PM +0900, Fernando Luis Vázquez Cao wrote:
> > As a consequence, the hardcoding of hard_smp_processor_id() to 0 on UP
> > systems (see "linux/smp.h") is not correct.
> > 
> > This patch-set does the following:
> > 
> > 1- Remove hardcoding of hard_smp_processor_id on UP systems.
> 
> NAK.  This has to be configurable, as many embedded systems don't even 
> have APICs.  Please rework the patch set so that there is not any overhead 
> for existing UP systems.

Fernando did the code audit and found no instance of hard_smp_processor_id
being used for non APIC case. So are embedded systems you are referring,
patching the kernel?

Anyway, I think providing hard_smp_processor_id() definition for UP systems
without APIC does not harm.

Thanks
Vivek
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Christoph Lameter

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > I would say that anti-frag / defrag enables memory unplug.
> 
> Well that really depends. If you want to have any sort of guaranteed
> amount of unplugging or shrinking (or hugepage allocating), then antifrag
> doesn't work because it is a heuristic.

We would need additional measures such as real defrag and make more 
structure movable.

> One thing that worries me about anti-fragmentation is that people might
> actually start _using_ higher order pages in the kernel. Then fragmentation
> comes back, and it's worse because now it is not just the fringe hugepage or
> unplug users (who can anyway work around the fragmentation by allocating
> from reserve zones).

Yes, we (SGI) need exactly that: Use of higher order pages in the kernel 
in order to reduce overhead of managing page structs for large I/O and 
large memory applications. We need appropriate measures to deal with the 
fragmentation problem.

> > Thats a value judgement that I doubt. Zone based balancing is bad and has 
> > been repeatedly patched up so that it works with the usual loads.
> 
> Shouldn't we fix it instead of deciding it is broken and add another layer
> on top that supposedly does better balancing?

We need to reduce the real hardware zones as much as possible. Most high 
performance architectures have no need for additional DMA zones f.e. and
do not have to deal with the complexities that arise there.

> But just because zones are hardware _now_ doesn't mean they have to stay
> that way. The upshot is that a lot of work for zones is already there.

Well you cannot get there without the nodes. The control of memory 
allocations with user space support etc only comes with the nodes.

> > A. moveable/unmovable
> > B. DMA restrictions
> > C. container assignment.
> 
> There are alternatives to adding a new layer of virtual zones. We could try
> using zones, enven.

No merge them to one thing and handle them as one. No difference between 
zones and nodes anymore.

> zones aren't perfect right now, but they are quite similar to what you
> want (ie. blocks of memory). I think we should first try to generalise what
> we have rather than adding another layer.

Yes that would mean merging nodes and zones. So "nones".

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Andrew Morton

On Thu, 1 Mar 2007 19:44:27 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> 
wrote:

> In other words, I really don't see a huge upside. I see *lots* of 
> downsides, but upsides? Not so much. Almost everybody who wants unplug 
> wants virtualization, and right now none of the "big virtualization" 
> people would want to have kernel-level anti-fragmentation anyway sicne 
> they'd need to do it on their own.

Agree with all that, but you're missing the other application: power
saving.  FBDIMMs take eight watts a pop.  If we can turn them off when the
system is unloaded we save either four or all eight watts (assuming we can
get Intel to part with the information which is needed to do this.  I fear
an ACPI method will ensue).

There's a whole lot of complexity and work in all of this, but 24*8 watts
is a lot of watts, and it's worth striving for.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Nick Piggin

On Thu, Mar 01, 2007 at 07:05:48PM -0800, Christoph Lameter wrote:
> On Thu, 1 Mar 2007, Andrew Morton wrote:
> > For prioritisation purposes I'd judge that memory hot-unplug is of similar
> > value to the antifrag work (because memory hot-unplug permits DIMM
> > poweroff).
> 
> I would say that anti-frag / defrag enables memory unplug.

Well that really depends. If you want to have any sort of guaranteed
amount of unplugging or shrinking (or hugepage allocating), then antifrag
doesn't work because it is a heuristic.

One thing that worries me about anti-fragmentation is that people might
actually start _using_ higher order pages in the kernel. Then fragmentation
comes back, and it's worse because now it is not just the fringe hugepage or
unplug users (who can anyway work around the fragmentation by allocating
from reserve zones).

> > Our basic unit of memory management is the zone.  Right now, a zone maps
> > onto some hardware-imposed thing.  But the zone-based MM works *well*.  I
> 
> Thats a value judgement that I doubt. Zone based balancing is bad and has 
> been repeatedly patched up so that it works with the usual loads.

Shouldn't we fix it instead of deciding it is broken and add another layer
on top that supposedly does better balancing?

> > suspect that a good way to solve both per-container RSS and mem hotunplug
> > is to split the zone concept away from its hardware limitations: create a
> > "software zone" and a "hardware zone".  All the existing page allocator and
> > reclaim code remains basically unchanged, and it operates on "software
> > zones".  Each software zones always lies within a single hardware zone. 
> > The software zones are resizeable.  For per-container RSS we give each
> > container one (or perhaps multiple) resizeable software zones.
> 
> Resizable software zones? Are they contiguous or not? If not then we
> add another layer to the defrag problem.

I think Andrew is proposing that we work out what the problem is first.
I don't know what the defrag problem is, but I know that fragmentation
is unavoidable unless you have fixed size areas for each different size
of unreclaimable allocation.

> > NUMA and cpusets screwed up: they've gone and used nodes as their basic
> > unit of memory management whereas they should have used zones.  This will
> > need to be untangled.
> 
> zones have hardware characteristics at its core. In a NUMA setting zones 
> determine the performance of loads from those areas. I would like to have
> zones and nodes merged. Maybe extend node numbers into the negative area
> -1 = DMA -2 DMA32 etc? All systems then manage the "nones" (node / zones 
> meerged). One could create additional "virtual" nones after the real nones 
> that have hardware characteristics behind them. The virtual nones would be 
> something like the software zones? Contain MAX_ORDER portions of hardware 
> nones?

But just because zones are hardware _now_ doesn't mean they have to stay
that way. The upshot is that a lot of work for zones is already there.

> > Anyway, that's just a shot in the dark.  Could be that we implement unplug
> > and RSS control by totally different means.  But I do wish that we'd sort
> > out what those means will be before we potentially complicate the story a
> > lot by adding antifragmentation.
> 
> Hmmm My shot:
> 
> 1. Merge zones/nodes
> 
> 2. Create new virtual zones/nodes that are subsets of MAX_order blocks of 
> the real zones/nodes. These may then have additional characteristics such
> as 
> 
> A. moveable/unmovable
> B. DMA restrictions
> C. container assignment.

There are alternatives to adding a new layer of virtual zones. We could try
using zones, enven.

zones aren't perfect right now, but they are quite similar to what you
want (ie. blocks of memory). I think we should first try to generalise what
we have rather than adding another layer.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + fully-honor-vdso_enabled.patch added to -mm tree

2007-03-01 Thread Paul Mundt

On Thu, Mar 01, 2007 at 08:52:07PM +0300, Oleg Nesterov wrote:
> > --- a/arch/i386/kernel/sysenter.c~fully-honor-vdso_enabled
> > +++ a/arch/i386/kernel/sysenter.c
> > @@ -22,6 +22,8 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> >
> >  /*
> >   * Should the kernel map a VDSO page into processes and pass its
> > @@ -105,10 +107,25 @@ int arch_setup_additional_pages(struct l
> >  {
> > struct mm_struct *mm = current->mm;
> > unsigned long addr;
> > +   unsigned long flags;
> > int ret;
> >
> > +   switch (vdso_enabled) {
> > +   case 0:  /* none */
> > +   return 0;
> 
> This means we don't initialize mm->context.vdso and ->sysenter_return.
> 
> Is it ok? For example, setup_rt_frame() uses VDSO_SYM(&__kernel_rt_sigreturn),
> sysenter_past_esp pushes ->sysenter_return on stack.
> 
The setup_rt_frame() case is fairly straightforward, both PPC and SH
already check to make sure there's a valid context before trying to use
VDSO_SYM(), I'm not sure why x86 doesn't.

Though I wonder if there's any point in checking binfmt->hasvdso here?
There shouldn't be a valid mm->context.vdso in the !hasvdso case..

Someone else will have to comment on ->sysenter_return.

Signed-off-by: Paul Mundt <[EMAIL PROTECTED]>

--

 arch/i386/kernel/signal.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/i386/kernel/signal.c b/arch/i386/kernel/signal.c
index 4f99e87..f778d34 100644
--- a/arch/i386/kernel/signal.c
+++ b/arch/i386/kernel/signal.c
@@ -350,7 +350,7 @@ static int setup_frame(int sig, struct k_sigaction *ka,
goto give_sigsegv;
}
 
-   if (current->binfmt->hasvdso)
+   if (current->binfmt->hasvdso && current->mm->context.vdso)
restorer = (void *)VDSO_SYM(&__kernel_sigreturn);
else
restorer = (void *)&frame->retcode;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Linus Torvalds

On Fri, 2 Mar 2007, Balbir Singh wrote:
>
> > My personal opinion is that while I'm not a huge fan of virtualization,
> > these kinds of things really _can_ be handled more cleanly at that layer,
> > and not in the kernel at all. Afaik, it's what IBM already does, and has
> > been doing for a while. There's no shame in looking at what already works,
> > especially if it's simpler.
> 
> Could you please clarify as to what "that layer" means - is it the
> firmware/hardware for virtualization? or does it refer to user space?

Virtualization in general. We don't know what it is - in IBM machines it's 
a hypervisor. With Xen and VMware, it's usually a hypervisor too. With 
KVM, it's obviously a host Linux kernel/user-process combination.

The point being that in the guests, hotunplug is almost useless (for 
bigger ranges), and we're much better off just telling the virtualization 
hosts on a per-page level whether we care about a page or not, than to 
worry about fragmentation.

And in hosts, we usually don't care EITHER, since it's usually done in a 
hypervisor.

> It would also be useful to have a resource controller like per-container
> RSS control (container refers to a task grouping) within the kernel or
> non-virtualized environments as well.

.. but this has again no impact on anti-fragmentation.

In other words, I really don't see a huge upside. I see *lots* of 
downsides, but upsides? Not so much. Almost everybody who wants unplug 
wants virtualization, and right now none of the "big virtualization" 
people would want to have kernel-level anti-fragmentation anyway sicne 
they'd need to do it on their own.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread KAMEZAWA Hiroyuki

On Thu, 1 Mar 2007 16:09:15 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

> On Thu, 1 Mar 2007 10:12:50 +
> [EMAIL PROTECTED] (Mel Gorman) wrote:
> 
> > Any opinion on merging these patches into -mm
> > for wider testing?
> 
> I'm a little reluctant to make changes to -mm's core mm unless those
> changes are reasonably certain to be on track for mainline, so let's talk
> about that.
> 
> What worries me is memory hot-unplug and per-container RSS limits.  We
> don't know how we're going to do either of these yet, and it could well be
> that the anti-frag work significantly complexicates whatever we end up
> doing there.
> 
> For prioritisation purposes I'd judge that memory hot-unplug is of similar
> value to the antifrag work (because memory hot-unplug permits DIMM
> poweroff).

About memory-hot-unplug, I'm now writing a new patch-set for memory-unplug for
showing my overview and roadmap. I'm now debugging it. I think I will be able to
post them as RFC in a week.

At least, ZONE_MOVABLE(or something partitioning memory) is necessary for
memory-hot-unplug like DIMM-poweroff. (I'm now using my own ZONE_MOVABLE patch, 
but
It is O.K. to migrate to Mel's one if it's ready to be merged.)


> Our basic unit of memory management is the zone.  Right now, a zone maps
> onto some hardware-imposed thing.  But the zone-based MM works *well*.  I
> suspect that a good way to solve both per-container RSS and mem hotunplug
> is to split the zone concept away from its hardware limitations: create a
> "software zone" and a "hardware zone".  All the existing page allocator and
> reclaim code remains basically unchanged, and it operates on "software
> zones".  Each software zones always lies within a single hardware zone. 
> The software zones are resizeable.  For per-container RSS we give each
> container one (or perhaps multiple) resizeable software zones.
> 
> For memory hotunplug, some of the hardware zone's software zones are marked
> reclaimable and some are not; DIMMs which are wholly within reclaimable
> zones can be depopulated and powered off or removed.
> 
Hmm...software-zone seems attractive.
I remember someone posted pesuedo-zone(pzone) patch in past.

-Kame

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: PATCH 2.6.21-rc1 aoe: handle zero _count pages in bios

2007-03-01 Thread Andrew Morton

On Fri, 2 Mar 2007 02:29:19 + Christoph Hellwig <[EMAIL PROTECTED]> wrote:

> On Thu, Mar 01, 2007 at 05:42:04PM -0800, Andrew Morton wrote:
> > Something funny is going on here.
> 
> Not so funny for those who've tried to sort out the issue over
> the past years and just got ignored..
> 
> > Generally, one should increment the refcount of a page when it is put into
> > some container.  That means that the page should get +1 when it is added to
> > a bio.  (direct-io does this, but the mpage.c pagecache code cheats, and
> > relies upon PG_locked and PG-writeback protecting the page).
> 
> It's a slab page, and slab pages aren't refcounted (which is a good thing
> as you don't own the whole page)

ah, I see.

> > Similarly, the network code (or its caller) should be incrementing the
> > page's refcount as the page goes into a container (ie: the skb) and
> > decrementing it as the page is removed.
> > 
> > But someone somewhere is breaking those rules.  Who?
> 
> slab code.  

Well I spose slab _could_ take a ref on these pages.

> > So.  Who is breaking refcounting protocol here?  Perhaps it is AOE, failing
> > to increment the refcount on pages as they are added to an skb?
> > 
> > (Do we know which callsite in XFS is adding zero-ref pages to a BIO, btw?)
> 
> For example all log I/O is done from kmalloce pages.
> 
> Anyway, to rehash what I've been trying to get clarified for ages:
> 
> 
>  (1) should we allow to pass slab pages into bios
> 
> and
> 
>  (2) if yes what's the way lower layers are supposed to handle them
>  for any possible refcounting operations like networking or rdma.
> 
> There's also a pontial caller in ext3 that can send down kmalloc'ed
> buffers: journal_write_metadata_buffer() in need_copy_out && !done_copy_out
> case.  But apparently that's an almost dead code path as I've never
> seen anyone tripping this one, it's always XFS that people report.

OK.  Let's go through it.

Networking internally maintains caller memory lifetimes, and it assumes
that the caller allocated memory via __alloc_pages() - because it uses
get_page() and put_page().

BIO, however, does not internally manage caller memory lifetime.  This is
because the caller's ->bi_end_io is always called, so the caller can do it.

So where we've come unstuck is in a module which has gone and fed BIO
memory into networking.  The differing design philosophies are clashing.

I'm surprised this doesn't happen in other places - aren't there any other
drivers which take a BIO and stuff it down the network?

Anyway, where's the bug?

Really, I'd say it's XFS (and ext3).  Even though BIO doesn't presently
manage page lifetimes, it _could_.  After all, the function is called
bio_add_page(), not bio_add_virtual_address().  It's a bit hacky to kmalloc
some memory, run virt_to_page() and to then present that page to BIO even
though the caller (thanks to the slab optimisation) doesn't actually have
control of that page's lifetime.

So we have a few options to look at:

a) kludge things in AOE.  Unpleasing, and might cause memory leaks
   (although it won't, because the caller hasn't run bi_end_io yet).

b) Take a ref on slab pages in slab.  A bit costly, perhaps.

c) teach ext3 and XFS to take a ref on these pages as they are added to
   the BIOs, undo that ref in bi_end_io.

I think c)?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rcutorture: Mark rcu_torture_init as __init

2007-03-01 Thread Paul E. McKenney

On Thu, Mar 01, 2007 at 11:29:03AM -0800, Josh Triplett wrote:

Acked-by: Paul E. McKenney <[EMAIL PROTECTED]>
> Signed-off-by: Josh Triplett <[EMAIL PROTECTED]>
> ---
> The corresponding rcu_torture_cleanup cannot get marked as __exit, because
> rcu_torture_init uses it to clean up if init fails.
> 
>  kernel/rcutorture.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
> index 7258bcb..df49eca 100644
> --- a/kernel/rcutorture.c
> +++ b/kernel/rcutorture.c
> @@ -866,7 +866,7 @@ rcu_torture_cleanup(void)
>   rcu_torture_print_module_parms("End of test: SUCCESS");
>  }
>  
> -static int
> +static int __init
>  rcu_torture_init(void)
>  {
>   int i;
> 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/5] hard_smp_processor_id overhaul

2007-03-01 Thread Fernando Luis Vázquez Cao

On Thu, 2007-03-01 at 09:06 -0500, Benjamin LaHaise wrote:
> On Thu, Mar 01, 2007 at 04:16:13PM +0900, Fernando Luis Vázquez Cao wrote:
> > As a consequence, the hardcoding of hard_smp_processor_id() to 0 on UP
> > systems (see "linux/smp.h") is not correct.
> > 
> > This patch-set does the following:
> > 
> > 1- Remove hardcoding of hard_smp_processor_id on UP systems.
> 
> NAK.  This has to be configurable, as many embedded systems don't even 
> have APICs.  Please rework the patch set so that there is not any overhead 
> for existing UP systems.
In i386 (with the exception of voyager) and x86_64,
hard_smp_processor_id is not used anywhere in the kernel when there are
no APICs available.

Regarding the overhead, hard_smp_processor_id is used mostly during
initialization and doesn't seem to be used in any fast path in i386,
x86_64, and ia64. All the other architectures are not affected by this
patch, because I kept the hardcoding of hard_smp_processor_id on UP
kernels, and just moved the definition to asm/smp.h because it should be
handled by architecture-speficic code.

So unless strictly necessary I would not like to make this patches
dependent on kdump.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/9] Vmi fix highpte

2007-03-01 Thread Jeremy Fitzhardinge

Zachary Amsden wrote:
> Provide a PT map hook for HIGHPTE kernels to designate where they are mapping
> page tables.  This information is required so the physical address of PTE
> updates can be determined; otherwise, the mm layer would have to carry the
> physical address all the way to each PTE modification callsite, which is
> even more hideous that the macros required to provide the proper hooks.
>
> So lets not mess up arch neutral code to achieve this, but keep the horror
> in an #ifdef HIGHPTE in include/asm-i386/pgtable.h.  I had to use macros
> here because some types are not yet defined in all the include paths for
> this header.
>
> This patch is absolutely required for HIGHPTE kernels to operate properly
> with VMI.
>   

Hm, I don't think this interface will work for Xen.  In Xen, whenever a
pagetable page gets mapped, it must be mapped RO.  map_pt_hook gets
called after the mapping has already been created, so its too late for Xen.

I was planning on adding kmap_atomic_pte() for use in pte_offset_map*(),
which would be wired through to paravirt_ops to allow Xen to make this a
RO mapping.  Would this be sufficient for you to do your vmi thing?

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Christoph Lameter

On Thu, 1 Mar 2007, Andrew Morton wrote:

> What worries me is memory hot-unplug and per-container RSS limits.  We
> don't know how we're going to do either of these yet, and it could well be
> that the anti-frag work significantly complexicates whatever we end up
> doing there.

Right now it seems that the per container RSS limits differ from the 
statistics calculated per zone. There would be a conceptual overlap but 
the containers are optional and track numbers differently. There is no RSS 
counter in a zone f.e.

memory hot-unplug would directly tap into the anti-frag work. Essentially 
only the zone with movable pages would be unpluggable without additional 
measures. Making slab items and other allocations that is fixed movable 
requires work anyways. A new zone concept will not help.

> For prioritisation purposes I'd judge that memory hot-unplug is of similar
> value to the antifrag work (because memory hot-unplug permits DIMM
> poweroff).

I would say that anti-frag / defrag enables memory unplug.

> And I'd judge that per-container RSS limits are of considerably more value
> than antifrag (in fact per-container RSS might be a superset of antifrag,
> in the sense that per-container RSS and containers could be abused to fix
> the i-cant-get-any-hugepages problem, dunno).

They relate? How can a container perform antifrag? Meaning a container 
reserves a portion of a hardware zone and becomes a software zone.

> So some urgent questions are: how are we going to do mem hotunplug and
> per-container RSS?

Separately. There is no need to mingle these two together.

> Our basic unit of memory management is the zone.  Right now, a zone maps
> onto some hardware-imposed thing.  But the zone-based MM works *well*.  I

Thats a value judgement that I doubt. Zone based balancing is bad and has 
been repeatedly patched up so that it works with the usual loads.

> suspect that a good way to solve both per-container RSS and mem hotunplug
> is to split the zone concept away from its hardware limitations: create a
> "software zone" and a "hardware zone".  All the existing page allocator and
> reclaim code remains basically unchanged, and it operates on "software
> zones".  Each software zones always lies within a single hardware zone. 
> The software zones are resizeable.  For per-container RSS we give each
> container one (or perhaps multiple) resizeable software zones.

Resizable software zones? Are they contiguous or not? If not then we
add another layer to the defrag problem.

> For memory hotunplug, some of the hardware zone's software zones are marked
> reclaimable and some are not; DIMMs which are wholly within reclaimable
> zones can be depopulated and powered off or removed.

So subzones indeed. How about calling the MAX_ORDER entities that Mel's 
patches create "software zones"?

> NUMA and cpusets screwed up: they've gone and used nodes as their basic
> unit of memory management whereas they should have used zones.  This will
> need to be untangled.

zones have hardware characteristics at its core. In a NUMA setting zones 
determine the performance of loads from those areas. I would like to have
zones and nodes merged. Maybe extend node numbers into the negative area
-1 = DMA -2 DMA32 etc? All systems then manage the "nones" (node / zones 
meerged). One could create additional "virtual" nones after the real nones 
that have hardware characteristics behind them. The virtual nones would be 
something like the software zones? Contain MAX_ORDER portions of hardware 
nones?

> Anyway, that's just a shot in the dark.  Could be that we implement unplug
> and RSS control by totally different means.  But I do wish that we'd sort
> out what those means will be before we potentially complicate the story a
> lot by adding antifragmentation.

Hmmm My shot:

1. Merge zones/nodes

2. Create new virtual zones/nodes that are subsets of MAX_order blocks of 
the real zones/nodes. These may then have additional characteristics such
as 

A. moveable/unmovable
B. DMA restrictions
C. container assignment.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.20-rc1: CIFS cheers, NFS4 jeers

2007-03-01 Thread Florin Iucha

On Wed, Feb 28, 2007 at 09:52:34PM -0800, Andrew Morton wrote:
> On Mon, 26 Feb 2007 00:45:00 -0600 [EMAIL PROTECTED] (Florin Iucha) wrote:
> 
> > Hello, it's me and my 70 GB of photos again.
[snip]
> > Running 'top', one core is idle and the other is 99% waiting, while
> > the 'cp' program is in 'D' state.  Also, after NFSv4 stalls, invokations
> > of 'lsof' stall as well.  I can 'ssh' into the box without problems.
> 
> and
> 
> > The kernel on the client is 2.6.21-rc1 (but it echoes problems I
> > reported in December with 2.6.20 series as well) as can be seen from
> > the kernel logs.
> > 
> > I have corrected the links:
> > 
> >http://iucha.net/21-rc1/before.1
> >http://iucha.net/21-rc1/after.1
> >http://iucha.net/21-rc1/config-2.6.21-rc1
> > 
> 
> The relevant part is:
> 
> [ 1215.657827] cpD 00f86f105704 0  2859   2843
>  (NOTLB)
> [ 1215.657833]  81007343faa8 0082  
> 81007343fb58
> [ 1215.657837]  0002 81007343faa8 0008 
> 81007e578ee0
> [ 1215.657842]  810002f4a080 2150 81007e5790b8 
> 00017343fb50
> [ 1215.657847] Call Trace:
> [ 1215.657852]  [] io_schedule+0x28/0x34
> [ 1215.657856]  [] sync_page+0x41/0x45
> [ 1215.657859]  [] __wait_on_bit+0x45/0x77
> [ 1215.657862]  [] sync_page+0x0/0x45
> [ 1215.657867]  [] wait_on_page_bit+0x6e/0x75
> [ 1215.657870]  [] wake_bit_function+0x0/0x2a
> [ 1215.657874]  [] pagevec_lookup_tag+0x22/0x2b
> [ 1215.657878]  [] wait_on_page_writeback_range+0x6e/0x142
> [ 1215.657885]  [] filemap_fdatawait+0x20/0x22
> [ 1215.657889]  [] filemap_write_and_wait+0x29/0x38
> [ 1215.657894]  [] nfs_setattr+0xa0/0x11a
> [ 1215.657897]  [] link_path_walk+0xe8/0xfc
> [ 1215.657902]  [] autoremove_wake_function+0x0/0x38
> [ 1215.657907]  [] poison_obj+0x27/0x32
> [ 1215.657910]  [] current_fs_time+0x3f/0x41
> [ 1215.657913]  [] __user_walk_fd+0x53/0x62
> [ 1215.657918]  [] notify_change+0x129/0x238
> [ 1215.657923]  [] do_utimes+0xfc/0x126
> [ 1215.657928]  [] _raw_spin_lock+0xf3/0xf9
> [ 1215.657933]  [] sys_futimesat+0x45/0x56
> [ 1215.657937]  [] sys_utimes+0x14/0x16
> [ 1215.657941]  [] system_call+0x7e/0x83
> 
> seems that we've simply lost an IO completion.
> 
> Was 2.6.19 OK?

I just tested, 2.6.19 is OK!  Kernel log output after the cp and sync
completed are at

http://iucha.net/19/before
http://iucha.net/19/after  (after echo t > /proc/sysrq-trigger)

When I get a chance I will try again, and report if it fails.  But so far
it seems fine: df and lsof work as expected.

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature

[PATCH 7/9] Fix nohz compile.patch

2007-03-01 Thread Zachary Amsden

More goo from hrtimers integration.  We do compile and run properly with NO_HZ
enabled.  There was a period when we didn't because of a missing export, but
that was since fixed.

And with the clocksource code now firmly in place, we can get rid of code
that fixes up the wallclock, since this is done in the common infrastructure.
This actually fixes a timer bug as well, that was caused by do_settimeofday
no longer being callable with interrupts disabled due to the use of
on_each_cpu().

Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>

diff -r 5d41588419ab arch/i386/Kconfig
--- a/arch/i386/Kconfig Tue Feb 27 17:24:55 2007 -0800
+++ b/arch/i386/Kconfig Tue Feb 27 17:25:44 2007 -0800
@@ -220,7 +220,7 @@ config PARAVIRT
 
 config VMI
bool "VMI Paravirt-ops support"
-   depends on PARAVIRT && !NO_HZ
+   depends on PARAVIRT
default y
help
  VMI provides a paravirtualized interface to multiple hypervisors
diff -r 5d41588419ab arch/i386/kernel/vmi.c
--- a/arch/i386/kernel/vmi.cTue Feb 27 17:24:55 2007 -0800
+++ b/arch/i386/kernel/vmi.cTue Feb 27 18:46:26 2007 -0800
@@ -934,6 +934,7 @@ void __init vmi_init(void)
 #ifdef CONFIG_X86_IO_APIC
no_timer_check = 1;
 #endif
+   no_sync_cmos_clock = 1;
 
local_irq_restore(flags & X86_EFLAGS_IF);
 }
diff -r 5d41588419ab arch/i386/kernel/vmitime.c
--- a/arch/i386/kernel/vmitime.cTue Feb 27 17:24:55 2007 -0800
+++ b/arch/i386/kernel/vmitime.cTue Feb 27 18:47:51 2007 -0800
@@ -153,13 +153,6 @@ static void vmi_get_wallclock_ts(struct 
ts->tv_sec = wallclock;
 }
 
-static void update_xtime_from_wallclock(void)
-{
-   struct timespec ts;
-   vmi_get_wallclock_ts(&ts);
-   do_settimeofday(&ts);
-}
-
 unsigned long vmi_get_wallclock(void)
 {
struct timespec ts;
@@ -197,18 +190,10 @@ void __init vmi_time_init(void)
set_intr_gate(LOCAL_TIMER_VECTOR, apic_vmi_timer_interrupt);
 #endif
 
-   no_sync_cmos_clock = 1;
-
-   vmi_get_wallclock_ts(&xtime);
-   set_normalized_timespec(&wall_to_monotonic,
-   -xtime.tv_sec, -xtime.tv_nsec);
-
real_cycles_accounted_system = read_real_cycles();
-   update_xtime_from_wallclock();
per_cpu(process_times_cycles_accounted_cpu, 0) = 
read_available_cycles();
 
cycles_per_sec = vmi_timer_ops.get_cycle_frequency();
-
cycles_per_jiffy = cycles_per_sec;
(void)do_div(cycles_per_jiffy, HZ);
cycles_per_alarm = cycles_per_sec;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/9] Pit override.patch

2007-03-01 Thread Zachary Amsden

The time_init_hook in paravirt-ops no longer functions in the correct manner
after the integration of the hrtimers code.  The problem is that now the
call path for time initialization is:

  time_init :
   late_time_init = hpet_time_init;

  late_time_init -> hpet_time_init:
   setup_pit_timer (BAD)
   do_time_init --> (via paravirt.h)
  time_init_hook --> (via arch_hooks.h)
  time_init_hook (in SUBARCH/setup.c)

If this isn't confusing enough, the paravirt case goes through an indirect
function pointer in the paravirt-ops table.  The problem is, by the time
the paravirt hook is called, the pit timer is already enabled.

But paravirt guests have their own timer, and don't want to use the PIT.
Rather than intensify the struggle for power going on here, just make it
all nice and simple and just unconditionally do all timer setup in
the late_time_init hook.  This also has the advantage of enabling timers
in the same place in all code paths, so everyone has the same bugs and
we don't have outliers who break other code because they turn on timer
too early or too late.

So the paravirt-ops time init function is now by default hpet_time_init,
which is the time init function used for native hardware.  Paravirt
guests have the chance to override this when they setup the paravirt-ops
table, and should need no change.

Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>

diff -r 2ae8eb19b227 arch/i386/kernel/paravirt.c
--- a/arch/i386/kernel/paravirt.c   Tue Feb 27 16:28:10 2007 -0800
+++ b/arch/i386/kernel/paravirt.c   Tue Feb 27 17:08:11 2007 -0800
@@ -494,7 +494,7 @@ struct paravirt_ops paravirt_ops = {
.memory_setup = machine_specific_memory_setup,
.get_wallclock = native_get_wallclock,
.set_wallclock = native_set_wallclock,
-   .time_init = time_init_hook,
+   .time_init = hpet_time_init,
.init_IRQ = native_init_IRQ,
 
.cpuid = native_cpuid,
diff -r 2ae8eb19b227 arch/i386/kernel/time.c
--- a/arch/i386/kernel/time.c   Tue Feb 27 16:28:10 2007 -0800
+++ b/arch/i386/kernel/time.c   Tue Feb 27 16:50:01 2007 -0800
@@ -262,14 +262,22 @@ void notify_arch_cmos_timer(void)
 
 extern void (*late_time_init)(void);
 /* Duplicate of time_init() below, with hpet_enable part added */
-static void __init hpet_time_init(void)
+void __init hpet_time_init(void)
 {
if (!hpet_enable())
setup_pit_timer();
-   do_time_init();
-}
-
+   time_init_hook();
+}
+
+/*
+ * This is called directly from init code; we must delay timer setup in the
+ * HPET case as we can't make the decision to turn on HPET this early in the
+ * boot process.
+ *
+ * The chosen time_init function will usually be hpet_time_init, above, but
+ * in the case of virtual hardware, an alternative function may be substituted.
+ */
 void __init time_init(void)
 {
-   late_time_init = hpet_time_init;
-}
+   late_time_init = choose_time_init();
+}
diff -r 2ae8eb19b227 include/asm-i386/paravirt.h
--- a/include/asm-i386/paravirt.h   Tue Feb 27 16:28:10 2007 -0800
+++ b/include/asm-i386/paravirt.h   Tue Feb 27 17:07:23 2007 -0800
@@ -186,9 +186,9 @@ static inline int set_wallclock(unsigned
return paravirt_ops.set_wallclock(nowtime);
 }
 
-static inline void do_time_init(void)
-{
-   return paravirt_ops.time_init();
+static inline void (*choose_time_init(void))(void)
+{
+   return paravirt_ops.time_init;
 }
 
 /* The paravirtualized CPUID instruction. */
diff -r 2ae8eb19b227 include/asm-i386/time.h
--- a/include/asm-i386/time.h   Tue Feb 27 16:28:10 2007 -0800
+++ b/include/asm-i386/time.h   Tue Feb 27 16:50:45 2007 -0800
@@ -28,13 +28,16 @@ static inline int native_set_wallclock(u
return retval;
 }
 
+extern void (*late_time_init)(void);
+extern void hpet_time_init(void);
+
 #ifdef CONFIG_PARAVIRT
 #include 
 #else /* !CONFIG_PARAVIRT */
 
 #define get_wallclock() native_get_wallclock()
 #define set_wallclock(x) native_set_wallclock(x)
-#define do_time_init() time_init_hook()
+#define choose_time_init() hpet_time_init
 
 #endif /* CONFIG_PARAVIRT */
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 9/9] Vmi smp fixes.patch

2007-03-01 Thread Zachary Amsden

Critical fixes for SMP.

Fix a couple functions which needed to be __devinit and fix a bogus
parameter to AP startup that just so happened to work because the
low virtual mapping of memory was still established.

Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>

diff -r baf2e278a482 arch/i386/kernel/vmi.c
--- a/arch/i386/kernel/vmi.cThu Mar 01 18:08:53 2007 -0800
+++ b/arch/i386/kernel/vmi.cThu Mar 01 18:10:18 2007 -0800
@@ -525,13 +525,14 @@ void vmi_pmd_clear(pmd_t *pmd)
 #endif
 
 #ifdef CONFIG_SMP
-struct vmi_ap_state ap;
 extern void setup_pda(void);
 
-static void __init /* XXX cpu hotplug */
+static void __devinit
 vmi_startup_ipi_hook(int phys_apicid, unsigned long start_eip,
 unsigned long start_esp)
 {
+   struct vmi_ap_state ap;
+
/* Default everything to zero.  This is fine for most GPRs. */
memset(&ap, 0, sizeof(struct vmi_ap_state));
 
@@ -570,7 +571,7 @@ vmi_startup_ipi_hook(int phys_apicid, un
/* Protected mode, paging, AM, WP, NE, MP. */
ap.cr0 = 0x80050023;
ap.cr4 = mmu_cr4_features;
-   vmi_ops.set_initial_ap_state(__pa(&ap), phys_apicid);
+   vmi_ops.set_initial_ap_state((u32)&ap, phys_apicid);
 }
 #endif
 
diff -r baf2e278a482 arch/i386/kernel/vmitime.c
--- a/arch/i386/kernel/vmitime.cThu Mar 01 18:08:53 2007 -0800
+++ b/arch/i386/kernel/vmitime.cThu Mar 01 18:08:53 2007 -0800
@@ -243,7 +243,7 @@ void __init vmi_timer_setup_boot_alarm(v
 
 /* Initialize the time accounting variables for an AP on an SMP system.
  * Also, set the local alarm for the AP. */
-void __init vmi_timer_setup_secondary_alarm(void)
+void __devinit vmi_timer_setup_secondary_alarm(void)
 {
int cpu = smp_processor_id();
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 8/9] Vmi apic ops.diff

2007-03-01 Thread Zachary Amsden

Use para_fill instead of directly setting the APIC ops to the result of the
vmi_get_function call - this allows one to implement a VMI ROM without
implementing APIC functions, just using the native APIC functions.

While doing this, I realized that there is a lot more cleanup that should
have been done.  Basically, we should never assume that the ROM implements
a specific set of functions, and always allow fallback to the native
implementation.

This is critical for future compatibility.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>
Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>

diff -r 0ba8434a5c7e arch/i386/kernel/vmi.c
--- a/arch/i386/kernel/vmi.cThu Mar 01 16:49:27 2007 -0800
+++ b/arch/i386/kernel/vmi.cThu Mar 01 16:49:33 2007 -0800
@@ -54,6 +54,7 @@ static int disable_tsc;
 static int disable_tsc;
 static int disable_mtrr;
 static int disable_noidle;
+static int disable_vmi_timer;
 
 /* Cached VMI operations */
 struct {
@@ -662,12 +663,12 @@ void vmi_bringup(void)
 void vmi_bringup(void)
 {
/* We must establish the lowmem mapping for MMU ops to work */
-   if (vmi_rom)
+   if (vmi_ops.set_linear_mapping)
vmi_ops.set_linear_mapping(0, __PAGE_OFFSET, max_low_pfn, 0);
 }
 
 /*
- * Return a pointer to the VMI function or a NOP stub
+ * Return a pointer to a VMI function or NULL if unimplemented
  */
 static void *vmi_get_function(int vmicall)
 {
@@ -678,12 +679,13 @@ static void *vmi_get_function(int vmical
if (rel->type == VMI_RELOCATION_CALL_REL)
return (void *)rel->eip;
else
-   return (void *)vmi_nop;
+   return NULL;
 }
 
 /*
  * Helper macro for making the VMI paravirt-ops fill code readable.
- * For unimplemented operations, fall back to default.
+ * For unimplemented operations, fall back to default, unless nop
+ * is returned by the ROM.
  */
 #define para_fill(opname, vmicall) \
 do {   \
@@ -692,8 +694,28 @@ do {   
\
if (rel->type != VMI_RELOCATION_NONE) { \
BUG_ON(rel->type != VMI_RELOCATION_CALL_REL);   \
paravirt_ops.opname = (void *)rel->eip; \
+   } else if (rel->type == VMI_RELOCATION_NOP) \
+   paravirt_ops.opname = (void *)vmi_nop;  \
+} while (0)
+
+/*
+ * Helper macro for making the VMI paravirt-ops fill code readable.
+ * For cached operations which do not match the VMI ROM ABI and must
+ * go through a tranlation stub.  Ignore NOPs, since it is not clear
+ * a NOP * VMI function corresponds to a NOP paravirt-op when the
+ * functions are not in 1-1 correspondence.
+ */
+#define para_wrap(opname, wrapper, cache, vmicall) \
+do {   \
+   reloc = call_vrom_long_func(vmi_rom, get_reloc, \
+   VMI_CALL_##vmicall);\
+   BUG_ON(rel->type == VMI_RELOCATION_JUMP_REL);   \
+   if (rel->type == VMI_RELOCATION_CALL_REL) { \
+   paravirt_ops.opname = wrapper;  \
+   vmi_ops.cache = (void *)rel->eip;   \
}   \
 } while (0)
+
 
 /*
  * Activate the VMI interface and switch into paravirtualized mode
@@ -731,13 +753,8 @@ static inline int __init activate_vmi(vo
 *  rdpmc is not yet used in Linux
 */
 
-   /* CPUID is special, so very special */
-   reloc = call_vrom_long_func(vmi_rom, get_reloc, VMI_CALL_CPUID);
-   if (rel->type != VMI_RELOCATION_NONE) {
-   BUG_ON(rel->type != VMI_RELOCATION_CALL_REL);
-   vmi_ops.cpuid = (void *)rel->eip;
-   paravirt_ops.cpuid = vmi_cpuid;
-   }
+   /* CPUID is special, so very special it gets wrapped like a present */
+   para_wrap(cpuid, vmi_cpuid, cpuid, CPUID);
 
para_fill(clts, CLTS);
para_fill(get_debugreg, GetDR);
@@ -754,6 +771,7 @@ static inline int __init activate_vmi(vo
para_fill(restore_fl, SetInterruptMask);
para_fill(irq_disable, DisableInterrupts);
para_fill(irq_enable, EnableInterrupts);
+
/* irq_save_disable !!! sheer pain */
patch_offset(&irq_save_disable_callout[IRQ_PATCH_INT_MASK],
 (char *)paravirt_ops.save_fl);
@@ -761,26 +779,18 @@ static inline int __init activate_vmi(vo
 (char *)paravirt_ops.irq_disable);
 
para_fill(wbinvd, WBINVD);
+   para_fill(read_tsc, RDTSC);
+
+   /* The following we emulate with trap and emulate for now */
/* paravirt_ops.read_msr = vmi_rdmsr */
/* paravirt_ops.write_msr = vmi_wrmsr */
-   para_fill(read_tsc, RDTSC);
/* paravirt_ops.rdpmc = vmi_rdpmc */
 
-   /* TR interface doesn't pass TR value

[PATCH 5/9] Paravirt drop udelay op

2007-03-01 Thread Zachary Amsden

Not respecting udelay causes problems with any virtual hardware that is
passed through to real hardware.  This can be noticed by any device that
interacts with the real world in real time - like AP startup, which takes
real time.  Or keyboard LEDs, which should blink in real-time.  Or floppy
drives, but only when passed through to a real floppy controller on OSes
which can't sufficiently buffer the floppy commands to emulate a zero
latency floppy.  Or IDE drives, when connecting to a physical CDROM.

This was mostly a hack to get the kernel to boot faster, but it introduced
a number of misvirtualization bugs, and Alan and Pavel argued pretty strongly
against it.  We were the only client, and now want to clean up this cruft.

Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>

diff -r 135d1b73c878 arch/i386/kernel/paravirt.c
--- a/arch/i386/kernel/paravirt.c   Tue Feb 27 16:23:56 2007 -0800
+++ b/arch/i386/kernel/paravirt.c   Tue Feb 27 16:25:26 2007 -0800
@@ -538,7 +538,6 @@ struct paravirt_ops paravirt_ops = {
 
.set_iopl_mask = native_set_iopl_mask,
.io_delay = native_io_delay,
-   .const_udelay = __const_udelay,
 
 #ifdef CONFIG_X86_LOCAL_APIC
.apic_write = native_apic_write,
diff -r 135d1b73c878 arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.cTue Feb 27 16:23:56 2007 -0800
+++ b/arch/i386/kernel/smpboot.cTue Feb 27 16:27:16 2007 -0800
@@ -33,11 +33,6 @@
  * Dave Jones  :   Report invalid combinations of Athlon 
CPUs.
 *  Rusty Russell   :   Hacked into shape for new "hotplug" 
boot process. */
 
-
-/* SMP boot always wants to use real time delay to allow sufficient time for
- * the APs to come online */
-#define USE_REAL_TIME_DELAY
-
 #include 
 #include 
 #include 
diff -r 135d1b73c878 arch/i386/kernel/vmi.c
--- a/arch/i386/kernel/vmi.cTue Feb 27 16:23:56 2007 -0800
+++ b/arch/i386/kernel/vmi.cTue Feb 27 16:28:00 2007 -0800
@@ -48,7 +48,6 @@ typedef u64 __attribute__((regparm(2))) 
 
 static struct vrom_header *vmi_rom;
 static int license_gplok;
-static int disable_nodelay;
 static int disable_pge;
 static int disable_pse;
 static int disable_sep;
@@ -801,9 +800,6 @@ static inline int __init activate_vmi(vo
 
para_fill(set_iopl_mask, SetIOPLMask);
paravirt_ops.io_delay = (void *)vmi_nop;
-   if (!disable_nodelay) {
-   paravirt_ops.const_udelay = (void *)vmi_nop;
-   }
 
para_fill(set_lazy_mode, SetLazyMode);
 
@@ -947,9 +943,7 @@ static int __init parse_vmi(char *arg)
if (!arg)
return -EINVAL;
 
-   if (!strcmp(arg, "disable_nodelay"))
-   disable_nodelay = 1;
-   else if (!strcmp(arg, "disable_pge")) {
+   if (!strcmp(arg, "disable_pge")) {
clear_bit(X86_FEATURE_PGE, boot_cpu_data.x86_capability);
disable_pge = 1;
} else if (!strcmp(arg, "disable_pse")) {
diff -r 135d1b73c878 include/asm-i386/delay.h
--- a/include/asm-i386/delay.h  Tue Feb 27 16:23:56 2007 -0800
+++ b/include/asm-i386/delay.h  Tue Feb 27 16:26:01 2007 -0800
@@ -16,13 +16,6 @@ extern void __const_udelay(unsigned long
 extern void __const_udelay(unsigned long usecs);
 extern void __delay(unsigned long loops);
 
-#if defined(CONFIG_PARAVIRT) && !defined(USE_REAL_TIME_DELAY)
-#define udelay(n) paravirt_ops.const_udelay((n) * 0x10c7ul)
-
-#define ndelay(n) paravirt_ops.const_udelay((n) * 5ul)
-
-#else /* !PARAVIRT || USE_REAL_TIME_DELAY */
-
 /* 0x10c7 is 2**32 / 100 (rounded up) */
 #define udelay(n) (__builtin_constant_p(n) ? \
((n) > 2 ? __bad_udelay() : __const_udelay((n) * 0x10c7ul)) : \
@@ -32,7 +25,6 @@ extern void __delay(unsigned long loops)
 #define ndelay(n) (__builtin_constant_p(n) ? \
((n) > 2 ? __bad_ndelay() : __const_udelay((n) * 5ul)) : \
__ndelay(n))
-#endif
 
 void use_tsc_delay(void);
 
diff -r 135d1b73c878 include/asm-i386/paravirt.h
--- a/include/asm-i386/paravirt.h   Tue Feb 27 16:23:56 2007 -0800
+++ b/include/asm-i386/paravirt.h   Tue Feb 27 16:25:39 2007 -0800
@@ -117,7 +117,6 @@ struct paravirt_ops
void (*set_iopl_mask)(unsigned mask);
 
void (*io_delay)(void);
-   void (*const_udelay)(unsigned long loops);
 
 #ifdef CONFIG_X86_LOCAL_APIC
void (*apic_write)(unsigned long reg, unsigned long v);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/9] Vmi fix highpte

2007-03-01 Thread Zachary Amsden

Provide a PT map hook for HIGHPTE kernels to designate where they are mapping
page tables.  This information is required so the physical address of PTE
updates can be determined; otherwise, the mm layer would have to carry the
physical address all the way to each PTE modification callsite, which is
even more hideous that the macros required to provide the proper hooks.

So lets not mess up arch neutral code to achieve this, but keep the horror
in an #ifdef HIGHPTE in include/asm-i386/pgtable.h.  I had to use macros
here because some types are not yet defined in all the include paths for
this header.

This patch is absolutely required for HIGHPTE kernels to operate properly
with VMI.

Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>

diff -r 87bf6b2d338d arch/i386/kernel/paravirt.c
--- a/arch/i386/kernel/paravirt.c   Tue Feb 27 14:14:34 2007 -0800
+++ b/arch/i386/kernel/paravirt.c   Tue Feb 27 14:14:36 2007 -0800
@@ -553,6 +553,8 @@ struct paravirt_ops paravirt_ops = {
.flush_tlb_kernel = native_flush_tlb_global,
.flush_tlb_single = native_flush_tlb_single,
 
+   .map_pt_hook = (void *)native_nop,
+
.alloc_pt = (void *)native_nop,
.alloc_pd = (void *)native_nop,
.alloc_pd_clone = (void *)native_nop,
diff -r 87bf6b2d338d arch/i386/kernel/vmi.c
--- a/arch/i386/kernel/vmi.cTue Feb 27 14:14:34 2007 -0800
+++ b/arch/i386/kernel/vmi.cTue Feb 27 16:23:37 2007 -0800
@@ -370,6 +370,24 @@ static void vmi_check_page_type(u32 pfn,
 #define vmi_check_page_type(p,t) do { } while (0)
 #endif
 
+static void vmi_map_pt_hook(int type, pte_t *va, u32 pfn)
+{
+   /*
+* Internally, the VMI ROM must map virtual addresses to physical
+* addresses for processing MMU updates.  By the time MMU updates
+* are issued, this information is typically already lost.
+* Fortunately, the VMI provides a cache of mapping slots for active
+* page tables.
+*
+* We use slot zero for the linear mapping of physical memory, and
+* in HIGHPTE kernels, slot 1 and 2 for KM_PTE0 and KM_PTE1.
+* 
+*  args: SLOT VACOUNT PFN
+*/
+   BUG_ON(type != KM_PTE0 && type != KM_PTE1);
+   vmi_ops.set_linear_mapping((type - KM_PTE0)+1, (u32)va, 1, pfn);
+}
+
 static void vmi_allocate_pt(u32 pfn)
 {
vmi_set_page_type(pfn, VMI_PAGE_L1);
@@ -813,6 +831,7 @@ static inline int __init activate_vmi(vo
vmi_ops.allocate_page = vmi_get_function(VMI_CALL_AllocatePage);
vmi_ops.release_page = vmi_get_function(VMI_CALL_ReleasePage);
 
+   paravirt_ops.map_pt_hook = vmi_map_pt_hook;
paravirt_ops.alloc_pt = vmi_allocate_pt;
paravirt_ops.alloc_pd = vmi_allocate_pd;
paravirt_ops.alloc_pd_clone = vmi_allocate_pd_clone;
diff -r 87bf6b2d338d include/asm-i386/paravirt.h
--- a/include/asm-i386/paravirt.h   Tue Feb 27 14:14:34 2007 -0800
+++ b/include/asm-i386/paravirt.h   Tue Feb 27 16:21:22 2007 -0800
@@ -131,6 +131,8 @@ struct paravirt_ops
void (*flush_tlb_kernel)(void);
void (*flush_tlb_single)(u32 addr);
 
+   void (fastcall *map_pt_hook)(int type, pte_t *va, u32 pfn);
+
void (*alloc_pt)(u32 pfn);
void (*alloc_pd)(u32 pfn);
void (*alloc_pd_clone)(u32 pfn, u32 clonepfn, u32 start, u32 count);
@@ -354,6 +356,8 @@ static inline void startup_ipi_hook(int 
 #define __flush_tlb_global() paravirt_ops.flush_tlb_kernel()
 #define __flush_tlb_single(addr) paravirt_ops.flush_tlb_single(addr)
 
+#define paravirt_map_pt_hook(type, va, pfn) paravirt_ops.map_pt_hook(type, va, 
pfn)
+
 #define paravirt_alloc_pt(pfn) paravirt_ops.alloc_pt(pfn)
 #define paravirt_release_pt(pfn) paravirt_ops.release_pt(pfn)
 
diff -r 87bf6b2d338d include/asm-i386/pgtable.h
--- a/include/asm-i386/pgtable.hTue Feb 27 14:14:34 2007 -0800
+++ b/include/asm-i386/pgtable.hTue Feb 27 16:19:54 2007 -0800
@@ -263,6 +263,7 @@ static inline pte_t pte_mkhuge(pte_t pte
  */
 #define pte_update(mm, addr, ptep) do { } while (0)
 #define pte_update_defer(mm, addr, ptep)   do { } while (0)
+#define paravirt_map_pt_hook(slot, va, pfn)do { } while (0)
 #endif
 
 /*
@@ -469,10 +470,24 @@ extern pte_t *lookup_address(unsigned lo
 #endif
 
 #if defined(CONFIG_HIGHPTE)
-#define pte_offset_map(dir, address) \
-   ((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE0) + pte_index(address))
-#define pte_offset_map_nested(dir, address) \
-   ((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE1) + pte_index(address))
+#define pte_offset_map(dir, address)   \
+({ \
+   pte_t *__ptep;  \
+   unsigned pfn = pmd_val(*(dir)) >> PAGE_SHIFT;   \
+   __ptep = (pte_t *)kmap_atomic(pfn_to_page(pfn),KM_PTE0);\
+   paravirt_map_pt_hook(KM_PTE0,__ptep, pfn);  \
+   __ptep

Re: + extend-print_symbol-capability.patch added to -mm tree

2007-03-01 Thread Randy Dunlap

On Thu, 01 Mar 2007 18:17:56 -0800 [EMAIL PROTECTED] wrote:

> Today's print_symbol function dumps a kernel symbol with printk.  This
> patch extends the functionality of kallsyms.c so that the symbol lookup
> function may be used without the printk.  This is useful for modules that
> want to dump symbols elsewhere, for example, to debugfs.  I intend to use
> the new function call in the GFS2 file system (which will be a separate
> patch).

Hey, I've needed this one in the past.  Thanks.

> ---
> 
>  include/linux/kallsyms.h |   10 ++
>  kernel/kallsyms.c|   21 ++---
>  2 files changed, 24 insertions(+), 7 deletions(-)
> 
> diff -puN kernel/kallsyms.c~extend-print_symbol-capability kernel/kallsyms.c
> --- a/kernel/kallsyms.c~extend-print_symbol-capability
> +++ a/kernel/kallsyms.c

> @@ -288,6 +285,15 @@ void __print_symbol(const char *fmt, uns
>   else
>   sprintf(buffer, "%s+%#lx/%#lx", name, offset, size);
>   }
> +}
> +
> +/* Replace "%s" in format with address, or returns -errno. */

Please fix the comment above...

> +void __print_symbol(const char *fmt, unsigned long address)
> +{
> + char buffer[KSYM_SYMBOL_LEN];
> +
> + lookup_symbol(address, buffer);
> +
>   printk(fmt, buffer);
>  }


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/9] Vmi cpu cycles.patch

2007-03-01 Thread Zachary Amsden

In order to share the common code in tsc.c which does CPU Khz calibration, we
need to make an accurate value of CPU speed available to the tsc.c code.
This value loses a lot of precision in a VM because of the timing differences
with real hardware, but we need it to be as precise as possible so the guest
can make accurate time calculations with the cycle counters.

Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>

diff -r b8b315c897bb arch/i386/kernel/vmi.c
--- a/arch/i386/kernel/vmi.cTue Feb 27 14:04:43 2007 -0800
+++ b/arch/i386/kernel/vmi.cTue Feb 27 14:06:46 2007 -0800
@@ -874,6 +874,7 @@ static inline int __init activate_vmi(vo
paravirt_ops.setup_secondary_clock = 
vmi_timer_setup_secondary_alarm;
 #endif
paravirt_ops.get_scheduled_cycles = vmi_get_sched_cycles;
+   paravirt_ops.get_cpu_khz = vmi_cpu_khz;
}
if (!disable_noidle)
para_fill(safe_halt, Halt);
diff -r b8b315c897bb arch/i386/kernel/vmitime.c
--- a/arch/i386/kernel/vmitime.cTue Feb 27 14:04:43 2007 -0800
+++ b/arch/i386/kernel/vmitime.cTue Feb 27 14:06:46 2007 -0800
@@ -177,6 +177,15 @@ unsigned long long vmi_get_sched_cycles(
return read_available_cycles();
 }
 
+unsigned long vmi_cpu_khz(void)
+{
+   unsigned long long khz;
+
+   khz = vmi_timer_ops.get_cycle_frequency();
+   (void)do_div(khz, 1000);
+   return khz;
+}
+
 void __init vmi_time_init(void)
 {
unsigned long long cycles_per_sec, cycles_per_msec;
@@ -206,7 +215,6 @@ void __init vmi_time_init(void)
(void)do_div(cycles_per_alarm, alarm_hz);
cycles_per_msec = cycles_per_sec;
(void)do_div(cycles_per_msec, 1000);
-   cpu_khz = cycles_per_msec;
 
printk(KERN_WARNING "VMI timer cycles/sec = %llu ; cycles/jiffy = %llu 
;"
   "cycles/alarm = %llu\n", cycles_per_sec, cycles_per_jiffy,
diff -r b8b315c897bb include/asm-i386/vmi_time.h
--- a/include/asm-i386/vmi_time.h   Tue Feb 27 14:04:43 2007 -0800
+++ b/include/asm-i386/vmi_time.h   Tue Feb 27 14:06:46 2007 -0800
@@ -50,6 +50,7 @@ extern unsigned long vmi_get_wallclock(v
 extern unsigned long vmi_get_wallclock(void);
 extern int vmi_set_wallclock(unsigned long now);
 extern unsigned long long vmi_get_sched_cycles(void);
+extern unsigned long vmi_cpu_khz(void);
 
 #ifdef CONFIG_X86_LOCAL_APIC
 extern void __init vmi_timer_setup_boot_alarm(void);
diff -r b8b315c897bb arch/i386/kernel/paravirt.c
--- a/arch/i386/kernel/paravirt.c   Tue Feb 27 14:04:43 2007 -0800
+++ b/arch/i386/kernel/paravirt.c   Tue Feb 27 14:08:59 2007 -0800
@@ -522,6 +522,7 @@ struct paravirt_ops paravirt_ops = {
.read_tsc = native_read_tsc,
.read_pmc = native_read_pmc,
.get_scheduled_cycles = native_read_tsc,
+   .get_cpu_khz = native_calculate_cpu_khz,
.load_tr_desc = native_load_tr_desc,
.set_ldt = native_set_ldt,
.load_gdt = native_load_gdt,
diff -r b8b315c897bb arch/i386/kernel/tsc.c
--- a/arch/i386/kernel/tsc.cTue Feb 27 14:04:43 2007 -0800
+++ b/arch/i386/kernel/tsc.cTue Feb 27 14:09:23 2007 -0800
@@ -117,7 +117,7 @@ unsigned long long sched_clock(void)
return cycles_2_ns(this_offset);
 }
 
-static unsigned long calculate_cpu_khz(void)
+unsigned long native_calculate_cpu_khz(void)
 {
unsigned long long start, end;
unsigned long count;
diff -r b8b315c897bb include/asm-i386/paravirt.h
--- a/include/asm-i386/paravirt.h   Tue Feb 27 14:04:43 2007 -0800
+++ b/include/asm-i386/paravirt.h   Tue Feb 27 14:10:25 2007 -0800
@@ -95,6 +95,7 @@ struct paravirt_ops
u64 (*read_tsc)(void);
u64 (*read_pmc)(void);
u64 (*get_scheduled_cycles)(void);
+   unsigned long (*get_cpu_khz)(void);
 
void (*load_tr_desc)(void);
void (*load_gdt)(const struct Xgt_desc_struct *);
@@ -275,6 +276,7 @@ static inline void halt(void)
 #define rdtscll(val) (val = paravirt_ops.read_tsc())
 
 #define get_scheduled_cycles(val) (val = paravirt_ops.get_scheduled_cycles())
+#define calculate_cpu_khz() (paravirt_ops.get_cpu_khz())
 
 #define write_tsc(val1,val2) wrmsr(0x10, val1, val2)
 
diff -r b8b315c897bb include/asm-i386/timer.h
--- a/include/asm-i386/timer.h  Tue Feb 27 14:04:43 2007 -0800
+++ b/include/asm-i386/timer.h  Tue Feb 27 14:11:35 2007 -0800
@@ -7,6 +7,7 @@
 
 void setup_pit_timer(void);
 unsigned long long native_sched_clock(void);
+unsigned long native_calculate_cpu_khz(void);
 
 /* Modifiers for buggy PIT handling */
 extern int pit_latch_buggy;
@@ -17,6 +18,7 @@ extern int recalibrate_cpu_khz(void);
 
 #ifndef CONFIG_PARAVIRT
 #define get_scheduled_cycles(val) rdtscll(val)
+#define calculate_cpu_khz() native_calculate_cpu_khz()
 #endif
 
 #endif
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www

[PATCH 1/9] Vmi timer fixes round two.patch

2007-03-01 Thread Zachary Amsden

Critical bugfixes for the VMI-Timer code.

1) Do not setup a one shot alarm if we are keeping the periodic alarm
armed.  Additionally, since the periodic alarm can be run at a lower
rate than HZ, let's fixup the guard to the no-idle-hz mode appropriately.
This fixes the bug where the no-idle-hz mode might have a higher interrupt
rate than the non-idle case.

2) The interrupt handler can no longer adjust xtime due to nested lock
acquisition.  Drop this.  We don't need to check for wallclock time at
every tick, it can be done in userspace instead.

3) Add a bypass to disable noidle operation.  This is useful as a last
minute workaround, or testing measure.

4) The code to skip the IO_APIC timer testing (no_timer_check) should be
conditional on IO_APIC, not SMP, since UP kernels can have this configured
in as well.

Signed-off-by: Dan Hecht <[EMAIL PROTECTED]>
Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>

diff -r f62ebe3ba01c arch/i386/kernel/vmi.c
--- a/arch/i386/kernel/vmi.cTue Feb 27 14:01:28 2007 -0800
+++ b/arch/i386/kernel/vmi.cTue Feb 27 14:12:46 2007 -0800
@@ -54,6 +54,7 @@ static int disable_sep;
 static int disable_sep;
 static int disable_tsc;
 static int disable_mtrr;
+static int disable_noidle;
 
 /* Cached VMI operations */
 struct {
@@ -255,7 +256,6 @@ static void vmi_nop(void)
 }
 
 /* For NO_IDLE_HZ, we stop the clock when halting the kernel */
-#ifdef CONFIG_NO_IDLE_HZ
 static fastcall void vmi_safe_halt(void)
 {
int idle = vmi_stop_hz_timer();
@@ -266,7 +266,6 @@ static fastcall void vmi_safe_halt(void)
local_irq_enable();
}
 }
-#endif
 
 #ifdef CONFIG_DEBUG_PAGE_TYPE
 
@@ -742,12 +741,7 @@ static inline int __init activate_vmi(vo
 (char *)paravirt_ops.save_fl);
patch_offset(&irq_save_disable_callout[IRQ_PATCH_DISABLE],
 (char *)paravirt_ops.irq_disable);
-#ifndef CONFIG_NO_IDLE_HZ
-   para_fill(safe_halt, Halt);
-#else
-   vmi_ops.halt = vmi_get_function(VMI_CALL_Halt);
-   paravirt_ops.safe_halt = vmi_safe_halt;
-#endif
+
para_fill(wbinvd, WBINVD);
/* paravirt_ops.read_msr = vmi_rdmsr */
/* paravirt_ops.write_msr = vmi_wrmsr */
@@ -881,6 +875,12 @@ static inline int __init activate_vmi(vo
 #endif
custom_sched_clock = vmi_sched_clock;
}
+   if (!disable_noidle)
+   para_fill(safe_halt, Halt);
+   else {
+   vmi_ops.halt = vmi_get_function(VMI_CALL_Halt);
+   paravirt_ops.safe_halt = vmi_safe_halt;
+   }
 
/*
 * Alternative instruction rewriting doesn't happen soon enough
@@ -914,9 +914,11 @@ void __init vmi_init(void)
 
local_irq_save(flags);
activate_vmi();
-#ifdef CONFIG_SMP
+
+#ifdef CONFIG_X86_IO_APIC
no_timer_check = 1;
 #endif
+
local_irq_restore(flags & X86_EFLAGS_IF);
 }
 
@@ -942,7 +944,8 @@ static int __init parse_vmi(char *arg)
} else if (!strcmp(arg, "disable_mtrr")) {
clear_bit(X86_FEATURE_MTRR, boot_cpu_data.x86_capability);
disable_mtrr = 1;
-   }
+   } else if (!strcmp(arg, "disable_noidle"))
+   disable_noidle = 1;
return 0;
 }
 
diff -r f62ebe3ba01c arch/i386/kernel/vmitime.c
--- a/arch/i386/kernel/vmitime.cTue Feb 27 14:01:28 2007 -0800
+++ b/arch/i386/kernel/vmitime.cTue Feb 27 14:12:01 2007 -0800
@@ -276,15 +276,12 @@ static void vmi_account_real_cycles(unsi
 
cycles_not_accounted = cur_real_cycles - real_cycles_accounted_system;
while (cycles_not_accounted >= cycles_per_jiffy) {
-   /* systems wide jiffies and wallclock. */
+   /* systems wide jiffies. */
do_timer(1);
 
cycles_not_accounted -= cycles_per_jiffy;
real_cycles_accounted_system += cycles_per_jiffy;
}
-
-   if (vmi_timer_ops.wallclock_updated())
-   update_xtime_from_wallclock();
 
write_sequnlock(&xtime_lock);
 }
@@ -380,7 +377,6 @@ int vmi_stop_hz_timer(void)
unsigned long seq, next;
unsigned long long real_cycles_expiry;
int cpu = smp_processor_id();
-   int idle;
 
BUG_ON(!irqs_disabled());
if (sysctl_hz_timer != 0)
@@ -388,13 +384,13 @@ int vmi_stop_hz_timer(void)
 
cpu_set(cpu, nohz_cpu_mask);
smp_mb();
+
if (rcu_needs_cpu(cpu) || local_softirq_pending() ||
-   (next = next_timer_interrupt(), time_before_eq(next, jiffies))) {
+   (next = next_timer_interrupt(), 
+time_before_eq(next, jiffies + HZ/CONFIG_VMI_ALARM_HZ))) {
cpu_clear(cpu, nohz_cpu_mask);
-   next = jiffies;
-   idle = 0;
-   } else
-   idle = 1;
+   return 0;
+   }
 
/* Convert jiffies to the real cycle counter. */
do {
@@ -404,17 +400,13 @@ int vmi_stop_hz_timer(void)
} while (read_seqretry(&xtime_lock,

[PATCH 2/9] Sched clock paravirt op fix.patch

2007-03-01 Thread Zachary Amsden

The custom_sched_clock hook is broken.  The result from sched_clock needs to be
in nanoseconds, not in CPU cycles.  The TSC is insufficient for this purpose,
because TSC is poorly defined in a virtual environment, and mostly represents
real world time instead of scheduled process time (which can be interrupted
without notice when a virtual machine is descheduled).

To make the scheduler consistent, we must expose a different nature of time,
that is scheduled time.  So deprecate this custom_sched_clock hack and turn it
into a paravirt-op, as it should have been all along.  This allows the tsc.c
code which converts cycles to nanoseconds to be shared by all paravirt-ops
backends.

It is unfortunate to add a new paravirt-op, but this is a very distinct
abstraction which is clearly different for all virtual machine implementations,
and it gets rid of an ugly indirect function which I ashamedly admit I hacked
in to try to get this to work earlier, and then even got in the wrong units.

Please apply.

Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>

diff -r d58e6ddfdfa9 arch/i386/kernel/paravirt.c
--- a/arch/i386/kernel/paravirt.c   Thu Feb 15 23:52:41 2007 -0800
+++ b/arch/i386/kernel/paravirt.c   Fri Feb 16 00:04:39 2007 -0800
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* nop stub */
 static void native_nop(void)
@@ -520,6 +521,7 @@ struct paravirt_ops paravirt_ops = {
.write_msr = native_write_msr,
.read_tsc = native_read_tsc,
.read_pmc = native_read_pmc,
+   .get_scheduled_cycles = native_read_tsc,
.load_tr_desc = native_load_tr_desc,
.set_ldt = native_set_ldt,
.load_gdt = native_load_gdt,
diff -r d58e6ddfdfa9 arch/i386/kernel/tsc.c
--- a/arch/i386/kernel/tsc.cThu Feb 15 23:52:41 2007 -0800
+++ b/arch/i386/kernel/tsc.cFri Feb 16 00:06:34 2007 -0800
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mach_timer.h"
 
@@ -108,9 +109,6 @@ unsigned long long sched_clock(void)
 {
unsigned long long this_offset;
 
-   if (unlikely(custom_sched_clock))
-   return (*custom_sched_clock)();
-
/*
 * Fall back to jiffies if there's no TSC available:
 */
@@ -119,7 +117,7 @@ unsigned long long sched_clock(void)
return (jiffies_64 - INITIAL_JIFFIES) * (10 / HZ);
 
/* read the Time Stamp Counter: */
-   rdtscll(this_offset);
+   get_scheduled_cycles(this_offset);
 
/* return the value in ns */
return cycles_2_ns(this_offset);
diff -r d58e6ddfdfa9 arch/i386/kernel/vmi.c
--- a/arch/i386/kernel/vmi.cThu Feb 15 23:52:41 2007 -0800
+++ b/arch/i386/kernel/vmi.cFri Feb 16 00:02:48 2007 -0800
@@ -873,7 +873,7 @@ static inline int __init activate_vmi(vo
paravirt_ops.setup_boot_clock = vmi_timer_setup_boot_alarm;
paravirt_ops.setup_secondary_clock = 
vmi_timer_setup_secondary_alarm;
 #endif
-   custom_sched_clock = vmi_sched_clock;
+   paravirt_ops.get_scheduled_cycles = vmi_get_sched_cycles;
}
if (!disable_noidle)
para_fill(safe_halt, Halt);
diff -r d58e6ddfdfa9 arch/i386/kernel/vmitime.c
--- a/arch/i386/kernel/vmitime.cThu Feb 15 23:52:41 2007 -0800
+++ b/arch/i386/kernel/vmitime.cFri Feb 16 00:02:48 2007 -0800
@@ -172,7 +172,7 @@ int vmi_set_wallclock(unsigned long now)
return -1;
 }
 
-unsigned long long vmi_sched_clock(void)
+unsigned long long vmi_get_sched_cycles(void)
 {
return read_available_cycles();
 }
diff -r d58e6ddfdfa9 include/asm-i386/paravirt.h
--- a/include/asm-i386/paravirt.h   Thu Feb 15 23:52:41 2007 -0800
+++ b/include/asm-i386/paravirt.h   Fri Feb 16 00:07:22 2007 -0800
@@ -94,6 +94,7 @@ struct paravirt_ops
 
u64 (*read_tsc)(void);
u64 (*read_pmc)(void);
+   u64 (*get_scheduled_cycles)(void);
 
void (*load_tr_desc)(void);
void (*load_gdt)(const struct Xgt_desc_struct *);
@@ -273,6 +274,8 @@ static inline void halt(void)
 
 #define rdtscll(val) (val = paravirt_ops.read_tsc())
 
+#define get_scheduled_cycles(val) (val = paravirt_ops.get_scheduled_cycles())
+
 #define write_tsc(val1,val2) wrmsr(0x10, val1, val2)
 
 #define rdpmc(counter,low,high) do {   \
diff -r d58e6ddfdfa9 include/asm-i386/time.h
--- a/include/asm-i386/time.h   Thu Feb 15 23:52:41 2007 -0800
+++ b/include/asm-i386/time.h   Fri Feb 16 00:02:48 2007 -0800
@@ -30,7 +30,6 @@ static inline int native_set_wallclock(u
 
 #ifdef CONFIG_PARAVIRT
 #include 
-extern unsigned long long native_sched_clock(void);
 #else /* !CONFIG_PARAVIRT */
 
 #define get_wallclock() native_get_wallclock()
diff -r d58e6ddfdfa9 include/asm-i386/timer.h
--- a/include/asm-i386/timer.h  Thu Feb 15 23:52:41 2007 -0800
+++ b/include/asm-i386/timer.h  Fri Feb 16 00:05:13 2007 -0800
@@ -4,13 +4,19 @@
 #include 
 
 #define TICK_SIZE (tick_nsec / 1000)
+
 void setup_pit_timer(v

[PATCH 0/9] Bugfix patches for i386/vmi/paravirt-ops

2007-03-01 Thread Zachary Amsden

Andi, Linus, we have some critical bugfixes for the VMI paravirt-ops code.
Please apply.  If there are objections to certain pieces, they can be
reworked, but they are pretty much all needed for correctness.  We are
hoping to get these in the next 2.6.21-rc release.

We had quite a few difficulties debugging after the integration of the
hrtimers code, which is why this took so long.  Andrew, add you on
the list in case any further hrtimers integration issues pop up.

Thanks,

Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Is the clockevent resolution fine-grained enough?

2007-03-01 Thread Marko Rauhamaa


It would appear the new clockevent API has a one-nanosecond resolution.
It certainly looks sufficiently fine-grained, but I'm afraid it's too
coarse for some applications.

In our application, we need periodic clock interrupts at about 100 kHz.
If the (programmable) frequency must be rounded to the nearest
nanosecond, we have a cumulative error of

   100,000 * 0.5 ns/s = 50 µs/s

We need to maintain the cumulative error within, say, 1 ms/day, or
11 ns/s. (The error is not measured against real time, but between
different parts of our hardware that are run off of the same clock.)

For our needs, we have built our own "clockevent" system that has a
nominal one-femtosecond precision. The nanosecond resolution would be
sufficient if there was a way to "nudge" the next interrupt by a
nanosecond from the interrupt handler.


Marko

-- 
Marko Rauhamaa  mailto:[EMAIL PROTECTED] http://pacujo.net/marko/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc2 radeon backlight

2007-03-01 Thread Andrew Morton

On Wed, 28 Feb 2007 08:32:43 -0800
Alex Romosan <[EMAIL PROTECTED]> wrote:

> the backlight on my thinkpad still (2.6.20 worked fine) doesn't come
> on if i have the radeon backlight enabled. without it, i guess it's
> the ibm acpi modules that controls the backlight and it seems to work
> fine.
> 

Unclear.  Are you saying that the backlight comes on OK if you use the IBM
acpi module?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 >

1 - 100 of 535 matches

Mail list logo