Kernel traces coming back with trash/clutter

2007-04-24 Thread Mark Hull-Richter

I am experimenting with the kernel (CentOSv4.4 x86_64, 2.6.9-42.0.10)
and I have added a number of traces in some relatively sensitive code
in the page cache and some i/o functions.

I am getting this odd content in the trace log (dmesg), and I cannot
figure out what it is or why it is there.

4296757675 pdflush(80): do_writepages: mapopswrtpgs a0195ff5
4296757675 pdflush(80): mpage_writepages w/b index 49728 pages 256000
7
7
7
7
7__bio_add_page: 2x ph 88=128 || hw 88=88 || 360448max
802525d8 generic_make_request(bio 01017c745300) 50729472, 704
__make_request(q 0101b9293870, bio 01017c745300: sdc; 50729600, 704)
ll_new_hw_segment: 70 + 29  88
7
7
7
7
__bio_add_page: 2x ph 88=128 || hw 88=88 || 360448max
802525d8 generic_make_request(bio 01017c745a80) 50730176, 704
__make_request(q 0101b9293870, bio 01017c745a80: sdc; 50730304, 704)
4296757684 swapper(0): dl_mv2dsp: sdc start 50710368 secs 1408

(The lines with the 7s in them are long - I wrapped them for ease of
reading and to keep the width down somewhat.)

Any feedback that might illuminate this would be welcome.  Please CC
me personally as I am not yet able to subscribe to this list
(apologies).

Thanks.

--
Mark Hull-Richter, Linux Kernel Engineer
DATAllegro (www.datallegro.com)
85 Enterprise, Second Floor, Aliso Viejo, CA  92656
949-680-3082 - Office 949-330-7691 - fax
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)

2007-04-24 Thread Linus Torvalds


On Tue, 24 Apr 2007, Pavel Machek wrote:
  
  If the code just moved somewhere else, it's not less code.
 
 It is not just moved. It is in userspace, where we can use liblzf /
 gcrypt / ( and vbetool for s2ram/s2both) as libraries. We have about
 7000 LoC of userland code (that is not libraries).

If it's in user land, we also have 

 - communication difficulties between two parts, and all the *crap* that 
   tends to entail (ie legacy interfaces forever, and upgrading one 
   without the other etc)

 - people who work on the kernel part are working blind (ie they are at 
   the mercy of whatever userland does, and it's not a contained 
   subsystem). This just ends up becoming worse when you then interact 
   with ten different versions of the user-land stuff, thanks to small 
   tweaks by five different vendors, and a hundred random people.

And don't tell me that doesn't happen. Maybe it doesn't happen _now_, 
because people who use it all get the patches from one place, but the 
moment we start talking about integration into the standard kernel, that 
means that the kernel needs to work regardless of whether somebody uses 
SuSE, RH, Fedora, Ubuntu or cooked his own distro entirely using some 
development version of the suspend user-space tools.

This is why I don't believe in the whole kernel-line-counting thing. I'm 
personally 100% convinced that it's better to have ten times as many lines 
in the kernel, if it means that you can just forget about version skew and 
bad user-space interfaces etc.

So if you want to enumerate good points, you'd damn well also face the 
_problems_.

This is why there's a lot to be said for

echo mem  /sys/power/state

and being able to follow the path through _one_ object (the kernel) over 
trying to figure out the interaction between many different parts with 
different versions.

 I believe uswsusp user/kernel separation is clean enough. Kernel
 provides snapshot image and resume image. (Thanks go to Rafael for
 very clean interface).

Now, *that* is the kind of argument that matters.

Quite frankly, if you want to convince me, it's not by lines of kernel 
code, but by talking about easy-to-understand interfaces that actuually 
do one thing and do it well (and by one thing, I mean one _whole_ 
thing). Because I care a lot less about lines of code than about 
maintainable interfaces that people can think about and debug.

I absolutely detest all suspend-to-disk crap. Quite frankly, I hate the 
whole thing. I think they've _all_ caused problems for the true suspend 
(suspend-to-ram), and the last thing I want to see is three or four 
different suspend-to-disk implementations.  So unlike Ingo, I don't think 
let's just integrate them all side-by-side and maintain them and look who 
wins is really a good idea.

How many different magic ioctl's does the thing introduce? Is it really 
just *two* entry-points (and how simple are they, interface-wise), and 
nothing else?

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: old ISA DMA bug in 2.6.12?

2007-04-24 Thread Robert Hancock

Bob Tracy wrote:

I was enjoying yet another session of beating my head against the wall
trying to do useful things with old hardware :-), and managed to cause a
kernel panic by simply trying to mount a cdrom in the context of a DSL-N
installation.

The SCSI host adapter is an Adaptec AHA-1542B, and when I try to mount a
cdrom, I manage to run afoul of the BAD_DMA() check in aha1542.c: the
buffer returned is not in the lower 16 MB of memory.

The same 2.6.12 kernel + hardware combination works fine as long as I
confine my I/O to the hard disk that's also attached to the AHA-1542B.


Looks like the aha1542 driver doesn't set the DMA mask, so the kernel 
will default to thinking it can do 32-bit DMA when it should be 24-bit.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel traces coming back with trash/clutter

2007-04-24 Thread John Anthony Kazos Jr.
 I am experimenting with the kernel (CentOSv4.4 x86_64, 2.6.9-42.0.10)
 and I have added a number of traces in some relatively sensitive code
 in the page cache and some i/o functions.
 
 I am getting this odd content in the trace log (dmesg), and I cannot
 figure out what it is or why it is there.
 
 4296757675 pdflush(80): do_writepages: mapopswrtpgs a0195ff5
 4296757675 pdflush(80): mpage_writepages w/b index 49728 pages 256000
 7
 7
 7
 7
 7__bio_add_page: 2x ph 88=128 || hw 88=88 || 360448max
 802525d8 generic_make_request(bio 01017c745300) 50729472, 704
 __make_request(q 0101b9293870, bio 01017c745300: sdc; 50729600, 704)
 ll_new_hw_segment: 70 + 29  88
 7
 7
 7
 7
 __bio_add_page: 2x ph 88=128 || hw 88=88 || 360448max
 802525d8 generic_make_request(bio 01017c745a80) 50730176, 704
 __make_request(q 0101b9293870, bio 01017c745a80: sdc; 50730304, 704)
 4296757684 swapper(0): dl_mv2dsp: sdc start 50710368 secs 1408
 
 (The lines with the 7s in them are long - I wrapped them for ease of
 reading and to keep the width down somewhat.)
 
 Any feedback that might illuminate this would be welcome.  Please CC
 me personally as I am not yet able to subscribe to this list
 (apologies).

7 is KERN_DEBUG in include/linux/kernel.h, used with printk. Are you 
using printk in the following forms?

printk(KERN_DEBUG A debug message.\n);

...or...

const char msg_debug[] = KERN_DEBUG A debug message.\n;
printk(msg_debug);

Perhaps you have something looping that's outputting KERN_DEBUG with a 
null message? Or one of your diagnostic printk statements includes 
KERN_DEBUG with no actual message?

Remember, if you have a string in a variable without a KERN_* 
prependation, you can do this.

printk(KERN_DEBUG %s\n, debug_message);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] use mutex instead of semaphore in RocketPort driver

2007-04-24 Thread Robert Hancock

Matthias Kaehlcke wrote:

El Tue, Apr 24, 2007 at 07:53:04PM +0200 Oliver Neukum ha dit:


Am Dienstag, 24. April 2007 19:49 schrieb Matthias Kaehlcke:

@@ -1706,7 +1706,7 @@ static int rp_write(struct tty_struct *tty,
if (count = 0 || rocket_paranoia_check(info, rp_write))
return 0;
 
-   down_interruptible(info-write_sem);

+   mutex_lock_interruptible(info-write_mtx);

This is a bug. It is also present in the current code, but nevertheless
it is a bug. If you use an interruptible lock, you must be ready to deal
with interrupts, which are ignored by this code.


i fear i don't have the experience/knowledge to fix this bug, thanks
for your remark. 


i'm a bit confused now about the interruptible locks, i thought using
them means that the process will be waked up when receiving a
signal. what role are playing interrupts when using interruptible locks?


You are correct, interrupts aren't involved. However if the wait is 
interrupted by a signal, mutex_lock_interruptible will return a nonzero 
return code which needs to be checked for (and likely -ERESTARTSYS or 
-EINTR returned), otherwise the code will blindly continue as though it 
has locked the mutex even though it has not.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Peter Williams

Rogan Dawes wrote:

Chris Friesen wrote:

Rogan Dawes wrote:

I guess my point was if we somehow get to an odd number of 
nanoseconds, we'd end up with rounding errors. I'm not sure if your 
algorithm will ever allow that.


And Ingo's point was that when it takes thousands of nanoseconds for a 
single context switch, an error of half a nanosecond is down in the 
noise.


Chris


My concern was that since Ingo said that this is a closed economy, with 
a fixed sum/total, if we lose a nanosecond here and there, eventually 
we'll lose them all.


Some folks have uptimes of multiple years.

Of course, I could (very likely!) be full of it! ;-)


And won't be using the any new scheduler on these computers anyhow as 
that would involve bringing the system down to install the new kernel. :-)


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel traces coming back with trash/clutter

2007-04-24 Thread Mark Hull-Richter

On 4/24/07, John Anthony Kazos Jr. [EMAIL PROTECTED] wrote:


 I am getting this odd content in the trace log (dmesg), and I cannot
 figure out what it is or why it is there.

 7
 7
 7
 7
 7__bio_add_page: 2x ph 88=128 || hw 88=88 || 360448max
 802525d8 generic_make_request(bio 01017c745300) 50729472, 704

7 is KERN_DEBUG in include/linux/kernel.h, used with printk. Are you
using printk in the following forms?

printk(KERN_DEBUG A debug message.\n);


Yes, exclusively.


Perhaps you have something looping that's outputting KERN_DEBUG with a
null message? Or one of your diagnostic printk statements includes
KERN_DEBUG with no actual message?


No, they are all KERN_DEBUGspacesome string here, almost all with
some formatted output as well.  Could I be overloading the printk
output buffer, as in possibly too tightly repeated/looped code to be
able to output it all?


Remember, if you have a string in a variable without a KERN_*
prependation, you can do this.

printk(KERN_DEBUG %s\n, debug_message);


Haven't tried that one - they're all of the form above.

Thanks again.

--
Mark Hull-Richter, Linux Kernel Engineer
DATAllegro (www.datallegro.com)
85 Enterprise, Second Floor, Aliso Viejo, CA  92656
949-680-3082 - Office 949-330-7691 - fax
[This message is NOT SPAM and is sent in strict accordance with
Google, Yahoo, AOL, Netscape and Earthlink Terms of Service.  If you
are NOT receiving this through a group and do not want any more emails
from me, please reply to me and let me know.  If you are receiving
this second-hand, this sender disclaims all responsibility for your
response.]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cpufreq: allow full selection of default governors

2007-04-24 Thread Dave Jones
On Tue, Apr 24, 2007 at 03:05:36PM -0700, Nish Aravamudan wrote:
  On 4/24/07, Dave Jones [EMAIL PROTECTED] wrote:
   On Tue, Apr 24, 2007 at 09:03:23PM +, William Heimbigner wrote:
 The following patches should allow selection of conservative, 
   powersave, and
 ondemand in the kernel configuration.
  
   This has been rejected several times already.
   Ondemand and conservative isn't a viable governor for all cpufreq
   implementations (ie, ones with high switching latencies).
  
  This piques my curiosity -- some governors don't work with some
  cpufreq implementations. Are those implementations in the kernel or in
  userspace? If in the kernel, then perhaps there should be some
  dependency expressed there in Kconfig between cpufreq implementation
  and the available governors

it can't be solved that easily. powernow-k8 for example is fine to
use with ondemand on newer systems, where the latency is low.
On older models however, it isn't.

   Also, see the
   comment in the Kconfig a few lines above where you are adding this.
  
  Are these governors unfixable? If

tbh, I've forgotten the original issues that caused the comment
to be placed there. Dominik ?

  Just looking for more info -- feel free to just point me at the archives.

cpufreq-list archives are at http://lists.linux.org.uk/mailman/listinfo/cpufreq
(though only available to list members)

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/8] mount ownership and unprivileged mount syscall (v4)

2007-04-24 Thread Karel Zak
On Fri, Apr 20, 2007 at 12:25:32PM +0200, Miklos Szeredi wrote:
 The following extra security measures are taken for unprivileged
 mounts:
 
  - usermounts are limited by a sysctl tunable
  - force nosuid,nodev mount options on the created mount

 The original userspace user= solution also implies the noexec
 option by default (you can override the default by exec option).
 
 It means the kernel based solution is not fully compatible ;-(

Karel

-- 
 Karel Zak  [EMAIL PROTECTED]
 
 Red Hat Czech s.r.o.
 Purkynova 99/71, 612 45 Brno, Czech Republic
 Reg.id: CZ27690016
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Question about Reiser4

2007-04-24 Thread lkml777
On Sun, 22 Apr 2007 19:00:46 -0700, Eric Hopper
 [EMAIL PROTECTED] said:

 I did.  That whole thread is some guy spouting off a ludicrous Bonnie++
 benchmark showing that compressing long strings of 0s results in things
 taking up very little space and being very fast.

I think you are deliberately being stupid here.

You are claiming that REISER4's good speed results when using
compression actually has a simple explanation and THEREFORE all good
result for the filesystem, even those results that have nothing to do
with compression, are negated.

NOTHING COULD BE FURTHER FROM THE TRUTH.

Your conclusion is a total travesty of logic.

As I understand it, the default Reiser4 DOES NOT USE any compression at
all, not even tail compression, but saves space by eliminating block
alignment wastage (tail compression is an option).

So lets LOSE the statistics that involve compression. The results now
look like this:

.-.
| FILESYSTEM | TIME |DISK |
| TYPE   |(secs)|USAGE|
.-.
|REISER4 | 3462 | 692 |
|EXT2| 4092 | 816 |
|JFS | 4225 | 806 |
|EXT4| 4408 | 816 |
|EXT3| 4421 | 816 |
|XFS | 4625 | 779 |
|REISER3 | 6178 | 793 |
|FAT32   |12342 | 988 |
|NTFS-3g |10414 | 772 |
.-.

These results are still EXTREMELY GOOD for REISER4.

These results still say that Reiser4 is a truly remarkable filesystem,
as stated in:

http://linuxhelp.150m.com/resources/fs-benchmarks.htm
http://m.domaindlx.com/LinuxHelp/fs-benchmarks.htm

So why do I see an anti-Reiser religion, in all that you people say.

You, concentrate on the fact that bonnie++'s use of files that are
mainly zeroes, will make the results using compression less good than
they are.

I can't see anywhere where this has been denied.

In fact the other set of statistics that you just ignore, states that in
more realistic situations, the compression speedup is slightly negative.

What is wrong here, is:

You say that the Bonnie++ tests using compression are subject to
interpretation. No argument here.
You ignore the tests that confirm your statement. You are clearly not
interested in the actual results or their interpretation.
You, by some incredibly twisted logic the state that Reiser4 is
therefore not good, even though it is clearly the best filesystem when
NOT using compression.

This of course is completely deceitful logic.

That the speed advantage from compression would be small is clear from
the OTHER data that you ignore, namely:

.-.
|File |Disk |Copy |Copy |Tar  |Unzip| Del |
|System   |Usage|655MB|655MB|Gzip |UnTar| 2.5 |
|Type | (MB)| (1) | (2) |655MB|655MB| Gig |
.-.
|REISER4 lzo  | 278 | 138 |  56 |  80 |  34 |  84 |
|REISER4 gzip | 213 | 148 |  68 |  83 |  48 |  70 |
|REISER4  | 692 | 148 |  55 |  67 |  25 |  56 |
|EXT4 | 816 | 174 |  70 |  74 |  42 |  50 |
.-.


 So, the speed increase with compression (on very compressible kernel sources) 
 is slightly negative,
 
 but the speed is still comparable to that of EXT4.
 
  On Sun, 22 Apr 2007 19:00:46 -0700, Eric Hopper
  [EMAIL PROTECTED] said:
 
   I know that this whole effort has been put in disarray by the
   prosecution of Hans Reiser, but I'm curious as to its status. Is
   Reiser4 going to be going into the Linus kernel anytime soon? Is there
   somewhere I should be looking to find this out without wasting bandwidth
   here?
 
  There was a thread the other day, that talked about Reiser4.
 
  It took a while but I have found it (actually two)
 
  http://lkml.org/lkml/2007/4/5/360
  http://lkml.org/lkml/2007/4/9/4
 
  You may want to check them out.
 
 I did.  That whole thread is some guy spouting off a ludicrous Bonnie++
 benchmark showing that compressing long strings of 0s results in things
 taking up very little space and being very fast.
 
 Such things will produce lots of flames and no useful information
 whatsoever as is evinced by the half conspiracy theory, half truth the
 thread degenerated into in the second message you linked to.
 
-- 
  
  [EMAIL PROTECTED]

-- 
http://www.fastmail.fm - mmm... Fastmail...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression with gammu on 2.6.21-rc7

2007-04-24 Thread Greg KH
On Mon, Apr 23, 2007 at 10:10:22AM +0200, Wolfgang Erig wrote:
 Hello Greg,

Please don't take me out of the cc:, otherwise I might mist this (as I
did...)

 On Sun, Apr 22, 2007 at 10:47:17PM -0700, Greg KH wrote:
  On Fri, Apr 20, 2007 at 10:58:53AM +0200, Wolfgang Erig wrote:
   Hello,
   
   I have a regression with 2.6.21-rc7-g80d74d51.
   The utility gammu to talk to my mobile does not work anymore.
   With 2.6.20 gammu runs fine.
   
   Distribution is the latest Debian/testing
   
   Wolfgang
   
   $ gammu --backup backup
   Press Ctrl+C to break...
   I/O possible
 the problem is here because gammu stops working.
 Maybe a problem in gammu, but with 2.6.20 gammu works fine.
   $ uname -a
   Linux max 2.6.21-rc7-g80d74d51 #9 SMP Wed Apr 18 21:41:41 CEST 2007 i686 
   GNU/Linux
   $ tail messages 
   Apr 20 08:04:36 max kernel: ACPI: PCI Interrupt :00:1b.0[A] - GSI 16 
   (level, low) - IRQ 16
   Apr 20 08:04:36 max kernel: extern: link up, 100Mbps, full-duplex, lpa 
   0x45E1
   Apr 20 08:04:36 max kernel: intern:  setting half-duplex.
   Apr 20 08:09:02 max kernel: usb 2-2: USB disconnect, address 3
   Apr 20 08:09:02 max kernel: pl2303 ttyUSB0: pl2303 converter now 
   disconnected from ttyUSB0
   Apr 20 08:09:02 max kernel: pl2303 2-2:1.0: device disconnected
   Apr 20 08:10:24 max kernel: usb 2-2: new full speed USB device using 
   uhci_hcd and address 4
   Apr 20 08:10:25 max kernel: usb 2-2: configuration #1 chosen from 1 choice
   Apr 20 08:10:25 max kernel: pl2303 2-2:1.0: pl2303 converter detected
   Apr 20 08:10:25 max kernel: usb 2-2: pl2303 converter now attached to 
   ttyUSB0
  
  That looks ok, I'm guessing you yanked it out and then back in?
 Yes.
 This is included only to see which device is connected.
  Or is the problem that the device was removed?
 No, no problem with removal.
 I see no hint for a problem in the usb-layer.

I don't see any problems here.

If you enable debugging in the pl2303 driver, do you get any errors?
You can do this by:
modprobe pl2303 debug=1
or if the module is built in or already loaded:
echo 1  /sys/modules/pl2303/parameters/debug


Also, if you know how to use git, doing a 'git bisect' to try to track
down the problem commit would be very helpful.

thanks,

greg k-h
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/3] 2.6.21-rc7: known regressions (v2)

2007-04-24 Thread Greg KH
On Tue, Apr 24, 2007 at 11:32:53AM +0200, Wolfgang Erig wrote:
 On Mon, Apr 23, 2007 at 03:18:19PM -0700, Greg KH wrote:
  On Mon, Apr 23, 2007 at 11:48:47PM +0200, Adrian Bunk wrote:
   This email lists some known regressions in Linus' tree compared to 2.6.20.
   
   If you find your name in the Cc header, you are either submitter of one
   of the bugs, maintainer of an affectected subsystem or driver, a patch
   of you caused a breakage or I'm considering you in any other way
   possibly involved with one or more of these issues.
   
   Due to the huge amount of recipients, please trim the Cc when answering.
   
   
   Subject: gammu no longer works
   References : http://lkml.org/lkml/2007/4/20/84
   Submitter  : Wolfgang Erig [EMAIL PROTECTED]
   Status : unknown
  
  I've asked for more information about this, and so far am not sure it's
  a real problem.
 
 It is a real problem for me.
 I tried this on 2 different boxes with the same behaviour.
 No sync between my Nokia mobile and Linux with the latest kernel :(

Sorry, I didn't see your response, have followed up on lkml now.

thanks,

greg k-h
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


ChunkFS - measuring cross-chunk references

2007-04-24 Thread Karuna sagar K

On 4/24/07, Theodore Tso [EMAIL PROTECTED] wrote:

On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote:

.

It would also be good to distinguish between directories referencing
files in another chunk, and directories referencing subdirectories in
another chunk (which would be simpler to handle, given the topological
restrictions on directories, as compared to files and hard links).



Modified the tool to distinguish between
1. cross references between directories and files
2. cross references between directories and sub directories
3. cross references within a file (due to huge file size)

Below is the result from / partition of ext3 file system:

Number of files = 221794
Number of directories = 24457
Total size = 8193116 KB
Total data stored = 7187392 KB
Size of block groups = 131072 KB
Number of inodes per block group = 16288
No. of cross references between directories and sub-directories = 7791
No. of cross references between directories and file = 657
Total no. of cross references = 62018 (dir ref = 8448, file ref = 53570)

Thanks for the suggestions.


There may also be special things we will need to do to handle
scenarios such as BackupPC, where if it looks like a directory
contains a huge number of hard links to a particular chunk, we'll need
to make sure that directory is either created in the right chunk
(possibly with hints from the application) or migrated to the right
chunk (but this might cause the inode number of the directory to
change --- maybe we allow this as long as the directory has never been
stat'ed, so that the inode number has never been observed).

The other thing which we should consider is that chunkfs really
requires a 64-bit inode number space, which means either we only allow
it on 64-bit systems, or we need to consider a migration so that even
on 32-bit platforms, stat() functions like stat64(), insofar that it
uses a stat structure which returns a 64-bit ino_t.

   - Ted




Thanks,
Karuna


cref.tar.bz2
Description: BZip2 compressed data


Re: [1/3] 2.6.21-rc7: known regressions (v2)

2007-04-24 Thread Adrian Bunk
On Tue, Apr 24, 2007 at 05:14:28PM -0700, Greg KH wrote:
 On Tue, Apr 24, 2007 at 11:32:53AM +0200, Wolfgang Erig wrote:
  On Mon, Apr 23, 2007 at 03:18:19PM -0700, Greg KH wrote:
   On Mon, Apr 23, 2007 at 11:48:47PM +0200, Adrian Bunk wrote:
This email lists some known regressions in Linus' tree compared to 
2.6.20.

If you find your name in the Cc header, you are either submitter of one
of the bugs, maintainer of an affectected subsystem or driver, a patch
of you caused a breakage or I'm considering you in any other way
possibly involved with one or more of these issues.

Due to the huge amount of recipients, please trim the Cc when answering.


Subject: gammu no longer works
References : http://lkml.org/lkml/2007/4/20/84
Submitter  : Wolfgang Erig [EMAIL PROTECTED]
Status : unknown
   
   I've asked for more information about this, and so far am not sure it's
   a real problem.
  
  It is a real problem for me.
  I tried this on 2 different boxes with the same behaviour.
  No sync between my Nokia mobile and Linux with the latest kernel :(
 
 Sorry, I didn't see your response, have followed up on lkml now.

It turned out this was actually a bug in Gammu that will be fixed in 
the next release of Gammu.

 thanks,
 
 greg k-h

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Gene Heskett
On Tuesday 24 April 2007, Willy Tarreau wrote:
On Tue, Apr 24, 2007 at 10:38:32AM -0400, Gene Heskett wrote:
 On Tuesday 24 April 2007, Ingo Molnar wrote:
 * David Lang [EMAIL PROTECTED] wrote:
   (Btw., to protect against such mishaps in the future i have changed
   the SysRq-N [SysRq-Nice] implementation in my tree to not only
   change real-time tasks to SCHED_OTHER, but to also renice negative
   nice levels back to 0 - this will show up in -v6. That way you'd
   only have had to hit SysRq-N to get the system out of the wedge.)
 
  if you are trying to unwedge a system it may be a good idea to renice
  all tasks to 0, it could be that a task at +19 is holding a lock that
  something else is waiting for.
 
 Yeah, that's possible too, but +19 tasks are getting a small but
 guaranteed share of the CPU so eventually it ought to release it. It's
 still a possibility, but i think i'll wait for a specific incident to
 happen first, and then react to that incident :-)
 
 Ingo

 In the instance I created, even the SysRq+b was ignored, and ISTR thats
 supposed to initiate a reboot is it not?  So it was well and truly wedged.

On many machines I use this on, I have to release Alt while still holding B.
Don't know why, but it works like this.

Willy

Yeah, Willy, and pardon a slight bit of sarcasm here but that's how we get the 
reputation for needing virgins to sacrifice, regular experienced girls just 
wouldn't do.

This isn't APL running on an IBM 5120, so it should Just Work(TM) and not need 
a sceance or something to conjure up the right spell.  Besides, the reset 
button is only about 6 feet away...  I get some execsize that way by getting 
up to push it. :)

-- 
Cheers, Gene
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)
It is so soon that I am done for, I wonder what I was begun for.
-- Epitaph, Cheltenham Churchyard
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Gene Heskett
On Tuesday 24 April 2007, Willy Tarreau wrote:
On Tue, Apr 24, 2007 at 10:38:32AM -0400, Gene Heskett wrote:
 On Tuesday 24 April 2007, Ingo Molnar wrote:
 * David Lang [EMAIL PROTECTED] wrote:
   (Btw., to protect against such mishaps in the future i have changed
   the SysRq-N [SysRq-Nice] implementation in my tree to not only
   change real-time tasks to SCHED_OTHER, but to also renice negative
   nice levels back to 0 - this will show up in -v6. That way you'd
   only have had to hit SysRq-N to get the system out of the wedge.)
 
  if you are trying to unwedge a system it may be a good idea to renice
  all tasks to 0, it could be that a task at +19 is holding a lock that
  something else is waiting for.
 
 Yeah, that's possible too, but +19 tasks are getting a small but
 guaranteed share of the CPU so eventually it ought to release it. It's
 still a possibility, but i think i'll wait for a specific incident to
 happen first, and then react to that incident :-)
 
 Ingo

 In the instance I created, even the SysRq+b was ignored, and ISTR thats
 supposed to initiate a reboot is it not?  So it was well and truly wedged.

On many machines I use this on, I have to release Alt while still holding B.
Don't know why, but it works like this.

Willy

Yeah, Willy, and pardon a slight bit of sarcasm here but that's how we get the 
reputation for needing virgins to sacrifice, regular experienced girls just 
wouldn't do.

This isn't APL running on an IBM 5120, so it should Just Work(TM) and not need 
a sceance or something to conjure up the right spell.  Besides, the reset 
button is only about 6 feet away...  I get some execsize that way by getting 
up to push it. :)

-- 
Cheers, Gene
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)
It is so soon that I am done for, I wonder what I was begun for.
-- Epitaph, Cheltenham Churchyard
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/17] Large Blocksize Support V3

2007-04-24 Thread H. Peter Anvin
FWIW, this would also let zisofs remove the ugly hacks we currently 
employ to deal with compression blocks.


-hpa

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/7] libata: check for AN support

2007-04-24 Thread Olivier Galibert
On Tue, Apr 24, 2007 at 01:53:27PM -0700, Kristen Carlson Accardi wrote:
 Check to see if an ATAPI device supports Asynchronous Notification.
 If so, enable it.
 
 changes from last version: 
 * fix typo in ata_id_has_AN and make word 76 test more clear
 * If we fail to set the AN feature, just print a warning and continue
  
 Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED]
 
 @@ -299,6 +305,8 @@ struct ata_taskfile {
  #define ata_id_queue_depth(id)   (((id)[75]  0x1f) + 1)
  #define ata_id_removeable(id)((id)[0]  (1  7))
  #define ata_id_has_dword_io(id)  ((id)[50]  (1  0))
 +#define ata_id_has_AN(id)\
 + (((id[76] != 0x)  (id[76] != 0x))  ((id)[78]  (1  5)))

(id)[76] I guess ?  Sorry for being a pain :/

  OG.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/17] Large Blocksize Support V3

2007-04-24 Thread Jörn Engel
On Tue, 24 April 2007 15:21:05 -0700, [EMAIL PROTECTED] wrote:
 
 This patchset modifies the Linux kernel so that larger block sizes than
 page size can be supported. Larger block sizes are handled by using
 compound pages of an arbitrary order for the page cache instead of
 single pages with order 0.

I like to see this.

 2. 32/64k blocksize is also used in flash devices. Same issues.

Actually most chips I encounter these days already have 128KiB.  And
some people seem to do some kind of raid-0 in the drivers to increase
bandwidth.  FS-visible blocksize is also increased by that.

 Unsupported
 - Mmapping blocks larger than page size

Bummer.  Can this change in the future?

 Issues:
 - There are numerous places where the kernel can no longer assume that the
   page cache consists of PAGE_SIZE pages that have not been fixed yet.
 - Defrag warning: The patch set can fragment memory very fast.
   It is likely that Mel Gorman's anti-frag patches and some more
   work by him on defragmentation may be needed if one wants to use
   super sized pages.
   If you run a 2.6.21 kernel with this patch and start a kernel compile
   on a 4k volume with a concurrent copy operation to a 64k volume on
   a system with only 1 Gig then you will go boom (ummm no ... OOM) fast.
   How well Mel's antifrag/defrag methods address this issue still has to
   be seen.

only 1 Gig :)

With my LogFS hat on, I don't care too much whether data is cached in
terms of pages or blocks.  What matters to me most is to get fed
blocksize chunk on writeback and be able to read blocksize'd chunks.
Compressing 64KiB at a time gives somewhere around 10% (don't remember
exact number) better compression when compared to 4KiB.  JFFS2 can
benefit from this as well.

That should also be sufficient for cross-platform compatibility,
shouldn't it?

Better performance for the pagecache is also nice to have, no doubt.
But if system stability remains an issue, I'd rather keep slow and
stable.

Jörn

-- 
More computing sins are committed in the name of efficiency (without
necessarily achieving it) than for any other single reason - including
blind stupidity.
-- W. A. Wulf
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/3] 2.6.21-rc7: known regressions (v2)

2007-04-24 Thread Greg KH
On Wed, Apr 25, 2007 at 02:29:58AM +0200, Adrian Bunk wrote:
 On Tue, Apr 24, 2007 at 05:14:28PM -0700, Greg KH wrote:
  On Tue, Apr 24, 2007 at 11:32:53AM +0200, Wolfgang Erig wrote:
   On Mon, Apr 23, 2007 at 03:18:19PM -0700, Greg KH wrote:
On Mon, Apr 23, 2007 at 11:48:47PM +0200, Adrian Bunk wrote:
 This email lists some known regressions in Linus' tree compared to 
 2.6.20.
 
 If you find your name in the Cc header, you are either submitter of 
 one
 of the bugs, maintainer of an affectected subsystem or driver, a patch
 of you caused a breakage or I'm considering you in any other way
 possibly involved with one or more of these issues.
 
 Due to the huge amount of recipients, please trim the Cc when 
 answering.
 
 
 Subject: gammu no longer works
 References : http://lkml.org/lkml/2007/4/20/84
 Submitter  : Wolfgang Erig [EMAIL PROTECTED]
 Status : unknown

I've asked for more information about this, and so far am not sure it's
a real problem.
   
   It is a real problem for me.
   I tried this on 2 different boxes with the same behaviour.
   No sync between my Nokia mobile and Linux with the latest kernel :(
  
  Sorry, I didn't see your response, have followed up on lkml now.
 
 It turned out this was actually a bug in Gammu that will be fixed in 
 the next release of Gammu.

Ah, ok, thanks for letting me know.

But how was the kernel version change triggering it?

greg k-h
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: AppArmor FAQ

2007-04-24 Thread Joshua Brindle

Crispin Cowan wrote:

David Wagner wrote:
  

James Morris  wrote:
  

[...] you can change the behavior of the application and then bypass 
policy entirely by utilizing any mechanism other than direct filesystem 
access: IPC, shared memory, Unix domain sockets, local IP networking, 
remote networking etc.

  

[...]
  


Just look at their code and their own description of AppArmor.

  

My gosh, you're right.  What the heck?  With all due respect to the
developers of AppArmor, I can't help thinking that that's pretty lame.
I think this raises substantial questions about the value of AppArmor.
What is the point of having a jail if it leaves gaping holes that
malicious code could use to escape?

And why isn't this documented clearly, with the implications fully
explained?

I would like to hear the AppArmor developers defend this design decision.
  


It was a simplicity trade off at the time, when AppArmor was mostly
aimed at servers, and there was no HAL or DBUS. Now it is definitely a
limitation that we are addressing. We are working on a mediation system
for what kind of IPC a confined process can do
http://forge.novell.com/pipermail/apparmor-dev/2007-April/000503.html

  

Also, things like:

   share_mem /usr/bin/firefox r,# /bin/foo can share memory with 
/usr/bin/firefox for read only

clearly show that you aren't using native abstractions for IPC. The 
native abstraction for shared memory would be the key used when creating 
the shared memory segment. The same goes for message queues which are 
noticeably missing from the simplified IPC model.


This, of course, begs the question of whether you are using native 
abstractions for profiles at all, processes have nothing to do with the 
binary they started from after they've been started. The binary on disk 
could be something entirely different than the process from which it ran.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/8] mount ownership and unprivileged mount syscall (v4)

2007-04-24 Thread Eric W. Biederman
Karel Zak [EMAIL PROTECTED] writes:

 On Fri, Apr 20, 2007 at 12:25:32PM +0200, Miklos Szeredi wrote:
 The following extra security measures are taken for unprivileged
 mounts:
 
  - usermounts are limited by a sysctl tunable
  - force nosuid,nodev mount options on the created mount

  The original userspace user= solution also implies the noexec
  option by default (you can override the default by exec option).
  
  It means the kernel based solution is not fully compatible ;-(

Why noexec?  Either it was a silly or arbitrary decision, or
our kernel design may be incomplete.

Now I can see not wanting to support executables if you are locking
down a system.  The classic don't execute a program from a CD just because
the CD was stuck in the drive problem.

So I can see how executing code from an untrusted source could prevent
exploitation of other problems, and we certainly don't want to do it
automatically.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)

2007-04-24 Thread Olivier Galibert
On Tue, Apr 24, 2007 at 04:41:58PM -0700, Linus Torvalds wrote:
 How many different magic ioctl's does the thing introduce? Is it really 
 just *two* entry-points (and how simple are they, interface-wise), and 
 nothing else?

Aren't you a little late to the party here?  The userland version is
the one that currently is in the kernel, after all the people who said
doing it in userland is not necessarily a good idea got happily
ignored.  Suspend2 which is the continuity of the fully-in-kernel one
is the one that has been constantly rejected by Pavel, lately by
saying it should be done in userspace, and hence never merged.

Incidentally, it's 13 ioctls, and it's documented in
Documentation/power/userland-swsusp.txt in a hard drive near you.  I
especially like the get the available swap space in bytes one that
can only handle 32 bits.

  OG.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression with gammu on 2.6.21-rc7

2007-04-24 Thread Ray Lee

On 4/24/07, Greg KH [EMAIL PROTECTED] wrote:

Also, if you know how to use git, doing a 'git bisect' to try to track
down the problem commit would be very helpful.


Has to do with SIGIO, see this blog post:

http://blog.cihar.com/archives/2007/04/24/kernel_2_6_21_hits_gammu/

Ray
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel IOMMU][patch 8/8] Preserve some Virtual Address when devices cannot address entire range.

2007-04-24 Thread H. Peter Anvin

Andi Kleen wrote:

On Tuesday 24 April 2007 23:50:26 David Miller wrote:

From: Ashok Raj [EMAIL PROTECTED]
Date: Tue, 24 Apr 2007 14:38:35 -0700


Its not clear if we have a very generic device breakage.. most devices
on these platforms are going to be more recent, (except maybe some
legacy fd)... 

I'm not so sure, there are some modern sound cards that have
a 31-bit DMA addressing limitation because they use the 31st
bit as a status bit in their DMA descriptors :-)


There's also a 2GB only megaraid RAID controller that's pretty popular 
because Dell shipped it for a long time.




You can probably find almost any possible bitmask if you look long 
enough.  Hardware vendors are notorious for this kind of optimizations.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel IOMMU][patch 3/8] Generic hardware support for Intel IOMMU.

2007-04-24 Thread Shaohua Li
On Tue, 2007-04-24 at 21:27 +0200, Andi Kleen wrote:
 On Tuesday 24 April 2007 08:03:02 Ashok Raj wrote:
 
  +#ifdef CONFIG_DMAR
  +#ifdef CONFIG_SMP
  +static void dmar_msi_set_affinity(unsigned int irq, cpumask_t mask)
 
 
 Why does it need an own interrupt type?
 
  +
  +config IOVA_GUARD_PAGE
  +   bool Enables gaurd page when allocating IO Virtual Address for IOMMU
  +   depends on DMAR
  +
  +config IOVA_NEXT_CONTIG
  +   bool Keeps IOVA allocations consequent between allocations
  +   depends on DMAR  EXPERIMENTAL
 
 Needs reference to Intel and better description
 
 The file should have a high level description what it is good for etc.
 
 Need high level overview over what locks protects what and if there
 is a locking order.
 
 It doesn't seem to enable sg merging? Since you have enough space 
 that should work.
We actually have a patch to do sg merge. In my test, it doesn't have any
performance gain.

  +static char *fault_reason_strings[] =
  +{
  +   Software,
  +   Present bit in root entry is clear,
  +   Present bit in context entry is clear,
  +   Invalid context entry,
  +   Access beyond MGAW,
  +   PTE Write access is not set,
  +   PTE Read access is not set,
  +   Next page table ptr is invalid,
  +   Root table address invalid,
  +   Context table ptr is invalid,
  +   non-zero reserved fields in RTP,
  +   non-zero reserved fields in CTP,
  +   non-zero reserved fields in PTE,
  +   Unknown
  +};
  +
  +#define MAX_FAULT_REASON_IDX   (12)
 
 
 You got 14 of them. better use ARRAY_SIZE
 
  +#define IOMMU_NAME_LEN (7)
  +
  +struct iommu {
 
 call it intel_iommu or somesuch even when it's private.
 
  +static int __init intel_iommu_setup(char *str)
  +{
  +   if (!str)
  +   return -EINVAL;
  +   while (*str) {
  +   if (!strncmp(str, off, 3)) {
  +   dmar_disabled = 1;
  +   printk(KERN_INFOIntel-IOMMU: disabled\n);
  +   }
  +   str += strcspn(str, ,);
  +   while (*str == ',')
  +   str++;
  +   }
  +   return 0;
  +}
  +__setup(intel_iommu=, intel_iommu_setup);
 
 Why can't you just use the normal iommu=off for this? 
iommu=off disable all iommu, intel_iommu=off just disables intel_iommu.
Isn't possible people want to use other iommu like swiotlb?
  +
  +#define MIN_PGTABLE_PAGES  (10)
  +static mempool_t *pgtable_mempool;
  +#define MIN_DOMAIN_REQ (20)
  +static mempool_t *domain_mempool;
  +#define MIN_DEVINFO_REQ(20)
  +static mempool_t *devinfo_mempool;
 
 Lots of mempools. How much memory does this pin?
 
  +
  +#define alloc_pgtable_page() mempool_alloc(pgtable_mempool, GFP_ATOMIC)
  +#define free_pgtable_page(vaddr) mempool_free(vaddr, pgtable_mempool)
  +#define alloc_domain_mem() mempool_alloc(domain_mempool, GFP_ATOMIC)
  +#define free_domain_mem(vaddr) mempool_free(vaddr, domain_mempool)
  +#define alloc_devinfo_mem() mempool_alloc(devinfo_mempool, GFP_ATOMIC)
  +#define free_devinfo_mem(vaddr) mempool_free(vaddr, devinfo_mempool)
 
 Do we need the macros? Better expand them in the caller.
 
  +static void __iommu_flush_cache(struct iommu *iommu, void *addr, int size)
  +{
  +   if (!ecap_coherent(iommu-ecap))
  +   clflush_cache_range(addr, size);
  +}
  +
  +#define iommu_flush_cache_entry(iommu, addr) \
  +   __iommu_flush_cache(iommu, addr, 8)
  +#define iommu_flush_cache_page(iommu, addr) \
  +   __iommu_flush_cache(iommu, addr, PAGE_SIZE_4K)
 
 Similar.
 
 And the 8 should be probably something more descriptive (sizeof?)
 
  +/* context entry handling */
  +static struct context_entry * device_to_context_entry(struct iommu *iommu,
  +   u8 bus, u8 devfn)
  +{
  +   struct root_entry *root;
  +   struct context_entry *context;
  +   unsigned long phy_addr;
  +   unsigned long flags;
  +
  +   spin_lock_irqsave(iommu-lock, flags);
  +   root = iommu-root_entry[bus];
  +   if (!root_present(*root)) {
  +   phy_addr = (unsigned long)alloc_pgtable_page();
 
 A GFP_ATOMIC mempool is rather useless. mempool only works if it can block
 for someone else freeing memory and if it can't do that it's not failsafe.
 I'm afraid you need to revise the allocation strategy -- best would be
 to somehow move the memory allocations outside the spinlock paths
 and preallocate if possible.
The problem is pci_map_single and friends usually called with interrupt
disabled or spin locked, so we must use GFP_ATOMIC.

 Same problem in other code.
 
  +   if (!dma_pte_present(*pte)) {
  +   tmp = alloc_pgtable_page();
 
 Please don't name variable tmp. I know some other code does it, but it's
 just bad style imho.
 
 
  +   /* Make sure hardware complete it */
  +   start_time = jiffies;
  +   while (1) {
  +   sts = dmar_readl(iommu-reg, DMAR_GSTS_REG);
  +   if (sts  DMA_GSTS_RTPS)
  +   break;
  +   if (time_after(jiffies, start_time + DMAR_OPERATION_TIMEOUT))
  +

Re: [1/3] 2.6.21-rc7: known regressions (v2)

2007-04-24 Thread Adrian Bunk
On Tue, Apr 24, 2007 at 05:51:11PM -0700, Greg KH wrote:
 On Wed, Apr 25, 2007 at 02:29:58AM +0200, Adrian Bunk wrote:
  On Tue, Apr 24, 2007 at 05:14:28PM -0700, Greg KH wrote:
   On Tue, Apr 24, 2007 at 11:32:53AM +0200, Wolfgang Erig wrote:
On Mon, Apr 23, 2007 at 03:18:19PM -0700, Greg KH wrote:
 On Mon, Apr 23, 2007 at 11:48:47PM +0200, Adrian Bunk wrote:
  This email lists some known regressions in Linus' tree compared to 
  2.6.20.
  
  If you find your name in the Cc header, you are either submitter of 
  one
  of the bugs, maintainer of an affectected subsystem or driver, a 
  patch
  of you caused a breakage or I'm considering you in any other way
  possibly involved with one or more of these issues.
  
  Due to the huge amount of recipients, please trim the Cc when 
  answering.
  
  
  Subject: gammu no longer works
  References : http://lkml.org/lkml/2007/4/20/84
  Submitter  : Wolfgang Erig [EMAIL PROTECTED]
  Status : unknown
 
 I've asked for more information about this, and so far am not sure 
 it's
 a real problem.

It is a real problem for me.
I tried this on 2 different boxes with the same behaviour.
No sync between my Nokia mobile and Linux with the latest kernel :(
   
   Sorry, I didn't see your response, have followed up on lkml now.
  
  It turned out this was actually a bug in Gammu that will be fixed in 
  the next release of Gammu.
 
 Ah, ok, thanks for letting me know.
 
 But how was the kernel version change triggering it?

I don't know, perhaps a side effect of Eric's work in kernel/signal.c?

The bug in Gammu was:
- Gammu wrongly set FASYNC in a fcntl() call.
- The unhandled SIGIO terminated Gammu in 2.6.21-rc.

Gammu being terminated by the SIGIO seems to be expected and documented 
behavior, and the surprising thing is that it wasn't terminated with 
earlier kernels.

 greg k-h

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] Multiqueue network device support

2007-04-24 Thread Peter P Waskiewicz Jr
This is a redesign and repost of the multiqueue network device support patches.
The new API for base drivers allows multiqueue-capable devices to manage their
individual queues in the network stack.  The stack now handles both
non-multiqueue and multiqueue devices on the same codepath.  Also, allocation
and deallocation of the queues is handled by the kernel instead of the driver.

The PRIO qdisc is now modified to run in single-queue mode on multiqueue
devices by default.  A modification to tc is in another patchset being sent
that allows multiqueue behavior to be turned on for PRIO.

Documentation is also included describing in more detail how this works, as 
wellas how a base driver can use the API to implement multiple queues.

These patches can also be pulled from my git repository at:

git-pull git://lost.foo-projects.org/~ppwaskie/git/netdev-2.6.22 mq

--
Peter P. Waskiewicz Jr.
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] Adding documentation for the new multiqueue API.

2007-04-24 Thread Peter P Waskiewicz Jr
From: Peter P Waskiewicz Jr [EMAIL PROTECTED]

Signed-off-by: Peter P. Waskiewicz Jr [EMAIL PROTECTED]
Signed-off-by: Auke Kok [EMAIL PROTECTED]
---

 Documentation/networking/multiqueue.txt |   97 +++
 1 files changed, 97 insertions(+), 0 deletions(-)

diff --git a/Documentation/networking/multiqueue.txt 
b/Documentation/networking/multiqueue.txt
new file mode 100644
index 000..0bc5222
--- /dev/null
+++ b/Documentation/networking/multiqueue.txt
@@ -0,0 +1,97 @@
+
+   HOWTO for multiqueue network device support
+   ===
+
+Section 1: Base driver requirements for implementing multiqueue support
+Section 2: Qdisc support for multiqueue devices
+Section 3: Brief howto using PRIO for multiqueue devices
+
+
+Intro: Kernel support for multiqueue devices
+-
+
+Kernel support for multiqueue devices is only an API that is presented to the
+netdevice layer for base drivers to implement.  This feature is part of the
+core networking stack, and all network devices will be running on the
+multiqueue-aware stack.  If a base driver only has one queue, then these
+changes are transparent to that driver.
+
+
+Section 2: Base driver requirements for implementing multiqueue support
+---
+
+Base drivers are required to use the new alloc_etherdev_mq() or
+alloc_netdev_mq() functions to allocate the subqueues for the device.  The
+underlying kernel API will take care of the allocation and deallocation of
+the subqueue memory, as well as netdev configuration of where the queues
+exist in memory.
+
+The base driver will also need to manage the queues as it does the global
+netdev-queue_lock today.  Therefore base drivers should use the
+netif_{start|stop|wake}_subqueue() functions to manage each queue while the
+device is still operational.  netdev-queue_lock is still used when the device
+comes online or when it's completely shut down (unregister_netdev(), etc.).
+
+Finally, the base driver should indicate that it is a multiqueue device.  The
+feature flag NETIF_F_MULTI_QUEUE should be added to the netdev-features
+bitmap on device initialization.  Below is an example from e1000:
+
+#ifdef CONFIG_E1000_MQ
+   if ( (adapter-hw.mac.type == e1000_82571) ||
+(adapter-hw.mac.type == e1000_82572) ||
+(adapter-hw.mac.type == e1000_80003es2lan))
+   netdev-features |= NETIF_F_MULTI_QUEUE;
+#endif
+
+
+Section 3: Qdisc support for multiqueue devices
+---
+
+Currently two qdiscs support multiqueue devices.  The default qdisc, 
pfifo_fast,
+and the PRIO qdisc.  The qdisc is responsible for classifying the skb's to
+bands and queues, and will store the queue mapping into skb-queue_mapping.
+Use this field in the base driver to determine which queue to send the skb
+to.
+
+pfifo_fast, being the default qdisc when a device is brought online, will not
+assign a queue mapping, therefore the skb will have a value of zero.  We
+cannot assume anything about the device itself, how many queues it really has,
+etc.  Therefore sending all traffic to queue 0 is the safest thing to do here.
+
+The PRIO qdisc naturally plugs into a multiqueue device.  Upon load of the
+qdisc, PRIO will make a best-effort assignment of queue to PRIO band to evenly
+distribute traffic flows.  The algorithm can be found in prio_tune() in
+net/sched/sch_prio.c.  Once the association is made, any skb that is
+classified will have skb-queue_mapping set, which will allow the driver to
+properly queue skb's to multiple queues.
+
+
+Section 4: Brief howto using PRIO for multiqueue devices
+
+
+The userspace command 'tc,' part of the iproute2 package, is used to configure
+qdiscs.  To add the PRIO qdisc to your network device, assuming the device is
+called eth0, run the following command:
+
+# tc qdisc add dev eth0 root handle 1: prio multiqueue
+
+This will create 3 bands, 0 being highest priority, and associate those bands
+to the queues on your NIC.  Assuming eth0 has 2 Tx queues, the band mapping
+would look like:
+
+band 0 = queue 0
+band 1 = queue 0
+band 2 = queue 1
+
+Traffic will begin flowing through each queue if your TOS values are assigning
+traffic across the various bands.  For example, ssh traffic will always try to
+go out band 0 based on TOS - Linux priority conversion (realtime traffic),
+so it will be sent out queue 0.  ICMP traffic (pings) fall into the normal
+traffic classification, which is band 1.  Therefore pings will be send out
+queue 1 on the NIC.
+
+The behavior of tc filters remains the same, where it will override TOS 
priority
+classification.
+
+
+Author: Peter P. Waskiewicz Jr. [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message 

[PATCH] IPROUTE: Modify tc for new PRIO multiqueue behavior

2007-04-24 Thread Peter P Waskiewicz Jr
From: Peter P Waskiewicz Jr [EMAIL PROTECTED]

Modified tc so PRIO can now have a multiqueue parameter passed to it.  This
will turn on multiqueue behavior if a device has more than 1 queue.  Also,
running tc qdisc ls dev dev will display if multiqueue is on or off.

Signed-off-by: Peter P. Waskiewicz Jr [EMAIL PROTECTED]
---

 include/linux/pkt_sched.h |1 +
 tc/q_prio.c   |9 ++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..bab0b9e 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -99,6 +99,7 @@ struct tc_prio_qopt
 {
int bands;  /* Number of bands */
__u8priomap[TC_PRIO_MAX+1]; /* Map: logical priority - PRIO band */
+   unsigned short multiqueue;  /* 0 for no mq, 1 for mq */
 };
 
 /* TBF section */
diff --git a/tc/q_prio.c b/tc/q_prio.c
index d696e1b..55cb207 100644
--- a/tc/q_prio.c
+++ b/tc/q_prio.c
@@ -29,7 +29,7 @@
 
 static void explain(void)
 {
-   fprintf(stderr, Usage: ... prio bands NUMBER priomap P1 P2...\n);
+   fprintf(stderr, Usage: ... prio [multiqueue] bands NUMBER priomap P1 
P2...\n);
 }
 
 #define usage() return(-1)
@@ -39,7 +39,7 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, 
char **argv, struct n
int ok=0;
int pmap_mode = 0;
int idx = 0;
-   struct tc_prio_qopt opt={3,{ 1, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 
1, 1 }};
+   struct tc_prio_qopt opt={3,{ 1, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 
1, 1 },0};
 
while (argc  0) {
if (strcmp(*argv, bands) == 0) {
@@ -57,7 +57,9 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, 
char **argv, struct n
return -1;
}
pmap_mode = 1;
-   } else if (strcmp(*argv, help) == 0) {
+   } else if (strcmp(*argv, multiqueue) == 0)
+   opt.multiqueue = 1;
+   else if (strcmp(*argv, help) == 0) {
explain();
return -1;
} else {
@@ -105,6 +107,7 @@ int prio_print_opt(struct qdisc_util *qu, FILE *f, struct 
rtattr *opt)
if (RTA_PAYLOAD(opt)   sizeof(*qopt))
return -1;
qopt = RTA_DATA(opt);
+   fprintf(f, multiqueue %s  , qopt-multiqueue ? on : off);
fprintf(f, bands %u priomap , qopt-bands);
for (i=0; i=TC_PRIO_MAX; i++)
fprintf(f,  %d, qopt-priomap[i]);

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Li, Tong N
 Could you explain for the audience the technical definition of
fairness
 and what sorts of error metrics are commonly used? There seems to be
 some disagreement, and you're neutral enough of an observer that your
 statement would help.

The definition for proportional fairness assumes that each thread has a
weight, which, for example, can be specified by the user, or sth. mapped
from thread priorities, nice values, etc. A scheduler achieves ideal
proportional fairness if (1) it is work-conserving, i.e., it never
leaves a processor idle if there are runnable threads, and (2) for any
two threads, i and j, in any time interval, the ratio of their CPU time
is greater than or equal to the ratio of their weights, assuming that
thread i is continuously runnable in the entire interval and both
threads have fixed weights throughout the interval. A corollary of this
is that if both threads i and j are continuously runnable with fixed
weights in the time interval, then the ratio of their CPU time should be
equal to the ratio of their weights. This definition is pretty
restrictive since it requires the properties to hold for any thread in
any interval, which is not feasible. In practice, all algorithms try to
approximate this ideal scheduler (often referred to as Generalized
Processor Scheduling or GPS). Two error metrics are often used: 

(1) lag(t): for any interval [t1, t2], the lag of a thread at time t \in
[t1, t2] is S'(t1, t) - S(t1, t), where S' is the CPU time the thread
would receive in the interval [t1, t] under the ideal scheduler and S is
the actual CPU time it receives under the scheduler being evaluated.

(2) The second metric doesn't really have an agreed-upon name. Some call
it fairness measure and some call it sth else. Anyway, different from
lag, which is kind of an absolute measure for one thread, this metric
(call it F) defines a relative measure between two threads over any time
interval:

F(t1, t2) = S_i(t1, t2) / w_i - S_j(t1, t2) / w_j,

where S_i and S_j are the CPU time the two threads receive in the
interval [t1, t2] and w_i and w_j are their weights, assuming both
weights don't change throughout the interval.

The goal of a proportional-share scheduling algorithm is to minimize the
above metrics. If the lag function is bounded by a constant for any
thread in any time interval, then the algorithm is considered to be
fair. You may notice that the second metric is actually weaker than
first. In fact, if an algorithm achieves a constant lag bound, it must
also achieve a constant bound for the second metric, but the reverse is
not necessarily true. But in some settings, people have focused on the
second metric and still consider an algorithm to be fair as long as the
second metric is bounded by a constant.

 
 On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote:
  I understand that via experiments we can show a design is reasonably
  fair in the common case, but IMHO, to claim that a design is fair,
there
  needs to be some kind of formal analysis on the fairness bound, and
this
  bound should be proven to be constant. Even if the bound is not
  constant, at least this analysis can help us better understand and
  predict the degree of fairness that users would experience (e.g.,
would
  the system be less fair if the number of threads increases? What
happens
  if a large number of threads dynamically join and leave the
system?).
 
 Carrying out this sort of analysis on various policies would help, but
 I'd expect most of them to be difficult to analyze. cfs' current
 -fair_key computation should be simple enough to analyze, at least
 ignoring nice numbers, though I've done nothing rigorous in this area.
 

If we can derive some invariants from the algorithm, it'd help the
analysis. An example is the deficit round-robin (DRR) algorithm in
networking. Its analysis utilizes the fact that the round each flow (in
this case, it'd be thread) goes through in any time interval differs by
at most one.

Hope you didn't get bored by all of this. :)

  tong
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] NET: [UPDATED] Multiqueue network device support implementation.

2007-04-24 Thread Peter P Waskiewicz Jr
From: Peter P Waskiewicz Jr [EMAIL PROTECTED]

Update: Fixed band2queue mapping logic - it was reveresed with prio2band.
Added support in the PRIO qdisc to allow tc to turn on multiqueue behavior,
while keeping original PRIO behavior by default.  Fixed where
skb-queue_mapping is being reset (prior to q-enqueue() ).

Added an API and associated supporting routines for multiqueue network devices.
This allows network devices supporting multiple TX queues to configure each
queue within the netdevice and manage each queue independantly.  Changes to the
PRIO Qdisc also allow a user to map multiple flows to individual TX queues,
taking advantage of each queue on the device.

Signed-off-by: Peter P. Waskiewicz Jr [EMAIL PROTECTED]
Signed-off-by: Auke Kok [EMAIL PROTECTED]
---

 include/linux/etherdevice.h |3 +-
 include/linux/netdevice.h   |   66 ++-
 include/linux/pkt_sched.h   |1 +
 include/linux/skbuff.h  |2 +
 net/core/dev.c  |   28 +++---
 net/core/skbuff.c   |3 ++
 net/ethernet/eth.c  |9 +++---
 net/sched/sch_generic.c |4 +--
 net/sched/sch_prio.c|   66 +++
 9 files changed, 162 insertions(+), 20 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 745c988..446de39 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -39,7 +39,8 @@ extern void   eth_header_cache_update(struct hh_cache 
*hh, struct net_device *dev
 extern int eth_header_cache(struct neighbour *neigh,
 struct hh_cache *hh);
 
-extern struct net_device *alloc_etherdev(int sizeof_priv);
+extern struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count);
+#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1)
 static inline void eth_copy_and_sum (struct sk_buff *dest, 
 const unsigned char *src, 
 int len, int base)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 584c199..6829880 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -108,6 +108,14 @@ struct wireless_dev;
 #define MAX_HEADER (LL_MAX_HEADER + 48)
 #endif
 
+struct net_device_subqueue
+{
+   /* Give a control state for each queue.  This struct may contain
+* per-queue locks in the future.
+*/
+   unsigned long   state;
+};
+
 /*
  * Network device statistics. Akin to the 2.0 ether stats but
  * with byte counters.
@@ -326,6 +334,7 @@ struct net_device
 #define NETIF_F_GSO2048/* Enable software GSO. */
 #define NETIF_F_LLTX   4096/* LockLess TX */
 #define NETIF_F_INTERNAL_STATS 8192/* Use stats structure in net_device */
+#define NETIF_F_MULTI_QUEUE16384   /* Has multiple TX/RX queues */
 
/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT  16
@@ -538,6 +547,14 @@ struct net_device
struct device   dev;
/* space for optional statistics and wireless sysfs groups */
struct attribute_group  *sysfs_groups[3];
+
+   /* To retrieve statistics per subqueue - FOR FUTURE USE */
+   struct net_device_stats* (*get_subqueue_stats)(struct net_device *dev,
+   int queue_index);
+
+   /* The TX queue control structures */
+   struct net_device_subqueue  *egress_subqueue;
+   int egress_subqueue_count;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -679,6 +696,48 @@ static inline int netif_running(const struct net_device 
*dev)
return test_bit(__LINK_STATE_START, dev-state);
 }
 
+/*
+ * Routines to manage the subqueues on a device.  We only need start
+ * stop, and a check if it's stopped.  All other device management is
+ * done at the overall netdevice level.
+ * Also test the device if we're multiqueue.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, u16 
queue_index)
+{
+   clear_bit(__LINK_STATE_XOFF, dev-egress_subqueue[queue_index].state);
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+   if (netpoll_trap())
+   return;
+#endif
+   set_bit(__LINK_STATE_XOFF, dev-egress_subqueue[queue_index].state);
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+ u16 queue_index)
+{
+   return test_bit(__LINK_STATE_XOFF,
+   dev-egress_subqueue[queue_index].state);
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+   if (netpoll_trap())
+   return;
+#endif
+   if (test_and_clear_bit(__LINK_STATE_XOFF,
+  

Re: Kernel traces coming back with trash/clutter

2007-04-24 Thread John Anthony Kazos Jr.
   I am getting this odd content in the trace log (dmesg), and I cannot
   figure out what it is or why it is there.
  
   7
   7
   7
   7
   7__bio_add_page: 2x ph 88=128 || hw 88=88 || 360448max
   802525d8 generic_make_request(bio 01017c745300) 50729472, 704
  
  Perhaps you have something looping that's outputting KERN_DEBUG with a
  null message? Or one of your diagnostic printk statements includes
  KERN_DEBUG with no actual message?
  
 No, they are all KERN_DEBUGspacesome string here, almost all with
 some formatted output as well.  Could I be overloading the printk
 output buffer, as in possibly too tightly repeated/looped code to be
 able to output it all?

It is possible, I suppose. Is what you're working on open-source? If so, 
you could send it to me and I could try and reproduce it here and track it 
down. If you want me to, that is. (If you do send, please include a 
.config.)

Otherwise, I couldn't tell you what it might be. Make sure all your 
messages end with '\n', make sure you're not accidentally using the wrong 
formatting codes and it's backing over previous output with ^H or 
something. You could confirm or rule out the possibility of overflowing 
the printk buffers by writing a dummy module with a tight loop of nothing 
but printk statements with counters to see if you can get it to asplode.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Reasons to merge suspend2.

2007-04-24 Thread Nigel Cunningham
Hi all.

I've been working on this email on and off for a while, but since Pavel
raised the issue again, I thought I should make a concerted effort to
finish it...

In this email, I'm going to outline the problems with the current design
(uswsusp and swsusp) and the ways in which Suspend2 overcomes those
limitations, before going on to outline the additional advantages
Suspend2 has for users and address objections previously raised against
merging Suspend2.

A) Problems with the current design.


1) Ordering of operations.

The current [u]swsusp design doesn't do things in discrete, well ordered
stages. Storage for the image is not allocated until after the atomic
copy has been done. This means that the process can fail when we are a
significant portion of the way into suspending, and it means it can fail
when the user will seriously expect it to run to completion. The
solution to this issue is simple: separate preparing to suspend from
actually writing the image. In the preparation step, ensure, so far as
you are able, that there will be sufficient memory and sufficient
storage to complete the process, and don't write anything or do any
atomic copying until after that has been done.

The only valid objection I can think of is that you can't know for
certain prior to doing the atomic copy how much memory  storage will be
needed for allocations by driver suspend methods. That can be addressed
by a simple extension of the driver model, where in drivers could report
how many pages they will need. (If slab will be needed, the worst case
can be assumed). Rafael's notify patches (recently posted) also help in
that area.

Once processes are frozen, all significant memory usage can be accounted
for, because the process doing the suspending will be the only one
allocating memory.

2) Limit on image size.

The current implementation limits the size of an image to an absolute
maximum of half the amount of ram. This is certainly an improvement over
the old days where it sought to free everything it could, but it's still
not good enough. Current memory freeing code doesn't free the exact
amount requested; often far more than has been requested is freed. This
does not only result in a smaller image. It also means the system is
proportionately less responsive on resume at whatever stage that those
pages are needed again. A full image is certainly not needed by
everyone. Those with huge amounts of memory, very fast storage devices
or particular memory usage patterns may, quite rightly, not want to
store the whole lot in an image. This doesn't mean, however, that those
who want or need (from their perspective) a full image of memory
shouldn't be able to have it. It just adds to the argument for making it
tunable (which swsusp has done too).

3) Lack of provision for tuning to individual needs.

Swsusp historically included very little provision whatsoever for the
user to tune their configuration. This has recently begun to change, and
I applaud that. But it needs to go further. Suspending to disk is not a
one-size-fits-all situation. People have different hardware
configurations, with the result being that some people benefit from
compression while others do better without it. Some people want
encryption in a particular configuration while others don't care about
encryption at all. Some people want to limit the image size, others
don't. Sometimes a user might want to reboot instead of powering down
(dual booting). All of this should be doable, without having to hack the
code or recompile the kernel, and should be as simple as possible.
Suspend2, via its /sys/power/suspend2 interface and hibernate-script
porcelain, makes this easy.

4) No support for multiple swap devices / non swap storage.

Until recently, [u]swsusp supported a single swap partition only.
Support for a swap file has been added, but [u]swsusp still supports
only one swap device at a time. For most people, this is adequate, but
this doesn't mean everyone should be forced to fit this mould.

[u]swsusp also lacks support for storage to non-swap. Particularly in
systems that rely on swap for normal activity, this can make [u]swsusp
less reliable. The amount of swap available varies according to
workload, so sometimes the user will be unable to suspend. To address
this raciness/competition against other swap usage, Suspend2 supports
writing to a generic file, either a partition or a file on an ordinary
partition.

B) Further advantages of Suspend2.
==

1) Improvements over swsusp.


a) Modular design.

Parts of Suspend2 implement support for storing an image in swap or in a
file, using cryptoapi for compression and/or encryption and talking to a
userspace user interface via a netlink socket. Suspend2 works just fine
without CONFIG_SWAP, CONFIG_NET and/or CONFIG_CRYPTOAPI, however,
because it uses a modular design wherein support for these subsystems is
abstracted 

RE: [PATCH] x86_64/acpi: make kernel to be compiled when CONFIG_ACPI_NUMA is set and power management with acpi is not enabled

2007-04-24 Thread Yinghai Lu

-Original Message-
From: Len Brown [mailto:[EMAIL PROTECTED] mailto:[EMAIL PROTECTED]]
Sent: Tuesday, April 10, 2007 1:33 AM


Let me know if you have one that doesn't.


Please check this one. it will not compiled. 


grep ACPI .config
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_ACPI=y
CONFIG_ACPI_NUMA=y
# CONFIG_PNPACPI is not set
# CONFIG_BLK_DEV_IDEACPI is not set
CONFIG_SATA_ACPI=y

YH 


#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.21-rc7
# Tue Apr 24 18:27:37 2007
#
CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ZONE_DMA32=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_EARLY_PRINTK=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_DMI=y
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=-smp
CONFIG_LOCALVERSION_AUTO=y
# CONFIG_SWAP is not set
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_SYSVIPC_SYSCTL=y
# CONFIG_POSIX_MQUEUE is not set
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
# CONFIG_TASKSTATS is not set
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_CPUSETS is not set
CONFIG_SYSFS_DEPRECATED=y
# CONFIG_RELAY is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_SLAB=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_SLOB is not set

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
# CONFIG_KMOD is not set
CONFIG_STOP_MACHINE=y

#
# Block layer
#
CONFIG_BLOCK=y
# CONFIG_BLK_DEV_IO_TRACE is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_AS=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED=anticipatory

#
# Processor type and features
#
CONFIG_X86_PC=y
# CONFIG_X86_VSMP is not set
CONFIG_MK8=y
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
# CONFIG_MICROCODE is not set
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_MTRR=y
CONFIG_SMP=y
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_NODES_SHIFT=6
CONFIG_X86_64_ACPI_NUMA=y
# CONFIG_NUMA_EMU is not set
CONFIG_ARCH_DISCONTIGMEM_ENABLE=y
CONFIG_ARCH_DISCONTIGMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_SELECT_MEMORY_MODEL=y
# CONFIG_FLATMEM_MANUAL is not set
CONFIG_DISCONTIGMEM_MANUAL=y
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_DISCONTIGMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_NEED_MULTIPLE_NODES=y
# CONFIG_SPARSEMEM_STATIC is not set
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_MIGRATION=y
CONFIG_RESOURCES_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y
CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y
CONFIG_NR_CPUS=255
# CONFIG_HOTPLUG_CPU is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_HPET_TIMER=y
# CONFIG_HPET_EMULATE_RTC is not set
CONFIG_IOMMU=y
# CONFIG_CALGARY_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_X86_MCE=y
# CONFIG_X86_MCE_INTEL is not set
CONFIG_X86_MCE_AMD=y
CONFIG_KEXEC=y
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x20
CONFIG_SECCOMP=y
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
# CONFIG_REORDER is not set
CONFIG_K8_NB=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_ISA_DMA_API=y
CONFIG_GENERIC_PENDING_IRQ=y

#
# Power management options
#
# CONFIG_PM is not set
CONFIG_ACPI=y
CONFIG_ACPI_NUMA=y

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
# CONFIG_CPU_FREQ_DEBUG is not set
CONFIG_CPU_FREQ_STAT=y
# 

Re: [PATCH 2/2] Align ZONE_MOVABLE to a MAX_ORDER_NR_PAGES boundary

2007-04-24 Thread Yasunori Goto
Looks good. :-)
Thanks.

Acked-by: Yasunori Goto [EMAIL PROTECTED]


 
 The boot memory allocator makes assumptions on the alignment of zone
 boundaries even though the buddy allocator has no requirements on the
 alignment of zones. This may cause boot problems in situations where
 ZONE_MOVABLE is populated because the bootmem allocator assumes zones are
 at least order-log2(BITS_PER_LONG) aligned. As the two potential users
 (huge pages and memory hot-remove) of ZONE_MOVABLE would prefer a higher
 alignment, this patch aligns the start of the zone instead of fixing the
 different assumptions made by the bootmem allocator.
 
 This patch rounds the start of ZONE_MOVABLE in each node to a
 MAX_ORDER_NR_PAGES boundary. If the rounding pushes the start of ZONE_MOVABLE
 above the end of the node then the zone will contain no memory and will not
 be used at runtime. The value is rounded up instead of down as it is
 better to have the kernel-portion of memory larger than requested instead
 of smaller. The impact is that the kernel-usable portion of memory because a
 minimum guarantee instead of the exact size requested by the user.
 
 
 Signed-off-by: Mel Gorman [EMAIL PROTECTED]
 Acked-by: Andy Whitcroft [EMAIL PROTECTED]
 ---
 
  page_alloc.c |5 +
  1 files changed, 5 insertions(+)
 
 diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
 linux-2.6.21-rc6-mm1-002_commonparse/mm/page_alloc.c 
 linux-2.6.21-rc6-mm1-003_alignmovable/mm/page_alloc.c
 --- linux-2.6.21-rc6-mm1-002_commonparse/mm/page_alloc.c  2007-04-24 
 09:38:30.0 +0100
 +++ linux-2.6.21-rc6-mm1-003_alignmovable/mm/page_alloc.c 2007-04-24 
 11:15:40.0 +0100
 @@ -3642,6 +3642,11 @@ restart:
   usable_nodes--;
   if (usable_nodes  required_kernelcore  usable_nodes)
   goto restart;
 + 
 + /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
 + for (nid = 0; nid  MAX_NUMNODES; nid++)
 + zone_movable_pfn[nid] =
 + roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
  }
  
  /**

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 06/25] xen: Core Xen implementation

2007-04-24 Thread Jeremy Fitzhardinge
Andi Kleen wrote:
 On Monday 23 April 2007 23:56:44 Jeremy Fitzhardinge wrote:
   
 Core Xen Implementation

 This patch is a rollup of all the core pieces of the Xen
 implementation, including booting, memory management, interrupts, time
 and so on.
 

 The patch is definitely too big.
   

Yes.  It was originally smaller patches, which I tried to keep in a
state where everything was incrementally buildable, but it got too hard
to keep it all together.

I guess I can break it down into functional groups, and put the config
stuff at the end.

 +#ifdef CONFIG_XEN
 +/* Xen only supports sysenter/sysexit in ring0 guests,
 +   and only if it the guest asks for it.  So for now,
 +   this should never be used. */
 +ENTRY(xen_sti_sysexit)
 +CFI_STARTPROC
 +ud2
 +CFI_ENDPROC
 +ENDPROC(xen_sti_sysexit)
 

 Put that elsewhere? It doesn't need to be here.
   

Yes, I can drop it.  It's not needed in this kernel.

 +++ b/arch/i386/xen/enlighten.c
 @@ -0,0 +1,727 @@
 

 Comments describing what all the files do? 
   

OK.

 +unsigned maskedx = ~0;
 +if (*eax == 1)
 +maskedx = ~((1  X86_FEATURE_APIC) |
 +(1  X86_FEATURE_ACPI) |
 +(1  X86_FEATURE_ACC));
 

 Why ACC? 

 And why doesn't Xen mask those by itself?
   

Because it doesn't care whether they're set or not.  I'm suppressing
them here to prevent the kernel from trying to use these features.  I
suppress ACC in particular to stop the P4 thermal interrupt code from
trying to do anything.

I'll comment it.

 And you got apic functions later which would be never called?
 Why are the hooks needed then? 
   

They aren't.  They're only for VMI.  I've only got them to make sure
that there are no stray APIC usages.

 +
 +static unsigned long xen_save_fl(void)
 +{
 +struct vcpu_info *vcpu;
 +unsigned long flags;
 +
 +preempt_disable();
 +vcpu = x86_read_percpu(xen_vcpu);
 +/* flag has opposite sense of mask */
 +flags = !vcpu-evtchn_upcall_mask;
 +preempt_enable();
 

 If you use get_cpu/put_cpu it will be optimized away on PREEMPT  !SMP
 (more occurrences)
   

Won't preempt_disable disappear as well?  I don't need the CPU number.

 +static void xen_restore_fl(unsigned long flags)
 +{
 +struct vcpu_info *vcpu;
 +
 +preempt_disable();
 +
 +/* convert from IF type flag */
 +flags = !(flags  X86_EFLAGS_IF);
 +vcpu = x86_read_percpu(xen_vcpu);
 +vcpu-evtchn_upcall_mask = flags;
 +if (flags == 0) {
 +barrier(); /* unmask then check (avoid races) */
 

 Don't you need a rmb() here then? The CPU could speculate reads
 (more occurrences) 
   

OK.

 +if (unlikely(vcpu-evtchn_upcall_pending))
 +force_evtchn_callback();
 +preempt_enable();
 +} else
 +preempt_enable_no_resched();
 +}
 +
 +static void xen_irq_disable(void)
 +{
 +struct vcpu_info *vcpu;
 +preempt_disable();
 +vcpu = x86_read_percpu(xen_vcpu);
 +vcpu-evtchn_upcall_mask = 1;
 +preempt_enable_no_resched();
 

 First with the new per cpu the preempt disable shouldn't be needed
 anymore because the thing is atomic. In the worst case you do 
 the change on the previous CPU, but that can happen anyways after
 preempt_enable
   

No, there's a one instruction preempt window there.  If I do:

mov %fs:xen_vcpu, %eax
movb $1,1(%eax)

and a preempt happens in between, then the interrupt will be disabled on
the wrong cpu.

Once we can put the vcpu structure into the percpu area directly, then I
can do:

movb $1,%fs:xen_vcpu+1

which is preempt-safe, of course.

 And then when you have enabled who transfers the irq off state to the
 new CPU? 
   

I don't follow you.

 +static void xen_halt(void)
 +{
 +#if 0
 +if (irqs_disabled())
 +HYPERVISOR_vcpu_op(VCPUOP_down, smp_processor_id(), NULL);
 +#endif
 +}
 

 Who halts then?
   

I fix this up in the xen-machine-ops.patch.

 +static void xen_load_gdt(const struct Xgt_desc_struct *dtr)
 +{
 +unsigned long *frames;
 +unsigned long va = dtr-address;
 +unsigned int size = dtr-size + 1;
 +int f;
 +struct multicall_space mcs;
 +
 +BUG_ON(size  16*PAGE_SIZE);
 

 Why 16?
   

I'll make it more explicit.  It's 64k of GDT entries == 16 pages.

 +count = desc-size / 8;
 +BUG_ON(count  256);
 

 should be = ?
   

I think 256 idt entries is OK, but it should be (desc-size+1) / 8.

 +static void xen_set_iopl_mask(unsigned mask)
 +{
 +#if 0
 +struct physdev_set_iopl set_iopl;
 +
 +/* Force the change at ring 0. */
 +set_iopl.iopl = (mask == 0) ? 1 : (mask  12)  3;
 +HYPERVISOR_physdev_op(PHYSDEVOP_set_iopl, set_iopl);
 +#endif
 

 And who does iopl then?
   

Nobody at the moment.  I don't think there's much need for it in an
unprivileged Xen domU.  I could just nop it out for now.

 + * Page-directory addresses above 4GB do not fit into architectural 

Re: [00/17] Large Blocksize Support V3

2007-04-24 Thread William Lee Irwin III
On Tue, Apr 24, 2007 at 03:21:05PM -0700, [EMAIL PROTECTED] wrote:
 V2-V3
 - More restructuring
 - It actually works!
 - Add XFS support
 - Fix up UP support
 - Work out the direct I/O issues
 - Add CONFIG_LARGE_BLOCKSIZE. Off by default which makes the inlines revert
   back to constants. Disabled for 32bit and HIGHMEM configurations.
   This also allows a gradual migration to the new page cache
   inline functions. LARGE_BLOCKSIZE capabilities can be
   added gradually and if there is a problem then we can disable
   a subsystem.

Excellent, I'll do some testing here at the very least.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] drivers/ata: remove the wildcard from sata_nv driver

2007-04-24 Thread Peer Chen
Because nvidia SATA controllers onward base on AHCI, so wildcard in
sata_nv driver is unnecessary.
Also the wildcard sometimes cause sata_nv driver to be loaded for AHCI
controllers,which is not as expected.

Signed-off-by: Peer Chen [EMAIL PROTECTED]

=

--- linux-2.6.21-rc7/drivers/ata/sata_nv.c.orig
+++ linux-2.6.21-rc7/drivers/ata/sata_nv.c
@@ -285,12 +285,6 @@ static const struct pci_device_id nv_pci
{ PCI_VDEVICE(NVIDIA, PCI_DEVICE_ID_NVIDIA_NFORCE_MCP61_SATA),
GENERIC },
{ PCI_VDEVICE(NVIDIA, PCI_DEVICE_ID_NVIDIA_NFORCE_MCP61_SATA2),
GENERIC },
{ PCI_VDEVICE(NVIDIA, PCI_DEVICE_ID_NVIDIA_NFORCE_MCP61_SATA3),
GENERIC },
-   { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
-   PCI_ANY_ID, PCI_ANY_ID,
-   PCI_CLASS_STORAGE_IDE8, 0x00, GENERIC },
-   { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
-   PCI_ANY_ID, PCI_ANY_ID,
-   PCI_CLASS_STORAGE_RAID8, 0x00, GENERIC },
 
{ } /* terminate list */
 };
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH] syctl for selecting global zonelist[] order

2007-04-24 Thread KAMEZAWA Hiroyuki
Make zonelist policy selectable from sysctl.

Assume 2 node NUMA, only node(0) has ZONE_DMA (ZONE_DMA32).

In this case, default (node0's) zonelist order is

Node(0)'s NORMAL - Node(0)'s DMA - Node(1)s NORMAL.

This means Node(0)'s DMA is used before Node(1)'s NORMAL.

In some server, some application uses large memory allcation.
This exhaust memory in the above order.
Thensometimes OOM_KILL will occur when 32bit device requires memory.

This patch adds sysctl for rebuilding zonelist after boot and doesn't change
default zonelist order.

command:
%echo 0  /proc/sys/vm/better_locality

Will rebuild zonelist in following order.

Node(0)'s NORMAL - Node(1)'s NORMAL - Node(0)'s DMA.

if set better_locality == 1 (default), zonelist is
Node(0)'s NORMAL - Node(0)'s DMA - Node(1)'s NORMAL.

Maybe useful in some users with heavy memory pressure and mlocks.

Tested under ia64 2 node NUMA  against 2.6.21-rc7.. works well.

Signed-Off-By: KAMEZAWA Hiroyuki [EMAIL PROTECTED]

Index: linux-2.6.21-rc7/kernel/sysctl.c
===
--- linux-2.6.21-rc7.orig/kernel/sysctl.c
+++ linux-2.6.21-rc7/kernel/sysctl.c
@@ -76,6 +76,9 @@ extern int pid_max_min, pid_max_max;
 extern int sysctl_drop_caches;
 extern int percpu_pagelist_fraction;
 extern int compat_log;
+#ifdef CONFIG_NUMA
+extern int sysctl_better_locality;
+#endif
 
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
@@ -845,6 +848,15 @@ static ctl_table vm_table[] = {
.extra1 = zero,
.extra2 = one_hundred,
},
+   {
+   .ctl_name   = VM_BETTER_LOCALITY,
+   .procname   = better_locality,
+   .data   = sysctl_better_locality,
+   .maxlen = sizeof(sysctl_better_locality),
+   .mode   = 0644,
+   .proc_handler   = sysctl_better_locality_handler,
+   .strategy   = sysctl_intvec,
+   },
 #endif
 #if defined(CONFIG_X86_32) || \
(defined(CONFIG_SUPERH)  defined(CONFIG_VSYSCALL))
Index: linux-2.6.21-rc7/mm/page_alloc.c
===
--- linux-2.6.21-rc7.orig/mm/page_alloc.c
+++ linux-2.6.21-rc7/mm/page_alloc.c
@@ -1670,7 +1670,7 @@ static int __meminit build_zonelists_nod
 
 #ifdef CONFIG_NUMA
 #define MAX_NODE_LOAD (num_online_nodes())
-static int __meminitdata node_load[MAX_NUMNODES];
+static int node_load[MAX_NUMNODES];
 /**
  * find_next_best_node - find the next node that should appear in a given 
node's fallback list
  * @node: node whose fallback list we're appending
@@ -1685,7 +1685,7 @@ static int __meminitdata node_load[MAX_N
  * on them otherwise.
  * It returns -1 if no node is found.
  */
-static int __meminit find_next_best_node(int node, nodemask_t *used_node_mask)
+static int find_next_best_node(int node, nodemask_t *used_node_mask)
 {
int n, val;
int min_val = INT_MAX;
@@ -1731,7 +1731,10 @@ static int __meminit find_next_best_node
return best_node;
 }
 
-static void __meminit build_zonelists(pg_data_t *pgdat)
+/*
+ * Build zonelists based on node locality.
+ */
+static void build_zonelists_locality_aware(pg_data_t *pgdat)
 {
int j, node, local_node;
enum zone_type i;
@@ -1780,6 +1783,78 @@ static void __meminit build_zonelists(pg
}
 }
 
+/*
+ * Build zonelist based on zone priority.
+ */
+static int node_order[MAX_NUMNODES];
+static void build_zonelists_zone_aware(pg_data_t *pgdat)
+{
+   int i, j, pos, zone_type, node, load;
+   nodemask_t used_mask;
+   int local_node, prev_node;
+   struct zone *z;
+   struct zonelist *zonelist;
+
+   for (i = 0; i  MAX_NR_ZONES; i++) {
+   zonelist = pgdat-node_zonelists + i;
+   zonelist-zones[0] = NULL;
+   }
+   memset(node_order, 0, sizeof(node_order));
+   local_node = pgdat-node_id;
+   load = num_online_nodes();
+   prev_node = local_node;
+   nodes_clear(used_mask);
+   j = 0;
+   while ((node = find_next_best_node(local_node, used_mask)) = 0) {
+   int distance = node_distance(local_node, node);
+   if (distance  RECLAIM_DISTANCE)
+   zone_reclaim_mode = 1;
+   if (distance != node_distance(local_node, prev_node))
+   node_load[node] = load;
+   node_order[j++] = node;
+   prev_node = node;
+   load--;
+   }
+   /* calculate node order */
+   for (i = 0; i  MAX_NR_ZONES; i++) {
+   zonelist = pgdat-node_zonelists + i;
+   pos = 0;
+   for (zone_type = i; zone_type = 0; zone_type--) {
+   for (j = 0; j  num_online_nodes(); j++) {
+   node = node_order[j];
+   z = NODE_DATA(node)-node_zones[zone_type];

Re: regression with gammu on 2.6.21-rc7

2007-04-24 Thread Greg KH
On Tue, Apr 24, 2007 at 06:12:33PM -0700, Ray Lee wrote:
  On 4/24/07, Greg KH [EMAIL PROTECTED] wrote:
  Also, if you know how to use git, doing a 'git bisect' to try to track
  down the problem commit would be very helpful.
 
  Has to do with SIGIO, see this blog post:
 
  http://blog.cihar.com/archives/2007/04/24/kernel_2_6_21_hits_gammu/

Ah, thank you very much, that makes more sense as nothing changed in
that usb-serial driver and I was starting to get a bit worried...

thanks,

greg k-h
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Ramdisk Vs NFS

2007-04-24 Thread Siva Prasad

Hi,

What is the primary difference between Ramdisk and NFS with respect to
the wait_queue's?

If I use ramdisk, every thing works fine, but with NFS (or you may read
as 'no ramdisk') kernel/sched.c:__wake_up_common() routines has a
problem. Basically the value of q-task_list-next is out of our
memory range (not between 0xc000 and 0xF000), and this causes
trouble of accessing non-existing memory. Why would this happen?

Interesting thing is, this happens much before we even load the ramdisk
drivers.

Appreciate if any one has some insight into this. At least a pointer to
where to start looking would be great.

Thanks
Siva

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] IPROUTE: Modify tc for new PRIO multiqueue behavior

2007-04-24 Thread Stephen Hemminger

Peter P Waskiewicz Jr wrote:

From: Peter P Waskiewicz Jr [EMAIL PROTECTED]

Modified tc so PRIO can now have a multiqueue parameter passed to it.  This
will turn on multiqueue behavior if a device has more than 1 queue.  Also,
running tc qdisc ls dev dev will display if multiqueue is on or off.

Signed-off-by: Peter P. Waskiewicz Jr [EMAIL PROTECTED]
---

 include/linux/pkt_sched.h |1 +
 tc/q_prio.c   |9 ++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..bab0b9e 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -99,6 +99,7 @@ struct tc_prio_qopt
 {
int bands;  /* Number of bands */
__u8priomap[TC_PRIO_MAX+1]; /* Map: logical priority - PRIO band */
+   unsigned short multiqueue;  /* 0 for no mq, 1 for mq */
 };
 
 /* TBF section */

diff --git a/tc/q_prio.c b/tc/q_prio.c
index d696e1b..55cb207 100644
--- a/tc/q_prio.c
+++ b/tc/q_prio.c
@@ -29,7 +29,7 @@
 
 static void explain(void)

 {
-   fprintf(stderr, Usage: ... prio bands NUMBER priomap P1 P2...\n);
+   fprintf(stderr, Usage: ... prio [multiqueue] bands NUMBER priomap P1 
P2...\n);
 }
 
 #define usage() return(-1)

@@ -39,7 +39,7 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, 
char **argv, struct n
int ok=0;
int pmap_mode = 0;
int idx = 0;
-   struct tc_prio_qopt opt={3,{ 1, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 
1, 1 }};
+   struct tc_prio_qopt opt={3,{ 1, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 
1, 1 },0};
 
 	while (argc  0) {

if (strcmp(*argv, bands) == 0) {
@@ -57,7 +57,9 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, 
char **argv, struct n
return -1;
}
pmap_mode = 1;
-   } else if (strcmp(*argv, help) == 0) {
+   } else if (strcmp(*argv, multiqueue) == 0)
+   opt.multiqueue = 1;
+   else if (strcmp(*argv, help) == 0) {
explain();
return -1;
} else {
@@ -105,6 +107,7 @@ int prio_print_opt(struct qdisc_util *qu, FILE *f, struct 
rtattr *opt)
if (RTA_PAYLOAD(opt)   sizeof(*qopt))
return -1;
qopt = RTA_DATA(opt);
+   fprintf(f, multiqueue %s  , qopt-multiqueue ? on : off);
fprintf(f, bands %u priomap , qopt-bands);
for (i=0; i=TC_PRIO_MAX; i++)
fprintf(f,  %d, qopt-priomap[i]);

  

Only if this binary compatiable with older kernels.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7] Containers (V8): Cpusets hooked into containers

2007-04-24 Thread Paul Menage

On 4/23/07, Vaidyanathan Srinivasan [EMAIL PROTECTED] wrote:


  config CONTAINERS
 - bool Container support
 - help
 -   This option will let you create and manage process containers,
 -   which can be used to aggregate multiple processes, e.g. for
 -   the purposes of resource tracking.
 -
 -   Say N if unsure
 + bool

Hi Paul,

This looks like some patch generation error.  Description for
containers should not be removed after applying this patch.


No, this is intentional - in the first patch in the series,
CONFIG_CONTAINER was a user-selectable option so it had a description;
in the second it becomes an option that's only selected if other
selected systems (e.g. cpusets) depend on it. So it no longer needs
help text.

Cheers,

Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [PATCH 0/7] Containers (V8): Generic Process Containers

2007-04-24 Thread Paul Menage

On 4/23/07, Vaidyanathan Srinivasan [EMAIL PROTECTED] wrote:

Hi Paul,

In [patch 3/7] Containers (V8): Add generic multi-subsystem API to
containers, you have forcefully enabled interrupt in
container_init_subsys() with spin_unlock_irq() which breaks on PPC64.


 +static void container_init_subsys(struct container_subsys *ss) {
 + int retval;
 + struct list_head *l;
 + printk(KERN_ERR Initializing container subsys %s\n,
 ss-name);
 +
 + /* Create the top container state for this subsystem */
 + ss-root = rootnode;
 + retval = ss-create(ss, dummytop);
 + BUG_ON(retval);
 + init_container_css(ss, dummytop);
 +
 + /* Update all container groups to contain a subsys
 +  * pointer to this state - since the subsystem is
 +  * newly registered, all tasks and hence all container
 +  * groups are in the subsystem's top container. */
 + spin_lock_irq(container_group_lock);
 + l = init_container_group.list;
 + do {
 + struct container_group *cg =
 + list_entry(l, struct container_group, list);
 + cg-subsys[ss-subsys_id] =
 dummytop-subsys[ss-subsys_id];
 + l = l-next;
 + } while (l != init_container_group.list);
 + spin_unlock_irq(container_group_lock);

Interrupt gets enabled here and on PPC64, the kernel takes a pending
decrementer and crashes because it is too early to handle them.

Use of irqsave and restore routines would fix the problem.


OK, thanks. I'll add that change.

Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


SOME STUFF ABOUT REISER4 To Mr Hopper

2007-04-24 Thread lkml777
Seems I did not answer the correct thread.

On Sun, 22 Apr 2007 19:00:46 -0700, Eric Hopper
 [EMAIL PROTECTED] said:

 I did.  That whole thread is some guy spouting off a ludicrous Bonnie++
 benchmark showing that compressing long strings of 0s results in things
 taking up very little space and being very fast.

I think you are deliberately being stupid here.

You are claiming that REISER4's good speed results when using
compression actually has a simple explanation and THEREFORE all good
results for the filesystem, even those results that have nothing to do
with compression, are negated.

NOTHING COULD BE FURTHER FROM THE TRUTH.

Your conclusion is a total travesty of logic.

As I understand it, the default Reiser4 DOES NOT USE any compression at
all, not even tail compression, but saves space by eliminating block
alignment wastage (tail compression is an option).

So lets LOSE the statistics that involve compression. The results now
look like this:

.-.
| FILESYSTEM | TIME |DISK |
| TYPE   |(secs)|USAGE|
.-.
|REISER4 | 3462 | 692 |
|EXT2| 4092 | 816 |
|JFS | 4225 | 806 |
|EXT4| 4408 | 816 |
|EXT3| 4421 | 816 |
|XFS | 4625 | 779 |
|REISER3 | 6178 | 793 |
|FAT32   |12342 | 988 |
|NTFS-3g |10414 | 772 |
.-.


These results are still EXTREMELY GOOD for REISER4.

These results still say that Reiser4 is a truly remarkable filesystem,
as stated in:

http://linuxhelp.150m.com/resources/fs-benchmarks.htm
http://m.domaindlx.com/LinuxHelp/fs-benchmarks.htm

So why do I see an anti-Reiser religion, in all that you people say.

You, concentrate on the fact that bonnie++'s use of files that are
mainly zeroes, will make the results using compression less good than
they are.

I can't see anywhere where this has been denied.

In fact the other set of statistics that you just ignore, states that in
more realistic situations, the compression speedup is slightly negative.

What is wrong here, is:

You say that the Bonnie++ tests using compression are subject to
interpretation. No argument here.
You ignore the tests that confirm your statement. You are clearly not
interested in the actual results or their interpretation.
You, by some incredibly twisted logic then state that Reiser4 is
therefore not good, even though it is clearly the best filesystem when
NOT using compression.

This of course is completely deceitful logic.

That the speed advantage from compression would be small is clear from
the OTHER data that you ignore, namely:

.-.
|File |Disk |Copy |Copy |Tar  |Unzip| Del |
|System   |Usage|655MB|655MB|Gzip |UnTar| 2.5 |
|Type | (MB)| (1) | (2) |655MB|655MB| Gig |
.-.
|REISER4 lzo  | 278 | 138 |  56 |  80 |  34 |  84 |
|REISER4 gzip | 213 | 148 |  68 |  83 |  48 |  70 |
|REISER4  | 692 | 148 |  55 |  67 |  25 |  56 |
|EXT4 | 816 | 174 |  70 |  74 |  42 |  50 |
.-.


 
  On Sun, 22 Apr 2007 19:00:46 -0700, Eric Hopper
  [EMAIL PROTECTED] said:
 
   I know that this whole effort has been put in disarray by the
   prosecution of Hans Reiser, but I'm curious as to its status. Is
   Reiser4 going to be going into the Linus kernel anytime soon? Is there
   somewhere I should be looking to find this out without wasting bandwidth
   here?
 
  There was a thread the other day, that talked about Reiser4.
 
  It took a while but I have found it (actually two)
 
  http://lkml.org/lkml/2007/4/5/360
  http://lkml.org/lkml/2007/4/9/4
 
  You may want to check them out.
 
 I did.  That whole thread is some guy spouting off a ludicrous Bonnie++
 benchmark showing that compressing long strings of 0s results in things
 taking up very little space and being very fast.
 
 Such things will produce lots of flames and no useful information
 whatsoever as is evinced by the half conspiracy theory, half truth the
 thread degenerated into in the second message you linked to.
 
-- 
  
  [EMAIL PROTECTED]

-- 
http://www.fastmail.fm - Accessible with your email software
  or over the web

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] use mutex instead of semaphore in RocketPort driver

2007-04-24 Thread Satyam Sharma

Hi Matthias,

On 4/25/07, Robert Hancock [EMAIL PROTECTED] wrote:

Matthias Kaehlcke wrote:
 El Tue, Apr 24, 2007 at 07:53:04PM +0200 Oliver Neukum ha dit:

 Am Dienstag, 24. April 2007 19:49 schrieb Matthias Kaehlcke:
 @@ -1706,7 +1706,7 @@ static int rp_write(struct tty_struct *tty,
 if (count = 0 || rocket_paranoia_check(info, rp_write))
 return 0;

 -   down_interruptible(info-write_sem);
 +   mutex_lock_interruptible(info-write_mtx);
 This is a bug. It is also present in the current code, but nevertheless
 it is a bug. If you use an interruptible lock, you must be ready to deal
 with interrupts, which are ignored by this code.
 [...]
 i'm a bit confused now about the interruptible locks, i thought using
 them means that the process will be waked up when receiving a
 signal. what role are playing interrupts when using interruptible locks?

You are correct, interrupts aren't involved. However if the wait is
interrupted by a signal, mutex_lock_interruptible will return a nonzero
return code which needs to be checked for (and likely -ERESTARTSYS or
-EINTR returned), otherwise the code will blindly continue as though it
has locked the mutex even though it has not.


Think I'll elaborate Robert's explanation for your benefit :-) Unlike
mutex_lock() and down() that put the task to TASK_UNINTERRUPTIBLE
sleep if the lock can't be acquired immediately,
mutex_lock_interruptible() and down_interruptible() sleep in
TASK_INTERRUPTIBLE state. So the task _can_ be woken up (without even
acquiring the lock) by incoming signals. When that happens, we can't
just blindly go on ... so the return values of the _interruptible()
versions of the locking functions *must* be checked for success and if
not, the task should return with error.

Use -ERESTARTSYS if a previous intermediate caller checks this return
value and tries and restarts the whole operation. If no such previous
caller exists (and/or introducing it would involve a change in kernel
behaviour as seen from userspace), you can safely use -EINTR. The goal
is that userspace must not get to see -ERESTARTSYS.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.21-rc7-mm1 + sysfs-oops-workaround.patch -- INFO: possible recursive locking detected

2007-04-24 Thread Miles Lane

[   59.677312] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state
recovery directory
[   59.688633] NFSD: starting 90-second grace period
[   60.221454]
[   60.221456] =
[   60.221461] [ INFO: possible recursive locking detected ]
[   60.221464] 2.6.21-rc7-mm1 #53
[   60.221466] -
[   60.221469] S20powernowd/3584 is trying to acquire lock:
[   60.221472]  (sd-s_active){}, at: [c01a2436]
sysfs_hash_and_remove+0x91/0x10e
[   60.221486]
[   60.221487] but task is already holding lock:
[   60.221489]  (sd-s_active){}, at: [c01a2a20]
sysfs_write_file+0xb9/0x14a
[   60.221496]
[   60.221497] other info that might help us debug this:
[   60.221499] 4 locks held by S20powernowd/3584:
[   60.221501]  #0:  (sd-s_active){}, at: [c01a2a20]
sysfs_write_file+0xb9/0x14a
[   60.221508]  #1:  (sd-s_active){}, at: [c01a2a32]
sysfs_write_file+0xcb/0x14a
[   60.221515]  #2:  (per_cpu(cpu_policy_rwsem, cpu)){--..}, at:
[c024081b] lock_policy_rwsem_write+0x20/0x37
[   60.221524]  #3:  (userspace_mutex){--..}, at: [c0299dfe]
mutex_lock+0x1f/0x23
[   60.221534]
[   60.221535] stack backtrace:
[   60.221538]  [c0104e0f] show_trace_log_lvl+0x1a/0x30
[   60.221543]  [c0105a26] show_trace+0x12/0x14
[   60.221547]  [c0105ab3] dump_stack+0x16/0x18
[   60.221551]  [c0134d63] __lock_acquire+0x12e/0xb4c
[   60.221557]  [c01357e9] lock_acquire+0x68/0x82
[   60.221561]  [c012ddda] down_write+0x3a/0x53
[   60.221567]  [c01a2436] sysfs_hash_and_remove+0x91/0x10e
[   60.221571]  [c01a2bb0] sysfs_remove_file+0x10/0x12
[   60.221575]  [c0241756] cpufreq_governor_userspace+0x10c/0x1dc
[   60.221579]  [c023fd2b] __cpufreq_governor+0x9c/0xd0
[   60.221583]  [c023fed0] __cpufreq_set_policy+0x171/0x209
[   60.221587]  [c02400b5] store_scaling_governor+0x14d/0x184
[   60.221591]  [c0240bee] store+0x3e/0x60
[   60.221594]  [c01a2a85] sysfs_write_file+0x11e/0x14a
[   60.221599]  [c01699fb] vfs_write+0x90/0x119
[   60.221605]  [c0169eef] sys_write+0x3d/0x61
[   60.221609]  [c0103e66] sysenter_past_esp+0x5f/0x99
[   60.221613]  ===
[   60.763809] Clocksource tsc unstable (delta = -75646443 ns)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc7-mm1 + sysfs-oops-workaround.patch -- INFO: possible recursive locking detected

2007-04-24 Thread Tejun Heo
Miles Lane wrote:
 [   59.677312] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state
 recovery directory
 [   59.688633] NFSD: starting 90-second grace period
 [   60.221454]
 [   60.221456] =
 [   60.221461] [ INFO: possible recursive locking detected ]
 [   60.221464] 2.6.21-rc7-mm1 #53
 [   60.221466] -
 [   60.221469] S20powernowd/3584 is trying to acquire lock:
 [   60.221472]  (sd-s_active){}, at: [c01a2436]
 sysfs_hash_and_remove+0x91/0x10e
 [   60.221486]
 [   60.221487] but task is already holding lock:
 [   60.221489]  (sd-s_active){}, at: [c01a2a20]
 sysfs_write_file+0xb9/0x14a
 [   60.221496]
 [   60.221497] other info that might help us debug this:
 [   60.221499] 4 locks held by S20powernowd/3584:
 [   60.221501]  #0:  (sd-s_active){}, at: [c01a2a20]
 sysfs_write_file+0xb9/0x14a
 [   60.221508]  #1:  (sd-s_active){}, at: [c01a2a32]
 sysfs_write_file+0xcb/0x14a
 [   60.221515]  #2:  (per_cpu(cpu_policy_rwsem, cpu)){--..}, at:
 [c024081b] lock_policy_rwsem_write+0x20/0x37
 [   60.221524]  #3:  (userspace_mutex){--..}, at: [c0299dfe]
 mutex_lock+0x1f/0x23

Thanks for reporting.  We need to separate s_active users into two
classes - one for r/w the other for deleting for nodes which delete
other nodes when written to.  Will post a patch soon.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.21-rc7-mm1 + sysfs-oops-workaround.patch -- software suspend failed (1 tasks refusing to freeze)

2007-04-24 Thread Miles Lane

[ 1251.506964] PM: Preparing system for mem sleep
[ 1251.514790] Stopping tasks ...
[ 1271.456065] Stopping user space processes timed out after 20
seconds (1 tasks refusing to freeze):
[ 1271.456243]  multiload-apple
[ 1271.456291] Restarting tasks ... done.

This isn't happening under earlier builds I've tested.  How can I debug this?

Thanks,
 Miles
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] drivers/net: move the nvidia forcedeth driver from 100M group to 1000M group

2007-04-24 Thread Peer Chen
nForce ehternet is a Gigabit NIC not 100M, move it to 1000M group to
avoid the confusion.

Signed-off-by: Peer Chen [EMAIL PROTECTED]



--- linux-2.6.21-rc7/drivers/net/Kconfig.orig
+++ linux-2.6.21-rc7/drivers/net/Kconfig
@@ -1399,35 +1399,6 @@ config B44
  file:Documentation/networking/net-modules.txt.  The module
will be
  called b44.
 
-config FORCEDETH
-   tristate nForce Ethernet support
-   depends on NET_PCI  PCI
-   help
- If you have a network (Ethernet) controller of this type, say
Y and
- read the Ethernet-HOWTO, available from
- http://www.tldp.org/docs.html#howto.
-
- To compile this driver as a module, choose M here and read
- file:Documentation/networking/net-modules.txt.  The module
will be
- called forcedeth.
-
-config FORCEDETH_NAPI
-   bool Use Rx and Tx Polling (NAPI) (EXPERIMENTAL)
-   depends on FORCEDETH  EXPERIMENTAL
-   help
- NAPI is a new driver API designed to reduce CPU and interrupt
load
- when the driver is receiving lots of packets from the card. It
is
- still somewhat experimental and thus not yet enabled by
default.
-
- If your estimated Rx load is 10kpps or more, or if the card
will be
- deployed on potentially unfriendly networks (e.g. in a
firewall),
- then say Y here.
-
- See file:Documentation/networking/NAPI_HOWTO.txt for more
- information.
-
- If in doubt, say N.
-
 config CS89x0
tristate CS89x0 support
depends on NET_PCI  (ISA || MACH_IXDP2351 || ARCH_IXDP2X01 ||
ARCH_PNX010X)
@@ -1999,6 +1970,35 @@ config MYRI_SBUS
  To compile this driver as a module, choose M here: the module
  will be called myri_sbus.  This is recommended.
 
+config FORCEDETH
+   tristate nForce Ethernet support
+   depends on NET_PCI  PCI
+   help
+ If you have a network (Ethernet) controller of this type, say
Y and
+ read the Ethernet-HOWTO, available from
+ http://www.tldp.org/docs.html#howto.
+
+ To compile this driver as a module, choose M here and read
+ file:Documentation/networking/net-modules.txt.  The module
will be
+ called forcedeth.
+
+config FORCEDETH_NAPI
+   bool Use Rx and Tx Polling (NAPI) (EXPERIMENTAL)
+   depends on FORCEDETH  EXPERIMENTAL
+   help
+ NAPI is a new driver API designed to reduce CPU and interrupt
load
+ when the driver is receiving lots of packets from the card. It
is
+ still somewhat experimental and thus not yet enabled by
default.
+
+ If your estimated Rx load is 10kpps or more, or if the card
will be
+ deployed on potentially unfriendly networks (e.g. in a
firewall),
+ then say Y here.
+
+ See file:Documentation/networking/NAPI_HOWTO.txt for more
+ information.
+
+ If in doubt, say N.
+
 config NS83820
tristate National Semiconductor DP83820 support
depends on PCI
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc7-mm1 + sysfs-oops-workaround.patch -- software suspend failed (1 tasks refusing to freeze)

2007-04-24 Thread Andrew Morton
On Tue, 24 Apr 2007 22:27:44 -0700 Miles Lane [EMAIL PROTECTED] wrote:

 [ 1251.506964] PM: Preparing system for mem sleep
 [ 1251.514790] Stopping tasks ...
 [ 1271.456065] Stopping user space processes timed out after 20
 seconds (1 tasks refusing to freeze):
 [ 1271.456243]  multiload-apple
 [ 1271.456291] Restarting tasks ... done.
 
 This isn't happening under earlier builds I've tested.  How can I debug this?
 

hm, that's multiload-applet, some gnome thing.

sysrq-T, perhaps?  Perhaps the process is sleeping in the kernel somewhere.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc7-mm1 + sysfs-oops-workaround.patch -- software suspend failed (1 tasks refusing to freeze)

2007-04-24 Thread Miles Lane

On 4/24/07, Andrew Morton [EMAIL PROTECTED] wrote:

On Tue, 24 Apr 2007 22:27:44 -0700 Miles Lane [EMAIL PROTECTED] wrote:

 [ 1251.506964] PM: Preparing system for mem sleep
 [ 1251.514790] Stopping tasks ...
 [ 1271.456065] Stopping user space processes timed out after 20
 seconds (1 tasks refusing to freeze):
 [ 1271.456243]  multiload-apple
 [ 1271.456291] Restarting tasks ... done.

 This isn't happening under earlier builds I've tested.  How can I debug this?


hm, that's multiload-applet, some gnome thing.

sysrq-T, perhaps?  Perhaps the process is sleeping in the kernel somewhere.


Should I wait for the next patch from Tejun before retesting?  Perhaps
this suspend problem is a side effect of the locking problem he
mentioned.

Miles
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc7-mm1 + sysfs-oops-workaround.patch -- software suspend failed (1 tasks refusing to freeze)

2007-04-24 Thread Andrew Morton
On Tue, 24 Apr 2007 22:49:48 -0700 Miles Lane [EMAIL PROTECTED] wrote:

 On 4/24/07, Andrew Morton [EMAIL PROTECTED] wrote:
  On Tue, 24 Apr 2007 22:27:44 -0700 Miles Lane [EMAIL PROTECTED] wrote:
 
   [ 1251.506964] PM: Preparing system for mem sleep
   [ 1251.514790] Stopping tasks ...
   [ 1271.456065] Stopping user space processes timed out after 20
   seconds (1 tasks refusing to freeze):
   [ 1271.456243]  multiload-apple
   [ 1271.456291] Restarting tasks ... done.
  
   This isn't happening under earlier builds I've tested.  How can I debug 
   this?
  
 
  hm, that's multiload-applet, some gnome thing.
 
  sysrq-T, perhaps?  Perhaps the process is sleeping in the kernel somewhere.
 
 Should I wait for the next patch from Tejun before retesting?  Perhaps
 this suspend problem is a side effect of the locking problem he
 mentioned.

It's unlikely to be related to Tejun's sysfs changes.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] powerpc pseries eeh: Convert to kthread API

2007-04-24 Thread Benjamin Herrenschmidt
> Since we need to have some way to track them having an explicit data
> structure that the callers manage seems to make sense.

Oh sure, I wasn't arguing against that at all...

It might be handy to have a release() callback (optional) that gets
called after the kthread stops/exits, once we know the data structure
isn't going to be used anymore (if practical to implement, depends on
your approach).

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] oom: kill all threads that share mm with killed task

2007-04-24 Thread David Rientjes
On Mon, 23 Apr 2007, Christoph Lameter wrote:

> Obvious fix. It was broken by
>  
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f2a2a7108aa0039ba7a5fe7a0d2ecef2219a7584
> Dec 7. So its in 2.6.20 and later. Candiate for stable?
> 

I agree it's obvious enough that it should be included in stable.  
Otherwise the entire iteration becomes a big no-op and it won't alleviate 
the OOM condition in one call to out_of_memory() because there may be 
outstanding tasks with the shared ->mm.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Transparently handle <.symbol> lookup for kprobes

2007-04-24 Thread Paul Mackerras
Srinivasa Ds writes:

> + } else {\
> + char dot_name[KSYM_NAME_LEN+1]; \
> + dot_name[0] = '.';  \
> + dot_name[1] = '\0'; \
> + strncat(dot_name, name, KSYM_NAME_LEN); \

Assuming the kernel strncat works like the userspace one does, there
is a possibility that dot_name[] won't be properly null-terminated
here.  If strlen(name) >= KSYM_NAME_LEN-1, then strncat will set
dot_name[KSYM_NAME_LEN-1] to something non-null and won't touch
dot_name[KSYM_NAME_LEN].

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] powerpc pseries eeh: Convert to kthread API

2007-04-24 Thread Paul Mackerras
Christoph Hellwig writes:

> The first question is obviously, is this really something we want?
> spawning kernel thread on demand without reaping them properly seems
> quite dangerous.

What specifically has to be done to reap a kernel thread?  Are you
concerned about the number of threads, or about having zombies hanging
around?

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


SOME STUFF ABOUT REISER4

2007-04-24 Thread lkml777
On Sun, 22 Apr 2007 19:00:46 -0700, "Eric Hopper"
<[EMAIL PROTECTED]> said:

> I know that this whole effort has been put in disarray by the
> prosecution of Hans Reiser, but I'm curious as to its status. Is
> Reiser4 going to be going into the Linus kernel anytime soon? Is there
> somewhere I should be looking to find this out without wasting bandwidth
> here?

There was a thread the other day, that talked about Reiser4.

It took a while but I have found it (actually two)

http://lkml.org/lkml/2007/4/5/360
http://lkml.org/lkml/2007/4/9/4

You may want to check them out.
-- 
  
  [EMAIL PROTECTED]

-- 
http://www.fastmail.fm - Access your email from home and the web

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/25] xen: Add nosegneg capability to the vsyscall page notes

2007-04-24 Thread Jeremy Fitzhardinge
Roland McGrath wrote:
>> I have to admit I still don't really understand all this.  Is it
>> documented somewhere?
>> 
>
> I have explained it in public more than once, but I don't know off hand
> anywhere that was helpfully recorded.
>   

Thanks very much.  I'd been poking about, but the closest I came to an
actual description was various patches fixing bugs, so it was a little
incomplete.

> For example, a Xen-enabled kernel can use a single vDSO image (or a single
> pair of int80/sysenter images), containing the "nosegneg" hwcap note.  When
> there is no need for it (native or hvm or 64-bit hv or whatever), it just
> clears the mask word.  If you actually do this, you'll want to modify the
> NOTE_KERNELCAP_BEGIN macro to define a global label you can use with VDSO_SYM.
>   

Thanks for the pointer.  I'd been getting a bit of heat for enabling the
nonegseg flag unconditionally.  If I can make Xen-specific then that
will be one less source of complaints.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Peter Williams

Arjan van de Ven wrote:
Within reason, it's not the number of clients that X has that causes its 
CPU bandwidth use to sky rocket and cause problems.  It's more to to 
with what type of clients they are.  Most GUIs (even ones that are 
constantly updating visual data (e.g. gkrellm -- I can open quite a 
large number of these without increasing X's CPU usage very much)) cause 
very little load on the X server.  The exceptions to this are the 



there is actually 2 and not just 1 "X server", and they are VERY VERY
different in behavior.

Case 1: Accelerated driver

If X talks to a decent enough card it supports will with acceleration,
it will be very rare for X itself to spend any kind of significant
amount of CPU time, all the really heavy stuff is done in hardware, and
asynchronously at that. A bit of batching will greatly improve system
performance in this case.

Case 2: Unaccelerated VESA

Some drivers in X, especially the VESA and NV drivers (which are quite
common, vesa is used on all hardware without a special driver nowadays),
have no or not enough acceleration to matter for modern desktops. This
means the CPU is doing all the heavy lifting, in the X program. In this
case even a simple "move the window a bit" becomes quite a bit of a CPU
hog already.


Mine's a:

SiS 661/741/760 PCI/AGP or 662/761Gx PCIE VGA Display adapter according 
to X's display settings tool.  Which category does that fall into?


It's not a special adapter and is just the one that came with the 
motherboard. It doesn't use much CPU unless I grab a window and wiggle 
it all over the screen or do something like "ls -lR /" in an xterm.




The cases are fundamentally different in behavior, because in the first
case, X hardly consumes the time it would get in any scheme, while in
the second case X really is CPU bound and will happily consume any CPU
time it can get.


Which still doesn't justify an elaborate "points" sharing scheme. 
Whichever way you look at that that's just another way of giving X more 
CPU bandwidth and there are simpler ways to give X more CPU if it needs 
it.  However, I think there's something seriously wrong if it needs the 
-19 nice that I've heard mentioned.  You might as well just run it as a 
real time process.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


NonExecutable Bit in 32Bit

2007-04-24 Thread Cestonaro, Thilo \(external\)
Hey,

is it right, that the NX Bit is not used under i386-Arch but under x86_64-Arch?
When yes, is there a special argument for it not to be used?

Ciao Thilo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] x86_64: Reflect the relocatability of the kernel in the ELF header.

2007-04-24 Thread Vivek Goyal
On Sun, Apr 22, 2007 at 11:12:13PM -0600, Eric W. Biederman wrote:
> 
> Currently because vmlinux does not reflect that the kernel is relocatable
> we still have to support CONFIG_PHYSICAL_START.  So this patch adds a small
> c program to do what we cannot do with a linker script, set the elf header
> type to ET_DYN.
> 
> This should remove the last obstacle to removing CONFIG_PHYSICAL_START
> on x86_64.
> 
> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>

[Dropping fastboot mailing list from CC as kexec mailing list is new list
 for this discussion]

[..]
> +void file_open(const char *name)
> +{
> + if ((fd = open(name, O_RDWR, 0)) < 0)
> + die("Unable to open `%s': %m", name);
> +}
> +
> +static void mketrel(void)
> +{
> + unsigned char e_type[2];
> + if (read(fd, _ident, sizeof(e_ident)) != sizeof(e_ident))
> + die("Cannot read ELF header: %s\n", strerror(errno));
> +
> + if (memcmp(e_ident, ELFMAG, 4) != 0)
> + die("No ELF magic\n");
> +
> + if ((e_ident[EI_CLASS] != ELFCLASS64) &&
> + (e_ident[EI_CLASS] != ELFCLASS32))
> + die("Unrecognized ELF class: %x\n", e_ident[EI_CLASS]);
> + 
> + if ((e_ident[EI_DATA] != ELFDATA2LSB) &&
> + (e_ident[EI_DATA] != ELFDATA2MSB))
> + die("Unrecognized ELF data encoding: %x\n", e_ident[EI_DATA]);
> +
> + if (e_ident[EI_VERSION] != EV_CURRENT)
> + die("Unknown ELF version: %d\n", e_ident[EI_VERSION]);
> +
> + if (e_ident[EI_DATA] == ELFDATA2LSB) {
> + e_type[0] = ET_REL & 0xff;
> + e_type[1] = ET_REL >> 8;
> + } else {
> + e_type[1] = ET_REL & 0xff;
> + e_type[0] = ET_REL >> 8;
> + }

Hi Eric,

Should this be ET_REL or ET_DYN? kexec refuses to load this vmlinux
as it does not find it to be executable type.

I am not well versed with various conventions but if I go through "Executable
and Linking Format" document, this is what it says about various file types.

• A relocatable file holds code and data suitable for linking with other
  object files to create an executable or a shared object file.

• An executable file holds a program suitable for execution.

• A shared object file holds code and data suitable for linking in two
  contexts. First, the link editor may process it with other relocatable and
  shared object files to create another object file. Second, the dynamic
  linker combines it with an executable file and other shared objects
  to create a process image.

So above does not seem to fit in the ET_REL type. We can't relink this
vmlinux? And it does not seem to fit in ET_DYN definition too. We are
not relinking this vmlinux with another executable or other relocatable
files.

I remember once you mentioned the term dynamic executable which can be
loaded at a non-compiled address and let run without requiring any
relocation processing. This vmlinux will fall in that category but can't 
relate it to standard elf file definitions.

Thanks
Vivek
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* Peter Williams <[EMAIL PROTECTED]> wrote:

> > The cases are fundamentally different in behavior, because in the 
> > first case, X hardly consumes the time it would get in any scheme, 
> > while in the second case X really is CPU bound and will happily 
> > consume any CPU time it can get.
> 
> Which still doesn't justify an elaborate "points" sharing scheme. 
> Whichever way you look at that that's just another way of giving X 
> more CPU bandwidth and there are simpler ways to give X more CPU if it 
> needs it.  However, I think there's something seriously wrong if it 
> needs the -19 nice that I've heard mentioned.

Gene has done some testing under CFS with X reniced to +10 and the 
desktop still worked smoothly for him. So CFS does not 'need' a reniced 
X. There are simply advantages to negative nice levels: for example 
screen refreshes are smoother on any scheduler i tried. BUT, there is a 
caveat: on non-CFS schedulers i tried X is much more prone to get into 
'overscheduling' scenarios that visibly hurt X's performance, while on 
CFS there's a max of 1000-1500 context switches a second at nice -10. 
(which, considering the cost of a context switch is well under 1% 
overhead.)

So, my point is, the nice level of X for desktop users should not be set 
lower than a low limit suggested by that particular scheduler's author. 
That limit is scheduler-specific. Con i think recommends a nice level of 
-1 for X when using SD [Con, can you confirm?], while my tests show that 
if you want you can go as low as -10 under CFS, without any bad 
side-effects. (-19 was a bit too much)

> [...]  You might as well just run it as a real time process.

hm, that would be a bad idea under any scheduler (including CFS), 
because real time processes can starve other processes indefinitely.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NonExecutable Bit in 32Bit

2007-04-24 Thread William Heimbigner

On Tue, 24 Apr 2007, Cestonaro, Thilo (external) wrote:


Hey,

is it right, that the NX Bit is not used under i386-Arch but under x86_64-Arch?
When yes, is there a special argument for it not to be used?

Ciao Thilo

I don't think so - some i386 cpus definitely have support for the NX bit.

Would having this be supported in i386 help debugging (and security) 
significantly?


William Heimbigner
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v2] Fixes and cleanups for earlyprintk aka boot console.

2007-04-24 Thread Andrew Morton
On Thu, 15 Mar 2007 16:46:39 +0100 Gerd Hoffmann <[EMAIL PROTECTED]> wrote:

> The console subsystem already has an idea of a boot console, using the
> CON_BOOT flag.  The implementation has some flaws though.  The major
> problem is that presence of a boot console makes register_console()
> ignore any other console devices (unless explicitly specified on the
> kernel command line).
> 
> This patch fixes the console selection code to *not* consider a boot
> console a full-featured one, so the first non-boot console registering
> will become the default console instead.  This way the unregister call
> for the boot console in the register_console() function actually
> triggers and the handover from the boot console to the real console
> device works smoothly.  Added a printk for the handover, so you know
> which console device the output goes to when the boot console stops
> printing messages.
> 
> The disable_early_printk() call is obsolete with that patch, explicitly
> disabling the early console isn't needed any more as it works
> automagically with that patch.
> 
> I've walked through the tree, dropped all disable_early_printk()
> instances found below arch/ and tagged the consoles with CON_BOOT if
> needed.  The code is tested on x86, sh (thanks to Paul) and mips
> (thanks to Ralf).
> 
> Changes to last version: Rediffed against -rc3, adapted to mips
> cleanups by Ralf, fixed "udbg-immortal" cmd line arg on powerpc.

I get this, across netconsole:

[17179569.184000] console handover: boot [earlyvga_f_0] -> real [tty0]

wanna take a look at why there's cruft in bootconsole->name please?

in grub.conf I have

kernel /boot/bzImage-2.6.21-rc7-mm1 ro root=LABEL=/ rhgb vga=0x263 
[EMAIL PROTECTED]/eth0,[EMAIL PROTECTED]/00:0D:56:C6:C6:CC profile=1 
earlyprintk=vga resume=8:5 time

and I'm using

http://userweb.kernel.org/~akpm/config-sony.txt

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Transparently handle <.symbol> lookup for kprobes

2007-04-24 Thread Srinivasa Ds
Paul Mackerras wrote:
> Srinivasa Ds writes:
> 
>> +} else {\
>> +char dot_name[KSYM_NAME_LEN+1]; \
>> +dot_name[0] = '.';  \
>> +dot_name[1] = '\0'; \
>> +strncat(dot_name, name, KSYM_NAME_LEN); \
> 
> Assuming the kernel strncat works like the userspace one does, there
> is a possibility that dot_name[] won't be properly null-terminated
> here.  If strlen(name) >= KSYM_NAME_LEN-1, then strncat will set
> dot_name[KSYM_NAME_LEN-1] to something non-null and won't touch
> dot_name[KSYM_NAME_LEN].

Irrespective of length of the string, kernel implementation of
strncat(lib/string.c) ensures that last character of string is set to
null. So dot_name[] is always null terminated.


char *strncat(char *dest, const char *src, size_t count)
{
char *tmp = dest;

if (count) {
while (*dest)
dest++;
while ((*dest++ = *src++) != 0) {
if (--count == 0) {
*dest = '\0';
break;
}
}
}
return tmp;
}
EXPORT_SYMBOL(strncat);
===

Is this OK then ??


Thanks
 Srinivasa DS

> 
> Paul.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] Ignore stolen time in the softlockup watchdog

2007-04-24 Thread Andrew Morton
On Tue, 27 Mar 2007 14:49:20 -0700 Jeremy Fitzhardinge <[EMAIL PROTECTED]> 
wrote:

> The softlockup watchdog is currently a nuisance in a virtual machine,
> since the whole system could have the CPU stolen from it for a long
> period of time.  While it would be unlikely for a guest domain to be
> denied timer interrupts for over 10s, it could happen and any softlockup
> message would be completely spurious.
> 
> Earlier I proposed that sched_clock() return time in unstolen
> nanoseconds, which is how Xen and VMI currently implement it.  If the
> softlockup watchdog uses sched_clock() to measure time, it would
> automatically ignore stolen time, and therefore only report when the
> guest itself locked up.  When running native, sched_clock() returns
> real-time nanoseconds, so the behaviour would be unchanged.
> 
> Note that sched_clock() used this way is inherently per-cpu, so this
> patch makes sure that the per-processor watchdog thread initialized
> its own timestamp.

This patch
(ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc6/2.6.21-rc6-mm1/broken-out/ignore-stolen-time-in-the-softlockup-watchdog.patch)
causes six failures in the locking self-tests, which I must say is rather
clever of it.


Here's the first one:

[17179569.184000] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., 
Ingo Molnar
[17179569.184000] ... MAX_LOCKDEP_SUBCLASSES:8
[17179569.184000] ... MAX_LOCK_DEPTH:  30
[17179569.184000] ... MAX_LOCKDEP_KEYS:2048
[17179569.184000] ... CLASSHASH_SIZE:   1024
[17179569.184000] ... MAX_LOCKDEP_ENTRIES: 8192
[17179569.184000] ... MAX_LOCKDEP_CHAINS:  16384
[17179569.184000] ... CHAINHASH_SIZE:  8192
[17179569.184000]  memory used by lock dependency info: 992 kB
[17179569.184000]  per task-struct memory footprint: 1200 bytes
[17179569.184000] 
[17179569.184000] | Locking API testsuite:
[17179569.184000] 

[17179569.184000]  | spin |wlock |rlock |mutex 
| wsem | rsem |
[17179569.184000]   
--
[17179569.184000]  A-A deadlock:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184000]  A-B-B-A deadlock:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184000]  A-B-B-C-C-A deadlock:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184001]  A-B-C-A-B-C deadlock:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184002]  A-B-B-C-C-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184003]  A-B-C-D-B-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184004]  A-B-C-D-B-C-D-A deadlock:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184005] double unlock:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184006]   initialize held:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184006]  bad unlock order:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184006]   
--
[17179569.184006]   recursive read-lock: |  ok  |   
  |  ok  |
[17179569.184006]recursive read-lock #2: |  ok  |   
  |  ok  |
[17179569.184007] mixed read-write-lock: |  ok  |   
  |  ok  |
[17179569.184007] mixed write-read-lock: |  ok  |   
  |  ok  |
[17179569.184007]   
--
[17179569.184007]  hard-irqs-on + irq-safe-A/12:  ok  |  ok  |  ok  |
[17179569.184007]  soft-irqs-on + irq-safe-A/12:  ok  |  ok  |  ok  |
[17179569.184007]  hard-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
[17179569.184007]  soft-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
[17179569.184007]sirq-safe-A => hirqs-on/12:  ok  |  ok  |irq event 
stamp: 458
[17179569.184007] hardirqs last  enabled at (458): [] 
irqsafe2A_rlock_12+0x96/0xa3
[17179569.184007] hardirqs last disabled at (457): [] 
sched_clock+0x5e/0xe9
[17179569.184007] softirqs last  enabled at (454): [] 
irqsafe2A_rlock_12+0x81/0xa3
[17179569.184007] softirqs last disabled at (450): [] 
irqsafe2A_rlock_12+0xb/0xa3
[17179569.184007] FAILED| [] dump_trace+0x63/0x1ec
[17179569.184007]  [] show_trace_log_lvl+0x1a/0x30
[17179569.184007]  [] show_trace+0x12/0x14
[17179569.184007]  [] dump_stack+0x16/0x18
[17179569.184007]  [] dotest+0x6b/0x3d0
[17179569.184007]  [] locking_selftest+0x915/0x1a58
[17179569.184007]  [] start_kernel+0x1d0/0x2a2
[17179569.184007]  ===
[17179569.184007] 
[17179569.184007]sirq-safe-A => hirqs-on/21:irq event stamp: 462
[17179569.184007] hardirqs last  enabled at (462): [] 

Re: [REPORT] First "glitch1" results, 2.6.21-rc7-git6-CFSv5 + SD 0.46

2007-04-24 Thread Ingo Molnar

* Ed Tomlinson <[EMAIL PROTECTED]> wrote:

> > SD 0.46 1-2 FPS
> > cfs v5 nice -19 219-233 FPS
> > cfs v5 nice 0   1000-1996
>cfs v5 nice -10  60-65 FPS

the problem is, the glxgears portion of this test is an _inverse_ 
testcase.

The reason? glxgears on true 3D hardware will _not_ use X, it will 
directly use the 3D driver of the kernel. So by renicing X to -19 you 
give the xterms more chance to show stuff - the performance of the 
glxgears will 'degrade' - but that is what you asked for: glxgears is 
'just another CPU hog' that competes with X, it's not a "true" X client.

if you are after glxgears performance in this test then you'll get the 
best performance out of this by renicing X to +19 or even SCHED_BATCH.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] Ignore stolen time in the softlockup watchdog

2007-04-24 Thread Jeremy Fitzhardinge
Andrew Morton wrote:
> On Tue, 27 Mar 2007 14:49:20 -0700 Jeremy Fitzhardinge <[EMAIL PROTECTED]> 
> wrote:
>
>   
>> The softlockup watchdog is currently a nuisance in a virtual machine,
>> since the whole system could have the CPU stolen from it for a long
>> period of time.  While it would be unlikely for a guest domain to be
>> denied timer interrupts for over 10s, it could happen and any softlockup
>> message would be completely spurious.
>>
>> Earlier I proposed that sched_clock() return time in unstolen
>> nanoseconds, which is how Xen and VMI currently implement it.  If the
>> softlockup watchdog uses sched_clock() to measure time, it would
>> automatically ignore stolen time, and therefore only report when the
>> guest itself locked up.  When running native, sched_clock() returns
>> real-time nanoseconds, so the behaviour would be unchanged.
>>
>> Note that sched_clock() used this way is inherently per-cpu, so this
>> patch makes sure that the per-processor watchdog thread initialized
>> its own timestamp.
>> 
>
> This patch
> (ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc6/2.6.21-rc6-mm1/broken-out/ignore-stolen-time-in-the-softlockup-watchdog.patch)
> causes six failures in the locking self-tests, which I must say is rather
> clever of it.
>   

Interesting.  Which variation of sched_clock do you have in your tree at
the moment?

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Gene Heskett
On Tuesday 24 April 2007, Ingo Molnar wrote:
>* Peter Williams <[EMAIL PROTECTED]> wrote:
>> > The cases are fundamentally different in behavior, because in the
>> > first case, X hardly consumes the time it would get in any scheme,
>> > while in the second case X really is CPU bound and will happily
>> > consume any CPU time it can get.
>>
>> Which still doesn't justify an elaborate "points" sharing scheme.
>> Whichever way you look at that that's just another way of giving X
>> more CPU bandwidth and there are simpler ways to give X more CPU if it
>> needs it.  However, I think there's something seriously wrong if it
>> needs the -19 nice that I've heard mentioned.
>
>Gene has done some testing under CFS with X reniced to +10 and the
>desktop still worked smoothly for him.

As a data point here, and probably nothing to do with X, but I did manage to 
lock it up, solid, reset button time tonight, by wanting 'smart' to get done 
with an update session after amanda had started.  I took both smart processes 
I could see in htop all the way to -19, but when it was about done about 3 
minutes later, everything came to an instant, frozen, reset button required 
lockup.  I should have stopped at -17 I guess. :(

>So CFS does not 'need' a reniced 
>X. There are simply advantages to negative nice levels: for example
>screen refreshes are smoother on any scheduler i tried. BUT, there is a
>caveat: on non-CFS schedulers i tried X is much more prone to get into
>'overscheduling' scenarios that visibly hurt X's performance, while on
>CFS there's a max of 1000-1500 context switches a second at nice -10.
>(which, considering the cost of a context switch is well under 1%
>overhead.)
>
>So, my point is, the nice level of X for desktop users should not be set
>lower than a low limit suggested by that particular scheduler's author.
>That limit is scheduler-specific. Con i think recommends a nice level of
>-1 for X when using SD [Con, can you confirm?], while my tests show that
>if you want you can go as low as -10 under CFS, without any bad
>side-effects. (-19 was a bit too much)
>
>> [...]  You might as well just run it as a real time process.
>
>hm, that would be a bad idea under any scheduler (including CFS),
>because real time processes can starve other processes indefinitely.
>
>   Ingo



-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
I have discovered that all human evil comes from this, man's being unable
to sit still in a room.
-- Blaise Pascal
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: NonExecutable Bit in 32Bit

2007-04-24 Thread Cestonaro, Thilo \(external\)
 
> I don't think so - some i386 cpus definitely have support for the NX bit.
Ok, the cpu's do support it, but the kernel doesn't use it if it is active in 
the bios.

> Would having this be supported in i386 help debugging (and security) 
> significantly?
@William: I don't understand this question :(


Ciao Thilo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Rogan Dawes

Ingo Molnar wrote:


static void
yield_task_fair(struct rq *rq, struct task_struct *p, struct task_struct *p_to)
{
struct rb_node *curr, *next, *first;
struct task_struct *p_next;

/*
 * yield-to support: if we are on the same runqueue then
 * give half of our wait_runtime (if it's positive) to the other task:
 */
if (p_to && p->wait_runtime > 0) {
p->wait_runtime >>= 1;
p_to->wait_runtime += p->wait_runtime;
}

the above is the basic expression of: "charge a positive bank balance". 



[..]

[note, due to the nanoseconds unit there's no rounding loss to worry 
about.]


Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss?


Ingo


Rogan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* Gene Heskett <[EMAIL PROTECTED]> wrote:

> > Gene has done some testing under CFS with X reniced to +10 and the 
> > desktop still worked smoothly for him.
> 
> As a data point here, and probably nothing to do with X, but I did 
> manage to lock it up, solid, reset button time tonight, by wanting 
> 'smart' to get done with an update session after amanda had started.  
> I took both smart processes I could see in htop all the way to -19, 
> but when it was about done about 3 minutes later, everything came to 
> an instant, frozen, reset button required lockup.  I should have 
> stopped at -17 I guess. :(

yeah, i guess this has little to do with X. I think in your scenario it 
might have been smarter to either stop, or to renice the workloads that 
took away CPU power from others to _positive_ nice levels. Negative nice 
levels can indeed be dangerous.

(Btw., to protect against such mishaps in the future i have changed the 
SysRq-N [SysRq-Nice] implementation in my tree to not only change 
real-time tasks to SCHED_OTHER, but to also renice negative nice levels 
back to 0 - this will show up in -v6. That way you'd only have had to 
hit SysRq-N to get the system out of the wedge.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Peter Zijlstra
On Tue, 2007-04-24 at 12:58 +1000, Neil Brown wrote:
> On Friday April 20, [EMAIL PROTECTED] wrote:
> > Scale writeback cache per backing device, proportional to its writeout 
> > speed.
> 
> So it works like this:
> 
>  We account for writeout in full pages.
>  When a page has the Writeback flag cleared, we account that as a
>  successfully retired write for the relevant bdi.
>  By using floating averages we keep track of how many writes each bdi
>  has retired 'recently' where the unit of time in which we understand
>  'recently' is a single page written.

That is actually that period I keep referring to. So recently is the
last 'period' number of writeout completions.

>  We keep a floating average for each bdi, and a floating average for
>  the total writeouts (that 'average' is, of course, 1.)

1 in the sense of unity, yes :-)

>  Using these numbers we can calculate what faction of 'recently'
>  retired writes were retired by each bdi (get_writeout_scale).
> 
>  Multiplying this fraction by the system-wide number of pages that are
>  allowed to be dirty before write-throttling, we get the number of
>  pages that the bdi can have dirty before write-throttling the bdi.
> 
>  I note that the same fraction is *not* applied to background_thresh.
>  Should it be?  I guess not - there would be interesting starting
>  transients, as a bdi which had done no writeout would not be allowed
>  any dirty pages, so background writeout would start immediately,
>  which isn't what you want... or is it?

This is something I have not been able to come to a conclusive answer
yet,... 

>  For each bdi we also track the number of (dirty, writeback, unstable)
>  pages and do not allow this to exceed the limit set for this bdi.
> 
>  The calculations involving 'reserve' in get_dirty_limits are a little
>  confusing.  It looks like you calculating how much total head-room
>  there is for the bdi (pages that the system can still dirty - pages
>  this bdi has dirty) and making sure the number returned in pbdi_dirty
>  doesn't allow more than that to be used.  

Yes, it limits the earned share of the total dirty limit to the possible
share, ensuring that the total dirty limit is never exceeded.

This is especially relevant when the proportions change faster than the
pages get written out, ie. when the period << total dirty limit.

> This is probably a
>  reasonable thing to do but it doesn't feel like the right place.  I
>  think get_dirty_limits should return the raw threshold, and
>  balance_dirty_pages should do both tests - the bdi-local test and the
>  system-wide test.

Ok, that makes sense I guess.

>  Currently you have a rather odd situation where
> + if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> + break;
>  might included numbers obtained with bdi_stat_sum being compared with
>  numbers obtained with bdi_stat.

Yes, I was aware of that. The bdi_thresh is based on bdi_stat() numbers,
whereas the others could be bdi_stat_sum(). I think this is ok, since
the threshold is a 'guess' anyway, we just _need_ to ensure we do not
get trapped by writeouts not arriving (due to getting stuck in the per
cpu deltas).  -- I have all this commented in the new version.

>  With these patches, the VM still (I think) assumes that each BDI has
>  a reasonable queue limit, so that writeback_inodes will block on a
>  full queue.  If a BDI has a very large queue, balance_dirty_pages
>  will simply turn lots of DIRTY pages into WRITEBACK pages and then
>  think "We've done our duty" without actually blocking at all.

It will block once we exceed the total number of dirty pages allowed for
that BDI. But yes, this does not take away the need for queue limits.

This work was primarily aimed at allowing multiple queues to not
interfere as much, so they all can make progress and not get starved.

>  With the extra accounting that we now have, I would like to see
>  balance_dirty_pages dirty pages wait until RECLAIMABLE+WRITEBACK is
>  actually less than 'threshold'.  This would probably mean that we
>  would need to support per-bdi background_writeout to smooth things
>  out.  Maybe that it fodder for another patch-set.

Indeed, I still have to wrap my mind around the background thing. Your
input is appreciated.

>  You set:
> + vm_cycle_shift = 1 + ilog2(vm_total_pages);
> 
>  Can you explain that?

You found the one random knob I hid :-)

>   My experience is that scaling dirty limits
>  with main memory isn't what we really want.  When you get machines
>  with very large memory, the amount that you want to be dirty is more
>  a function of the speed of your IO devices, rather than the amount
>  of memory, otherwise you can sometimes see large filesystem lags
>  ('sync' taking minutes?)
> 
>  I wonder if it makes sense to try to limit the dirty data for a bdi
>  to the amount that it can write out in some period of time - maybe 3
>  seconds.  Probably configurable.  You seem to 

Re: [patch 1/4] Ignore stolen time in the softlockup watchdog

2007-04-24 Thread Andrew Morton
On Mon, 23 Apr 2007 23:58:20 -0700 Jeremy Fitzhardinge <[EMAIL PROTECTED]> 
wrote:

> Andrew Morton wrote:
> > On Tue, 27 Mar 2007 14:49:20 -0700 Jeremy Fitzhardinge <[EMAIL PROTECTED]> 
> > wrote:
> >
> >   
> >> The softlockup watchdog is currently a nuisance in a virtual machine,
> >> since the whole system could have the CPU stolen from it for a long
> >> period of time.  While it would be unlikely for a guest domain to be
> >> denied timer interrupts for over 10s, it could happen and any softlockup
> >> message would be completely spurious.
> >>
> >> Earlier I proposed that sched_clock() return time in unstolen
> >> nanoseconds, which is how Xen and VMI currently implement it.  If the
> >> softlockup watchdog uses sched_clock() to measure time, it would
> >> automatically ignore stolen time, and therefore only report when the
> >> guest itself locked up.  When running native, sched_clock() returns
> >> real-time nanoseconds, so the behaviour would be unchanged.
> >>
> >> Note that sched_clock() used this way is inherently per-cpu, so this
> >> patch makes sure that the per-processor watchdog thread initialized
> >> its own timestamp.
> >> 
> >
> > This patch
> > (ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc6/2.6.21-rc6-mm1/broken-out/ignore-stolen-time-in-the-softlockup-watchdog.patch)
> > causes six failures in the locking self-tests, which I must say is rather
> > clever of it.
> >   
> 
> Interesting.

I'll say.

>  Which variation of sched_clock do you have in your tree at
> the moment?

Andi's, plus the below fix.

Sigh.  I thought I was only two more bugs away from a release, then...


[18014389.347124] BUG: unable to handle kernel paging request at virtual 
address 6b6b7193
[18014389.347142]  printing eip:
[18014389.347149] c029a80c
[18014389.347156] *pde = 
[18014389.347166] Oops:  [#1]
[18014389.347174] Modules linked in: i915 drm ipw2200 sonypi ipv6 autofs4 hidp 
l2cap bluetooth sunrpc nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 
xt_state nf_conntrack nfnetlink xt_tcpudp iptable_filter ip_tables x_tables 
cpufreq_ondemand video sbs button battery asus_acpi ac nvram ohci1394 ieee1394 
ehci_hcd uhci_hcd sg joydev snd_hda_intel snd_seq_dummy snd_seq_oss 
snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm 
sr_mod cdrom snd_timer ieee80211 i2c_i801 piix ieee80211_crypt i2c_core generic 
snd soundcore snd_page_alloc ext3 jbd ide_disk ide_core
[18014389.347520] CPU:0
[18014389.347521] EIP:0060:[]Tainted: G  D VLI
[18014389.347522] EFLAGS: 00010296   (2.6.21-rc7-mm1 #35)
[18014389.347547] EIP is at input_release_device+0x8/0x4e
[18014389.347555] eax: c99709a8   ebx: 6b6b6b6b   ecx: 0286   edx: 
[18014389.347563] esi: 6b6b6b6b   edi: c99709cc   ebp: c21e3d40   esp: c21e3d38
[18014389.347571] ds: 007b   es: 007b   fs: 00d8  gs:   ss: 0068
[18014389.347580] Process khubd (pid: 159, ti=c21e2000 task=c20a62f0 
task.ti=c21e2000)
[18014389.347588] Stack: 6b6b6b6b c99709a8 c21e3d60 c029b489 c2014ec8 c9182000 
c96b167c c9970954 
[18014389.347655]c9970954 c99709cc c21e3d80 c029d401 c9977a6c c96b1000 
c21e3d90 c9970954 
[18014389.347708]c99709a8 c9164000 c21e3d90 c029d4b5 c96b1000 c9970564 
c21e3db0 c029c50b 
[18014389.347771] Call Trace:
[18014389.347792]  [] input_close_device+0x13/0x51
[18014389.347810]  [] mousedev_destroy+0x29/0x7e
[18014389.347827]  [] mousedev_disconnect+0x5f/0x63
[18014389.347842]  [] input_unregister_device+0x6a/0x100
[18014389.347858]  [] hidinput_disconnect+0x24/0x41
[18014389.347874]  [] hid_disconnect+0x79/0xc9
[18014389.347889]  [] usb_unbind_interface+0x47/0x8f
[18014389.347916]  [] __device_release_driver+0x74/0x90
[18014389.347933]  [] device_release_driver+0x37/0x4e
[18014389.347957]  [] bus_remove_device+0x73/0x82
[18014389.347977]  [] device_del+0x214/0x28c
[18014389.348132]  [] usb_disable_device+0x62/0xc2
[18014389.348148]  [] usb_disconnect+0x99/0x126
[18014389.348163]  [] hub_thread+0x3a5/0xb07
[18014389.348178]  [] kthread+0x6e/0x79
[18014389.348194]  [] kernel_thread_helper+0x7/0x10
[18014389.348210]  ===
[18014389.348218] INFO: lockdep is turned off.
[18014389.348224] Code: 5b 5d c3 55 b9 f0 ff ff ff 8b 50 0c 89 e5 83 ba 28 06 
00 00 00 75 08 89 82 28 06 00 00 31 c9 5d 89 c8 c3 55 89 e5 56 53 8b 70 0c <39> 
86 28 06 00 00 75 3a 8b 9e e4 08 00 00 c7 86 28 06 00 00 00 

I dunno.  I'll keep plugging for another couple hours then I'll shove
out what I have as a -mm snapshot whatsit.

Things are just ridiculous.  I'm thinking of having a hard-disk crash and
accidentally losing everything.



From: Andrew Morton <[EMAIL PROTECTED]>

WARNING: arch/x86_64/kernel/built-in.o - Section mismatch: reference to 
.init.text:sc_cpu_event from .data between 'sc_cpu_notifier' (at offset 0x2110) 
and 'mcelog'

Use hotcpu_notifier().  This takes care of making sure that the unused code
disappears from vmlinux if !CONFIG_HOTPLUG_CPU, too.


How do you send a reply to an email you have deleted.

2007-04-24 Thread lkml777
How do you send a reply to an email you have deleted?
-- 
  
  [EMAIL PROTECTED]

-- 
http://www.fastmail.fm - I mean, what is it about a decent email service?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Gene Heskett
On Tuesday 24 April 2007, Ingo Molnar wrote:
>* Gene Heskett <[EMAIL PROTECTED]> wrote:
>> > Gene has done some testing under CFS with X reniced to +10 and the
>> > desktop still worked smoothly for him.
>>
>> As a data point here, and probably nothing to do with X, but I did
>> manage to lock it up, solid, reset button time tonight, by wanting
>> 'smart' to get done with an update session after amanda had started.
>> I took both smart processes I could see in htop all the way to -19,
>> but when it was about done about 3 minutes later, everything came to
>> an instant, frozen, reset button required lockup.  I should have
>> stopped at -17 I guess. :(
>
>yeah, i guess this has little to do with X. I think in your scenario it
>might have been smarter to either stop, or to renice the workloads that
>took away CPU power from others to _positive_ nice levels. Negative nice
>levels can indeed be dangerous.
>
>(Btw., to protect against such mishaps in the future i have changed the
>SysRq-N [SysRq-Nice] implementation in my tree to not only change
>real-time tasks to SCHED_OTHER, but to also renice negative nice levels
>back to 0 - this will show up in -v6. That way you'd only have had to
>hit SysRq-N to get the system out of the wedge.)
>
>   Ingo

That sounds handy, particularly with idiots like me at the wheel...


-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
When a Banker jumps out of a window, jump after him--that's where the money 
is.
-- Robespierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* Gene Heskett <[EMAIL PROTECTED]> wrote:

> > (Btw., to protect against such mishaps in the future i have changed 
> > the SysRq-N [SysRq-Nice] implementation in my tree to not only 
> > change real-time tasks to SCHED_OTHER, but to also renice negative 
> > nice levels back to 0 - this will show up in -v6. That way you'd 
> > only have had to hit SysRq-N to get the system out of the wedge.)
> 
> That sounds handy, particularly with idiots like me at the wheel...

by that standard i guess we tinkerers are all idiots ;)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread David Lang

On Tue, 24 Apr 2007, Ingo Molnar wrote:


* Gene Heskett <[EMAIL PROTECTED]> wrote:


Gene has done some testing under CFS with X reniced to +10 and the
desktop still worked smoothly for him.


As a data point here, and probably nothing to do with X, but I did
manage to lock it up, solid, reset button time tonight, by wanting
'smart' to get done with an update session after amanda had started.
I took both smart processes I could see in htop all the way to -19,
but when it was about done about 3 minutes later, everything came to
an instant, frozen, reset button required lockup.  I should have
stopped at -17 I guess. :(


yeah, i guess this has little to do with X. I think in your scenario it
might have been smarter to either stop, or to renice the workloads that
took away CPU power from others to _positive_ nice levels. Negative nice
levels can indeed be dangerous.

(Btw., to protect against such mishaps in the future i have changed the
SysRq-N [SysRq-Nice] implementation in my tree to not only change
real-time tasks to SCHED_OTHER, but to also renice negative nice levels
back to 0 - this will show up in -v6. That way you'd only have had to
hit SysRq-N to get the system out of the wedge.)


if you are trying to unwedge a system it may be a good idea to renice all tasks 
to 0, it could be that a task at +19 is holding a lock that something else is 
waiting for.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] x86_64: Reflect the relocatability of the kernel in the ELF header.

2007-04-24 Thread Eric W. Biederman
Vivek Goyal <[EMAIL PROTECTED]> writes:

> On Sun, Apr 22, 2007 at 11:12:13PM -0600, Eric W. Biederman wrote:
>> 
>> Currently because vmlinux does not reflect that the kernel is relocatable
>> we still have to support CONFIG_PHYSICAL_START.  So this patch adds a small
>> c program to do what we cannot do with a linker script, set the elf header
>> type to ET_DYN.
>> 
>> This should remove the last obstacle to removing CONFIG_PHYSICAL_START
>> on x86_64.
>> 
>> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>
>
> [Dropping fastboot mailing list from CC as kexec mailing list is new list
>  for this discussion]
>
> [..]
>> +void file_open(const char *name)
>> +{
>> +if ((fd = open(name, O_RDWR, 0)) < 0)
>> +die("Unable to open `%s': %m", name);
>> +}
>> +
>> +static void mketrel(void)
>> +{
>> +unsigned char e_type[2];
>> +if (read(fd, _ident, sizeof(e_ident)) != sizeof(e_ident))
>> +die("Cannot read ELF header: %s\n", strerror(errno));
>> +
>> +if (memcmp(e_ident, ELFMAG, 4) != 0)
>> +die("No ELF magic\n");
>> +
>> +if ((e_ident[EI_CLASS] != ELFCLASS64) &&
>> +(e_ident[EI_CLASS] != ELFCLASS32))
>> +die("Unrecognized ELF class: %x\n", e_ident[EI_CLASS]);
>> +
>> +if ((e_ident[EI_DATA] != ELFDATA2LSB) &&
>> +(e_ident[EI_DATA] != ELFDATA2MSB))
>> +die("Unrecognized ELF data encoding: %x\n", e_ident[EI_DATA]);
>> +
>> +if (e_ident[EI_VERSION] != EV_CURRENT)
>> +die("Unknown ELF version: %d\n", e_ident[EI_VERSION]);
>> +
>> +if (e_ident[EI_DATA] == ELFDATA2LSB) {
>> +e_type[0] = ET_REL & 0xff;
>> +e_type[1] = ET_REL >> 8;
>> +} else {
>> +e_type[1] = ET_REL & 0xff;
>> +e_type[0] = ET_REL >> 8;
>> +}
>
> Hi Eric,
>
> Should this be ET_REL or ET_DYN? kexec refuses to load this vmlinux
> as it does not find it to be executable type.

Doh.  It should be ET_DYN.  I had relocatable much to much on the brain,
and so I stuffed in the wrong type.

> I am not well versed with various conventions but if I go through "Executable
> and Linking Format" document, this is what it says about various file types.
>
> • A relocatable file holds code and data suitable for linking with other
>   object files to create an executable or a shared object file.
>
> • An executable file holds a program suitable for execution.
>
> • A shared object file holds code and data suitable for linking in two
>   contexts. First, the link editor may process it with other relocatable and
>   shared object files to create another object file. Second, the dynamic
>   linker combines it with an executable file and other shared objects
>   to create a process image.
>
> So above does not seem to fit in the ET_REL type. We can't relink this
> vmlinux? And it does not seem to fit in ET_DYN definition too. We are
> not relinking this vmlinux with another executable or other relocatable
> files.
>
> I remember once you mentioned the term dynamic executable which can be
> loaded at a non-compiled address and let run without requiring any
> relocation processing. This vmlinux will fall in that category but can't 
> relate it to standard elf file definitions.

Sorry about that.  

ET_DYN without a PT_DYNAMIC segment, without a PT_INTERP segment,
and with a valid entry point is exactly that.  Loaders never perform
relocation processing on a ET_DYN executable but they are allowed to
shift all of the addresses by a single delta so long as all of the
alignment restrictions are honored.

Relocation processing when it happens comes from the dynamic linker,
which is set in PT_INTERP and the dynamic linker looks a PT_DYNAMIC
to figure out what relocations are available for processing.

The basic issue is that ld don't really comprehend what we are doing
since we are building a position independent executable in a way
that the normal tools don't allow, so we have to poke the header.

If we had compiled with -fPIC we could have specified -pie or
--pic-executable to ld and it would have done the right thing.
But as it is our executable only changes physical addresses and
not virtual addresses something completely foreign to ld.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* David Lang <[EMAIL PROTECTED]> wrote:

> > (Btw., to protect against such mishaps in the future i have changed 
> > the SysRq-N [SysRq-Nice] implementation in my tree to not only 
> > change real-time tasks to SCHED_OTHER, but to also renice negative 
> > nice levels back to 0 - this will show up in -v6. That way you'd 
> > only have had to hit SysRq-N to get the system out of the wedge.)
> 
> if you are trying to unwedge a system it may be a good idea to renice 
> all tasks to 0, it could be that a task at +19 is holding a lock that 
> something else is waiting for.

Yeah, that's possible too, but +19 tasks are getting a small but 
guaranteed share of the CPU so eventually it ought to release it. It's 
still a possibility, but i think i'll wait for a specific incident to 
happen first, and then react to that incident :-)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> yeah, i guess this has little to do with X. I think in your scenario 
> it might have been smarter to either stop, or to renice the workloads 
> that took away CPU power from others to _positive_ nice levels. 
> Negative nice levels can indeed be dangerous.

btw., was X itself at nice 0 or nice -10 when the lockup happened?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* Rogan Dawes <[EMAIL PROTECTED]> wrote:

> >if (p_to && p->wait_runtime > 0) {
> >p->wait_runtime >>= 1;
> >p_to->wait_runtime += p->wait_runtime;
> >}
> >
> >the above is the basic expression of: "charge a positive bank balance". 
> >
> 
> [..]
> 
> > [note, due to the nanoseconds unit there's no rounding loss to worry 
> > about.]
> 
> Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss?

yes. But not that we'll only truly have to worry about that when we'll 
have context-switching performance in that range - currently it's at 
least 2-3 orders of magnitude above that. Microseconds seemed to me to 
be too coarse already, that's why i picked nanoseconds and 64-bit 
arithmetics for CFS.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH]Fix parsing kernelcore boot option for ia64

2007-04-24 Thread Yasunori Goto
Mel-san.

I tested your patch (Thanks!). It worked. But..

> In my understanding, why ia64 doesn't use early_param() macro for mem= at el. 
> is that 
> it has to use mem= option at efi handling which is called before 
> parse_early_param().
> 
> Current ia64's boot path is
>  setup_arch()
> -> efi handling -> parse_early_param() -> numa handling -> pgdat/zone init
> 
> kernelcore= option is just used at pgdat/zone initialization. (no arch 
> dependent part...)
> 
> So I think just adding
> ==
> early_param("kernelcore",cmpdline_parse_kernelcore)
> ==
> to ia64 is ok.

Then, it can be common code.
How is this patch? I confirmed this can work well too.



When "kernelcore" boot option is specified, kernel can't boot up
on ia64. It is cause of eternal loop.
In addition, its code can be common code. This is fix for it.
I tested this patch on my ia64 box.


Signed-off-by: Yasunori Goto <[EMAIL PROTECTED]>

-

 arch/i386/kernel/setup.c   |1 -
 arch/ia64/kernel/efi.c |2 --
 arch/powerpc/kernel/prom.c |1 -
 arch/ppc/mm/init.c |2 --
 arch/x86_64/kernel/e820.c  |1 -
 include/linux/mm.h |1 -
 mm/page_alloc.c|3 +++
 7 files changed, 3 insertions(+), 8 deletions(-)

Index: kernelcore/arch/ia64/kernel/efi.c
===
--- kernelcore.orig/arch/ia64/kernel/efi.c  2007-04-24 15:09:37.0 
+0900
+++ kernelcore/arch/ia64/kernel/efi.c   2007-04-24 15:25:22.0 +0900
@@ -423,8 +423,6 @@ efi_init (void)
mem_limit = memparse(cp + 4, );
} else if (memcmp(cp, "max_addr=", 9) == 0) {
max_addr = GRANULEROUNDDOWN(memparse(cp + 9, ));
-   } else if (memcmp(cp, "kernelcore=",11) == 0) {
-   cmdline_parse_kernelcore(cp+11);
} else if (memcmp(cp, "min_addr=", 9) == 0) {
min_addr = GRANULEROUNDDOWN(memparse(cp + 9, ));
} else {
Index: kernelcore/arch/i386/kernel/setup.c
===
--- kernelcore.orig/arch/i386/kernel/setup.c2007-04-24 15:29:20.0 
+0900
+++ kernelcore/arch/i386/kernel/setup.c 2007-04-24 15:29:39.0 +0900
@@ -195,7 +195,6 @@ static int __init parse_mem(char *arg)
return 0;
 }
 early_param("mem", parse_mem);
-early_param("kernelcore", cmdline_parse_kernelcore);
 
 #ifdef CONFIG_PROC_VMCORE
 /* elfcorehdr= specifies the location of elf core header
Index: kernelcore/arch/powerpc/kernel/prom.c
===
--- kernelcore.orig/arch/powerpc/kernel/prom.c  2007-04-24 15:04:47.0 
+0900
+++ kernelcore/arch/powerpc/kernel/prom.c   2007-04-24 15:30:25.0 
+0900
@@ -431,7 +431,6 @@ static int __init early_parse_mem(char *
return 0;
 }
 early_param("mem", early_parse_mem);
-early_param("kernelcore", cmdline_parse_kernelcore);
 
 /*
  * The device tree may be allocated below our memory limit, or inside the
Index: kernelcore/arch/ppc/mm/init.c
===
--- kernelcore.orig/arch/ppc/mm/init.c  2007-04-24 15:04:47.0 +0900
+++ kernelcore/arch/ppc/mm/init.c   2007-04-24 15:30:56.0 +0900
@@ -214,8 +214,6 @@ void MMU_setup(void)
}
 }
 
-early_param("kernelcore", cmdline_parse_kernelcore);
-
 /*
  * MMU_init sets up the basic memory mappings for the kernel,
  * including both RAM and possibly some I/O regions,
Index: kernelcore/arch/x86_64/kernel/e820.c
===
--- kernelcore.orig/arch/x86_64/kernel/e820.c   2007-04-24 15:04:47.0 
+0900
+++ kernelcore/arch/x86_64/kernel/e820.c2007-04-24 15:34:02.0 
+0900
@@ -604,7 +604,6 @@ static int __init parse_memopt(char *p)
return 0;
 } 
 early_param("mem", parse_memopt);
-early_param("kernelcore", cmdline_parse_kernelcore);
 
 static int userdef __initdata;
 
Index: kernelcore/include/linux/mm.h
===
--- kernelcore.orig/include/linux/mm.h  2007-04-24 15:09:37.0 +0900
+++ kernelcore/include/linux/mm.h   2007-04-24 15:35:52.0 +0900
@@ -1051,7 +1051,6 @@ extern unsigned long find_max_pfn_with_a
 extern void free_bootmem_with_active_regions(int nid,
unsigned long max_low_pfn);
 extern void sparse_memory_present_with_active_regions(int nid);
-extern int cmdline_parse_kernelcore(char *p);
 #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
 extern int early_pfn_to_nid(unsigned long pfn);
 #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
Index: kernelcore/mm/page_alloc.c
===
--- kernelcore.orig/mm/page_alloc.c 2007-04-24 15:09:37.0 +0900
+++ kernelcore/mm/page_alloc.c  

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> [...] That way you'd only have had to hit SysRq-N to get the system 
> out of the wedge.)

small correction: Alt-SysRq-N.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] i802.11: fixed memory leak on multicasts

2007-04-24 Thread Markus Pietrek

Hi,

socket buffers were not always freed when receiving multicasts

Bye,
--
Markus Pietrek
Lead Software Engineer
Phone: +49-7667-908-501, Fax: +49-7667-908-200
mailto:[EMAIL PROTECTED]

FS Forth-Systeme GmbH
"A Digi International Company"
Kueferstr. 8, 79206 Breisach, Germany
Tax: 07008/12000 / VAT: DE142208834 / Reg. Amtsgericht Freiburg HRB 290212
Directors: Klaus Flesch, Subramanian Krishnan, Dieter Vesper
http://www.digi.com
Index: net/ieee80211/ieee80211_rx.c
===
RCS file: 
/data/vcs/cvs/fsforth_products/LxNETES/linux/net/ieee80211/ieee80211_rx.c,v
retrieving revision 1.5
retrieving revision 1.6
diff -c -r1.5 -r1.6
*** net/ieee80211/ieee80211_rx.c13 Apr 2007 12:39:38 -  1.5
--- net/ieee80211/ieee80211_rx.c23 Apr 2007 15:51:28 -  1.6
***
*** 860,868 
break;
}
  
!   if (is_packet_for_us)
if (!ieee80211_rx(ieee, skb, stats))
dev_kfree_skb_irq(skb);
return;
  
  drop_free:
--- 860,871 
break;
}
  
!   if (is_packet_for_us) {
if (!ieee80211_rx(ieee, skb, stats))
dev_kfree_skb_irq(skb);
+ } else
+ dev_kfree_skb_irq(skb);
+ 
return;
  
  drop_free:


cfs works fine for me

2007-04-24 Thread Hemmann, Volker Armin
Hello,

I have tried the cfs patches with 2.6.20.7 in the last days.

I am using KDE 3.5.6, gentoo unstable and have a dual core AMD64 system with 
1GB ram and a nvidia card (using the closed source drivers, yes I suck, but I 
love playing 3d games once in a while).

I don't have interactivity problems with plain kernel.org kernels (except when 
swapping a lot, swapping really sucks)
My system works well and is stable.

With the cfs patches, my system continues to work well. I have not seen any 
regressions, desktop is snappy, emerge'ing stuff (niced to +19), does not 
hurt and unreal tournament 2004 is as fast (or slow, depends on the 
situation) as always. It even looks like FPS under heavy stress (like 
onslaught torlan when lots of bots and me are fighting at a powernode), don't 
go down as low as with the mainline scheduler. Not a big difference, but it 
is there (20-25 with plain kernel.org kernel in extrem situations compared to 
>30 with the cfs patches). Maybe I did not hit the worst case, playing is a 
little bit restricted at the moment - my wrist and ellbow hate me, but it 
looks promising. Apart from the worst case scenrios, FPS are more or less the 
same.

My usage consisted of surfing the web with konqueror, watching videos with 
xine and mplayer, using kmail (with tens of thousands of mails in different 
folders), looking at pictures with kuickshow, installing XFCE, asorted 
updates, typing lots and lots of stuff in kate and web forums, listening to 
mp3/ogg with amarok, playing pysol/kpat/lgeneral/wesnoth/ut2004/freecol, a 
lot of that parallel (not ut2004... I don't want to hurt my precious fps...).

Again, my system worked fine with the 'normal' scheduler, from the stuff I 
read in the lkml archives I must be some special kind of guy, so there was no 
improvement on the 'feels snappy or not' front, but there are also no 
regressions. So from my point of view, everything is fine with cfs and I 
would not mind having it as default scheduler. 

If you want specs of my hardware, my kernel config or any other information, 
just send me an email. I am not subscribed to lkml, nor can I read any of its 
archives in the next couple of days, which is one reason why I don't answer 
to one at the existing threads (I don't even know if there are some at the 
moment), so in case of an answer cc'ing me would be nice.

Glück Auf
Volker
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[REPORT] cfs-v5 vs sd-0.46

2007-04-24 Thread Michael Gerdau
Hi list,

with cfs-v5 finally booting on my machine I have run my daily
numbercrunching jobs on both cfs-v5 and sd-0.46, 2.6.21-v7 on
top of a stock openSUSE 10.2 (X86_64). Config for both kernel
is the same except for the X boost option in cfs-v5 which on
my system didn't work (X still was @ -19; I understand this will
be fixed in -v6). HZ is 250 in both.

System is a Dell XPS M1710, Intel Core2 2.33GHz, 4GB,
NVIDIA GeForce Go 7950 GTX with proprietary driver 1.0-9755

I'm running three single threaded perl scripts that do double
precision floating point math with little i/o after initially
loading the data.

Both cfs and sd showed very similar behavior when monitored in top.
I'll show more or less representative excerpt from a 10 minutes
log, delay 3sec.

sd-0.46
top - 00:14:24 up  1:17,  9 users,  load average: 4.79, 4.95, 4.80
Tasks:   3 total,   3 running,   0 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.8%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.2%hi,  0.0%si,  0.0%st
Mem:   3348628k total,  1648560k used,  1700068k free,64392k buffers
Swap:  2097144k total,0k used,  2097144k free,   828204k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

 
 6671 mgd   33   0 95508  22m 3652 R  100  0.7  44:28.11 perl   

 
 6669 mgd   31   0 95176  22m 3652 R   50  0.7  43:50.02 perl   

 
 6674
 mgd   31   0 95368  22m 3652 R   50  0.7  47:55.29 perl



cfs-v5
top - 08:07:50 up 21 min,  9 users,  load average: 4.13, 4.16, 3.23
Tasks:   3 total,   3 running,   0 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.5%us,  0.2%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Mem:   3348624k total,  1193500k used,  2155124k free,32516k buffers
Swap:  2097144k total,0k used,  2097144k free,   545568k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

 
 6357 mgd   20   0 92024  19m 3652 R  100  0.6   8:54.21 perl   

 
 6356 mgd   20   0 91652  18m 3652 R   50  0.6  10:35.52 perl   

 
 6359 mgd   20   0 91700  18m 3652 R   50  0.6   8:47.32 perl   

 

What did surprise me is that cpu utilization had been spread 100/50/50
(round robin) most of the time. I did expect 66/66/66 or so.

What I also don't understand is the difference in load average, sd
constantly had higher values, the above figures are representative
for the whole log. I don't know which is better though.


Here are excerpts from a concurrently run vmstat 3 200:

sd-0.46
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 5  0  0 1702928  63664 82787600 067  458 1350 100  0  0  0
 3  0  0 1702928  63684 82787600 089  468 1362 100  0  0  0
 5  0  0 1702680  63696 82787600 0   132  461 1598 99  1  0  0
 8  0  0 1702680  63712 82789200 080  465 1180 99  1  0  0
 3  0  0 1702712  63732 82788400 067  453 1005 100  0  0  0
 4  0  0 1702792  63744 82792000 041  461 1138 100  0  0  0
 3  0  0 1702792  63760 82791600 057  456 1073 100  0  0  0
 3  0  0 1702808  63776 82792800 0   111  473 1095 100  0  0  0
 3  0  0 1702808  63788 82792800 081  461 1092 99  1  0  0
 3  0  0 1702188  63808 82792800 0   160  463 1437 99  1  0  0
 3  0  0 1702064  63884 82790000 0   229  479 1125 99  0  0  0
 4  0  0 1702064  63912 82797200 177  460 1108 100  0  0  0
 7  0  0 1702032  63920 82800000 040  463 1068 100  0  0  0
 4  0  0 1702048  63928 82800800 068  454 1114 100  0  0  0
11  0  0 1702048  63928 82800800 0 0  458 1001 100  0  0  0
 3  0  0 1701500  63960 82802000 0   

Re: [PATCH] powerpc pseries eeh: Convert to kthread API

2007-04-24 Thread Cornelia Huck
On Tue, 24 Apr 2007 15:00:42 +1000,
Benjamin Herrenschmidt <[EMAIL PROTECTED]> wrote:

> Like anything else, modules should have separated the entrypoints for
> 
>  - Initiating a removal request
>  - Releasing the module
> 
> The former is use did "rmmod", can unregister things from subsystems,
> etc... (and can file if the driver decides to refuse removal requests
> when it's busy doing things or whatever policy that module wants to
> implement).
> 
> The later is called when all references to the modules have been
> dropped, it's a bit like the kref "release" (and could be implemented as
> one).

That sounds quite similar to the problems we have with kobject
refcounting vs. module unloading. The patchset I posted at
http://marc.info/?l=linux-kernel=117679014404994=2 exposes the
refcount of the kobject embedded in the module. Maybe the kthread code
could use that reference as well?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NonExecutable Bit in 32Bit

2007-04-24 Thread Tuncer Ayaz

On 4/24/07, William Heimbigner <[EMAIL PROTECTED]> wrote:

On Tue, 24 Apr 2007, Cestonaro, Thilo (external) wrote:

> Hey,
>
> is it right, that the NX Bit is not used under i386-Arch but
> under x86_64-Arch?
> When yes, is there a special argument for it not to be used?
>
> Ciao Thilo
I don't think so - some i386 cpus definitely have support for
the NX bit.



In detail:
1) if your CPU has NX support (some 32bit Xeons do)
2) it is not disabled in the BIOS
3) you see 'nx' in the 'flags' line in /proc/cpuinfo
4) and you have a kernel with the following config options
CONFIG_HIGHMEM64G=y
CONFIG_HIGHMEM=y
CONFIG_X86_PAE=y

NX should just work.

[snip]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ofa-general] [PATCH] eHCA: Add "Modify Port" verb

2007-04-24 Thread Christoph Raisch

Hi Hal,
you are correct,
with the current firmware version it will fail later.

Christoph R.

[EMAIL PROTECTED] wrote on 23.04.2007 18:55:59:

> Hi Joachim,
>
> On Mon, 2007-04-23 at 12:23, Joachim Fenkes wrote:
> > Add "Modify Port" verb support to eHCA driver.
> > ib_cm needs this to initialize properly.
>
> I didn't think IB_PORT_SM was allowed (as QP0 is not exposed) or does
> this just fail later when it is attempted to be actually set ?
>
> -- Hal

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] Kconfig: cleanup s390 v2.

2007-04-24 Thread Martin Schwidefsky
On Mon, 2007-04-23 at 10:45 -0700, Andrew Morton wrote:
> > Andrew: I plan to add patches 1-5 to the for-andrew branch of the
> > git390 repository if that is fine with you. The only thing that will
> > be missing in the tree is the patch that disables wireless for s390.
> > The code does compile but without hardware it is mute to have the
> > config options. I'll wait until the git-wireless.patch is upstream.
> > Patches 7-9 depend on patches found in -mm.
> > 
> 
> umm, OK.  If it's Ok I think I'll duck it for now: -mm is full.
> 
> Over-full, really: I've been working basically continuously since Friday
> getting the current dungpile to compile and boot, and it's still miles away
> from that.

I understand. I'll wait until -mm is a little bit smaller again. It is
just that someday I want to finish with the Kconfig cleanup, it has been
sitting on my harddriver for ages now.

-- 
blue skies,  IBM Deutschland Entwicklung GmbH
   MartinVorsitzender des Aufsichtsrats: Johann Weihen
 Geschäftsführung: Herbert Kircher
Martin Schwidefsky   Sitz der Gesellschaft: Böblingen
Linux on zSeries Registergericht: Amtsgericht Stuttgart,
   Development   HRB 243294

"Reality continues to ruin my life." - Calvin.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v5 vs sd-0.46

2007-04-24 Thread Ingo Molnar

* Michael Gerdau <[EMAIL PROTECTED]> wrote:

> I'm running three single threaded perl scripts that do double 
> precision floating point math with little i/o after initially loading 
> the data.

thanks for the testing!

> What I also don't understand is the difference in load average, sd 
> constantly had higher values, the above figures are representative for 
> the whole log. I don't know which is better though.

hm, it's hard from here to tell that. What load average does the vanilla 
kernel report? I'd take that as a reference.

> Here are excerpts from a concurrently run vmstat 3 200:
> 
> sd-0.46
> procs ---memory-- ---swap-- -io -system-- cpu
>  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
>  5  0  0 1702928  63664 82787600 067  458 1350 100  0  0  > 0
>  3  0  0 1702928  63684 82787600 089  468 1362 100  0  0  > 0
>  5  0  0 1702680  63696 82787600 0   132  461 1598 99  1  0  0
>  8  0  0 1702680  63712 82789200 080  465 1180 99  1  0  0

> cfs-v5
> procs ---memory-- ---swap-- -io -system-- cpu
>  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
>  6  0  0 2157728  31816 54523600 0   103  543  748 100  0  0  > 0
>  4  0  0 2157780  31828 54525600 063  435  752 100  0  0  > 0
>  4  0  0 2157928  31852 54525600 0   105  424  770 100  0  0  > 0
>  4  0  0 2157928  31868 54526800 0   261  457  763 100  0  0  > 0

interesting - CFS has half the context-switch rate of SD. That is 
probably because on your workload CFS defaults to longer 'timeslices' 
than SD. You can influence the 'timeslice length' under SD via 
/proc/sys/kernel/rr_interval (milliseconds units) and under CFS via 
/proc/sys/kernel/sched_granularity_ns. On CFS the value is not 
necessarily the timeslice length you will observe - for example in your 
workload above the granularity is set to 5 msec, but your rescheduling 
rate is 13 msecs. SD default to a rr_interval value of 8 msecs, which in 
your workload produces a timeslice length of 6-7 msecs.

so to be totally 'fair' and get the same rescheduling 'granularity' you 
should probably lower CFS's sched_granularity_ns to 2 msecs.

> Last not least I'd like to add that at least on my system having X 
> niced to -19 does result in kind of "erratic" (for lack of a better 
> word) desktop behavior. I'll will reevaluate this with -v6 but for now 
> IMO nicing X to -19 is a regression at least on my machine despite the 
> claim that cfs doesn't suffer from it.

indeed with -19 the rescheduling limit is so high under CFS that it does 
not throttle X's scheduling rate enough and so it will make CFS behave 
as badly as other schedulers.

I retested this with -10 and it should work better with that. In -v6 i 
changed the default to -10 too.

> PS: Only learning how to test these things I'm happy to get pointed 
> out the shortcomings of what I tested above. Of course suggestions for 
> improvements are welcome.

your report was perfectly fine and useful. "no visible regressions" is 
valuable feedback too. [ In fact, such type of feedback is the one i 
find the easiest to resolve ;-) ]

Since you are running number-crunchers you might be able to give 
performacne feedback too: do you have any reliable 'performance metric' 
available for your number cruncher jobs (ops per minute, runtime, etc.) 
so that it would be possible to compare number-crunching performance of 
mainline to SD and to CFS as well? If that value is easy to get and 
reliable/stable enough to be meaningful. (And it would be nice to also 
establish some ballpark figure about how much noise there is in any 
performance metric, so that we can see whether any differences between 
schedulers are systematic or not.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


cpufreq default governor

2007-04-24 Thread William Heimbigner
Question: is there some reason that kconfig does not allow for default 
governors of conservative/ondemand/powersave?
I'm not aware of any reason why one of those governors could not be used 
as default.


William Heimbigner
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

2007-04-24 Thread Jiri Kosina
On Tue, 24 Apr 2007, Herbert Xu wrote:

> > Hmm, *sigh*. I guess the patch below fixes the problem, but it is a 
> > masterpiece in the field of ugliness. And I am not sure whether it is 
> > completely correct either. Are there any immediate ideas for better 
> > solution with respect to how struct sock locking works?
> Please cc such patches to netdev.  Thanks.

Hi Herbert,

well it's pretty much bluetooth-specific, and bluez-devel was CCed, but 
OK.

> > diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
> > index 71f5cfb..c5c93cd 100644
> > --- a/net/bluetooth/hci_sock.c
> > +++ b/net/bluetooth/hci_sock.c
> > @@ -656,7 +656,10 @@ static int hci_sock_dev_event(struct notifier_block 
> > *this, unsigned long event,
> >/* Detach sockets from device */
> >read_lock(_sk_list.lock);
> >sk_for_each(sk, node, _sk_list.head) {
> > -   lock_sock(sk);
> > +   if (in_atomic())
> > +   bh_lock_sock(sk);
> > +   else
> > +   lock_sock(sk);
> 
> This doesn't do what you think it does.  bh_lock_sock can still succeed
> even with lock_sock held by someone else.

I know, this was precisely the reason why I converted the bh_lock_sock() 
to lock_sock() here some time ago (as it was racy with 
l2cap_connect_cfm()).

> Does this need to occur immediately when an event occurs? If not I'd
> suggest moving this into a workqueue.

Will have to check whether this will be processed properly in time when 
going to suspend.

Thanks,

-- 
Jiri Kosina
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/7] libata: check for AN support

2007-04-24 Thread Tejun Heo
Hello,

Kristen Carlson Accardi wrote:
>  static unsigned int ata_print_id = 1;
> @@ -1744,6 +1745,23 @@ int ata_dev_configure(struct ata_device 
>   }
>   dev->cdb_len = (unsigned int) rc;
>  
> + /*
> +  * check to see if this ATAPI device supports
> +  * Asynchronous Notification
> +  */
> + if ((ap->flags & ATA_FLAG_AN) && ata_id_has_AN(id))
> + {
> + /* issue SET feature command to turn this on */
> + rc = ata_dev_set_AN(dev);

Please don't store err_mask into int rc.  Please store it to a separate
err_mask variable and report it when printing error message.

> + if (rc) {
> + ata_dev_printk(dev, KERN_ERR,
> + "unable to set AN\n");
> + rc = -EINVAL;

Wouldn't -EIO be more appropriate?

> + goto err_out_nosup;
> + }
> + dev->flags |= ATA_DFLAG_AN;
> + }
> +

Not NACKing.  Just notes for future improvements.  We need to be more
careful here.  ATA/ATAPI world is filled with braindamaged devices and I
bet there are devices which advertises it can do AN but chokes when AN
is enabled.

This should be handled similarly to ACPI failure.  Currently ACPI does
the following.

1. try once, if fail, record that ACPI failed.  return error to trigger
retry.
2. try again, if fail again, ignore error if possible (!FROZEN) and turn
off ACPI.

This fallback mechanism for optional features can probably be
generalized and used for both ACPI and AN.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/7] libata: check for AN support

2007-04-24 Thread Alan Cox
> + /*
> +  * check to see if this ATAPI device supports
> +  * Asynchronous Notification
> +  */
> + if ((ap->flags & ATA_FLAG_AN) && ata_id_has_AN(id))
> + {

Bracketing police ^^^

> + /* issue SET feature command to turn this on */
> + rc = ata_dev_set_AN(dev);
> + if (rc) {
> + ata_dev_printk(dev, KERN_ERR,
> + "unable to set AN\n");
> + rc = -EINVAL;
> + goto err_out_nosup;

How fatal is this - do we need to ignore the device at this point or
should we just pretend (possibly correctly) that the device itself does
not support notification. 

> @@ -299,6 +305,8 @@ struct ata_taskfile {
>  #define ata_id_queue_depth(id)   (((id)[75] & 0x1f) + 1)
>  #define ata_id_removeable(id)((id)[0] & (1 << 7))
>  #define ata_id_has_dword_io(id)  ((id)[50] & (1 << 0))
> +#define ata_id_has_AN(id)\
> + ((id[76] && (~id[76])) & ((id)[78] & (1 << 5)))

Might be nice to check ATA version as well to be paranoid but this all
looks ok as its a reserved field since way back when.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/7] genhd: expose AN to user space

2007-04-24 Thread Tejun Heo
Kristen Carlson Accardi wrote:
> +static struct disk_attribute disk_attr_capability = {
> + .attr = {.name = "capability_flags", .mode = S_IRUGO },
> + .show   = disk_capability_read
> +};

How about just "capability"?  I think that would be more consistent with
other attributes.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/7] libata: send event when AN received

2007-04-24 Thread Alan Cox
> + /* check the 'N' bit in word 0 of the FIS */
> + if (f[0] & (1 << 15)) {
> + int port_addr =  ((f[0] & 0x0f00) >> 8);
> + struct ata_device *adev = >device[port_addr];

You can't be sure that the port_addr returned will be in range if a
device is malfunctioning...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2   3   4   5   6   7   8   9   10   >