Re: Determine version of kernel that produced vmcore

2007-07-23 Thread Ken'ichi Ohmichi

Hi Bernhard,

2007/07/19 01:10:50 +0200, Bernhard Walle <[EMAIL PROTECTED]> wrote:
>[1] didn't we agree to vmcoreinfo?

I agree to vmcoreinfo.
I'll rename makedumpfile's config file to "vmcoreinfo".


Thanks
Ken'ichi Ohmichi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] compat_ioctl requires CONFIG_BLOCK

2007-07-23 Thread Jens Axboe
On Mon, Jul 23 2007, Andrew Morton wrote:
> On Sat, 21 Jul 2007 01:08:57 +0200
> Arnd Bergmann <[EMAIL PROTECTED]> wrote:
> 
> > On Saturday 21 July 2007, Sebastian Siewior wrote:
> > > 
> > > Got with randconfig
> > > include/linux/loop.h:66: error: expected specifier-qualifier-list before
> > > 'request_queue_t'
> > > make[1]: *** [fs/compat_ioctl.o] Error 1
> > > 
> > > parts of compat ioctl require CONFIG_BLOCK to be set.
> > > 
> > > Signed-off-by: Sebastian Siewior <[EMAIL PROTECTED]>
> > > Index: b/fs/compat_ioctl.c
> > > ===
> > > --- a/fs/compat_ioctl.c
> > > +++ b/fs/compat_ioctl.c
> > > @@ -63,7 +63,9 @@
> > > __#include 
> > > __#include 
> > > __#include 
> > > +#ifdef CONFIG_BLOCK
> > > __#include 
> > > +#endif
> > 
> > Adding #ifdef around an #include is considered bad style. Better just
> > make loop.h compile without any conditionals. Does the below
> > patch work for you?
> 
> This is the classic why-typedefs-are-bad.  AFAIK there is no way of fixing
> this build error apart from adding otherwise-unneeded nested inclusions
> (very bad), or:
> 
> > Arnd <><
> > 
> > --- a/include/linux/loop.h
> > +++ b/include/linux/loop.h
> > @@ -63,7 +63,7 @@ struct loop_device {
> > struct task_struct  *lo_thread;
> > wait_queue_head_t   lo_event;
> >  
> > -   request_queue_t *lo_queue;
> > +   struct request_queue*lo_queue;
> > struct gendisk  *lo_disk;
> > struct list_headlo_list;
> >  };
> 
> Good start.  Now can we do the rest of the kernel?  ;)

Yep indeed, it's been my plan to kill that ugly typedef for some time...

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add __GFP_ZERO to GFP_LEVEL_MASK

2007-07-23 Thread Peter Zijlstra
On Tue, 2007-07-24 at 08:01 +0200, Peter Zijlstra wrote:

> Then we can either fixup the slab allocators to mask out __GFP_ZERO, or
> do something like the below.
> 
> Personally I like the consistency of adding __GFP_ZERO here (removes
> this odd exception) and just masking it in the sl[aou]b thingies.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/linux/gfp.h |2 +-
 mm/slab.c   |4 +++-
 mm/slob.c   |2 ++
 mm/slub.c   |4 +++-
 4 files changed, 9 insertions(+), 3 deletions(-)

Index: linux-2.6-2/include/linux/gfp.h
===
--- linux-2.6-2.orig/include/linux/gfp.h
+++ linux-2.6-2/include/linux/gfp.h
@@ -56,7 +56,7 @@ struct vm_area_struct;
 /* if you forget to add the bitmask here kernel will crash, period */
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
-   __GFP_NOFAIL|__GFP_NORETRY|__GFP_COMP| \
+   __GFP_NOFAIL|__GFP_NORETRY|__GFP_COMP|__GFP_ZERO| \
__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE| \
__GFP_MOVABLE)
 
Index: linux-2.6-2/mm/slab.c
===
--- linux-2.6-2.orig/mm/slab.c
+++ linux-2.6-2/mm/slab.c
@@ -2739,11 +2739,13 @@ static int cache_grow(struct kmem_cache 
gfp_t local_flags;
struct kmem_list3 *l3;
 
+   flags &= ~__GFP_ZERO; /* slab has its own object zeroing */
+
/*
 * Be lazy and only check for valid flags here,  keeping it out of the
 * critical path in kmem_cache_alloc().
 */
-   BUG_ON(flags & ~(GFP_DMA | __GFP_ZERO | GFP_LEVEL_MASK));
+   BUG_ON(flags & ~(GFP_DMA | GFP_LEVEL_MASK));
 
local_flags = (flags & GFP_LEVEL_MASK);
/* Take the l3 list lock to change the colour_next on this node */
Index: linux-2.6-2/mm/slob.c
===
--- linux-2.6-2.orig/mm/slob.c
+++ linux-2.6-2/mm/slob.c
@@ -223,6 +223,8 @@ static void *slob_new_page(gfp_t gfp, in
 {
void *page;
 
+   gfp &= ~__GFP_ZERO; /* slob has its own object zeroing */
+
 #ifdef CONFIG_NUMA
if (node != -1)
page = alloc_pages_node(node, gfp, order);
Index: linux-2.6-2/mm/slub.c
===
--- linux-2.6-2.orig/mm/slub.c
+++ linux-2.6-2/mm/slub.c
@@ -1078,7 +1078,9 @@ static struct page *new_slab(struct kmem
void *last;
void *p;
 
-   BUG_ON(flags & ~(GFP_DMA | __GFP_ZERO | GFP_LEVEL_MASK));
+   flags &= ~__GFP_ZERO; /* slab has its own object zeroing */
+
+   BUG_ON(flags & ~(GFP_DMA | GFP_LEVEL_MASK));
 
if (flags & __GFP_WAIT)
local_irq_enable();



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Documentation for sysfs, hotplug, and firmware loading.

2007-07-23 Thread Greg KH
On Sat, Jul 21, 2007 at 02:14:41PM +0200, Bodo Eggert wrote:
> Greg KH <[EMAIL PROTECTED]> wrote:
> > On Fri, Jul 20, 2007 at 08:21:39PM -0400, Rob Landley wrote:
> 
> >> I'm not trying to document /sys/devices.  I'm trying to document hotplug,
> >> populating /dev, and things like firmware loading that fall out of that.
> >> This requires use of sysfs, and I'm only trying to document as much of 
> >> sysfs
> >> as you need to do that.
> > 
> > Like I stated before, you do not need to even have sysfs mounted to have
> > a dynamic /dev.
> > 
> > And why do you need to document populating /dev dynamically?  udev
> > already solves this problem for you, it's not like people are going off
> > and reinventing udev for their own enjoyment would not at least look at
> > how it solves this problem first.
> 
> Turning your words around, you get: "Whatever one of these programs does
> documents how dynamic devices should be handled." If this is true, any
> change that makes one of these programs break is a kernel bug.

Not at all.  The kernel changed things numerous times that showed up as
bugs in udev, and I fully admit that (and have in the past, numerous
times.)

> Besides that: How am I supposed to be able to correctly change udev if
> there is no document telling me what would work and what happens to
> work by accident?

Um, the same way you change any codebase?  :)

> > To do otherwise would be foolish :)
> 
> Some people like to fool around and create even smaller wheels.
> E.g. I'm changing the ACPI button driver to just call Ctrl_alt_del
> in order not to have an extra process running and free 0.2 % of my RAM.

That's great, and I have nothing against that, and encourage you to do
so.

But I don't suppose you are trying to complain to the ACPI developers
about this whole thing now are you?  Are you hasseling them and
demanding that they fully document their interfaces that you need to use
so that you can hook into their code differently than they wish you to
do so?

> > Firmware loading is fine to document if you wish to do so.  But again,
> > why?  We already have multiple userspace programs that provide this
> > feature for them.  Perhaps you want to document how to add firmware to a
> > system in order for these different programs to pick them up?
> 
> I once tried to install a firmware for hotplug. Even finding the place whre
> I'm supposed to put it was harder than rewriting that *beep* from start,
> but I could not rewrite it because I didn't have any documentation.

The firmware layer has never been fully documented, and the maintainer
of the code died a few years ago.  It has been well known that this is
one area of the kernel that needs a lot of attention and help.  Please
feel free to chip in if you can do so.

> Even digging in that pile of wrapper scrips in order to debug that thing
> was a nightmare. (Having a number of places where the firmware will be
> expected in one of many versions and formats stored using one of many
> filenames can drive you nuts.)

I fully agree.

> > Or perhaps you want to document how to add this kind of functionality to
> > your kernel driver so that it can handle firmware loading by using the
> > firmware interface that the kernel provides?
> 
> I suppose that's missing, too. Or scattered in a number of contradicting
> and mostly outdated howtos across the internet.

Ir proably is, hence my suggestion on something that would be very
valuable to have documented.

> > If you just want to document the hotplug/uevent api, then do just that.
> > However I think you are overreaching with your scope here and getting
> > mighty confused in the process.
> 
> In other words: Grasping sysfs is not a feasible task? If this is true,
> how can anybody reliably use sysfs?

Huh, I never stated that at all.  If you wish to fully document sysfs
and how it works, then great, do that.  But that was not the stated
intent of this document, and is why I think the author got confused as
he was attempting to put a narrow portion of how sysfs works as a
reflection on how the whole of the body works.

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [RFC 0/8]KVM: swap out guest pages

2007-07-23 Thread Rusty Russell
On Tue, 2007-07-24 at 09:21 +0300, Avi Kivity wrote:
> Rusty Russell wrote:
> > Actually, get_user_pages() does that for you.  You have to make R/O any
> > writable pte where the guest doesn't set the dirty bit (so you can trap
> > it later) but last I put a printk in there, Linux doesn't do that.
> 
> Don't understand.  You mean Linux always sets the dirty bit when it
> makes a page writable?  Surely some mistake.
> 
> It probably does do so on demand write faults, but I'm sure the dirty
> bit can get cleaned out by the swapper.

Yeah, me dumb.  I should put that printk back and try doing a kernel
compile.

> > If not, it does get harder.  A callback in the mm struct to say "I want
> > to swap your page out" is required if we don't take a reference to the
> > page.  Dirty bit handling would be an interesting issue (maybe the
> > callback can say "No!" and dirty the page again?).
> 
> Since we have rmap, I don't see that as an issue.  Given a page, we can
> easily drop all refs.  Though lguest doesn't do that, right?

Yeah, rmap might maul some puppies.  I could do poor man's rmap tho with
one backref and a bit to say "there are more".  Then if that bit is set,
I just drop all 4 shadows 8)

> I'm also concerned with picking the correct page, but there's no good
> solution here.

But since you have rmap, if there was a cb when the the page was
undirtied, you could undirty the ptes.  When there "I want to kick this
page out" cb comes along, see if one of the ptes is now dirty, dirty the
page and return "no".

Maybe it's too simplistic, but it might work.
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Determine version of kernel that produced vmcore

2007-07-23 Thread Ken'ichi Ohmichi

Hi Dan,

2007/07/23 16:02:39 +0300, Dan Aloni <[EMAIL PROTECTED]> wrote:
>> Dan Aloni, I'd like to cooperate with you for implementing this feature.
>> If you have some patches other than 2007/07/10 patches, could you send
>> me them ?  I will update them.
>
>Sure, though I haven't made new patches (was busy with other things). 
>Feel free to post new versions of these patches, I'll take a look and 
>cooperate with you on this.

OK. Now, I am testing a new makedumpfile including some corrections
and some features to release it in July. I'll start making new patches
after the release.


Thanks
Ken'ichi Ohmichi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


sysfs/udev broken in latest git?

2007-07-23 Thread Simon Arlott
The following commit appears to break some of my udev rules (I don't 
have the time to finish the bisect right now, but there's only four 
changes showing in "git bisect visualize" - this one is tagged 
bisect/bad, and the other three are docs/docs/unrelated).

Neither of these symlinks get created by udev on kernels marked bad 
(see bisect log below):

ACTION=="add", \
KERNEL=="event*", \
SUBSYSTEM=="input", \
SYSFS{description}=="i8042 KBD port", \
NAME="input/%k", \
SYMLINK="input/i8042-kbd", \
MODE="0640", \
GROUP="event"

ACTION=="add", \
KERNEL=="event*", \
SUBSYSTEM=="input", \
SYSFS{manufacturer}=="Logitech", \
SYSFS{product}=="USB-PS/2 Optical Mouse", \
NAME="input/%k", \
SYMLINK="input/logitech-mouse", \
MODE="0640", \
GROUP="event"

Author: Cornelia Huck <[EMAIL PROTECTED]>  2007-07-18 09:43:47
Committer: Greg Kroah-Hartman <[EMAIL PROTECTED]>  2007-07-18 23:49:50
Parent: be3884943674f8ee7656b1d8b71c087ec900c836 (HOWTO: Add the 
knwon_regression URI to the documentation)

Driver core: check return code of sysfs_create_link()

Check for return value of sysfs_create_link() in device_add() and
device_rename().  Add helper functions device_add_class_symlinks() and
device_remove_class_symlinks() to make the code easier to read.

[EMAIL PROTECTED]: fix unused var warnings]

Signed-off-by: Cornelia Huck <[EMAIL PROTECTED]>
Acked-by: Jeff Garzik <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: Greg Kroah-Hartman <[EMAIL PROTECTED]>

git-bisect start
# good: [2f493789ddc636ff19156b2752763d76ba563b39] Merge branch 'master' of 
git://git.kernel.org/pub/scm/linux$
git-bisect good 2f493789ddc636ff19156b2752763d76ba563b39
# bad: [a80ef3844fd9f87461578d98a992db6cc64369f8] IPv6: Don't update ADVMSS on 
routes where the MTU is not als$
git-bisect bad a80ef3844fd9f87461578d98a992db6cc64369f8
# bad: [febe3375ea690a6cf544c33fa0fea1a06ff451ee] [ALSA] hda-codec - Add HP 
Pavillion quirk to Realtek code
git-bisect bad febe3375ea690a6cf544c33fa0fea1a06ff451ee
# bad: [83c54070ee1a2d05c89793884bea1a03f2851ed4] mm: fault feedback #2
git-bisect bad 83c54070ee1a2d05c89793884bea1a03f2851ed4
# good: [a267c0a887064720dfab5775a4f09b20b4f8ec37] Merge branch 'master' of 
ssh://master.kernel.org/pub/scm/li$
git-bisect good a267c0a887064720dfab5775a4f09b20b4f8ec37
# good: [fc15bc817eecd5c13581adab2a182c07edededa0] Merge 
master.kernel.org:/pub/scm/linux/kernel/git/gregkh/ui$
git-bisect good fc15bc817eecd5c13581adab2a182c07edededa0
# bad: [789c56b7f73218141b8004cb4f775eed8c514212] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/$
git-bisect bad 789c56b7f73218141b8004cb4f775eed8c514212
# bad: [70b315b0dd3879cb3ab8aadffb14f10b2d19b9c3] [CIFS] merge conflict in 
fs/cifs/export.c
git-bisect bad 70b315b0dd3879cb3ab8aadffb14f10b2d19b9c3
# good: [3870253efb65e1960421ca74f5d336218c28fc5b] [CIFS] more whitespace fixes
git-bisect good 3870253efb65e1960421ca74f5d336218c28fc5b
# good: [4a379e6657ae2dd910f9f06d46bd7c05fbe9ed5c] [CIFS] Fix build break - 
inet.h not included when experimen$
git-bisect good 4a379e6657ae2dd910f9f06d46bd7c05fbe9ed5c
# good: [7e42ca886b0282679c2721dc4853163cc89b8a34] [CIFS] Typo in previous patch
git-bisect good 7e42ca886b0282679c2721dc4853163cc89b8a34
# good: [c18c842b1fdf527717303a4e173cbece7ab2deb8] [CIFS] Allow disabling CIFS 
Unix Extensions as mount option
git-bisect good c18c842b1fdf527717303a4e173cbece7ab2deb8
# good: [fc15bc817eecd5c13581adab2a182c07edededa0] Merge 
master.kernel.org:/pub/scm/linux/kernel/git/gregkh/ui$
git-bisect good fc15bc817eecd5c13581adab2a182c07edededa0

Something went wrong here and hit a single CIFS patch, so I started from 
second-last bad and the good before it.

# bad: [789c56b7f73218141b8004cb4f775eed8c514212] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/$
git-bisect bad 789c56b7f73218141b8004cb4f775eed8c514212
# bad: [2ee97caf0a6602f749ddbfdb1449e383e1212707] Driver core: check return 
code of sysfs_create_link()
git-bisect bad 2ee97caf0a6602f749ddbfdb1449e383e1212707
# good: [2c19c49a59ccf2162c0eb999de1ec60c0e07a533] Documentation fix 
devres.txt: lib/iomap.c -> lib/devres.c
git-bisect good 2c19c49a59ccf2162c0eb999de1ec60c0e07a533
# good: [aebdc3b450a3febf7d7d00cd2235509055ec7082] dev_vdbg(), available with 
-DVERBOSE_DEBUG
git-bisect good aebdc3b450a3febf7d7d00cd2235509055ec7082

-- 
Simon Arlott


config.gz
Description: GNU Zip compressed data


Re: [PATCH 03/10] readahead: combine file_ra_state.prev_index/prev_offset into prev_pos

2007-07-23 Thread Fengguang Wu
On Mon, Jul 23, 2007 at 09:53:10PM -0700, Andrew Morton wrote:
> On Tue, 24 Jul 2007 12:32:15 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:
> 
> > On Mon, Jul 23, 2007 at 08:55:35PM -0700, Andrew Morton wrote:
> > > On Tue, 24 Jul 2007 10:00:12 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > > 
> > > > @@ -342,11 +342,9 @@ ondemand_readahead(struct address_space 
> > > >bool hit_readahead_marker, pgoff_t offset,
> > > >unsigned long req_size)
> > > >  {
> > > > -   int max;/* max readahead pages */
> > > > -   int sequential;
> > > > -
> > > > -   max = ra->ra_pages;
> > > > -   sequential = (offset - ra->prev_index <= 1UL) || (req_size > 
> > > > max);
> > > > +   int max = ra->ra_pages; /* max readahead pages */
> > > > +   pgoff_t prev_offset;
> > > > +   int sequential;
> > > >  
> > > > /*
> > > >  * It's the expected callback offset, assume sequential access.
> > > > @@ -360,6 +358,9 @@ ondemand_readahead(struct address_space 
> > > > goto readit;
> > > > }
> > > >  
> > > > +   prev_offset = ra->prev_pos >> PAGE_CACHE_SHIFT;
> > > > +   sequential = offset - prev_offset <= 1UL || req_size > max;
> > > 
> > > It's a bit pointless using an opaque type for prev_offset here, and then
> > > encoding the knowledge that it is implemented as "unsigned long".
> > > 
> > > It's a minor thing, but perhaps just "<= 1" would make more sense here.
> > 
> > Yeah, "<= 1" is OK.  But the expression still requires pgoff_t to be
> > 'unsigned' to work correctly.
> > 
> > So what about "<= 1U"?
> 
> umm, if one really cared one could do
> 
>== 1 ||  == 0

Yeah, I'd prefer this if we are to change it.

> or something.  But whatever - let's leave it as-is.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily

2007-07-23 Thread Satyam Sharma
On Tue, 24 Jul 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> > From: Satyam Sharma <[EMAIL PROTECTED]>
> > 
> > [6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
> > 
> > The goal is to let gcc generate good, beautiful, optimized code.
> > 
> > But test_and_set_bit, test_and_clear_bit, __test_and_change_bit,
> > and test_and_change_bit unnecessarily mark all of memory as clobbered,
> > thereby preventing gcc from doing perfectly valid optimizations.
> > 
> > The case of __test_and_change_bit() is particularly surprising, given
> > that it's a variant where we don't make any guarantees at all.
> 
> __test_and_change_bit is one that you could remove the memory clobber
> from.

Yes, for the atomic versions we don't care if we're asking gcc to
generate trashy code (even though I'd have wanted to only disallow
problematic optimizations -- ones involving the passed bit-string
address -- there, and allow other memory references to be optimized
as and how the compiler feels like it) because the atomic variants
are slow anyway and we probably want to be extra-safe there.

But for the non-atomic variants, it does make sense to remove the
memory clobber (and the unneeded __asm__ __volatile__ that another
patch did -- for the non-atomic variants, again).

OTOH, as per Linus' review it seems we can drop the "memory" clobber
and specify the output operand for the extended asm as "+m". But I
must admit I didn't quite understand that at all.

[ I should probably start reading gcc sources, the docs are said to
  be insufficient/out-of-date, as per the reviews of the patches. ]


Satyam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Oops while modprobing phy fixed module

2007-07-23 Thread Tejun Heo
Christoph Lameter wrote:
> On Wed, 18 Jul 2007 14:51:14 +0900
> Tejun Heo <[EMAIL PROTECTED]> wrote:
> 
>> Okay, successfully reproduced here.  Will hunt down.
> 
> Next time simply boot with "slub_debug". It will save a lot of time.

Alright, thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [RFC 0/8]KVM: swap out guest pages

2007-07-23 Thread Avi Kivity
Rusty Russell wrote:
> On Tue, 2007-07-24 at 08:30 +0300, Avi Kivity wrote:
>   
>> Rusty Russell wrote:
>> 
>>> On Mon, 2007-07-23 at 13:27 +0300, Avi Kivity wrote:
>>>   
>>>   
 Having an address_space (like your patch does) is remarkably simple, and 
 requires few hooks from the current vm.  However using existing vmas 
 mapped by the user has many advantages:

 - compatible with s390 requirements
 - allows the user to use hugetlbfs pages, which have a performance 
 advantage using ept/npt (but which are unswappable)
 - allows the user to map a file (which can be regarded as way to specify 
 the swap device)
 - better ingration with the rest of the vm
 
 
>>> You don't need to expose the vmas.  You just have userspace point out
>>> the start+len of each region of memory it wants the guest to be able to
>>> access, and the address it wants it to appear in the guest.
>>>
>>> This is a slight superset of what lguest does in two ways:
>>>
>>> 1) my guest address == user address, but I'm looking at adding an offset
>>> so I don't have to link the launcher binary specially.
>>> 2) I have only one contiguous region of guest-physical memory, since I
>>> can place device memory immediately above "normal" mem.
>>>
>>>   
>>>   
>> My intent was to allow userspace to establish assign a virtual address
>> range into a memory slot.
>>
>> So long as you don't do swapping, all is simple, since you can do a
>> get_user_pages() on initialization or when installing a shadow pte.  But
>> if you want to swap, you need:
>>
>> - a way to transfer the dirty bit from the shadow ptes to the struct page
>> 
>
> Actually, get_user_pages() does that for you.  You have to make R/O any
> writable pte where the guest doesn't set the dirty bit (so you can trap
> it later) but last I put a printk in there, Linux doesn't do that.
>
>   

Don't understand.  You mean Linux always sets the dirty bit when it
makes a page writable?  Surely some mistake.

It probably does do so on demand write faults, but I'm sure the dirty
bit can get cleaned out by the swapper.

>> - a way to let the vm rmap know that there are shadow ptes that point to
>> the page in addition to Linux ptes.  These shadow ptes may be in a
>> different format than Linux ptes.
>> - a different tlb invalidation method with ASIDs
>> 
>
> Well first I was just going to see how well hooking into the shrinker
> works.  That might be sufficient: just throw out shadow refs to pages
> when there's pressure.
>   

Ah, interesting.  Yes, you trim the shadow page table cache which unrefs
pages for you.

Maybe that's a good way to get things started.

> If not, it does get harder.  A callback in the mm struct to say "I want
> to swap your page out" is required if we don't take a reference to the
> page.  Dirty bit handling would be an interesting issue (maybe the
> callback can say "No!" and dirty the page again?).
>   

Since we have rmap, I don't see that as an issue.  Given a page, we can
easily drop all refs.  Though lguest doesn't do that, right?

I'm also concerned with picking the correct page, but there's no good
solution here.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/8] i386: bitops: Kill volatile-casting of memory addresses

2007-07-23 Thread Satyam Sharma
On Tue, 24 Jul 2007, Nick Piggin wrote:

> Linus Torvalds wrote:
> > 
> > On Mon, 23 Jul 2007, Satyam Sharma wrote:
> > 
> > 
> > > [4/8] i386: bitops: Kill volatile-casting of memory addresses
> > 
> > 
> > This is wrong.
> > 
> > The "const volatile" is so that you can pass an arbitrary pointer. The only
> > kind of abritraty pointer is "const volatile".
> > 
> > In other words, the "volatile" has nothing at all to do with whether the
> > memory is volatile or not (the same way "const" has nothing to do with it:
> > it's purely a C type *safety* issue, exactly the same way "const" is a type
> > safety issue.
> > 
> > A "const" on a pointer doesn't mean that the thing it points to cannot
> > change. When you pass a source pointer to "strlen()", it doesn't have to be
> > constant. But "strlen()" takes a "const" pointer, because it work son
> > constant pointers *too*.
> > 
> > Same deal here.
> > 
> > Admittedly this may be mostly historic, but regardless - the "volatiles" are
> > right.
> > 
> > Using volatile on *data* is generally considered incorrect and bad taste,
> > but using it in situations like this potentially makes sense.
> > 
> > Of course, if we remove all "volatiles" in data in the kernel (with the
> > possible exception of "jiffies"), we can then remove them from function
> > declarations too, but it should be done in that order.
> 
> Well, regardless, it still forces the function to treat the pointer
> target as volatile, won't it? It definitely prevents valid optimisations
> that would be useful for me in mm/page_alloc.c where page flags are
> being set up or torn down or checked with non-atomic bitops.

Yes, and yes. But I think what he meant there is that we'd need to
audit the kernel for all users of set_bit and friends and see if callers
actually pass in any _data_ that _is_ volatile. So we have to kill them
there first, and then in the function declarations here. I think I'll put
that on my long-term todo list, but see below.

> Anyway by type safety, do you mean it will stop the compiler from
> warning if a pointer to a volatile is passed to the bitop?

The compiler would start warning for all those cases (passing in
a pointer to volatile data, when the bitops have lost the volatile
casting from their function declarations), actually. Something like
"passing argument discards qualifiers from pointer type" ... but
considering I didn't see *any* of those warnings after these patches,
I'm confused as to what exactly Linus meant here ... and what exactly
do we need to do "kill the volatiles".

> If so, then
> why don't we just kill all the volatiles out of here and fix any
> warnings that comeup? I doubt there would be many, and of those, some
> might show up real synchronisation problems.

Yes, but see above.

> OK, not the i386 functions as much because they are all in asm anwyay,
> but in general (btw. why does i386 or any architecture define their own
> non-atomic bitops? If the version in asm-generic/bitops/non-atomic.h
> is not good enough then surely it is a bug in gcc or that file?)



Satyam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc1: known regressions with patches

2007-07-23 Thread Tejun Heo
Michal Piotrowski wrote:
> Subject : Oops while modprobing phy fixed module
> References  : http://lkml.org/lkml/2007/7/14/63
> Last known good : ?
> Submitter   : Gabriel C <[EMAIL PROTECTED]>
> Caused-By   : Tejun Heo <[EMAIL PROTECTED]>
>   commit 3007e997de91ec59af39a3f9c91595b31ae6e08b
> Handled-By  : Satyam Sharma <[EMAIL PROTECTED]>
>   Tejun Heo <[EMAIL PROTECTED]>
>   Vitaly Bordug <[EMAIL PROTECTED]>
> Patch1  : http://lkml.org/lkml/2007/7/18/506
> Status  : patch available

Patch is in mainline.  Commit a1da4dfe35bc36c3bc9716d995c85b7983c38a76.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [RFC 0/8]KVM: swap out guest pages

2007-07-23 Thread Rusty Russell
On Tue, 2007-07-24 at 08:30 +0300, Avi Kivity wrote:
> Rusty Russell wrote:
> > On Mon, 2007-07-23 at 13:27 +0300, Avi Kivity wrote:
> >   
> >> Having an address_space (like your patch does) is remarkably simple, and 
> >> requires few hooks from the current vm.  However using existing vmas 
> >> mapped by the user has many advantages:
> >>
> >> - compatible with s390 requirements
> >> - allows the user to use hugetlbfs pages, which have a performance 
> >> advantage using ept/npt (but which are unswappable)
> >> - allows the user to map a file (which can be regarded as way to specify 
> >> the swap device)
> >> - better ingration with the rest of the vm
> >> 
> >
> > You don't need to expose the vmas.  You just have userspace point out
> > the start+len of each region of memory it wants the guest to be able to
> > access, and the address it wants it to appear in the guest.
> >
> > This is a slight superset of what lguest does in two ways:
> >
> > 1) my guest address == user address, but I'm looking at adding an offset
> > so I don't have to link the launcher binary specially.
> > 2) I have only one contiguous region of guest-physical memory, since I
> > can place device memory immediately above "normal" mem.
> >
> >   
> 
> My intent was to allow userspace to establish assign a virtual address
> range into a memory slot.
> 
> So long as you don't do swapping, all is simple, since you can do a
> get_user_pages() on initialization or when installing a shadow pte.  But
> if you want to swap, you need:
> 
> - a way to transfer the dirty bit from the shadow ptes to the struct page

Actually, get_user_pages() does that for you.  You have to make R/O any
writable pte where the guest doesn't set the dirty bit (so you can trap
it later) but last I put a printk in there, Linux doesn't do that.

> - a way to let the vm rmap know that there are shadow ptes that point to
> the page in addition to Linux ptes.  These shadow ptes may be in a
> different format than Linux ptes.
> - a different tlb invalidation method with ASIDs

Well first I was just going to see how well hooking into the shrinker
works.  That might be sufficient: just throw out shadow refs to pages
when there's pressure.

If not, it does get harder.  A callback in the mm struct to say "I want
to swap your page out" is required if we don't take a reference to the
page.  Dirty bit handling would be an interesting issue (maybe the
callback can say "No!" and dirty the page again?).

I fear mm code.

> It's not going to be simple.

Yeah, but it's one thing stopping lguest from being non-root usable, so
I want it there, too.

Cheers,
Rusty.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: -mm merge plans for 2.6.23

2007-07-23 Thread Andrew Morton
On Mon, 23 Jul 2007 23:01:41 -0700 "Ray Lee" <[EMAIL PROTECTED]> wrote:

> So, what do I measure to make this an objective problem report?

Ideal would be to find a reproducible-by-others testcase which does what you
believe to be the wrong thing.

> And if
> I do that (and it shows a positive result), will that be good enough
> to argue for inclusion?

That depends upon whether there are more suitable ways of fixing "the
wrong thing".

There may not be - it could well be that present behaviour
is correct for the testcase, but it leaves the system in the wrong
state for your large workload shift.  In that case, prefetching (ie:
restoring system state approximately to that which prevailed prior to
"testcase") might well be a suitable fix.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc1: i386 section mismatch warnings

2007-07-23 Thread Sam Ravnborg
On Mon, Jul 23, 2007 at 09:18:38PM -0400, Jeff Garzik wrote:
> make allmodconfig on i386:
> 
> WARNING: vmlinux(.text+0xc0101183): Section mismatch: reference to 
> .init.text:start_kernel (between 'is386' and 'check_x87')
> WARNING: vmlinux(.text+0xc02cfcdb): Section mismatch: reference to 
> .init.text:kernel_init (between 'rest_init' and 'kthreadd_setup')
> WARNING: vmlinux(.text+0xc02d5ed2): Section mismatch: reference to 
> .init.text: (between 'iret_exc' and '_etext')
> WARNING: vmlinux(.text+0xc02d5ede): Section mismatch: reference to 
> .init.text: (between 'iret_exc' and '_etext')
> WARNING: vmlinux(.text+0xc02d5eea): Section mismatch: reference to 
> .init.text: (between 'iret_exc' and '_etext')
> WARNING: vmlinux(.text+0xc02d5ef6): Section mismatch: reference to 
> .init.text: (between 'iret_exc' and '_etext')
> WARNING: vmlinux(.text+0xc02cfda4): Section mismatch: reference to 
> .init.text:__alloc_bootmem_node (between 'alloc_node_mem_map' and 
> 'zone_wait_table_init')
> WARNING: vmlinux(.text+0xc02cfe4e): Section mismatch: reference to 
> .init.text:__alloc_bootmem_node (between 'zone_wait_table_init' and 
> 'vgacon_scrollback_startup')
> WARNING: vmlinux(.text+0xc02d64c6): Section mismatch: reference to 
> .init.text: (between 'iret_exc' and '_etext')
> WARNING: vmlinux(.text+0xc02cfea7): Section mismatch: reference to 
> .init.text:__alloc_bootmem (between 'vgacon_scrollback_startup' and 
> 'fb_find_logo')
> WARNING: vmlinux(.text+0xc02cfecb): Section mismatch: reference to 
> .init.data:logo_linux_mono (between 'fb_find_logo' and 'schedule')
> WARNING: vmlinux(.text+0xc02cfed5): Section mismatch: reference to 
> .init.data:logo_linux_clut224 (between 'fb_find_logo' and 'schedule')
> WARNING: vmlinux(.text+0xc02cfeda): Section mismatch: reference to 
> .init.data:logo_linux_vga16 (between 'fb_find_logo' and 'schedule')
> WARNING: vmlinux(.text+0xc02d6612): Section mismatch: reference to 
> .init.text: (between 'iret_exc' and '_etext')

The above warnings happens because during final link we stuff
a lot of different sections down in a smaller number of sections.
So modpost does not know that the references are actually OK.

The fix is simply to avoid doing section mismatch check on vmlinux
when processing the modules and I have this in kbuild.git.
I expect this to be merged during -rc1 (was sent in the merge window
but did not get applied).

As a general rule just ignore all section mismatch warnings that originate
from vmlinux. The others needs to be fixed and there seem to be enough
to deal with :-(

Sam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [2/11] x86: Fix alternatives and kprobes to remap write-protected kernel text

2007-07-23 Thread Rusty Russell
On Fri, 2007-07-20 at 17:32 +0200, Andi Kleen wrote:
> Reenable kprobes and alternative patching when the kernel text is write
> protected by DEBUG_RODATA
> Add a general utility function to change write protected text.
> The new function remaps the code using vmap to write it and takes
> care of CPU synchronization. It also does CLFLUSH to make icache recovery
> faster. 
> 
> There are some limitations on when the function can be used, see
> the comment.
> 
> This is a newer version that also changes the paravirt_ops code.
> text_poke also supports multi byte patching now.

Hmm, I wrote this code originally, and would have liked to catch this
change before it went in.

It's broken on i386 w/ paravirt_ops.

Problem: lookup_address() contains paravirt_ops.  So we call
paravirt_ops.patch which patches part of the instructions, then calls
nop_out to patch the end of the instructions.  This calls text_poke
which calls lookup_address, which is what we've half-patched.  Mainly,
we get lucky at the moment (I just hit this in lguest).

We could use a simpler nop_out at boot, or fix this a little more
robustly by making everyone do their patching via a single call to
text_poke.

Needs testing for VMI and Xen (lguest and native booted here):
===
Make patching more robust, fix paravirt issue

Commit 19d36ccdc34f5ed444f8a6af0cbfdb6790eb1177 "x86: Fix alternatives
and kprobes to remap write-protected kernel text" uses code which is
being patched for patching.

In particular, paravirt_ops does patching in two stages: first it
calls paravirt_ops.patch, then it fills any remaining instructions
with nop_out().  nop_out calls text_poke() which calls
lookup_address() which calls pgd_val() (aka paravirt_ops.pgd_val):
that call site is one of the places we patch.

If we always do patching as one single call to text_poke(), we only
need make sure we're not patching the memcpy in text_poke itself.
This means the prototype to paravirt_ops.patch needs to change, to
marshal the new code into a buffer rather than patching in place as it
does now.  It also means all patching goes through text_poke(), which
is known to be safe (apply_alternatives is also changed to make a
single patch).

Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>

diff -r b81206bbf749 arch/i386/kernel/alternative.c
--- a/arch/i386/kernel/alternative.cTue Jul 24 13:05:36 2007 +1000
+++ b/arch/i386/kernel/alternative.cTue Jul 24 14:50:54 2007 +1000
@@ -10,6 +10,8 @@
 #include 
 #include 
 #include 
+
+#define MAX_PATCH_LEN 128
 
 #ifdef CONFIG_HOTPLUG_CPU
 static int smp_alt_once;
@@ -148,7 +150,8 @@ static unsigned char** find_nop_table(vo
 
 #endif /* CONFIG_X86_64 */
 
-static void nop_out(void *insns, unsigned int len)
+/* Use this to add nops to a buffer, then text_poke the whole buffer. */
+static void add_nops(void *insns, unsigned int len)
 {
unsigned char **noptable = find_nop_table();
 
@@ -156,7 +159,7 @@ static void nop_out(void *insns, unsigne
unsigned int noplen = len;
if (noplen > ASM_NOP_MAX)
noplen = ASM_NOP_MAX;
-   text_poke(insns, noptable[noplen], noplen);
+   memcpy(insns, noptable[noplen], noplen);
insns += noplen;
len -= noplen;
}
@@ -174,15 +177,14 @@ void apply_alternatives(struct alt_instr
 void apply_alternatives(struct alt_instr *start, struct alt_instr *end)
 {
struct alt_instr *a;
-   u8 *instr;
-   int diff;
+   char insnbuf[MAX_PATCH_LEN];
 
DPRINTK("%s: alt table %p -> %p\n", __FUNCTION__, start, end);
for (a = start; a < end; a++) {
BUG_ON(a->replacementlen > a->instrlen);
+   BUG_ON(a->instrlen > sizeof(insnbuf));
if (!boot_cpu_has(a->cpuid))
continue;
-   instr = a->instr;
 #ifdef CONFIG_X86_64
/* vsyscall code is not mapped yet. resolve it manually. */
if (instr >= (u8 *)VSYSCALL_START && instr < (u8*)VSYSCALL_END) 
{
@@ -191,9 +193,10 @@ void apply_alternatives(struct alt_instr
__FUNCTION__, a->instr, instr);
}
 #endif
-   memcpy(instr, a->replacement, a->replacementlen);
-   diff = a->instrlen - a->replacementlen;
-   nop_out(instr + a->replacementlen, diff);
+   memcpy(insnbuf, a->replacement, a->replacementlen);
+   add_nops(insnbuf + a->replacementlen,
+a->instrlen - a->replacementlen);
+   text_poke(a->instr, insnbuf, a->instrlen);
}
 }
 
@@ -215,16 +218,18 @@ static void alternatives_smp_unlock(u8 *
 static void alternatives_smp_unlock(u8 **start, u8 **end, u8 *text, u8 
*text_end)
 {
u8 **ptr;
+   char insn[1];
 
if (noreplace_smp)
return;
 
+   add_nops(insn, 1);
for (ptr = start; ptr < end; ptr++) {
if (*ptr < text)
  

Linux TCP modifications

2007-07-23 Thread Pallab Dutta
Hi,

I want to change the TCP ( for a given specific application with lots of
modifications to the basic TCP) implementation in Linux and integrate it with
the linux kernal. Since TCP comes as apart of Linux kernal and also TCP and IP
implementations in Linux are tightly coupled, please suggest how to go about
integrating TCP (which is developed for a specific application with lots of
modifications) into Linux kernal. Kindly give as much details as possible.  

regards,
Pallab.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] nommu: vmalloc_32_user()/vm_insert_page() and symbol exports.

2007-07-23 Thread Greg Ungerer

Hi Paul,

Paul Mundt wrote:

Trying to survive an allmodconfig on a nommu platform results in
many screen lengths of module unhappiness. Many of the mmap
related things that binfmt_flat hooks in to are never exported
despite being global, and there are also missing definitions for
vmalloc_32_user() and vm_insert_page().

I've implemented vmalloc_32_user() trying to stick as close to
the mm/vmalloc.c implementation as possible, though we don't
have any need for VM_USERMAP, so groveling for the VMA can be
skipped. vm_insert_page() has been stubbed for now in order to
keep the build happy.


Looks good to me.
You can add my acked by if you want:

Acked-by: Greg Ungerer <[EMAIL PROTECTED]>



Signed-off-by: Paul Mundt <[EMAIL PROTECTED]>

--

 mm/nommu.c |   45 +
 1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/mm/nommu.c b/mm/nommu.c
index 1b105d2..9eef6a3 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -54,12 +54,6 @@ DECLARE_RWSEM(nommu_vma_sem);
 struct vm_operations_struct generic_file_vm_ops = {
 };
 
-EXPORT_SYMBOL(vfree);

-EXPORT_SYMBOL(vmalloc_to_page);
-EXPORT_SYMBOL(vmalloc_32);
-EXPORT_SYMBOL(vmap);
-EXPORT_SYMBOL(vunmap);
-
 /*
  * Handle all mappings that got truncated by a "truncate()"
  * system call.
@@ -168,7 +162,6 @@ int get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
 finish_or_fault:
return i ? : -EFAULT;
 }
-
 EXPORT_SYMBOL(get_user_pages);
 
 DEFINE_RWLOCK(vmlist_lock);

@@ -178,6 +171,7 @@ void vfree(void *addr)
 {
kfree(addr);
 }
+EXPORT_SYMBOL(vfree);
 
 void *__vmalloc(unsigned long size, gfp_t gfp_mask, pgprot_t prot)

 {
@@ -186,17 +180,19 @@ void *__vmalloc(unsigned long size, gfp_t gfp_mask, 
pgprot_t prot)
 */
return kmalloc(size, (gfp_mask | __GFP_COMP) & ~__GFP_HIGHMEM);
 }
+EXPORT_SYMBOL(__vmalloc);
 
 struct page * vmalloc_to_page(void *addr)

 {
return virt_to_page(addr);
 }
+EXPORT_SYMBOL(vmalloc_to_page);
 
 unsigned long vmalloc_to_pfn(void *addr)

 {
return page_to_pfn(virt_to_page(addr));
 }
-
+EXPORT_SYMBOL(vmalloc_to_pfn);
 
 long vread(char *buf, char *addr, unsigned long count)

 {
@@ -237,9 +233,8 @@ void *vmalloc_node(unsigned long size, int node)
 }
 EXPORT_SYMBOL(vmalloc_node);
 
-/*

- * vmalloc_32  -  allocate virtually continguos memory (32bit addressable)
- *
+/**
+ * vmalloc_32  -  allocate virtually contiguous memory (32bit addressable)
  * @size:  allocation size
  *
  * Allocate enough 32bit PA addressable pages to cover @size from the
@@ -249,17 +244,33 @@ void *vmalloc_32(unsigned long size)
 {
return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL);
 }
+EXPORT_SYMBOL(vmalloc_32);
+
+/**
+ * vmalloc_32_user - allocate zeroed virtually contiguous 32bit memory
+ * @size:  allocation size
+ *
+ * The resulting memory area is 32bit addressable and zeroed so it can be
+ * mapped to userspace without leaking data.
+ */
+void *vmalloc_32_user(unsigned long size)
+{
+   return __vmalloc(size, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL);
+}
+EXPORT_SYMBOL(vmalloc_32_user);
 
 void *vmap(struct page **pages, unsigned int count, unsigned long flags, pgprot_t prot)

 {
BUG();
return NULL;
 }
+EXPORT_SYMBOL(vmap);
 
 void vunmap(void *addr)

 {
BUG();
 }
+EXPORT_SYMBOL(vunmap);
 
 /*

  * Implement a stub for vmalloc_sync_all() if the architecture chose not to
@@ -269,6 +280,13 @@ void  __attribute__((weak)) vmalloc_sync_all(void)
 {
 }
 
+int vm_insert_page(struct vm_area_struct *vma, unsigned long addr,

+  struct page *page)
+{
+   return -EINVAL;
+}
+EXPORT_SYMBOL(vm_insert_page);
+
 /*
  *  sys_brk() for the most part doesn't need the global kernel
  *  lock, except when an application is doing something nasty
@@ -994,6 +1012,7 @@ unsigned long do_mmap_pgoff(struct file *file,
show_free_areas();
return -ENOMEM;
 }
+EXPORT_SYMBOL(do_mmap_pgoff);
 
 /*

  * handle mapping disposal for uClinux
@@ -1074,6 +1093,7 @@ int do_munmap(struct mm_struct *mm, unsigned long addr, 
size_t len)
 
 	return 0;

 }
+EXPORT_SYMBOL(do_munmap);
 
 asmlinkage long sys_munmap(unsigned long addr, size_t len)

 {
@@ -1164,6 +1184,7 @@ unsigned long do_mremap(unsigned long addr,
 
 	return vma->vm_start;

 }
+EXPORT_SYMBOL(do_mremap);
 
 asmlinkage unsigned long sys_mremap(unsigned long addr,

unsigned long old_len, unsigned long new_len,
@@ -1231,7 +1252,6 @@ unsigned long get_unmapped_area(struct file *file, 
unsigned long addr,
 
 	return get_area(file, addr, len, pgoff, flags);

 }
-
 EXPORT_SYMBOL(get_unmapped_area);
 
 /*

@@ -1346,6 +1366,7 @@ int filemap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
BUG();
return 0;
 }
+EXPORT_SYMBOL(filemap_fault);
 
 /*

  * Access another process' address space.



--

Greg Ungerer  --  Chief Software Dude   EMAIL: [EMA

Re: [PATCH] add __GFP_ZERO to GFP_LEVEL_MASK

2007-07-23 Thread Peter Zijlstra
On Mon, 2007-07-23 at 16:17 -0700, Christoph Lameter wrote:
> On Mon, 23 Jul 2007, Peter Zijlstra wrote:
> 
> > ---
> > Daniel recently spotted that __GFP_ZERO is not (and has never been)
> > part of GFP_LEVEL_MASK. I could not find a reason for this in the
> > original patch: 3977971c7f09ce08ed1b8d7a67b2098eb732e4cd in the -bk
> > tree.
> > 
> > This of course is in stark contradiction with the comment accompanying
> > GFP_LEVEL_MASK.
> 
> NACK.
> 
> The effect that this patch will have is that __GFP_ZERO is passed through 
> to the page allocator which will needlessly zero pages. GFP_LEVEL_MASK is 
> used to filter out the flags that are to be passed to the page allocator. 
> __GFP_ZERO is not passed on but handled by the slab allocators.

Then we can either fixup the slab allocators to mask out __GFP_ZERO, or
do something like the below.

Personally I like the consistency of adding __GFP_ZERO here (removes
this odd exception) and just masking it in the sl[aou]b thingies.

Anybody else got a preference?

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/linux/gfp.h |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

Index: linux-2.6-2/include/linux/gfp.h
===
--- linux-2.6-2.orig/include/linux/gfp.h
+++ linux-2.6-2/include/linux/gfp.h
@@ -53,7 +53,14 @@ struct vm_area_struct;
 #define __GFP_BITS_SHIFT 20/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
-/* if you forget to add the bitmask here kernel will crash, period */
+/*
+ * If you forget to add the bitmask here kernel will crash, period!
+ *
+ * GFP_LEVEL_MASK is used to filter out the flags that are to be passed to the
+ * page allocator.
+ *
+ * __GFP_ZERO is not passed on but handled by the slab allocators.
+ */
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_COMP| \


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: -mm merge plans for 2.6.23

2007-07-23 Thread Ray Lee

On 7/23/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

Let it just be noted that Con is not the only one who has expended effort
on this patch.  It's been in -mm for nearly two years and it has meant
ongoing effort for me and, to a lesser extent, other MM developers to keep
it alive.


 Yes, keeping patches from crufting up and stepping on other
patches' toes is hard work; I did it for a bit, and it was one of the
more thankless tasks I've tried a hand at.

So, thanks.


Critera are different for each patch, but it usually comes down to a
cost/benefit judgement.  Does the benefit of the patch exceed its
maintenance cost over the lifetime of the kernel (whatever that is).


Well, I suspect it's 'lifetime of the feature,' in this case as it's
no more user visible than the page replacement algorithm in the first
place.


The other consideration here is, as Nick points out, are the problems which
people see this patch solving for them solveable in other, better ways?
IOW, is this patch fixing up preexisting deficiencies post-facto?


In some cases, it almost certainly is. It also has the troubling
aspect of mitigating future regressions without anyone terribly
noticing, due to it being able to paper over those hypothetical future
deficiencies when they're introduced.


To attack the second question we could start out with bug reports: system A
with workload B produces result C.  I think result C is wrong for 
and would prefer to see result D.


I spend a lot of time each day watching my computer fault my
workingset back in when I switch contexts. I'd rather I didn't have to
do that. Unfortunately, that's a pretty subjective problem report. For
whatever it's worth, we have pretty subjective solution reports
pointing to swap prefetch as providing a fix for them.

My concern is that a subjective problem report may not be good enough.
So, what do I measure to make this an objective problem report? And if
I do that (and it shows a positive result), will that be good enough
to argue for inclusion?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/RFT 1/5] Input: implement proper locking in input core

2007-07-23 Thread Dmitry Torokhov
Hi Jeff, 

On Tuesday 24 July 2007 01:35, Jeff Garzik wrote:
> 
> spin_lock_irq() should generally be avoided.
> 
> In cases like the first case -- input_repeat_key() -- you are making 
> incorrect assumptions about the state of interrupts.  The other cases 
> are probably ok, but in general spin_lock_irq() has a long history of 
> being very fragile and quite often wrong.
> 
> Use spin_lock_irqsave() to be safe.  Definitely in input_repeat_key(), 
> but I strongly recommend removing spin_lock_irq() from all your patches 
> here.
>

Thasnk you for looking at the patches. Actually I went back and forth
between spin_lock_irq and spin_lock_irqsave.. I will change back to
irqsave version, it is indeed safer.

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Kconfig: Remove top level menu "Code maturity level options"

2007-07-23 Thread Sam Ravnborg
On Mon, Jul 23, 2007 at 04:58:13PM -0700, Andrew Morton wrote:
> On Sat, 21 Jul 2007 06:58:28 +0300
> Al Boldi <[EMAIL PROTECTED]> wrote:
> 
> > 
> > This patch removes the top level menu "Code maturity level options", and 
> > moves its options into menu "General setup".
> > 
> > This makes Kconfig less cluttered and easier to setup.
> > 
> > 
> > Cc: Andrew Morton <[EMAIL PROTECTED]>
> > Signed-off-by: Al Boldi <[EMAIL PROTECTED]>
> > 
> > ---
> > --- a/init/Kconfig  2007-07-09 06:38:47.0 +0300
> > +++ b/init/Kconfig  2007-07-21 06:42:06.0 +0300
> > @@ -7,7 +7,7 @@ config DEFCONFIG_LIST
> > default "/boot/config-$UNAME_RELEASE"
> > default "arch/$ARCH/defconfig"
> >  
> > -menu "Code maturity level options"
> > +menu "General setup"
> >  
> >  config EXPERIMENTAL
> > bool "Prompt for development and/or incomplete code/drivers"
> > @@ -61,9 +61,6 @@ config INIT_ENV_ARG_LIMIT
> >   Maximum of each of the number of arguments and environment
> >   variables passed to init from the kernel command line.
> >  
> > -endmenu
> > -
> > -menu "General setup"
> >  
> >  config LOCALVERSION
> > string "Local version - append to kernel release"
> 
> OK (IMO).  The "Code maturity level options" menu only has a single
> entry, and I doubt if it will grow more entries any time soon.  It
> makes sense to kill it and to move its sole entry into "General setup".

Agreed. These should be a reason for a menu and a single entry is not enough.
Acked-by: Sam Ravnborg <[EMAIL PROTECTED]>

Sam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8]KVM: swap out guest pages

2007-07-23 Thread Avi Kivity
Shaohua Li wrote:
> On Mon, 2007-07-23 at 18:27 +0800, Avi Kivity wrote:
>   
>> Shaohua Li wrote:
>> 
>>> This patch series make kvm guest pages be able to be swapped out and
>>> dynamically allocated. Without it, all guest memory is allocated at
>>> guest start time.
>>>
>>> patches are against latest git, and you need first patch Avi's
>>>   
>> kvm-sch
>> 
>>> integration patch
>>>
>>>   
>> (http://sourceforge.net/mailarchive/forum.php?thread_name=11841693332609-git-send-email-avi%40qumranet.com&forum_name=kvm-devel
>>  ).
>> 
>>> Patch is quite stable in my test. With the patch, I can run a 256M
>>> memory guest in a 300M memory host.
>>>   
>> What about the opposite?
>>
>> 
>>> If guest is idle, the memory it used
>>> can be less than 10M. I did a simple performance test (measure
>>>   
>> kernel
>> 
>>> build time in guest), if there is few swap, the performance w/wo the
>>> patch difference isn't significent. If you have better measurement
>>> approach, please let me try.
>>>
>>> Unresolved issue:
>>> 1. swapoff doesn't work, we need a hook.
>>> 2. SMP guest might not work, as kvm doesn't support smp till now.
>>> 3. better algorithm to select swaped out guest pages according to
>>> guest's memory usage.
>>> Maybe more.
>>>
>>> Any suggests and comments are appreciated.
>>>  
>>>   
>> The big question is whether to have kvm's own address_space or not.
>>
>> Having an address_space (like your patch does) is remarkably simple,
>> and
>> requires few hooks from the current vm.  However using existing vmas
>> mapped by the user has many advantages:
>>
>> - compatible with s390 requirements
>> - allows the user to use hugetlbfs pages, which have a performance
>> advantage using ept/npt (but which are unswappable)
>> - allows the user to map a file (which can be regarded as way to
>> specify
>> the swap device)
>> - better ingration with the rest of the vm
>>
>> I am quite torn between the simplicity of your approach and the
>> advantages of using generic vmas.  However, s390 pretty much forces
>> our
>> hand.
>>
>> What is your opinion of extending generic vmas to back kvm guest
>> memory?
>> 
> several issues:
> 1. vma is to manage usersapce address, kvm guest uses full address
> space.
> 2. qemu itself must use some address space.
>   

My idea is to keep the current slot concept, but instead of having kvm
allocate pages for a slot, it would call get_user_pages() for a virtual
address range.  Userspace doesn't directly talk about vmas, just virtual
address ranges.


> 3. kvm need special page fault for shadow page table. generic page table
> operations can't be directly used for guest.
> I have no idea if your idea is feasible. The s390 guys said their shadow
> page table is the same as host, this is why they can easily implement
> swap, x86 is hard.
>   

No question that it is hard.  I'd like to explore just how hard it is.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 7/8]KVM: swap out guest pages

2007-07-23 Thread Avi Kivity
Shaohua Li wrote:
>>>  
>>>   
>> You're not removing any shadows of the page, in case that page is a
>> guest page table.  But I don't see anything wrong with it -- the page
>> won't change while it's in swap.
>> 
> You are right. Should we?
>   

I don't think so.  It's just strange to have shadows for a guest page
that is swapped out, so I pointed that out.  But as the page cannot
change in swap, everything is safe.  I guess that kernel page tables
could be swapped out after a short while in a guest that doesn't swap
its own kernel pages.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/8]KVM: fix bugs in kvm sched integration patch

2007-07-23 Thread Avi Kivity
Shaohua Li wrote:
>>
>>> 1. vmcs_readl/vmcs_writel are called with preempt enabled
>>>  
>>>   
>> Why is that bad?
>> 
> 1. raw_smp_processor_id()
> 2. migrate to other cpu
> 3. current->kvm_vcpu->cpu != the cpu id of step 1.
> you will see the warning.
>   

Ah, that code is gone from preempt-hooks, hence I didn't understand you.

The current version of the patchset does not change
vmcs_readl/vmcs_writel.  So I think everything is safe.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/RFT 1/5] Input: implement proper locking in input core

2007-07-23 Thread Jeff Garzik

Dmitry Torokhov wrote:

+static void input_repeat_key(unsigned long data)
+{
+   struct input_dev *dev = (void *) data;
 
-			change_bit(code, dev->key);

+   spin_lock_irq(&dev->event_lock);

[...]

+void input_inject_event(struct input_handle *handle,
+   unsigned int type, unsigned int code, int value)
 {
-   struct input_dev *dev = (void *) data;
+   struct input_dev *dev = handle->dev;
+   struct input_handle *grab;
 
-	if (!test_bit(dev->repeat_key, dev->key))

-   return;
+   if (is_event_supported(type, dev->evbit, EV_MAX)) {
+   spin_lock_irq(&dev->event_lock);
 
-	input_event(dev, EV_KEY, dev->repeat_key, 2);

-   input_sync(dev);
+   grab = rcu_dereference(dev->grab);
+   if (!grab || grab == handle)
+   input_handle_event(dev, type, code, value);
 
-	if (dev->rep[REP_PERIOD])

-   mod_timer(&dev->timer, jiffies + 
msecs_to_jiffies(dev->rep[REP_PERIOD]));
+   spin_unlock_irq(&dev->event_lock);
+   }
 }
+EXPORT_SYMBOL(input_inject_event);

[...]

+   spin_lock_irq(&dev->event_lock);
+
+   /*
+* Simulate keyup events for all pressed keys so that handlers
+* are not left with "stuck" keys. The driver may continue
+* generate events even after we done here but they will not
+* reach any handlers.
+*/
+   if (is_event_supported(EV_KEY, dev->evbit, EV_MAX)) {
+   for (code = 0; code <= KEY_MAX; code++) {
+   if (is_event_supported(code, dev->keybit, KEY_MAX) &&
+   test_bit(code, dev->key)) {
+   input_pass_event(dev, EV_KEY, code, 0);
+   }
+   }
+   input_pass_event(dev, EV_SYN, SYN_REPORT, 1);
+   }
+
+   list_for_each_entry(handle, &dev->h_list, d_node)
+   handle->open = 0;
+
+   spin_unlock_irq(&dev->event_lock);



spin_lock_irq() should generally be avoided.

In cases like the first case -- input_repeat_key() -- you are making 
incorrect assumptions about the state of interrupts.  The other cases 
are probably ok, but in general spin_lock_irq() has a long history of 
being very fragile and quite often wrong.


Use spin_lock_irqsave() to be safe.  Definitely in input_repeat_key(), 
but I strongly recommend removing spin_lock_irq() from all your patches 
here.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [RFC 0/8]KVM: swap out guest pages

2007-07-23 Thread Avi Kivity
Rusty Russell wrote:
> On Mon, 2007-07-23 at 13:27 +0300, Avi Kivity wrote:
>   
>> Having an address_space (like your patch does) is remarkably simple, and 
>> requires few hooks from the current vm.  However using existing vmas 
>> mapped by the user has many advantages:
>>
>> - compatible with s390 requirements
>> - allows the user to use hugetlbfs pages, which have a performance 
>> advantage using ept/npt (but which are unswappable)
>> - allows the user to map a file (which can be regarded as way to specify 
>> the swap device)
>> - better ingration with the rest of the vm
>> 
>
> You don't need to expose the vmas.  You just have userspace point out
> the start+len of each region of memory it wants the guest to be able to
> access, and the address it wants it to appear in the guest.
>
> This is a slight superset of what lguest does in two ways:
>
> 1) my guest address == user address, but I'm looking at adding an offset
> so I don't have to link the launcher binary specially.
> 2) I have only one contiguous region of guest-physical memory, since I
> can place device memory immediately above "normal" mem.
>
>   

My intent was to allow userspace to establish assign a virtual address
range into a memory slot.

So long as you don't do swapping, all is simple, since you can do a
get_user_pages() on initialization or when installing a shadow pte.  But
if you want to swap, you need:

- a way to transfer the dirty bit from the shadow ptes to the struct page
- a way to let the vm rmap know that there are shadow ptes that point to
the page in addition to Linux ptes.  These shadow ptes may be in a
different format than Linux ptes.
- a different tlb invalidation method with ASIDs

It's not going to be simple.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8]KVM: swap out guest pages

2007-07-23 Thread Avi Kivity
Jeff Dike wrote:
> On Mon, Jul 23, 2007 at 01:27:40PM +0300, Avi Kivity wrote:
>   
>> Having an address_space (like your patch does) is remarkably simple, and 
>> requires few hooks from the current vm.  However using existing vmas 
>> mapped by the user has many advantages:
>> 
>
> It's also needed for a SKAS-like UML client, where the host side will
> need to make system calls on behalf of the guest.
>
>   

Even in the current model, guest physical memory is mmap()ed into host
userspace.  The kernel cannot enforce this, however.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: -mm merge plans for 2.6.23

2007-07-23 Thread Andrew Morton
On Mon, 23 Jul 2007 21:53:38 -0700 "Ray Lee" <[EMAIL PROTECTED]> wrote:

> 
> Since this merge period has appeared particularly frazzling for
> Andrew, I've been keeping silent and waiting for him to get to a point
> where there's a breather. I didn't feel it would be polite to request
> yet more work out of him while he had a mess on his hands.

Let it just be noted that Con is not the only one who has expended effort
on this patch.  It's been in -mm for nearly two years and it has meant
ongoing effort for me and, to a lesser extent, other MM developers to keep
it alive.

> But, given this has come to a head, I'm asking now.
> 
> Andrew? You've always given the impression that you want this run more
> as an engineering effort than an artistic endeavour, so help us out
> here. What are your concerns with swap prefetch? What sort of
> comparative data would you like to see to justify its inclusion, or to
> prove that it's not needed?

Critera are different for each patch, but it usually comes down to a
cost/benefit judgement.  Does the benefit of the patch exceed its
maintenance cost over the lifetime of the kernel (whatever that is).

In this case the answer to that has never been clear to me.  The (much
older) fs-aio patches were (are) in a similar situation.

The other consideration here is, as Nick points out, are the problems which
people see this patch solving for them solveable in other, better ways? 
IOW, is this patch fixing up preexisting deficiencies post-facto?

To attack the second question we could start out with bug reports: system A
with workload B produces result C.  I think result C is wrong for 
and would prefer to see result D.

> Or are we reading too much into the fact that it isn't merged? In
> short, communicate please, it will help.

Well.  The above, plus there's always a lot of stuff happening in MM land,
and I haven't seen much in the way of enthusiasm from the usual MM
developers.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc1: BUG_ON in kmap_atomic_prot()

2007-07-23 Thread Alexey Dobriyan
On Mon, Jul 23, 2007 at 03:27:12PM -0700, Andrew Morton wrote:
> On Tue, 24 Jul 2007 02:04:46 +0400
> Alexey Dobriyan <[EMAIL PROTECTED]> wrote:
> 
> > On Mon, Jul 23, 2007 at 02:11:37PM -0700, Andrew Morton wrote:
> > > On Tue, 24 Jul 2007 01:01:53 +0400
> > > Alexey Dobriyan <[EMAIL PROTECTED]> wrote:
> > > 
> > > > On Tue, Jul 24, 2007 at 12:40:45AM +0400, Alexey Dobriyan wrote:
> > > > > > I had more complete info: 
> > > > > > http://article.gmane.org/gmane.linux.network/66966
> > > > > > 
> > > > > > You're using DEBUG_PAGEALLOC, but I was not, so I think we can rule 
> > > > > > that out.
> > > > > > 
> > > > > > I haven't worked out where that kmap_atomic() call is coming from 
> > > > > > yet. 
> > > > > > Both traces point up into the page allocator, but I _think_ that's 
> > > > > > stack
> > > > > > gunk.
> > > > > 
> > > > > Ahh, you suspect networking.
> > > > > 
> > > > > Here, setup is 2 cheap-ass 100Mb realtek 8139 NICs, one to campus 
> > > > > network
> > > > > receiving ~20 junk packets per second, one gathering netconsole output
> > > > > and ssh to it, no conntracks and fancy stuff.
> > > > > 
> > > > > [reboots with cables physically unplugged]
> > > > 
> > > > OK, I run gdb recompile, cat(1) every file in /usr/portage (shitload of
> > > > small files) with both cables unplugged. It all went fine for ~5 minutes
> > > > after that it crashed exactly same way after 10 secs after plugging one
> > > > of them.
> > > 
> > > It'd be nice to get a clean trace.  Are you able to obtain the full
> > > trace with CONFIG_FRAME_POINTER=y?
> > 
> > Sorry, no camera shot, finding camera requires wakening up M. :)
> > 
> > It took longer that usual, but here it is
> > 
> > kmap_atomic
> > get_page_from_freelist
> > __alloc_pages
> > cache_alloc_refill
> > __alloc_pages
> > cache_alloc_refill
> > kmem_cache_alloc
> > dst_alloc
> > ip_route_input
> > ip_rcv
> > netif_receive_skb
> > rtl8139_poll
> > net_rx_action
> > __do_softirq
> > do_softirq
> > irq_exit
> > do_IRQ
> > common_interrupt
> > handle_mm_fault
> > do_page_fault
> > error_core
> > 
> > much more loaded x86_64 box near also running 2.6.23-rc1 with debugging
> > turned on, using atl1 driver doesn't experience any crashes.
> > 
> > And I found 2.6.22-b91cba52e9b7b3f1c0037908a192d93a869ca9e5-x entry on
> > top of grub config which means b91cba52e9b7b3f1c0037908a192d93a869ca9e5
> > _without_ any debugging was OK.
> 
> I worked out that the crash I saw was in
> 
> BUG_ON(!pte_none(*(kmap_pte-idx)));
> 
> in the read of kmap_pte[idx].  Which would be weird as the caller is using 
> a literal KM_USER0.
> 
> So maybe I goofed, and that BUG_ON is triggering (it scrolled off, and I am
> unable to reproduce it now).
> 
> If that BUG_ON _is_ triggering then it might indicate that someone is doing
> a __GFP_HIGHMEM|__GFP_ZERO allocation while holding KM_USER0.
> 
> If they're holding an atomic kmap then they'll be running in_atomic so it
> is unlikely that they accidentally added __GFP_WAIT because lots of people
> would be getting lots of might_sleep() warnings.
> 
> Hence that first VM_BUG_ON in prep_zero_page() _should_ be triggering.
> 
> Do you have CONFIG_DEBUG_VM enabled?

Yes.

> Also, it might be useful to apply -mm's kmap_atomic-debugging.patch.  it
> will detect lots of abuse.

I hit it only once with this patch applied, but there were no additional
warnings.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC/RFT 3/5] Input: tsdev - implement proper locking

2007-07-23 Thread Dmitry Torokhov
Input: tsdev - implement proper locking

Signed-off-by: Dmitry Torokhov <[EMAIL PROTECTED]>
---

 drivers/input/tsdev.c |  392 +++---
 1 files changed, 278 insertions(+), 114 deletions(-)

Index: work/drivers/input/tsdev.c
===
--- work.orig/drivers/input/tsdev.c
+++ work/drivers/input/tsdev.c
@@ -112,6 +112,8 @@ struct tsdev {
struct input_handle handle;
wait_queue_head_t wait;
struct list_head client_list;
+   spinlock_t client_lock; /* protects client_list */
+   struct mutex mutex;
struct device dev;
 
int x, y, pressure;
@@ -122,8 +124,9 @@ struct tsdev_client {
struct fasync_struct *fasync;
struct list_head node;
struct tsdev *tsdev;
+   struct ts_event buffer[TSDEV_BUFFER_SIZE];
int head, tail;
-   struct ts_event event[TSDEV_BUFFER_SIZE];
+   spinlock_t buffer_lock; /* protects access to buffer, head and tail */
int raw;
 };
 
@@ -137,6 +140,7 @@ struct tsdev_client {
 #define TS_SET_CAL _IOW(IOC_H3600_TS_MAGIC, 11, struct ts_calibration)
 
 static struct tsdev *tsdev_table[TSDEV_MINORS/2];
+static DEFINE_MUTEX(tsdev_table_mutex);
 
 static int tsdev_fasync(int fd, struct file *file, int on)
 {
@@ -144,9 +148,91 @@ static int tsdev_fasync(int fd, struct f
int retval;
 
retval = fasync_helper(fd, file, on, &client->fasync);
+
return retval < 0 ? retval : 0;
 }
 
+static void tsdev_free(struct device *dev)
+{
+   struct tsdev *tsdev = container_of(dev, struct tsdev, dev);
+
+   kfree(tsdev);
+}
+
+static void tsdev_attach_client(struct tsdev *tsdev, struct tsdev_client 
*client)
+{
+   spin_lock(&tsdev->client_lock);
+   list_add_tail_rcu(&client->node, &tsdev->client_list);
+   spin_unlock(&tsdev->client_lock);
+   synchronize_sched();
+}
+
+static void tsdev_detach_client(struct tsdev *tsdev, struct tsdev_client 
*client)
+{
+   spin_lock(&tsdev->client_lock);
+   list_del_rcu(&client->node);
+   spin_unlock(&tsdev->client_lock);
+   synchronize_sched();
+}
+
+static int tsdev_open_device(struct tsdev *tsdev)
+{
+   int retval;
+
+   retval = mutex_lock_interruptible(&tsdev->mutex);
+   if (retval)
+   return retval;
+
+   if (!tsdev->exist)
+   retval = -ENODEV;
+   else if (!tsdev->open++)
+   retval = input_open_device(&tsdev->handle);
+
+   mutex_unlock(&tsdev->mutex);
+   return retval;
+}
+
+static void tsdev_close_device(struct tsdev *tsdev)
+{
+   mutex_lock(&tsdev->mutex);
+
+   if (tsdev->exist && !--tsdev->open)
+   input_close_device(&tsdev->handle);
+
+   mutex_unlock(&tsdev->mutex);
+}
+
+/*
+ * Wake up users waiting for IO so they can disconnect from
+ * dead device.
+ */
+static void tsdev_hangup(struct tsdev *tsdev)
+{
+   struct tsdev_client *client;
+
+   spin_lock(&tsdev->client_lock);
+   list_for_each_entry(client, &tsdev->client_list, node)
+   kill_fasync(&client->fasync, SIGIO, POLL_HUP);
+   spin_unlock(&tsdev->client_lock);
+
+   wake_up_interruptible(&tsdev->wait);
+}
+
+static int tsdev_release(struct inode *inode, struct file *file)
+{
+   struct tsdev_client *client = file->private_data;
+   struct tsdev *tsdev = client->tsdev;
+
+   tsdev_fasync(-1, file, 0);
+   tsdev_detach_client(tsdev, client);
+   kfree(client);
+
+   tsdev_close_device(tsdev);
+   put_device(&tsdev->dev);
+
+   return 0;
+}
+
 static int tsdev_open(struct inode *inode, struct file *file)
 {
int i = iminor(inode) - TSDEV_MINOR_BASE;
@@ -161,11 +247,16 @@ static int tsdev_open(struct inode *inod
if (i >= TSDEV_MINORS)
return -ENODEV;
 
+   error = mutex_lock_interruptible(&tsdev_table_mutex);
+   if (error)
+   return error;
tsdev = tsdev_table[i & TSDEV_MINOR_MASK];
-   if (!tsdev || !tsdev->exist)
-   return -ENODEV;
+   if (tsdev)
+   get_device(&tsdev->dev);
+   mutex_unlock(&tsdev_table_mutex);
 
-   get_device(&tsdev->dev);
+   if (!tsdev)
+   return -ENODEV;
 
client = kzalloc(sizeof(struct tsdev_client), GFP_KERNEL);
if (!client) {
@@ -173,51 +264,42 @@ static int tsdev_open(struct inode *inod
goto err_put_tsdev;
}
 
+   spin_lock_init(&client->buffer_lock);
client->tsdev = tsdev;
-   client->raw = (i >= TSDEV_MINORS / 2) ? 1 : 0;
-   list_add_tail(&client->node, &tsdev->client_list);
+   client->raw = i >= TSDEV_MINORS / 2;
+   tsdev_attach_client(tsdev, client);
 
-   if (!tsdev->open++ && tsdev->exist) {
-   error = input_open_device(&tsdev->handle);
-   if (error)
-   goto err_free_client;
-   }
+   error = tsdev_open_device(tsdev

[RFC/RFT 5/5] Input: joydev - implement proper locking

2007-07-23 Thread Dmitry Torokhov
Input: joydev - implement proper locking

Signed-off-by: Dmitry Torokhov <[EMAIL PROTECTED]>
---

 drivers/input/joydev.c |  745 -
 1 files changed, 493 insertions(+), 252 deletions(-)

Index: work/drivers/input/joydev.c
===
--- work.orig/drivers/input/joydev.c
+++ work/drivers/input/joydev.c
@@ -43,6 +43,8 @@ struct joydev {
struct input_handle handle;
wait_queue_head_t wait;
struct list_head client_list;
+   spinlock_t client_lock; /* protects client_list */
+   struct mutex mutex;
struct device dev;
 
struct js_corr corr[ABS_MAX + 1];
@@ -61,31 +63,61 @@ struct joydev_client {
int head;
int tail;
int startup;
+   spinlock_t buffer_lock; /* protects access to buffer, head and tail */
struct fasync_struct *fasync;
struct joydev *joydev;
struct list_head node;
 };
 
 static struct joydev *joydev_table[JOYDEV_MINORS];
+static DEFINE_MUTEX(joydev_table_mutex);
 
 static int joydev_correct(int value, struct js_corr *corr)
 {
switch (corr->type) {
-   case JS_CORR_NONE:
-   break;
-   case JS_CORR_BROKEN:
-   value = value > corr->coef[0] ? (value < corr->coef[1] 
? 0 :
-   ((corr->coef[3] * (value - corr->coef[1])) >> 
14)) :
-   ((corr->coef[2] * (value - corr->coef[0])) >> 
14);
-   break;
-   default:
-   return 0;
+
+   case JS_CORR_NONE:
+   break;
+
+   case JS_CORR_BROKEN:
+   value = value > corr->coef[0] ? (value < corr->coef[1] ? 0 :
+   ((corr->coef[3] * (value - corr->coef[1])) >> 14)) :
+   ((corr->coef[2] * (value - corr->coef[0])) >> 14);
+   break;
+
+   default:
+   return 0;
}
 
return value < -32767 ? -32767 : (value > 32767 ? 32767 : value);
 }
 
-static void joydev_event(struct input_handle *handle, unsigned int type, 
unsigned int code, int value)
+static void joydev_pass_event(struct joydev_client *client,
+ struct js_event *event)
+{
+   struct joydev *joydev = client->joydev;
+
+   /*
+* IRQs already disabled, just acquire the lock
+*/
+   spin_lock(&client->buffer_lock);
+
+   client->buffer[client->head] = *event;
+
+   if (client->startup == joydev->nabs + joydev->nkey) {
+   client->head++;
+   client->head &= JOYDEV_BUFFER_SIZE - 1;
+   if (client->tail == client->head)
+   client->startup = 0;
+   }
+
+   spin_unlock(&client->buffer_lock);
+
+   kill_fasync(&client->fasync, SIGIO, POLL_IN);
+}
+
+static void joydev_event(struct input_handle *handle,
+unsigned int type, unsigned int code, int value)
 {
struct joydev *joydev = handle->private;
struct joydev_client *client;
@@ -93,39 +125,32 @@ static void joydev_event(struct input_ha
 
switch (type) {
 
-   case EV_KEY:
-   if (code < BTN_MISC || value == 2)
-   return;
-   event.type = JS_EVENT_BUTTON;
-   event.number = joydev->keymap[code - BTN_MISC];
-   event.value = value;
-   break;
-
-   case EV_ABS:
-   event.type = JS_EVENT_AXIS;
-   event.number = joydev->absmap[code];
-   event.value = joydev_correct(value, joydev->corr + 
event.number);
-   if (event.value == joydev->abs[event.number])
-   return;
-   joydev->abs[event.number] = event.value;
-   break;
+   case EV_KEY:
+   if (code < BTN_MISC || value == 2)
+   return;
+   event.type = JS_EVENT_BUTTON;
+   event.number = joydev->keymap[code - BTN_MISC];
+   event.value = value;
+   break;
 
-   default:
+   case EV_ABS:
+   event.type = JS_EVENT_AXIS;
+   event.number = joydev->absmap[code];
+   event.value = joydev_correct(value,
+   &joydev->corr[event.number]);
+   if (event.value == joydev->abs[event.number])
return;
+   joydev->abs[event.number] = event.value;
+   break;
+
+   default:
+   return;
}
 
event.time = jiffies_to_msecs(jiffies);
 
-   list_for_each_entry(client, &joydev->client_list, node) {
-
-   memcpy(client->buffer + client->head, &event, sizeof(struct 
js_event));
-
-   if (client->startup =

[RFC/RFT 4/5] Input: mousedev - implement proper locking

2007-07-23 Thread Dmitry Torokhov
Input: mousedev - implement proper locking

Signed-off-by: Dmitry Torokhov <[EMAIL PROTECTED]>
---

 drivers/input/mousedev.c |  736 +--
 1 files changed, 464 insertions(+), 272 deletions(-)

Index: work/drivers/input/mousedev.c
===
--- work.orig/drivers/input/mousedev.c
+++ work/drivers/input/mousedev.c
@@ -61,9 +61,11 @@ struct mousedev {
int open;
int minor;
char name[16];
+   struct input_handle handle;
wait_queue_head_t wait;
struct list_head client_list;
-   struct input_handle handle;
+   spinlock_t client_lock; /* protects client_list */
+   struct mutex mutex;
struct device dev;
 
struct list_head mixdev_node;
@@ -113,108 +115,137 @@ static unsigned char mousedev_imex_seq[]
 static struct input_handler mousedev_handler;
 
 static struct mousedev *mousedev_table[MOUSEDEV_MINORS];
+static DEFINE_MUTEX(mousedev_table_mutex);
 static struct mousedev *mousedev_mix;
 static LIST_HEAD(mousedev_mix_list);
 
+static void mixdev_open_devices(void);
+static void mixdev_close_devices(void);
+
 #define fx(i)  (mousedev->old_x[(mousedev->pkt_count - (i)) & 03])
 #define fy(i)  (mousedev->old_y[(mousedev->pkt_count - (i)) & 03])
 
-static void mousedev_touchpad_event(struct input_dev *dev, struct mousedev 
*mousedev, unsigned int code, int value)
+static void mousedev_touchpad_event(struct input_dev *dev,
+   struct mousedev *mousedev,
+   unsigned int code, int value)
 {
int size, tmp;
enum { FRACTION_DENOM = 128 };
 
switch (code) {
-   case ABS_X:
-   fx(0) = value;
-   if (mousedev->touch && mousedev->pkt_count >= 2) {
-   size = dev->absmax[ABS_X] - dev->absmin[ABS_X];
-   if (size == 0)
-   size = 256 * 2;
-   tmp = ((value - fx(2)) * (256 * 
FRACTION_DENOM)) / size;
-   tmp += mousedev->frac_dx;
-   mousedev->packet.dx = tmp / FRACTION_DENOM;
-   mousedev->frac_dx = tmp - mousedev->packet.dx * 
FRACTION_DENOM;
-   }
-   break;
 
-   case ABS_Y:
-   fy(0) = value;
-   if (mousedev->touch && mousedev->pkt_count >= 2) {
-   /* use X size to keep the same scale */
-   size = dev->absmax[ABS_X] - dev->absmin[ABS_X];
-   if (size == 0)
-   size = 256 * 2;
-   tmp = -((value - fy(2)) * (256 * 
FRACTION_DENOM)) / size;
-   tmp += mousedev->frac_dy;
-   mousedev->packet.dy = tmp / FRACTION_DENOM;
-   mousedev->frac_dy = tmp - mousedev->packet.dy * 
FRACTION_DENOM;
-   }
-   break;
+   case ABS_X:
+   fx(0) = value;
+   if (mousedev->touch && mousedev->pkt_count >= 2) {
+   size = dev->absmax[ABS_X] - dev->absmin[ABS_X];
+   if (size == 0)
+   size = 256 * 2;
+   tmp = ((value - fx(2)) * 256 * FRACTION_DENOM) / size;
+   tmp += mousedev->frac_dx;
+   mousedev->packet.dx = tmp / FRACTION_DENOM;
+   mousedev->frac_dx =
+   tmp - mousedev->packet.dx * FRACTION_DENOM;
+   }
+   break;
+
+   case ABS_Y:
+   fy(0) = value;
+   if (mousedev->touch && mousedev->pkt_count >= 2) {
+   /* use X size to keep the same scale */
+   size = dev->absmax[ABS_X] - dev->absmin[ABS_X];
+   if (size == 0)
+   size = 256 * 2;
+   tmp = -((value - fy(2)) * 256 * FRACTION_DENOM) / size;
+   tmp += mousedev->frac_dy;
+   mousedev->packet.dy = tmp / FRACTION_DENOM;
+   mousedev->frac_dy = tmp -
+   mousedev->packet.dy * FRACTION_DENOM;
+   }
+   break;
}
 }
 
-static void mousedev_abs_event(struct input_dev *dev, struct mousedev 
*mousedev, unsigned int code, int value)
+static void mousedev_abs_event(struct input_dev *dev, struct mousedev 
*mousedev,
+   unsigned int code, int value)
 {
int size;
 
switch (code) {
-   case ABS_X:
-   size = dev->absmax[ABS_X] - dev->absmin[ABS_X];
-   if (size == 0)
-

[RFC/RFT 1/5] Input: implement proper locking in input core

2007-07-23 Thread Dmitry Torokhov
Input: implement proper locking in input core

Also add some kerneldoc documentation to input.h

Signed-off-by: Dmitry Torokhov <[EMAIL PROTECTED]>
---

 drivers/input/input.c |  656 --
 include/linux/input.h |  112 +++-
 2 files changed, 585 insertions(+), 183 deletions(-)

Index: work/drivers/input/input.c
===
--- work.orig/drivers/input/input.c
+++ work/drivers/input/input.c
@@ -17,10 +17,10 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
+#include 
 
 MODULE_AUTHOR("Vojtech Pavlik <[EMAIL PROTECTED]>");
 MODULE_DESCRIPTION("Input core");
@@ -31,167 +31,242 @@ MODULE_LICENSE("GPL");
 static LIST_HEAD(input_dev_list);
 static LIST_HEAD(input_handler_list);
 
+/*
+ * input_mutex protects access to both input_dev_list and input_handler_list.
+ * This also causes input_[un]register_device and input_[un]register_handler
+ * be mutually exclusive which simplifies locking in drivers implementing
+ * input handlers.
+ */
+static DEFINE_MUTEX(input_mutex);
+
 static struct input_handler *input_table[8];
 
-/**
- * input_event() - report new input event
- * @dev: device that generated the event
- * @type: type of the event
- * @code: event code
- * @value: value of the event
- *
- * This function should be used by drivers implementing various input devices
- * See also input_inject_event()
- */
-void input_event(struct input_dev *dev, unsigned int type, unsigned int code, 
int value)
+static inline int is_event_supported(unsigned int code,
+unsigned long *bm, unsigned int max)
 {
-   struct input_handle *handle;
+   return code <= max && test_bit(code, bm);
+}
 
-   if (type > EV_MAX || !test_bit(type, dev->evbit))
-   return;
+static int input_defuzz_abs_event(int value, int old_val, int fuzz)
+{
+   if (fuzz) {
+   if (value > old_val - fuzz / 2 && value < old_val + fuzz / 2)
+   return value;
 
-   add_input_randomness(type, code, value);
+   if (value > old_val - fuzz && value < old_val + fuzz)
+   return (old_val * 3 + value) / 4;
 
-   switch (type) {
+   if (value > old_val - fuzz * 2 && value < old_val + fuzz * 2)
+   return (old_val + value) / 2;
+   }
 
-   case EV_SYN:
-   switch (code) {
-   case SYN_CONFIG:
-   if (dev->event)
-   dev->event(dev, type, code, 
value);
-   break;
-
-   case SYN_REPORT:
-   if (dev->sync)
-   return;
-   dev->sync = 1;
-   break;
-   }
-   break;
+   return value;
+}
 
-   case EV_KEY:
+/*
+ * Pass event through all open handles. This function is called with
+ * dev->event_lock held and interrupts disabled. Because of that we
+ * do not need to use rcu_read_lock() here although we are using RCU
+ * to access handle list.
+ */
+static void input_pass_event(struct input_dev *dev,
+unsigned int type, unsigned int code, int value)
+{
+   struct input_handle *handle = rcu_dereference(dev->grab);
 
-   if (code > KEY_MAX || !test_bit(code, dev->keybit) || 
!!test_bit(code, dev->key) == value)
-   return;
+   if (handle)
+   handle->handler->event(handle, type, code, value);
+   else
+   list_for_each_entry_rcu(handle, &dev->h_list, d_node)
+   if (handle->open)
+   handle->handler->event(handle,
+   type, code, value);
+}
 
-   if (value == 2)
-   break;
+/*
+ * Generate software autorepeat event. Note that we take
+ * dev->event_lock here to avoid racing with input_event
+ * which may cause keys get "stuck".
+ */
+static void input_repeat_key(unsigned long data)
+{
+   struct input_dev *dev = (void *) data;
 
-   change_bit(code, dev->key);
+   spin_lock_irq(&dev->event_lock);
 
-   if (test_bit(EV_REP, dev->evbit) && 
dev->rep[REP_PERIOD] && dev->rep[REP_DELAY] && dev->timer.data && value) {
-   dev->repeat_key = code;
-   mod_timer(&dev->timer, jiffies + 
msecs_to_jiffies(dev->rep[REP_DELAY]));
-   }
+   if (test_bit(dev->repeat_key, dev->key) &&
+   is_event_supported(dev->repeat_key, dev->keybit, KEY_MAX)) {
 
-   break;
+   input_pa

[RFC/RFT 2/5] evdev - implement proper locking

2007-07-23 Thread Dmitry Torokhov
Input: evdev - implement proper locking

Signed-off-by: Dmitry Torokhov <[EMAIL PROTECTED]>
---

 drivers/input/evdev.c |  719 +-
 1 files changed, 476 insertions(+), 243 deletions(-)

Index: work/drivers/input/evdev.c
===
--- work.orig/drivers/input/evdev.c
+++ work/drivers/input/evdev.c
@@ -30,6 +30,8 @@ struct evdev {
wait_queue_head_t wait;
struct evdev_client *grab;
struct list_head client_list;
+   spinlock_t client_lock; /* protects client_list */
+   struct mutex mutex;
struct device dev;
 };
 
@@ -37,39 +39,53 @@ struct evdev_client {
struct input_event buffer[EVDEV_BUFFER_SIZE];
int head;
int tail;
+   spinlock_t buffer_lock; /* protects access to buffer, head and tail */
struct fasync_struct *fasync;
struct evdev *evdev;
struct list_head node;
 };
 
 static struct evdev *evdev_table[EVDEV_MINORS];
+static DEFINE_MUTEX(evdev_table_mutex);
 
-static void evdev_event(struct input_handle *handle, unsigned int type, 
unsigned int code, int value)
+static void evdev_pass_event(struct evdev_client *client,
+struct input_event *event)
+{
+   /*
+* Interrupts are disabled, just acquire the lock
+*/
+   spin_lock(&client->buffer_lock);
+   client->buffer[client->head++] = *event;
+   client->head &= EVDEV_BUFFER_SIZE - 1;
+   spin_unlock(&client->buffer_lock);
+
+   kill_fasync(&client->fasync, SIGIO, POLL_IN);
+}
+
+/*
+ * Pass incoming event to all connected clients. Note that we are
+ * caleld under a spinlock with interrupts off so we don't need
+ * to use rcu_read_lock() here. Writers will be using syncronize_sched()
+ * instead of synchrnoize_rcu().
+ */
+static void evdev_event(struct input_handle *handle,
+   unsigned int type, unsigned int code, int value)
 {
struct evdev *evdev = handle->private;
struct evdev_client *client;
+   struct input_event event;
 
-   if (evdev->grab) {
-   client = evdev->grab;
-
-   do_gettimeofday(&client->buffer[client->head].time);
-   client->buffer[client->head].type = type;
-   client->buffer[client->head].code = code;
-   client->buffer[client->head].value = value;
-   client->head = (client->head + 1) & (EVDEV_BUFFER_SIZE - 1);
-
-   kill_fasync(&client->fasync, SIGIO, POLL_IN);
-   } else
-   list_for_each_entry(client, &evdev->client_list, node) {
-
-   do_gettimeofday(&client->buffer[client->head].time);
-   client->buffer[client->head].type = type;
-   client->buffer[client->head].code = code;
-   client->buffer[client->head].value = value;
-   client->head = (client->head + 1) & (EVDEV_BUFFER_SIZE 
- 1);
-
-   kill_fasync(&client->fasync, SIGIO, POLL_IN);
-   }
+   do_gettimeofday(&event.time);
+   event.type = type;
+   event.code = code;
+   event.value = value;
+
+   client = rcu_dereference(evdev->grab);
+   if (client)
+   evdev_pass_event(client, &event);
+   else
+   list_for_each_entry_rcu(client, &evdev->client_list, node)
+   evdev_pass_event(client, &event);
 
wake_up_interruptible(&evdev->wait);
 }
@@ -88,38 +104,142 @@ static int evdev_flush(struct file *file
 {
struct evdev_client *client = file->private_data;
struct evdev *evdev = client->evdev;
+   int retval;
+
+   retval = mutex_lock_interruptible(&evdev->mutex);
+   if (retval)
+   return retval;
 
if (!evdev->exist)
-   return -ENODEV;
+   retval = -ENODEV;
+   else
+   retval = input_flush_device(&evdev->handle, file);
 
-   return input_flush_device(&evdev->handle, file);
+   mutex_unlock(&evdev->mutex);
+   return retval;
 }
 
 static void evdev_free(struct device *dev)
 {
struct evdev *evdev = container_of(dev, struct evdev, dev);
 
-   evdev_table[evdev->minor] = NULL;
kfree(evdev);
 }
 
+/*
+ * Grabs an event device (along with underlying input device).
+ * This function is called with evdev->mutex taken.
+ */
+static int evdev_grab(struct evdev *evdev, struct evdev_client *client)
+{
+   int error;
+
+   if (evdev->grab)
+   return -EBUSY;
+
+   error = input_grab_device(&evdev->handle);
+   if (error)
+   return error;
+
+   rcu_assign_pointer(evdev->grab, client);
+   /*
+* We don't use synchronize_rcu() here because read-size
+* critical section is protected by a spinlock instead
+* of rcu_read_lock().
+*/
+   synchronize_sched();
+
+   return 0;
+}
+
+sta

Re: -mm merge plans for 2.6.23

2007-07-23 Thread Ray Lee

On 7/23/07, Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:

Ray Lee wrote:
> That said, I'm willing to run my day to day life through both a swap
> prefetch kernel and a normal one. *However*, before I go through all
> the work of instrumenting the damn thing, I'd really like Andrew (or
> Linus) to lay out his acceptance criteria on the feature. Exactly what
> *should* I be paying attention to? I've suggested keeping track of
> process swapin delay total time, and comparing with and without. Is
> that reasonable? Is it incomplete?

Um, isn't it up to you?


Huh? I'm not Linus or Andrew, with the power to merge a patch to the
2.6 kernel, so I think that the answer to that is a really clear 'No.'


4. Does it make anything worse?  A lot or a little?  Rare corner
cases, or a real world usage?  Again, numbers make the case most
strongly.

I can't say I've been following this particular feature very closely,
but these are the fundamental questions that need to be dealt with in
merging any significant change.  And as Nick says, historically point 4
is very important in VM tuning changes, because "obvious" improvements
have often ended up giving pathologically bad results on unexpected
workloads.


Dude. My whole question was *what* numbers. Please go back and read it
all again. Maybe I was unclear, but I really don't think so.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC/RFT 0/5] Input locking patches

2007-07-23 Thread Dmitry Torokhov
Hi everyone,

I finally managed to put together some patches implementing
locking in input core and main input handles. Please look
over them and give them a spin.

--
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: -mm merge plans for 2.6.23

2007-07-23 Thread Nick Piggin

Ray Lee wrote:

On 7/23/07, Nick Piggin <[EMAIL PROTECTED]> wrote:



That said, I'm willing to run my day to day life through both a swap
prefetch kernel and a normal one. *However*, before I go through all
the work of instrumenting the damn thing, I'd really like Andrew (or
Linus) to lay out his acceptance criteria on the feature. Exactly what
*should* I be paying attention to? I've suggested keeping track of
process swapin delay total time, and comparing with and without. Is
that reasonable? Is it incomplete?


I don't feel it is so useful without more context. For example, in
most situations where pages get pushed to swap, there will *also* be
useful file backed pages being thrown out. Swap prefetch might
improve the total swapin delay time very significantly but that may
be just a tiny portion of the real problem.

Also a random day at the desktop, it is quite a broad scope and
pretty well impossible to analyse. If we can first try looking at
some specific problems that are easily identified.

Looking at your past email, you have a 1GB desktop system and your
overnight updatedb run is causing stuff to get swapped out such that
swap prefetch makes it significantly better. This is really
intriguing to me, and I would hope we can start by making this
particular workload "not suck" without swap prefetch (and hopefully
make it even better than it currently is with swap prefetch because
we'll try not to evict useful file backed pages as well).

After that we can look at other problems that swap prefetch helps
with, or think of some ways to measure your "whole day" scenario.

So when/if you have time, I can cook up a list of things to monitor
and possibly a patch to add some instrumentation over this updatedb
run.

Anyway, I realise swap prefetching has some situations where it will
fundamentally outperform even the page replacement oracle. This is
why I haven't asked for it to be dropped: it isn't a bad idea at all.

However, if we can improve basic page reclaim where it is obviously
lacking, that is always preferable. eg: being a highly speculative
operation, swap prefetch is not great for power efficiency -- but we
still want laptop users to have a good experience as well, right?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: -mm merge plans for 2.6.23

2007-07-23 Thread Jeremy Fitzhardinge
Ray Lee wrote:
> That said, I'm willing to run my day to day life through both a swap
> prefetch kernel and a normal one. *However*, before I go through all
> the work of instrumenting the damn thing, I'd really like Andrew (or
> Linus) to lay out his acceptance criteria on the feature. Exactly what
> *should* I be paying attention to? I've suggested keeping track of
> process swapin delay total time, and comparing with and without. Is
> that reasonable? Is it incomplete?

Um, isn't it up to you?  The questions that need to be answered are:

   1. What are you trying to achieve?  Presumably you have some intended
  or desired effect you're trying to get.  What's the intended
  audience?  Who would be expected to see a benefit?  Who suffers?
   2. How does the code achieve that end?  Is it nasty or nice?  Has
  everyone who's interested in the affected areas at least looked at
  the changes, or ideally given them a good review?  Does it need
  lots of tunables, or is it set-and-forget?
   3. Does it achieve the intended end?  Numbers are helpful here.
   4. Does it make anything worse?  A lot or a little?  Rare corner
  cases, or a real world usage?  Again, numbers make the case most
  strongly.


I can't say I've been following this particular feature very closely,
but these are the fundamental questions that need to be dealt with in
merging any significant change.  And as Nick says, historically point 4
is very important in VM tuning changes, because "obvious" improvements
have often ended up giving pathologically bad results on unexpected
workloads.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/10] readahead: combine file_ra_state.prev_index/prev_offset into prev_pos

2007-07-23 Thread Andrew Morton
On Tue, 24 Jul 2007 12:32:15 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:

> On Mon, Jul 23, 2007 at 08:55:35PM -0700, Andrew Morton wrote:
> > On Tue, 24 Jul 2007 10:00:12 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > 
> > > @@ -342,11 +342,9 @@ ondemand_readahead(struct address_space 
> > >  bool hit_readahead_marker, pgoff_t offset,
> > >  unsigned long req_size)
> > >  {
> > > - int max;/* max readahead pages */
> > > - int sequential;
> > > -
> > > - max = ra->ra_pages;
> > > - sequential = (offset - ra->prev_index <= 1UL) || (req_size > max);
> > > + int max = ra->ra_pages; /* max readahead pages */
> > > + pgoff_t prev_offset;
> > > + int sequential;
> > >  
> > >   /*
> > >* It's the expected callback offset, assume sequential access.
> > > @@ -360,6 +358,9 @@ ondemand_readahead(struct address_space 
> > >   goto readit;
> > >   }
> > >  
> > > + prev_offset = ra->prev_pos >> PAGE_CACHE_SHIFT;
> > > + sequential = offset - prev_offset <= 1UL || req_size > max;
> > 
> > It's a bit pointless using an opaque type for prev_offset here, and then
> > encoding the knowledge that it is implemented as "unsigned long".
> > 
> > It's a minor thing, but perhaps just "<= 1" would make more sense here.
> 
> Yeah, "<= 1" is OK.  But the expression still requires pgoff_t to be
> 'unsigned' to work correctly.
> 
> So what about "<= 1U"?

umm, if one really cared one could do

 == 1 ||  == 0

or something.  But whatever - let's leave it as-is.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: -mm merge plans for 2.6.23

2007-07-23 Thread Ray Lee

On 7/23/07, Nick Piggin <[EMAIL PROTECTED]> wrote:

Not talking about swap prefetch itself, but everytime I have asked
anyone to instrument or produce some workload where swap prefetch
helps, they never do.

[...]

so for all the people who a whining about merging this and don't want
to actually work on the code -- post some numbers for where it helps
you!!


 You sound frustrated. Perhaps we could be
communicating better. I'll start.

Unlike others on the cc: line, I don't get paid to hack on the kernel,
not even indirectly. So if you find that my lack of providing numbers
is giving you heartache, I can only apologize and point at my paying
work that requires my attention.

That said, I'm willing to run my day to day life through both a swap
prefetch kernel and a normal one. *However*, before I go through all
the work of instrumenting the damn thing, I'd really like Andrew (or
Linus) to lay out his acceptance criteria on the feature. Exactly what
*should* I be paying attention to? I've suggested keeping track of
process swapin delay total time, and comparing with and without. Is
that reasonable? Is it incomplete?

Without Andrew's criteria, we're back to where we've been for a long
time: lots of work, no forward motion. Perhaps it's a character flaw
of mine, but I'd really like to know what would constitute proof here
before I invest the effort. Especially given that Con has already
written a test case that shows that swap prefetch works, and that I've
given you a clear argument for why better (or even perfect) page
reclaim can't provide full coverage to all the situations that swap
prefetch helps. (Also, it's not like I've got tons free time, y'know?
Just like all the rest of you all, I have to pick and choose my
battles if I'm going to be effective.)

Since this merge period has appeared particularly frazzling for
Andrew, I've been keeping silent and waiting for him to get to a point
where there's a breather. I didn't feel it would be polite to request
yet more work out of him while he had a mess on his hands.

But, given this has come to a head, I'm asking now.

Andrew? You've always given the impression that you want this run more
as an engineering effort than an artistic endeavour, so help us out
here. What are your concerns with swap prefetch? What sort of
comparative data would you like to see to justify its inclusion, or to
prove that it's not needed?

Or are we reading too much into the fact that it isn't merged? In
short, communicate please, it will help.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/10] readahead: combine file_ra_state.prev_index/prev_offset into prev_pos

2007-07-23 Thread Fengguang Wu
On Tue, Jul 24, 2007 at 12:32:15PM +0800, Fengguang Wu wrote:
> On Mon, Jul 23, 2007 at 08:55:35PM -0700, Andrew Morton wrote:
> > On Tue, 24 Jul 2007 10:00:12 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > 
> > > @@ -342,11 +342,9 @@ ondemand_readahead(struct address_space 
> > >  bool hit_readahead_marker, pgoff_t offset,
> > >  unsigned long req_size)
> > >  {
> > > - int max;/* max readahead pages */
> > > - int sequential;
> > > -
> > > - max = ra->ra_pages;
> > > - sequential = (offset - ra->prev_index <= 1UL) || (req_size > max);
> > > + int max = ra->ra_pages; /* max readahead pages */
> > > + pgoff_t prev_offset;
> > > + int sequential;
> > >  
> > >   /*
> > >* It's the expected callback offset, assume sequential access.
> > > @@ -360,6 +358,9 @@ ondemand_readahead(struct address_space 
> > >   goto readit;
> > >   }
> > >  
> > > + prev_offset = ra->prev_pos >> PAGE_CACHE_SHIFT;
> > > + sequential = offset - prev_offset <= 1UL || req_size > max;
> > 
> > It's a bit pointless using an opaque type for prev_offset here, and then
> > encoding the knowledge that it is implemented as "unsigned long".
> > 
> > It's a minor thing, but perhaps just "<= 1" would make more sense here.
> 
> Yeah, "<= 1" is OK.  But the expression still requires pgoff_t to be
> 'unsigned' to work correctly.
> 
> So what about "<= 1U"?

I wrote a test case and find that if pgoff_t is 'signed long',
"<= 1U" still yields the wrong result. Only "<= 1UL" does the trick :(

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/10] readahead: combine file_ra_state.prev_index/prev_offset into prev_pos

2007-07-23 Thread Fengguang Wu
On Mon, Jul 23, 2007 at 08:55:35PM -0700, Andrew Morton wrote:
> On Tue, 24 Jul 2007 10:00:12 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:
> 
> > @@ -342,11 +342,9 @@ ondemand_readahead(struct address_space 
> >bool hit_readahead_marker, pgoff_t offset,
> >unsigned long req_size)
> >  {
> > -   int max;/* max readahead pages */
> > -   int sequential;
> > -
> > -   max = ra->ra_pages;
> > -   sequential = (offset - ra->prev_index <= 1UL) || (req_size > max);
> > +   int max = ra->ra_pages; /* max readahead pages */
> > +   pgoff_t prev_offset;
> > +   int sequential;
> >  
> > /*
> >  * It's the expected callback offset, assume sequential access.
> > @@ -360,6 +358,9 @@ ondemand_readahead(struct address_space 
> > goto readit;
> > }
> >  
> > +   prev_offset = ra->prev_pos >> PAGE_CACHE_SHIFT;
> > +   sequential = offset - prev_offset <= 1UL || req_size > max;
> 
> It's a bit pointless using an opaque type for prev_offset here, and then
> encoding the knowledge that it is implemented as "unsigned long".
> 
> It's a minor thing, but perhaps just "<= 1" would make more sense here.

Yeah, "<= 1" is OK.  But the expression still requires pgoff_t to be
'unsigned' to work correctly.

So what about "<= 1U"?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] kernel-doc fixes for PCI and drivers/base/

2007-07-23 Thread Randy Dunlap
From: Randy Dunlap <[EMAIL PROTECTED]>

Fix undocumented function parameters in PCI and drivers/base.

Warning(linux-2.6.23-rc1//drivers/pci/pci.c:1526): No description found for 
parameter 'rq'
Warning(linux-2.6.23-rc1//drivers/base/firmware_class.c:245): No description 
found for parameter 'bin_attr'

Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
---
 drivers/base/firmware_class.c |1 +
 drivers/pci/pci.c |2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

--- linux-2.6.23-rc1.orig/drivers/base/firmware_class.c
+++ linux-2.6.23-rc1/drivers/base/firmware_class.c
@@ -232,6 +232,7 @@ fw_realloc_buffer(struct firmware_priv *
 /**
  * firmware_data_write - write method for firmware
  * @kobj: kobject for the device
+ * @bin_attr: bin_attr structure
  * @buffer: buffer being written
  * @offset: buffer offset for write in total data store area
  * @count: buffer size
--- linux-2.6.23-rc1.orig/drivers/pci/pci.c
+++ linux-2.6.23-rc1/drivers/pci/pci.c
@@ -1517,7 +1517,7 @@ EXPORT_SYMBOL(pcie_get_readrq);
 /**
  * pcie_set_readrq - set PCI Express maximum memory read request
  * @dev: PCI device to query
- * @count: maximum memory read count in bytes
+ * @rq: maximum memory read count in bytes
  *valid values are 128, 256, 512, 1024, 2048, 4096
  *
  * If possible sets maximum read byte count
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] kernel-doc fix for kmod.c

2007-07-23 Thread Randy Dunlap
From: Randy Dunlap <[EMAIL PROTECTED]>

Fix kmod.c:
Warning(linux-2.6.23-rc1//kernel/kmod.c:364): No description found for 
parameter 'envp'

Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
---
 kernel/kmod.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

--- linux-2.6.23-rc1.orig/kernel/kmod.c
+++ linux-2.6.23-rc1/kernel/kmod.c
@@ -351,11 +351,11 @@ static inline void register_pm_notifier_
 
 /**
  * call_usermodehelper_setup - prepare to call a usermode helper
- * @path - path to usermode executable
- * @argv - arg vector for process
- * @envp - environment for process
+ * @path: path to usermode executable
+ * @argv: arg vector for process
+ * @envp: environment for process
  *
- * Returns either NULL on allocation failure, or a subprocess_info
+ * Returns either %NULL on allocation failure, or a subprocess_info
  * structure.  This should be passed to call_usermodehelper_exec to
  * exec the process and free the structure.
  */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] readahead drop behind and size adjustment

2007-07-23 Thread Nick Piggin

Fengguang Wu wrote:

On Mon, Jul 23, 2007 at 12:40:09PM -0700, Andrew Morton wrote:



This is all fun stuff, but how do we find out that changes like this are
good ones, apart from shipping it and seeing who gets hurt 12 months later?



One thing I can imagine now is that the first pages may get more life
because of the conservative initial readahead size.


Yeah I don't think it is really worth the complexity and corner cases
it would introduce at this stage.

People are still complaining about their nightly cron job *swapping*
stuff out of their 1-2GB desktop systems. There must still be some low
hanging fruit or obvious problems with page reclaim so I'd like to try
working out exactly what is going wrong and fix it in page reclaim
rather than adding more complexity in the hope that it might help.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


pxa timer code broken for non-CONFIG_GENERIC_CLOCKEVENTS

2007-07-23 Thread Robert Schwebel
Bill,

7bbb18c9f4783b6fb3bf27af71625b590cf4f00b aka 4507/1 intensively uses
stuff from include/linux/clockchips.h which are defined under

#ifdef CONFIG_GENERIC_CLOCKEVENTS

Unless I'm missing something, all PXA platforms which have
CONFIG_GENERIC_CLOCKEVENTS not set are currently broken and don't
compile any more.

Robert
-- 
 Dipl.-Ing. Robert Schwebel | http://www.pengutronix.de
 Pengutronix - Linux Solutions for Science and Industry
   Handelsregister:  Amtsgericht Hildesheim, HRA 2686
 Hannoversche Str. 2, 31134 Hildesheim, Germany
   Phone: +49-5121-206917-0 |  Fax: +49-5121-206917-9

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/8] i386: bitops: Kill volatile-casting of memory addresses

2007-07-23 Thread Nick Piggin

Linus Torvalds wrote:


On Mon, 23 Jul 2007, Satyam Sharma wrote:



[4/8] i386: bitops: Kill volatile-casting of memory addresses



This is wrong.

The "const volatile" is so that you can pass an arbitrary pointer. The 
only kind of abritraty pointer is "const volatile".


In other words, the "volatile" has nothing at all to do with whether the 
memory is volatile or not (the same way "const" has nothing to do with it: 
it's purely a C type *safety* issue, exactly the same way "const" is a 
type safety issue.


A "const" on a pointer doesn't mean that the thing it points to cannot 
change. When you pass a source pointer to "strlen()", it doesn't have to 
be constant. But "strlen()" takes a "const" pointer, because it work son 
constant pointers *too*.


Same deal here.

Admittedly this may be mostly historic, but regardless - the "volatiles" 
are right.


Using volatile on *data* is generally considered incorrect and bad taste, 
but using it in situations like this potentially makes sense.


Of course, if we remove all "volatiles" in data in the kernel (with the 
possible exception of "jiffies"), we can then remove them from function 
declarations too, but it should be done in that order.


Well, regardless, it still forces the function to treat the pointer
target as volatile, won't it? It definitely prevents valid optimisations
that would be useful for me in mm/page_alloc.c where page flags are
being set up or torn down or checked with non-atomic bitops.

OK, not the i386 functions as much because they are all in asm anwyay,
but in general (btw. why does i386 or any architecture define their own
non-atomic bitops? If the version in asm-generic/bitops/non-atomic.h
is not good enough then surely it is a bug in gcc or that file?)

Anyway by type safety, do you mean it will stop the compiler from
warning if a pointer to a volatile is passed to the bitop? If so, then
why don't we just kill all the volatiles out of here and fix any
warnings that comeup? I doubt there would be many, and of those, some
might show up real synchronisation problems.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFH] Partition table recovery

2007-07-23 Thread Rene Herman

On 07/23/2007 03:58 PM, Theodore Tso wrote:

Well, I'm considering this to be a MBR backup scheme, so Minix and BSD 
slices are legacy systems which are out of scope.  If they are busted in

the same way as MBR in terms of not having redundant backups of critical
data, when they have a lot fewer excuses that MBR, and they can address
that issue in their own way.  The number of Linux users that also have
Minix and BSD partitions are a vanishingly small number in any case.


I'd in fact expect quite a few people to have a FreeBSD partition around. 
And MINIX if they are in university and in an operating systems course...


But "they should take whatever precautions they want themselves" is a valid 
argument.


[ blkid ]


Yeah, good point, I'd have to add that support into blkid.  It's been
on my todo list, but I just haven't gotten around to it yet.


I'll for now stop updating the partbackup thingy as posted. Given that Linux 
 only follows the first extended in the list of extendeds (which sort of 
destroys the nice recursion anyway) it might want to be iterative instead of 
recursive if the thing has a future -- not very important though.



My concern of sysfs is that #1, it won't work on older kernels since
you would need to add new fields to backup what we want,


I'd be okay with that.


and #2, I'm still fundamentally distrustful of sysfs because there isn't
a bright line between what is an exported interface that will never
change, and something which is considered an "internal implementation
detail" that can change whenever some kernel hacker feels like it.  (Or
when some kernel hacker is careless...)  So as far as I'm concerned sysfs
is a terrible, TERRIBLE way to export a published interface where we 
promise stability to userspace.


Oh come on, that's going overboard a bit, it's not all _that_ bad! Finding 
say "sda" will be possible without breaking too many times. Admittedly, the 
kernel's partittion scanning order is also not likely to change as it would 
certainly break userspace, but code duplication, with the possiblity of bugs 
slipping in at least userspace-ways (considering the kernel the reference no 
matter what it does) is a concern. Somewhat. A little.



So I'd just as soon do this in userspace; after all, the entire partition
manager (and there are multiple ones; fdisk, sfdisk, gpart, etc.) all in
userspace, and that needs to be in synch with the kernel partition
reading code anyway.  So one more userspace implementation is in my mind
much cleaner than trying to push the needed functionality into sysfs, and
then hoping against hope that it doesn't accidentally change in the
future.


* rene envisions /lib/libpart.so...

Not to mention my Grand Visions of a totally new Linux native partitioning 
scheme probably modelled after BSD slices (as also mentioned in a previous 
message just now). Or perhaps LVM already fills that role comfortably. It's 
certainly what I hear everyone talking about these days.


Rene.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] ext[234] bugfixes/cleanup

2007-07-23 Thread Theodore Ts'o
Hi Linus,

Please pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git for_linus

Many thanks!!

- Ted

Eric Sandeen (2):
  Fix fencepost error in ext[234]_check_descriptors
  Remove unused bh in calls to ext[234]_get_group_desc

Mingming Cao (1):
  Some cleanups in ext4/JBD2 to follow the naming rules

 b/fs/ext2/ialloc.c  |   24 
 b/fs/ext2/inode.c   |2 +-
 b/fs/ext2/super.c   |2 +-
 b/fs/ext3/ialloc.c  |   17 +++--
 b/fs/ext3/super.c   |2 +-
 b/fs/ext4/extents.c |2 +-
 b/fs/ext4/ialloc.c  |   17 +++--
 b/fs/ext4/super.c   |2 +-
 b/fs/jbd2/commit.c  |2 +-
 b/fs/jbd2/journal.c |   26 +-
 b/fs/jbd2/recovery.c|2 +-
 b/fs/jbd2/revoke.c  |4 ++--
 b/include/linux/ext4_jbd2.h |6 +++---
 b/include/linux/jbd2.h  |   30 +++---
 b/include/linux/poison.h|3 ++-
 fs/ext4/super.c |2 +-
 16 files changed, 65 insertions(+), 78 deletions(-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: -mm merge plans for 2.6.23

2007-07-23 Thread Nick Piggin

Jesper Juhl wrote:

On 10/07/07, Con Kolivas <[EMAIL PROTECTED]> wrote:


On Tuesday 10 July 2007 18:31, Andrew Morton wrote:
> When replying, please rewrite the subject suitably and try to Cc: the
> appropriate developer(s).

~swap prefetch

Nick's only remaining issue which I could remotely identify was to 
make it

cpuset aware:
http://marc.info/?l=linux-mm&m=117875557014098&w=2
as discussed with Paul Jackson it was cpuset aware:
http://marc.info/?l=linux-mm&m=117895463120843&w=2

I fixed all bugs I could find and improved it as much as I could last 
kernel

cycle.

Put me and the users out of our misery and merge it now or delete it 
forever
please. And if the meaningless handwaving that I 100% expect as a 
response
begins again, then that's fine. I'll take that as a no and you can 
dump it.



For what it's worth; put me down as supporting the merger of swap
prefetch. I've found it useful in the past, Con has maintained it
nicely and cleaned up everything that people have pointed out - it's
mature, does no harm - let's just get it merged.  It's too late for
2.6.23-rc1 now, but let's try and get this in by -rc2 - it's long
overdue...



Not talking about swap prefetch itself, but everytime I have asked
anyone to instrument or produce some workload where swap prefetch
helps, they never do.

Fair enough if swap prefetch helps them, but I also want to look at
why that is the case and try to improve page reclaim in some of
these situations (for example standard overnight cron jobs shouldn't
need swap prefetch on a 1 or 2GB system, I would hope).

Anyway, back to swap prefetch, I don't know why I've been singled out
as the bad guy here. I'm one of the only people who has had a look at
the damn thing and tried to point out areas where it could be improved
to the point of being included, and outlining things that are needed
for it to be merged (ie. numbers). If anyone thinks that makes me the
bad guy then they have an utterly inverted understanding of what peer
review is for.

Finally, everyone who has ever hacked on these heuristicy parts of the
VM has heaps of patches that help some workload or some silly test
case or (real or percieved) shortfall but have not been merged. It
really isn't anything personal.

If something really works, then it should be possible to get real
numbers in real situations where it helps (OK, swap prefetching won't
be as easy as a straight line performance improvement, but still much
easier than trying to measure something like scheduler interactivity).

Numbers are the best way to add weight to the pro-merge argument, so
for all the people who a whining about merging this and don't want
to actually work on the code -- post some numbers for where it helps
you!!

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Doc/sysfs-rules typos

2007-07-23 Thread Randy Dunlap
From: Randy Dunlap <[EMAIL PROTECTED]>

Fix typos only (spelling, grammar, duplicate words, etc.).

Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
---
 Documentation/sysfs-rules.txt |   72 --
 1 file changed, 35 insertions(+), 37 deletions(-)

--- linux-2.6.23-rc1.orig/Documentation/sysfs-rules.txt
+++ linux-2.6.23-rc1/Documentation/sysfs-rules.txt
@@ -1,19 +1,18 @@
 Rules on how to access information in the Linux kernel sysfs
 
-The kernel exported sysfs exports internal kernel implementation-details
+The kernel-exported sysfs exports internal kernel implementation details
 and depends on internal kernel structures and layout. It is agreed upon
 by the kernel developers that the Linux kernel does not provide a stable
 internal API. As sysfs is a direct export of kernel internal
-structures, the sysfs interface can not provide a stable interface eighter,
+structures, the sysfs interface cannot provide a stable interface either;
 it may always change along with internal kernel changes.
 
 To minimize the risk of breaking users of sysfs, which are in most cases
 low-level userspace applications, with a new kernel release, the users
-of sysfs must follow some rules to use an as abstract-as-possible way to
+of sysfs must follow some rules to use an as-abstract-as-possible way to
 access this filesystem. The current udev and HAL programs already
 implement this and users are encouraged to plug, if possible, into the
-abstractions these programs provide instead of accessing sysfs
-directly.
+abstractions these programs provide instead of accessing sysfs directly.
 
 But if you really do want or need to access sysfs directly, please follow
 the following rules and then your programs should work with future
@@ -25,22 +24,22 @@ versions of the sysfs interface.
   implementation details in its own API. Therefore it is not better than
   reading directories and opening the files yourself.
   Also, it is not actively maintained, in the sense of reflecting the
-  current kernel-development. The goal of providing a stable interface
-  to sysfs has failed, it causes more problems, than it solves. It
+  current kernel development. The goal of providing a stable interface
+  to sysfs has failed; it causes more problems than it solves. It
   violates many of the rules in this document.
 
 - sysfs is always at /sys
   Parsing /proc/mounts is a waste of time. Other mount points are a
   system configuration bug you should not try to solve. For test cases,
   possibly support a SYSFS_PATH environment variable to overwrite the
-  applications behavior, but never try to search for sysfs. Never try
+  application's behavior, but never try to search for sysfs. Never try
   to mount it, if you are not an early boot script.
 
 - devices are only "devices"
   There is no such thing like class-, bus-, physical devices,
   interfaces, and such that you can rely on in userspace. Everything is
   just simply a "device". Class-, bus-, physical, ... types are just
-  kernel implementation details, which should not be expected by
+  kernel implementation details which should not be expected by
   applications that look for devices in sysfs.
 
   The properties of a device are:
@@ -48,11 +47,11 @@ versions of the sysfs interface.
   - identical to the DEVPATH value in the event sent from the kernel
 at device creation and removal
   - the unique key to the device at that point in time
-  - the kernels path to the device-directory without the leading
+  - the kernel's path to the device directory without the leading
 /sys, and always starting with with a slash
   - all elements of a devpath must be real directories. Symlinks
 pointing to /sys/devices must always be resolved to their real
-target, and the target path must be used to access the device.
+target and the target path must be used to access the device.
 That way the devpath to the device matches the devpath of the
 kernel used at event time.
   - using or exposing symlink values as elements in a devpath string
@@ -73,17 +72,17 @@ versions of the sysfs interface.
 link
   - it is retrieved by reading the "driver"-link and using only the
 last element of the target path
-  - devices which do not have "driver"-link, just do not have a
-driver; copying the driver value in a child device context, is a
+  - devices which do not have "driver"-link just do not have a
+driver; copying the driver value in a child device context is a
 bug in the application
 
 o attributes
-  - the files in the device directory or files below a subdirectories
+  - the files in the device directory or files below subdirectories
 of the same device directory
   - accessing attributes reached by a symlink pointing to another device,
 like the "device"-link, is a bug in the application
 
-  Everything else is jus

Re: [PATCH 03/10] readahead: combine file_ra_state.prev_index/prev_offset into prev_pos

2007-07-23 Thread Fengguang Wu
On Mon, Jul 23, 2007 at 08:52:41PM -0700, Andrew Morton wrote:
> On Tue, 24 Jul 2007 10:00:12 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:
> 
> > -   ra->prev_index = page->index;
> > +   ra->prev_pos = page->index << PAGE_CACHE_SHIFT;
> 
> bug!  The rhs will get truncated befire it gets assigned to
> the lhs.  Need to cast page->index to loff_t.
> 
> I'll fix this one up.  Please review the other patches for this?

Thank you. Be sure to update *all* the lines:
ra->prev_pos = page->index << PAGE_CACHE_SHIFT

Other places should have been taken care of.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily

2007-07-23 Thread Nick Piggin

Satyam Sharma wrote:

From: Satyam Sharma <[EMAIL PROTECTED]>

[6/8] i386: bitops: Don't mark memory as clobbered unnecessarily

The goal is to let gcc generate good, beautiful, optimized code.

But test_and_set_bit, test_and_clear_bit, __test_and_change_bit,
and test_and_change_bit unnecessarily mark all of memory as clobbered,
thereby preventing gcc from doing perfectly valid optimizations.

The case of __test_and_change_bit() is particularly surprising, given
that it's a variant where we don't make any guarantees at all.


__test_and_change_bit is one that you could remove the memory clobber
from.


---

 include/asm-i386/bitops.h |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index 0f5634b..f37b8a2 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -254,7 +254,7 @@ static inline int __test_and_change_bit(int nr, unsigned 
long *addr)
__asm__ __volatile__(
"btcl %2,%1\n\tsbbl %0,%0"
:"=r" (oldbit),"=m" (*addr)
-   :"r" (nr) : "memory");
+   :"r" (nr));
return oldbit;
 }
 



--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/10] readahead: combine file_ra_state.prev_index/prev_offset into prev_pos

2007-07-23 Thread Andrew Morton
On Tue, 24 Jul 2007 10:00:12 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:

> @@ -342,11 +342,9 @@ ondemand_readahead(struct address_space 
>  bool hit_readahead_marker, pgoff_t offset,
>  unsigned long req_size)
>  {
> - int max;/* max readahead pages */
> - int sequential;
> -
> - max = ra->ra_pages;
> - sequential = (offset - ra->prev_index <= 1UL) || (req_size > max);
> + int max = ra->ra_pages; /* max readahead pages */
> + pgoff_t prev_offset;
> + int sequential;
>  
>   /*
>* It's the expected callback offset, assume sequential access.
> @@ -360,6 +358,9 @@ ondemand_readahead(struct address_space 
>   goto readit;
>   }
>  
> + prev_offset = ra->prev_pos >> PAGE_CACHE_SHIFT;
> + sequential = offset - prev_offset <= 1UL || req_size > max;

It's a bit pointless using an opaque type for prev_offset here, and then
encoding the knowledge that it is implemented as "unsigned long".

It's a minor thing, but perhaps just "<= 1" would make more sense here.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions

2007-07-23 Thread Nick Piggin

Satyam Sharma wrote:

From: Satyam Sharma <[EMAIL PROTECTED]>

[8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions


From Documentation/atomic_ops.txt, those archs that require explicit

memory barriers around clear_bit() must also implement these two interfaces.
However, for i386, clear_bit() is a strict, locked, atomic and
un-reorderable operation and includes an implicit memory barrier already.

But these two functions have been wrongly defined as "barrier()" which is
a pointless _compiler optimization_ barrier, and only serves to make gcc
not do legitimate optimizations that it could have otherwise done.

So let's make these proper no-ops, because that's exactly what we require
these to be on the i386 platform.


No. clear_bit is not a compiler barrier on i386, thus smp_mb__before/after
must be.




Signed-off-by: Satyam Sharma <[EMAIL PROTECTED]>
Cc: David Howells <[EMAIL PROTECTED]>
Cc: Nick Piggin <[EMAIL PROTECTED]>

---

[ A similar optimization needs to be done in the atomic.h also.
  Will submit that patch shortly. ]

 include/asm-i386/bitops.h |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index 4f1fda5..42999eb 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -106,8 +106,8 @@ static inline void __clear_bit(int nr, unsigned long *addr)
  * Bit operations are already serializing on x86.
  * These must still be defined here for API completeness.
  */
-#define smp_mb__before_clear_bit() barrier()
-#define smp_mb__after_clear_bit()  barrier()
+#define smp_mb__before_clear_bit() do {} while (0)
+#define smp_mb__after_clear_bit()  do {} while (0)
 
 /**

  * __change_bit - Toggle a bit in memory




--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/10] readahead: combine file_ra_state.prev_index/prev_offset into prev_pos

2007-07-23 Thread Andrew Morton
On Tue, 24 Jul 2007 10:00:12 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:

> - ra->prev_index = page->index;
> + ra->prev_pos = page->index << PAGE_CACHE_SHIFT;

bug!  The rhs will get truncated befire it gets assigned to
the lhs.  Need to cast page->index to loff_t.

I'll fix this one up.  Please review the other patches for this?


I decided to merge this ahead of that great pile of Nick's patches
(pagecache write deadlocks) and gto a number of easy-to-fix rejects as a
result.  Hopefully it all landed OK.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFH] Partion table recovery

2007-07-23 Thread Rene Herman

On 07/23/2007 10:08 PM, Bill Davidsen wrote:


Al Boldi wrote:


As always, a good friend of mine managed to scratch my partion table 
by cat'ing /dev/full into /dev/sda.  I was able to push him out of the 
way, but at least the first 100MB are gone.  I can probably live 
without the first partion, but there are many partitions after that, 
which I hope should easily be recoverable.


I tried parted, but it's not working out for me.  Does anybody know of 
a simple partition recovery tool, that would just scan the disk for 
lost partions?
  
You have gotten a bunch of thoughts on this, I will just say that plain 
old "fdisk -l" saved somewhere safe is probably all you need, in human 
readable format. Doesn't do you any good now, but all the complicated 
schemes discussed don't thrill me, I want to be able to see this, and 
recovery by partition table manual rebuild is so rare I would rather do 
it by hand than trust some software I rarely use.


ACK. Or NNAK (Non-NAK) at least...

Rene.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22-git17 boot failure

2007-07-23 Thread Jeremy Fitzhardinge
Tilman Schmidt wrote:
> So yes, all of ahci, pata_marvell, aic7xxx, jbd, dm_mod, ext3 are in
> fact modules in initrd. Would it help to try a kernel with some or all
> of these built in?
>   

Yes, that would be useful.  It would help tell whether its a module
loading problem or a basic pci probing problem.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFH] Partition table recovery

2007-07-23 Thread Rene Herman

On 07/23/2007 03:48 PM, Theodore Tso wrote:


On Mon, Jul 23, 2007 at 09:34:25AM +0200, Rene Herman wrote:


That's not quite correct. Logicals have a start field relative to the 
encompassing extended (ie, for me always 1, for others often always 63) and 
the encompassing extended are relative not to the previous extended but to 
the level 0 extended (the one in the MBR). 


This assumes that the extended partition is at the beginning of the
disk, yes?


Err, well, no, that's not what I meant. The "start" field for the extended 
partition that sits in the primary partitition table (the one in the MBR) is 
absolute, or "relative to the start of the disk", but the "start" field for 
the empty extended partitions that together form the logical partition list 
are relative not to the previous one in the list, but all to this outermost 
extended partition.



Why would anyone do that?  I normally have /dev/hda1 at the beginning of
the disk, and I normally make /dev/hda4 my extended, and place it *after*
partitions at /dev/hda2, /dev/hda3, etc.


... but having said that, I do actually have an extended partition as my 
/dev/hda1 at the beginning of the disk. This is the current layout on my 
main system:


   Device BootStart   End   #sectors  Id  System
/dev/sda1 1 231733119  231733119  85  Linux extended
/dev/sda2   * 231733120 2401217278388608   c  W95 FAT32 (LBA)
/dev/sda3 0 -  0   0  Empty
/dev/sda4 0 -  0   0  Empty
/dev/sda5 2   20971532097152  82  Linux swap
/dev/sda6   2097155  18874370   16777216  83  Linux
/dev/sda7  18874372  35651587   16777216  83  Linux
/dev/sda8  35651589 231733119  196081531  83  Linux

As you can see, everything neatly non-cylinder-aligned, with not a single 
sector wasted ;-) Table sectors at 0 (MBR), 1 (outer extended), 2097154, 
18874371 and 35651588 (list-extendeds).


/dev/sda2 used to be a FreeBSD install (partition type 0xa5), /dev/sda3 a 
MINIX install (type 0x81) and /dev/sda4 the still present FAT32 Windows 98 
partition at the very end of the disk. I removed FreeBSD and MINIX due to 
space shortage...


The reason that I use the first entry for an extended is that I view the 
type "Linux Extended" simply as "Linux": That is, I see 0x85 simply as the 
one and only Linux type with all my Linux data partitions on the logicals 
inside -- very much like 0xa5 is the one FreeBSD type with all its data 
partitions on the slices inside, and 0x81 the one MINIX partition with its 
data partitions on the subpartitions inside.


That is, I've been using a "Linux native partitioning scheme" where the 
Linux native layout just happens to coincide with a DOS/Windows native layout.


My Linux partition is at the start of the disk since it's the system I use. 
The others are/were there just to boot perhaps a few times a year to check 
some things -- and the start of the disk is the fastest bit, so I certainly 
want my main system to use that.


Anyone find my "Native Linux Partitioning Scheme" interesting? Designing and 
using a better way than regular logicals to carve up the space inside (such 
as something designed after BSD slices) would work for me as well ;-)



It would be interesting to see how badly modern Windows systems breaks
on this.  If Windows 2000 and above works, and Linux works, then if
other things break it might be quite sufficient to consider them
"broken software" that we don't need to worry about.


Googling for it, the 2TB limit of DOS partitioning is widely known and there 
would be no point worrying even about the single-bit overflow possibly of 
the list of extendeds...


With 32-bit values (and 512-byte sectors) you can service 2TB -- anything 
above that requires something better than MS-DOS partition tables. 


Well, in about 2-3 years or so we'll seeing having singleton disks
bigger 2TB.  I'm not terribly sanguine about BIOS vendors and OS
providers migrating to something better by then, alas.  Life is sure
going to be interesting.  :-)


And sectors probably larger than 512 bytes. I hope they'll not do _that_ 
until plain old partitions are truly abandoned since before you know it 
someone going to view it as an excuse to keep using this fragile mess ;-)


Rene.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cbe-oss-dev] [PATCH][36/37] Clean up duplicate includes in sound/ppc/

2007-07-23 Thread Masakazu Mokuno

On Sat, 21 Jul 2007 17:04:07 +0200
Jesper Juhl <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> This patch cleans up duplicate includes in
>   /
> 
> 
> Signed-off-by: Jesper Juhl <[EMAIL PROTECTED]>
> ---
> 
> diff --git a/sound/ppc/snd_ps3.c b/sound/ppc/snd_ps3.c
> index 1aa0b46..27b6189 100644
> --- a/sound/ppc/snd_ps3.c
> +++ b/sound/ppc/snd_ps3.c
> @@ -33,7 +33,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 

Acked-by: Masakazu Mokuno <[EMAIL PROTECTED]>

--
Masakazu MOKUNO

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] lguest: documentation pt I: Preparation

2007-07-23 Thread Rusty Russell
On Mon, 2007-07-23 at 20:06 -0700, Randy Dunlap wrote:
> On Mon, 23 Jul 2007 19:21:13 -0700 Randy Dunlap wrote:
> > It's great that Rusty took the time to produce all of this documentation.
> > Few people do that today.

Thanks Randy, it was something of an experiment.  We'll see if it has
the desired effect (ie. encouraging new hackers).

> Neat as that is, I'm concerned that it will be difficult to maintain
> (the order numbers at least -- or are they just difficult to set up
> the first time?).

Setup was a pain, but maintenance hasn't been too bad.  Most changes
don't deeply alter the code structure.  After major surgery I diff the
documentation output to check I haven't broke anything major (eg.
removed a title or a section terminator).

> Advantage:  it does keep the source code + doc text together.
> Martin (former kernel-doc maintainer) was going to come up with
> some way to do this, but he abandoned it.

Yeah, code documentation like this belongs in comments IMHO, and even
there it can rot.

Thanks,
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] tiny signalfd cleanup

2007-07-23 Thread Ulrich Drepper
This is probably a leftover from a time when the return wasn't there yet.
Now the extra assignment is just irritating.

Signed-off-by: Ulrich Drepper <[EMAIL PROTECTED]>

--- fs/signalfd.c   2007-06-29 10:24:04.0 -0700
+++ fs/signalfd.c-new   2007-07-23 20:17:34.0 -0700
@@ -320,7 +320,7 @@
 
if (sizemask != sizeof(sigset_t) ||
copy_from_user(&sigmask, user_mask, sizeof(sigmask)))
-   return error = -EINVAL;
+   return -EINVAL;
sigdelsetmask(&sigmask, sigmask(SIGKILL) | sigmask(SIGSTOP));
signotset(&sigmask);
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Incompatibilities of kmem_cache_create

2007-07-23 Thread Christoph Lameter
On Sat, 21 Jul 2007, Paul Mundt wrote:

> On Sat, Jul 21, 2007 at 02:50:01AM -0300, werner wrote:
> > Of the kernel 2.6.22-git15 of this night,  kmem_cache_create is not
> > compatible and causes compiling errors of some fundamental programs.
> > Before, this error didnt occur.
> > 
> Slab destructors haven't been supported in the kernel for ages, anything
> that's relying on them to work out-of-tree is fundamentally broken.

Slab destructors were supported and used in Linux kernels up to 
version 2.6.20. They were removed late in the 2.6.21 merge cycle.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] lguest: documentation pt I: Preparation

2007-07-23 Thread Randy Dunlap
On Mon, 23 Jul 2007 19:21:13 -0700 Randy Dunlap wrote:

> On Mon, 23 Jul 2007 17:12:38 -0700 Andrew Morton wrote:
> 
> > On Sat, 21 Jul 2007 11:17:58 +1000
> > Rusty Russell <[EMAIL PROTECTED]> wrote:
> > 
> > > The netfilter code had very good documentation: the Netfilter Hacking
> > > HOWTO.  Noone ever read it.
> > > 
> > > So this time I'm trying something different, using a bit of
> > > Knuthiness.  Start with drivers/lguest/README.
> > 
> > um.
> > 
> > I'm OK with merging patches and given lguest's newness, the timestamp on
> > these patches, the fact that they don't change code generation (right?) and
> > my reluctance to carry large do-nothing patches for two months, I'd be OK
> > with squeaking them into 2.6.23.
> > 
> > But I worry that you're proposing adding what appears to be new
> > Documentation-related machinery and infrastructure when there's already
> > increased activity in that area from other people and we might all be
> > headed in different directions and stuff.
> > 
> > So first I think we'd best form a kernel kommittee and mull this for a
> > while (preferably months) to screw you around as much as poss, OK?  ;)
> > 
> > Items for consideration would be:
> > 
> > - if this stuff is good, shouldn't other code be using it?  If so, is
> >   this new infrastructure in the correct place?
> 
> I wouldn't mind having a new doc infrastructure, but I don't see this as it.
> 
> > - if, otoh, this infrastructure is _not_ suitable for other code, well,
> >   what was wrong with it?
> 
> I think that we don't want to give up html/pdf/ps output formats in
> favor of just text or C source code.  If we do continue to have
> multiple "rich" output formats, we need even more rich syntax rules
> than we have right now.  OTOH, if we dump all of those rich output
> formats, we have less tool spice that is needed.
> 
> (I'm not ignoring Andrew's question here.  I'm just applying the
> 7 patches/series and looking at it more.)
> 
> > - if the requirement is good, perhaps alternative implementations should
> >   be explored (dunno what).
> 
> Yes, but I dunno what either.
> 
> 
> > IOW, I'd be interested in hearing Rob and Randy's opinions on it all,
> > please.
> 
> It's great that Rusty took the time to produce all of this documentation.
> Few people do that today.
> 
> Were current kernel-doc tools not sufficient?  If not, why not?

A:  Nope, kernel-doc won't weave the code + docs together based on a
prefix and order number (e.g., H:310).

Neat as that is, I'm concerned that it will be difficult to maintain
(the order numbers at least -- or are they just difficult to set up
the first time?).

Advantage:  it does keep the source code + doc text together.
Martin (former kernel-doc maintainer) was going to come up with
some way to do this, but he abandoned it.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [2.6.22] Fix a potential NULL pointer dereference in mace_interrupt() in drivers/net/pcmcia/nmclan_cs.c

2007-07-23 Thread Micah Gruber
This patch fixes a potential null dereference bug where we dereference 
DEV before a null check. This patch simply moves the dereferencing after 
the null check.


Signed-off-by: Micah Gruber <[EMAIL PROTECTED]>

---

--- a/drivers/net/pcmcia/nmclan_cs.c
+++ b/drivers/net/pcmcia/nmclan_cs.c
@@ -996,7 +996,7 @@

{
  struct net_device *dev = (struct net_device *) dev_id;
  mace_private *lp = netdev_priv(dev);
-  kio_addr_t ioaddr = dev->base_addr;
+  kio_addr_t ioaddr;
  int status;
  int IntrCnt = MACE_MAX_IR_ITERATIONS;

@@ -1006,6 +1006,8 @@
return IRQ_NONE;
  }

+  ioaddr = dev->base_addr;
+
  if (lp->tx_irq_disabled) {
printk(
  (lp->tx_irq_disabled?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] lguest: documentation pt I: Preparation

2007-07-23 Thread Rene Herman

On 07/24/2007 03:18 AM, Linus Torvalds wrote:


PS. Nothing rhymes with Ballalaba.


There once was a woman from Ballalaba
who hid kernel bugs in her djellabah.
 When her husband found out,
  he objected loud,
and got her expelled from the casbah!

Rene.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 10/10] filemap: convert some unsigned long to pgoff_t

2007-07-23 Thread Fengguang Wu
Convert some 'unsigned long' to pgoff_t.

Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
---
 include/linux/pagemap.h |   23 ---
 mm/filemap.c|   32 
 2 files changed, 28 insertions(+), 27 deletions(-)

--- linux-2.6.22-rc6-mm1.orig/include/linux/pagemap.h
+++ linux-2.6.22-rc6-mm1/include/linux/pagemap.h
@@ -83,11 +83,11 @@ static inline struct page *page_cache_al
 typedef int filler_t(void *, struct page *);
 
 extern struct page * find_get_page(struct address_space *mapping,
-   unsigned long index);
+   pgoff_t index);
 extern struct page * find_lock_page(struct address_space *mapping,
-   unsigned long index);
+   pgoff_t index);
 extern struct page * find_or_create_page(struct address_space *mapping,
-   unsigned long index, gfp_t gfp_mask);
+   pgoff_t index, gfp_t gfp_mask);
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
@@ -100,41 +100,42 @@ struct page *__grab_cache_page(struct ad
 /*
  * Returns locked page at given index in given cache, creating it if needed.
  */
-static inline struct page *grab_cache_page(struct address_space *mapping, 
unsigned long index)
+static inline struct page *grab_cache_page(struct address_space *mapping,
+   pgoff_t index)
 {
return find_or_create_page(mapping, index, mapping_gfp_mask(mapping));
 }
 
 extern struct page * grab_cache_page_nowait(struct address_space *mapping,
-   unsigned long index);
+   pgoff_t index);
 extern struct page * read_cache_page_async(struct address_space *mapping,
-   unsigned long index, filler_t *filler,
+   pgoff_t index, filler_t *filler,
void *data);
 extern struct page * read_cache_page(struct address_space *mapping,
-   unsigned long index, filler_t *filler,
+   pgoff_t index, filler_t *filler,
void *data);
 extern int read_cache_pages(struct address_space *mapping,
struct list_head *pages, filler_t *filler, void *data);
 
 static inline struct page *read_mapping_page_async(
struct address_space *mapping,
-unsigned long index, void *data)
+pgoff_t index, void *data)
 {
filler_t *filler = (filler_t *)mapping->a_ops->readpage;
return read_cache_page_async(mapping, index, filler, data);
 }
 
 static inline struct page *read_mapping_page(struct address_space *mapping,
-unsigned long index, void *data)
+pgoff_t index, void *data)
 {
filler_t *filler = (filler_t *)mapping->a_ops->readpage;
return read_cache_page(mapping, index, filler, data);
 }
 
 int add_to_page_cache(struct page *page, struct address_space *mapping,
-   unsigned long index, gfp_t gfp_mask);
+   pgoff_t index, gfp_t gfp_mask);
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
-   unsigned long index, gfp_t gfp_mask);
+   pgoff_t index, gfp_t gfp_mask);
 extern void remove_from_page_cache(struct page *page);
 extern void __remove_from_page_cache(struct page *page);
 
--- linux-2.6.22-rc6-mm1.orig/mm/filemap.c
+++ linux-2.6.22-rc6-mm1/mm/filemap.c
@@ -594,7 +594,7 @@ void fastcall __lock_page_nosync(struct 
  * Is there a pagecache struct page at the given (mapping, offset) tuple?
  * If yes, increment its refcount and return it; if no, return NULL.
  */
-struct page * find_get_page(struct address_space *mapping, unsigned long 
offset)
+struct page * find_get_page(struct address_space *mapping, pgoff_t offset)
 {
struct page *page;
 
@@ -618,7 +618,7 @@ EXPORT_SYMBOL(find_get_page);
  * Returns zero if the page was not present. find_lock_page() may sleep.
  */
 struct page *find_lock_page(struct address_space *mapping,
-   unsigned long offset)
+   pgoff_t offset)
 {
struct page *page;
 
@@ -664,7 +664,7 @@ EXPORT_SYMBOL(find_lock_page);
  * memory exhaustion.
  */
 struct page *find_or_create_page(struct address_space *mapping,
-   unsigned long index, gfp_t gfp_mask)
+   pgoff_t index, gfp_t gfp_mask)
 {
struct page *page;
int err;
@@ -794,7 +794,7 @@ EXPOR

Re: [PATCH 1/7] lguest: documentation pt I: Preparation

2007-07-23 Thread Randy Dunlap
On Mon, 23 Jul 2007 17:12:38 -0700 Andrew Morton wrote:

> On Sat, 21 Jul 2007 11:17:58 +1000
> Rusty Russell <[EMAIL PROTECTED]> wrote:
> 
> > The netfilter code had very good documentation: the Netfilter Hacking
> > HOWTO.  Noone ever read it.
> > 
> > So this time I'm trying something different, using a bit of
> > Knuthiness.  Start with drivers/lguest/README.
> 
> um.
> 
> I'm OK with merging patches and given lguest's newness, the timestamp on
> these patches, the fact that they don't change code generation (right?) and
> my reluctance to carry large do-nothing patches for two months, I'd be OK
> with squeaking them into 2.6.23.
> 
> But I worry that you're proposing adding what appears to be new
> Documentation-related machinery and infrastructure when there's already
> increased activity in that area from other people and we might all be
> headed in different directions and stuff.
> 
> So first I think we'd best form a kernel kommittee and mull this for a
> while (preferably months) to screw you around as much as poss, OK?  ;)
> 
> Items for consideration would be:
> 
> - if this stuff is good, shouldn't other code be using it?  If so, is
>   this new infrastructure in the correct place?

I wouldn't mind having a new doc infrastructure, but I don't see this as it.

> - if, otoh, this infrastructure is _not_ suitable for other code, well,
>   what was wrong with it?

I think that we don't want to give up html/pdf/ps output formats in
favor of just text or C source code.  If we do continue to have
multiple "rich" output formats, we need even more rich syntax rules
than we have right now.  OTOH, if we dump all of those rich output
formats, we have less tool spice that is needed.

(I'm not ignoring Andrew's question here.  I'm just applying the
7 patches/series and looking at it more.)

> - if the requirement is good, perhaps alternative implementations should
>   be explored (dunno what).

Yes, but I dunno what either.


> IOW, I'd be interested in hearing Rob and Randy's opinions on it all,
> please.

It's great that Rusty took the time to produce all of this documentation.
Few people do that today.

Were current kernel-doc tools not sufficient?  If not, why not?

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/10] filemap: trivial code cleanups

2007-07-23 Thread Fengguang Wu
- remove unused local next_index in do_generic_mapping_read()
- wrap a long line
- remove a redudant page_cache_read() declaration

Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
---
 mm/filemap.c |6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

--- linux-2.6.22-rc6-mm1.orig/mm/filemap.c
+++ linux-2.6.22-rc6-mm1/mm/filemap.c
@@ -873,13 +873,11 @@ void do_generic_mapping_read(struct addr
unsigned long index;
unsigned long offset;
unsigned long last_index;
-   unsigned long next_index;
unsigned long prev_index;
unsigned int prev_offset;
int error;
 
index = *ppos >> PAGE_CACHE_SHIFT;
-   next_index = index;
prev_index = ra->prev_pos >> PAGE_CACHE_SHIFT;
prev_offset = ra->prev_pos & (PAGE_CACHE_SIZE-1);
last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> 
PAGE_CACHE_SHIFT;
@@ -1214,7 +1212,8 @@ out:
 }
 EXPORT_SYMBOL(generic_file_aio_read);
 
-int file_send_actor(read_descriptor_t * desc, struct page *page, unsigned long 
offset, unsigned long size)
+int file_send_actor(read_descriptor_t * desc, struct page *page,
+   unsigned long offset, unsigned long size)
 {
ssize_t written;
unsigned long count = desc->count;
@@ -1287,7 +1286,6 @@ asmlinkage ssize_t sys_readahead(int fd,
 }
 
 #ifdef CONFIG_MMU
-static int FASTCALL(page_cache_read(struct file * file, unsigned long offset));
 /**
  * page_cache_read - adds requested page to the page cache if not already there
  * @file:  file to read

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 03/10] readahead: combine file_ra_state.prev_index/prev_offset into prev_pos

2007-07-23 Thread Fengguang Wu
Combine the file_ra_state members
unsigned long prev_index
unsigned int prev_offset
into
loff_t prev_pos

It is more consistent and better supports huge files.

Thanks to Peter for the nice proposal!

Cc: Peter Zijlstra <[EMAIL PROTECTED]>
Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
---
 fs/ext3/dir.c  |2 +-
 fs/ext4/dir.c  |2 +-
 fs/splice.c|2 +-
 include/linux/fs.h |3 +--
 mm/filemap.c   |   11 ++-
 mm/readahead.c |   15 ---
 6 files changed, 18 insertions(+), 17 deletions(-)

--- linux-2.6.22-rc6-mm1.orig/include/linux/fs.h
+++ linux-2.6.22-rc6-mm1/include/linux/fs.h
@@ -778,8 +778,7 @@ struct file_ra_state {
 
unsigned int ra_pages;  /* Maximum readahead window */
int mmap_miss;  /* Cache miss stat for mmap accesses */
-   unsigned long prev_index;   /* Cache last read() position */
-   unsigned int prev_offset;   /* Offset where last read() ended in a 
page */
+   loff_t prev_pos;/* Cache last read() position */
 };
 
 /*
--- linux-2.6.22-rc6-mm1.orig/mm/filemap.c
+++ linux-2.6.22-rc6-mm1/mm/filemap.c
@@ -881,8 +881,8 @@ void do_generic_mapping_read(struct addr
 
index = *ppos >> PAGE_CACHE_SHIFT;
next_index = index;
-   prev_index = ra.prev_index;
-   prev_offset = ra.prev_offset;
+   prev_index = ra.prev_pos >> PAGE_CACHE_SHIFT;
+   prev_offset = ra.prev_pos & (PAGE_CACHE_SIZE-1);
last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> 
PAGE_CACHE_SHIFT;
offset = *ppos & ~PAGE_CACHE_MASK;
 
@@ -968,7 +968,6 @@ page_ok:
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
prev_offset = offset;
-   ra.prev_offset = offset;
 
page_cache_release(page);
if (ret == nr && desc->count)
@@ -1055,7 +1054,9 @@ no_cached_page:
 
 out:
*_ra = ra;
-   _ra->prev_index = prev_index;
+   _ra->prev_pos = prev_index;
+   _ra->prev_pos <<= PAGE_CACHE_SHIFT;
+   _ra->prev_pos |= prev_offset;
 
*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
if (filp)
@@ -1435,7 +1436,7 @@ retry_find:
 * Found the page and have a reference on it.
 */
mark_page_accessed(page);
-   ra->prev_index = page->index;
+   ra->prev_pos = page->index << PAGE_CACHE_SHIFT;
return page;
 
 outside_data_content:
--- linux-2.6.22-rc6-mm1.orig/mm/readahead.c
+++ linux-2.6.22-rc6-mm1/mm/readahead.c
@@ -45,7 +45,7 @@ void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
 {
ra->ra_pages = mapping->backing_dev_info->ra_pages;
-   ra->prev_index = -1;
+   ra->prev_pos = -1;
 }
 EXPORT_SYMBOL_GPL(file_ra_state_init);
 
@@ -318,7 +318,7 @@ static unsigned long get_next_ra_size(st
  * indicator. The flag won't be set on already cached pages, to avoid the
  * readahead-for-nothing fuss, saving pointless page cache lookups.
  *
- * prev_index tracks the last visited page in the _previous_ read request.
+ * prev_pos tracks the last visited byte in the _previous_ read request.
  * It should be maintained by the caller, and will be used for detecting
  * small random reads. Note that the readahead algorithm checks loosely
  * for sequential patterns. Hence interleaved reads might be served as
@@ -342,11 +342,9 @@ ondemand_readahead(struct address_space 
   bool hit_readahead_marker, pgoff_t offset,
   unsigned long req_size)
 {
-   int max;/* max readahead pages */
-   int sequential;
-
-   max = ra->ra_pages;
-   sequential = (offset - ra->prev_index <= 1UL) || (req_size > max);
+   int max = ra->ra_pages; /* max readahead pages */
+   pgoff_t prev_offset;
+   int sequential;
 
/*
 * It's the expected callback offset, assume sequential access.
@@ -360,6 +358,9 @@ ondemand_readahead(struct address_space 
goto readit;
}
 
+   prev_offset = ra->prev_pos >> PAGE_CACHE_SHIFT;
+   sequential = offset - prev_offset <= 1UL || req_size > max;
+
/*
 * Standalone, small read.
 * Read as is, and do not pollute the readahead state.
--- linux-2.6.22-rc6-mm1.orig/fs/ext3/dir.c
+++ linux-2.6.22-rc6-mm1/fs/ext3/dir.c
@@ -143,7 +143,7 @@ static int ext3_readdir(struct file * fi
sb->s_bdev->bd_inode->i_mapping,
&filp->f_ra, filp,
index, 1);
-   filp->f_ra.prev_index = index;
+   filp->f_ra.prev_pos = index << PAGE_CACHE_SHIFT;
bh = ext3_bread(NULL, inode, blk, 0, &err);
}
 
--- linux-2.6.22-rc6-mm1.orig/fs/ext4/dir.c
+++ linux-2

[PATCH 01/10] readahead: compacting file_ra_state

2007-07-23 Thread Fengguang Wu
Use 'unsigned int' instead of 'unsigned long' for readahead sizes.

This helps reduce memory consumption on 64bit CPU when
a lot of files are opened.

CC: Andi Kleen <[EMAIL PROTECTED]>
Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
---
 include/linux/fs.h |8 
 mm/filemap.c   |2 +-
 mm/readahead.c |2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

--- linux-2.6.22-rc6-mm1.orig/include/linux/fs.h
+++ linux-2.6.22-rc6-mm1/include/linux/fs.h
@@ -771,12 +771,12 @@ struct fown_struct {
  * Track a single file's readahead state
  */
 struct file_ra_state {
-   pgoff_t start;  /* where readahead started */
-   unsigned long size; /* # of readahead pages */
-   unsigned long async_size;   /* do asynchronous readahead when
+   pgoff_t start;  /* where readahead started */
+   unsigned int size;  /* # of readahead pages */
+   unsigned int async_size;/* do asynchronous readahead when
   there are only # of pages ahead */
 
-   unsigned long ra_pages; /* Maximum readahead window */
+   unsigned int ra_pages;  /* Maximum readahead window */
unsigned long mmap_hit; /* Cache hit stat for mmap accesses */
unsigned long mmap_miss;/* Cache miss stat for mmap accesses */
unsigned long prev_index;   /* Cache last read() position */
--- linux-2.6.22-rc6-mm1.orig/mm/filemap.c
+++ linux-2.6.22-rc6-mm1/mm/filemap.c
@@ -840,7 +840,7 @@ static void shrink_readahead_size_eio(st
if (count > 5)
return;
count++;
-   printk(KERN_WARNING "Reducing readahead size to %luK\n",
+   printk(KERN_WARNING "Reducing readahead size to %dK\n",
ra->ra_pages << (PAGE_CACHE_SHIFT - 10));
 }
 
--- linux-2.6.22-rc6-mm1.orig/mm/readahead.c
+++ linux-2.6.22-rc6-mm1/mm/readahead.c
@@ -342,7 +342,7 @@ ondemand_readahead(struct address_space 
   bool hit_readahead_marker, pgoff_t offset,
   unsigned long req_size)
 {
-   unsigned long max;  /* max readahead pages */
+   int max;/* max readahead pages */
int sequential;
 
max = ra->ra_pages;

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 08/10] readahead: remove the limit max_sectors_kb imposed on max_readahead_kb

2007-07-23 Thread Fengguang Wu
Remove the size limit max_sectors_kb imposed on max_readahead_kb.

The size restriction is unreasonable. Especially when max_sectors_kb cannot
grow larger than max_hw_sectors_kb, which can be rather small for some disk
drives.

Cc: Jens Axboe <[EMAIL PROTECTED]>
Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
Acked-by: Jens Axboe <[EMAIL PROTECTED]>
---
 block/ll_rw_blk.c |9 -
 1 file changed, 9 deletions(-)

--- linux-2.6.22-rc6-mm1.orig/block/ll_rw_blk.c
+++ linux-2.6.22-rc6-mm1/block/ll_rw_blk.c
@@ -3945,7 +3945,6 @@ queue_max_sectors_store(struct request_q
max_hw_sectors_kb = q->max_hw_sectors >> 1,
page_kb = 1 << (PAGE_CACHE_SHIFT - 10);
ssize_t ret = queue_var_store(&max_sectors_kb, page, count);
-   int ra_kb;
 
if (max_sectors_kb > max_hw_sectors_kb || max_sectors_kb < page_kb)
return -EINVAL;
@@ -3954,14 +3953,6 @@ queue_max_sectors_store(struct request_q
 * values synchronously:
 */
spin_lock_irq(q->queue_lock);
-   /*
-* Trim readahead window as well, if necessary:
-*/
-   ra_kb = q->backing_dev_info.ra_pages << (PAGE_CACHE_SHIFT - 10);
-   if (ra_kb > max_sectors_kb)
-   q->backing_dev_info.ra_pages =
-   max_sectors_kb >> (PAGE_CACHE_SHIFT - 10);
-
q->max_sectors = max_sectors_kb << 1;
spin_unlock_irq(q->queue_lock);
 

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/10] readahead: basic support of interleaved reads

2007-07-23 Thread Fengguang Wu
This is a simplified version of the pagecache context based readahead.
It handles the case of multiple threads reading on the same fd and invalidating
each others' readahead state. It does the trick by scanning the pagecache and
recovering the current read stream's readahead status.

The algorithm works in a opportunistic way, in that it do not try to detect
interleaved reads _actively_, which requires a probe into the page cache(which
means a little more overheads for random reads). It only tries to handle a
previously started sequential readahead whose state was overwritten by
another concurrent stream, and it can do this job pretty well.

Negative and positive examples(or what you can expect from it):

1) it cannot detect and serve perfect request-by-request interleaved reads
   right:
timestream 1  stream 2
0   1 
1 1001
2   2
3 1002
4   3
5 1003
6   4
7 1004
8   5
9 1005
Here no single readahead will be carried out.

2) However, if it's two concurrent reads by two threads, the chance of the
   initial sequential readahead be started is huge. Once the first sequential
   readahead is started for a stream, this patch will ensure that the readahead
   window continues to rampup and won't be disturbed by other streams.

timestream 1  stream 2
0   1 
1   2
2 1001
3   3
4 1002
5 1003
6   4
7   5
8 1004
9   6
101005
11  7
121006
131007
Here steam 1 will start a readahead at page 2, and stream 2 will start its
first readahead at page 1003. From then on the two streams will be served right.

Cc: Rusty Russell <[EMAIL PROTECTED]>
Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
---
 mm/readahead.c |   33 +++--
 1 file changed, 23 insertions(+), 10 deletions(-)

--- linux-2.6.22-rc6-mm1.orig/mm/readahead.c
+++ linux-2.6.22-rc6-mm1/mm/readahead.c
@@ -371,6 +371,29 @@ ondemand_readahead(struct address_space 
}
 
/*
+* Hit a marked page without valid readahead state.
+* E.g. interleaved reads.
+* Query the pagecache for async_size, which normally equals to
+* readahead size. Ramp it up and use it as the new readahead size.
+*/
+   if (hit_readahead_marker) {
+   pgoff_t start;
+
+   read_lock_irq(&mapping->tree_lock);
+   start = radix_tree_next_hole(&mapping->page_tree, offset, 
max+1);
+   read_unlock_irq(&mapping->tree_lock);
+
+   if (!start || start - offset > max)
+   return 0;
+
+   ra->start = start;
+   ra->size = start - offset;  /* old async_size */
+   ra->size = get_next_ra_size(ra, max);
+   ra->async_size = ra->size;
+   goto readit;
+   }
+
+   /*
 * It may be one of
 *  - first read on start of file
 *  - sequential cache miss
@@ -381,16 +404,6 @@ ondemand_readahead(struct address_space 
ra->size = get_init_ra_size(req_size, max);
ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
 
-   /*
-* Hit on a marked page without valid readahead state.
-* E.g. interleaved reads.
-* Not knowing its readahead pos/size, bet on the minimal possible one.
-*/
-   if (hit_readahead_marker) {
-   ra->start++;
-   ra->size = get_next_ra_size(ra, max);
-   }
-
 readit:
return ra_submit(ra, mapping, filp);
 }

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 06/10] readahead: remove the local copy of ra in do_generic_mapping_read()

2007-07-23 Thread Fengguang Wu
The local copy of ra in do_generic_mapping_read() can now go away.

It predates readanead(req_size).  In a time when the readahead code was called
on *every* single page. Hence a local has to be made to reduce the chance of
the readahead state being overwritten by a concurrent reader. More details in:
Linux: Random File I/O Regressions In 2.6 

Cc: Nick Piggin <[EMAIL PROTECTED]>
Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
---
 mm/filemap.c |   20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

--- linux-2.6.22-rc6-mm1.orig/mm/filemap.c
+++ linux-2.6.22-rc6-mm1/mm/filemap.c
@@ -863,7 +863,7 @@ static void shrink_readahead_size_eio(st
  * It may be NULL.
  */
 void do_generic_mapping_read(struct address_space *mapping,
-struct file_ra_state *_ra,
+struct file_ra_state *ra,
 struct file *filp,
 loff_t *ppos,
 read_descriptor_t *desc,
@@ -877,12 +877,11 @@ void do_generic_mapping_read(struct addr
unsigned long prev_index;
unsigned int prev_offset;
int error;
-   struct file_ra_state ra = *_ra;
 
index = *ppos >> PAGE_CACHE_SHIFT;
next_index = index;
-   prev_index = ra.prev_pos >> PAGE_CACHE_SHIFT;
-   prev_offset = ra.prev_pos & (PAGE_CACHE_SIZE-1);
+   prev_index = ra->prev_pos >> PAGE_CACHE_SHIFT;
+   prev_offset = ra->prev_pos & (PAGE_CACHE_SIZE-1);
last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> 
PAGE_CACHE_SHIFT;
offset = *ppos & ~PAGE_CACHE_MASK;
 
@@ -897,7 +896,7 @@ find_page:
page = find_get_page(mapping, index);
if (!page) {
page_cache_sync_readahead(mapping,
-   &ra, filp,
+   ra, filp,
index, last_index - index);
page = find_get_page(mapping, index);
if (unlikely(page == NULL))
@@ -905,7 +904,7 @@ find_page:
}
if (PageReadahead(page)) {
page_cache_async_readahead(mapping,
-   &ra, filp, page,
+   ra, filp, page,
index, last_index - index);
}
if (!PageUptodate(page))
@@ -1016,7 +1015,7 @@ readpage:
}
unlock_page(page);
error = -EIO;
-   shrink_readahead_size_eio(filp, &ra);
+   shrink_readahead_size_eio(filp, ra);
goto readpage_error;
}
unlock_page(page);
@@ -1053,10 +1052,9 @@ no_cached_page:
}
 
 out:
-   *_ra = ra;
-   _ra->prev_pos = prev_index;
-   _ra->prev_pos <<= PAGE_CACHE_SHIFT;
-   _ra->prev_pos |= prev_offset;
+   ra->prev_pos = prev_index;
+   ra->prev_pos <<= PAGE_CACHE_SHIFT;
+   ra->prev_pos |= prev_offset;
 
*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
if (filp)

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 07/10] readahead: remove several readahead macros

2007-07-23 Thread Fengguang Wu
Remove VM_MAX_CACHE_HIT, MAX_RA_PAGES and MIN_RA_PAGES.

Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
---
 include/linux/mm.h |2 --
 mm/readahead.c |   10 +-
 2 files changed, 1 insertion(+), 11 deletions(-)

--- linux-2.6.22-rc6-mm1.orig/include/linux/mm.h
+++ linux-2.6.22-rc6-mm1/include/linux/mm.h
@@ -1148,8 +1148,6 @@ int write_one_page(struct page *page, in
 /* readahead.c */
 #define VM_MAX_READAHEAD   128 /* kbytes */
 #define VM_MIN_READAHEAD   16  /* kbytes (includes current page) */
-#define VM_MAX_CACHE_HIT   256 /* max pages in a row in cache before
-* turning readahead off */
 
 int do_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read);
--- linux-2.6.22-rc6-mm1.orig/mm/readahead.c
+++ linux-2.6.22-rc6-mm1/mm/readahead.c
@@ -21,16 +21,8 @@ void default_unplug_io_fn(struct backing
 }
 EXPORT_SYMBOL(default_unplug_io_fn);
 
-/*
- * Convienent macros for min/max read-ahead pages.
- * Note that MAX_RA_PAGES is rounded down, while MIN_RA_PAGES is rounded up.
- * The latter is necessary for systems with large page size(i.e. 64k).
- */
-#define MAX_RA_PAGES   (VM_MAX_READAHEAD*1024 / PAGE_CACHE_SIZE)
-#define MIN_RA_PAGES   DIV_ROUND_UP(VM_MIN_READAHEAD*1024, PAGE_CACHE_SIZE)
-
 struct backing_dev_info default_backing_dev_info = {
-   .ra_pages   = MAX_RA_PAGES,
+   .ra_pages   = VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
.state  = 0,
.capabilities   = BDI_CAP_MAP_COPY,
.unplug_io_fn   = default_unplug_io_fn,

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02/10] readahead: mmap read-around simplification

2007-07-23 Thread Fengguang Wu
Fold file_ra_state.mmap_hit into file_ra_state.mmap_miss
and make it an int.

Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
---
 include/linux/fs.h |3 +--
 mm/filemap.c   |4 ++--
 2 files changed, 3 insertions(+), 4 deletions(-)

--- linux-2.6.22-rc6-mm1.orig/include/linux/fs.h
+++ linux-2.6.22-rc6-mm1/include/linux/fs.h
@@ -777,8 +777,7 @@ struct file_ra_state {
   there are only # of pages ahead */
 
unsigned int ra_pages;  /* Maximum readahead window */
-   unsigned long mmap_hit; /* Cache hit stat for mmap accesses */
-   unsigned long mmap_miss;/* Cache miss stat for mmap accesses */
+   int mmap_miss;  /* Cache miss stat for mmap accesses */
unsigned long prev_index;   /* Cache last read() position */
unsigned int prev_offset;   /* Offset where last read() ended in a 
page */
 };
--- linux-2.6.22-rc6-mm1.orig/mm/filemap.c
+++ linux-2.6.22-rc6-mm1/mm/filemap.c
@@ -1389,7 +1389,7 @@ retry_find:
 * Do we miss much more than hit in this file? If so,
 * stop bothering with read-ahead. It will only hurt.
 */
-   if (ra->mmap_miss > ra->mmap_hit + MMAP_LOTSAMISS)
+   if (ra->mmap_miss > MMAP_LOTSAMISS)
goto no_cached_page;
 
/*
@@ -1415,7 +1415,7 @@ retry_find:
}
 
if (!did_readaround)
-   ra->mmap_hit++;
+   ra->mmap_miss--;
 
/*
 * We have a locked page in the page cache, now we need to check

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 00/10] readahead cleanups and interleaved readahead take 4

2007-07-23 Thread Fengguang Wu
Andrew,

Here are some more readahead related cleanups and updates.

smaller file_ra_state:
[PATCH 01/10] readahead: compacting file_ra_state   
  
[PATCH 02/10] readahead: mmap read-around simplification
  
[PATCH 03/10] readahead: combine file_ra_state.prev_index/prev_offset 
into prev_pos

Interleaved readahead:
[PATCH 04/10] radixtree: introduce radix_tree_scan_hole()   
  
[PATCH 05/10] readahead: basic support of interleaved reads 
  

Readahead cleanups:
[PATCH 06/10] readahead: remove several readahead macros
  
[PATCH 07/10] readahead: remove the limit max_sectors_kb imposed on 
max_readahead_kb
[PATCH 08/10] readahead: remove the local copy of ra in 
do_generic_mapping_read()

Filemap cleanups:
[PATCH 09/10] filemap: trivial code cleanups
  
[PATCH 10/10] filemap: convert some unsigned long to pgoff_t
  

The diffstat is

 block/ll_rw_blk.c  |9 
 fs/ext3/dir.c  |2 -
 fs/ext4/dir.c  |2 -
 fs/splice.c|2 -
 include/linux/fs.h |   14 +++
 include/linux/mm.h |2 -
 include/linux/pagemap.h|   23 ++--
 include/linux/radix-tree.h |2 +
 lib/radix-tree.c   |   36 +++
 mm/filemap.c   |   65 ---
 mm/readahead.c |   58 +--
 11 files changed, 122 insertions(+), 93 deletions(-)

Regards,
Fengguang Wu
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 04/10] radixtree: introduce radix_tree_scan_hole()

2007-07-23 Thread Fengguang Wu
Introduce radix_tree_scan_hole(root, index, max_scan) to scan radix tree
for the first hole. It will be used in interleaved readahead.

The implementation is dumb and obviously correct.
It can help debug(and document) the possible smart one in future.

Cc: Nick Piggin <[EMAIL PROTECTED]>
Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
---

 include/linux/radix-tree.h |2 +
 lib/radix-tree.c   |   36 +++
 2 files changed, 38 insertions(+)

--- linux-2.6.22-rc6-mm1.orig/include/linux/radix-tree.h
+++ linux-2.6.22-rc6-mm1/include/linux/radix-tree.h
@@ -155,6 +155,8 @@ void *radix_tree_delete(struct radix_tre
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
unsigned long first_index, unsigned int max_items);
+unsigned long radix_tree_next_hole(struct radix_tree_root *root,
+   unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
--- linux-2.6.22-rc6-mm1.orig/lib/radix-tree.c
+++ linux-2.6.22-rc6-mm1/lib/radix-tree.c
@@ -601,6 +601,42 @@ int radix_tree_tag_get(struct radix_tree
 EXPORT_SYMBOL(radix_tree_tag_get);
 #endif
 
+/**
+ * radix_tree_next_hole-find the next hole (not-present entry)
+ * @root:  tree root
+ * @index: index key
+ * @max_scan:  maximum range to search
+ *
+ * Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the lowest
+ * indexed hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'return - index >= max_scan'
+ * will be true).
+ *
+ * radix_tree_next_hole may be called under rcu_read_lock. However, like
+ * radix_tree_gang_lookup, this will not atomically search a snapshot of 
the
+ * tree at a single point in time. For example, if a hole is created at 
index
+ * 5, then subsequently a hole is created at index 10, radix_tree_next_hole
+ * covering both indexes may return 10 if called under rcu_read_lock.
+ */
+unsigned long radix_tree_next_hole(struct radix_tree_root *root,
+   unsigned long index, unsigned long max_scan)
+{
+   unsigned long i;
+
+   for (i = 0; i < max_scan; i++) {
+   if (!radix_tree_lookup(root, index))
+   break;
+   index++;
+   if (index == 0)
+   break;
+   }
+
+   return index;
+}
+EXPORT_SYMBOL(radix_tree_next_hole);
+
 static unsigned int
 __lookup(struct radix_tree_node *slot, void **results, unsigned long index,
unsigned int max_items, unsigned long *next_index)

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] posix-timer: fix deletion race

2007-07-23 Thread Jeremy Katz

On Fri, 20 Jul 2007, Oleg Nesterov wrote:


On 07/18, Jeremy Katz wrote:


On Wed, 18 Jul 2007, Oleg Nesterov wrote:


Jeremy, I agree with Thomas that your patch should not be right, but it
does make a difference. Perhaps this is just the timing, but who knows.
Could you add some printk's to be sure that lock_timer() actually fails
while it never should?


Agreed.

Unfortunately, adding any significant output appears to alter the
situation to the point where the issue either does not occur, or takes
significantly longer to surface.


No, no, I didn't mean any significant output. You changed itimer_delete()

>  -   spin_lock_irqsave(&timer->it_lock, flags);
>  +   /* timer already deleted? */
>  +   if (lock_timer(timer->it_id, &flags) == NULL)
>  +   return;

This change should not help, lock_timer() should always succeed here.
But since it makes a difference, we can make something like

if (lock_timer(timer->it_id, &flags) == NULL) {
printk("Impossible! but it happened.\n");
return;
}

The same for posix_timer_fn().


Ahh, of course.  I did try that at some point, and remember seeing at 
least the occasional failure.  This time, taking the spinlock and then 
checking for a valid timer ID, I did not see the locking fail.  I did see 
the attempt to use a freed sigqueue, further suggesting my 'fix' merely 
altered the timing sufficiently to hide the issue.



I still can't believe we have a double-free problem, this looks imposiible.
Do you see the

"idr_remove called for id=%d which is not allocated.\n"

in syslog?


No.  I also added some accounting with atomic counters, and don't see 
evidence of a second call to release_posix_timer.



Could you try the patch below? Perhaps we have some wierd problem with
->sigq corruption.


Tried, with apparent effect.

To add to the body of data: Turning off hyperthreading in the hardware 
does not resolve the issue.  Limiting the system to one CPU does appear to 
work.



Jeremy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [RFC 0/8]KVM: swap out guest pages

2007-07-23 Thread Shaohua Li
On Mon, 2007-07-23 at 20:34 +0800, Christoph Hellwig wrote:
> On Mon, Jul 23, 2007 at 03:29:36PM +0300, Avi Kivity wrote:
> > >Actually it requires lots of deep down VM internals symbols that'll
> never
> > >get exported.
> > >
> > > 
> >
> > What's "it" here?  kvm-specific address space or generic vmas.
> 
> The patches in this thread.
> 
> > Generic vmas will be more intrusive AFAICT.
> 
> People use intrusive differently.  Doing big changes to core code is
> not
> a problem if we actually get a proper interface.  Just exporting core
> function without other changes and then writing code in modules that
> pokes into internals is much much worse.
The patch follows the same way shm swap out pages. The only difference
is kvm is a module but shm not. why kvm can't use the symbols shm used?

Sure, it's possible to write guest memory to a file so not use the
symbols, if you really hate this, I'll consider the alternative method.

Thanks,
Shaohua
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 7/8]KVM: swap out guest pages

2007-07-23 Thread Shaohua Li
On Mon, 2007-07-23 at 19:32 +0800, Avi Kivity wrote:
> Shaohua Li wrote:
> > Make KVM guest pages be allocated dynamically and able to be swaped
> out.
> >
> > One issue: all inodes returned from anon_inode_getfd are shared,
> > if one module changes field of the inode, other moduels might break.
> > Should we introduce a new API to not share inode?
> >
> > Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
> > ---
> >  drivers/kvm/kvm.h  |8 +
> >  drivers/kvm/kvm_main.c |  220
> +
> >  2 files changed, 211 insertions(+), 17 deletions(-)
> >
> > +
> > + /*
> > +  * We just zap vcpu 0's page table. For a SMP guest, we should
> zap all
> > +  * vcpus'. It's better shadow page table is per-vm.
> > +  */
> > + if (PagePrivate(page))
> > + kvm_mmu_zap_pagetbl(&kvm->vcpus[0], page->index);
> > +
> >  
> 
> You're not removing any shadows of the page, in case that page is a
> guest page table.  But I don't see anything wrong with it -- the page
> won't change while it's in swap.
You are right. Should we?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 6/8]KVM: introduce kvm_mmu_zap_pagetbl

2007-07-23 Thread Shaohua Li
On Mon, 2007-07-23 at 19:16 +0800, Avi Kivity wrote:
> Shaohua Li wrote:
> > add a routine to zap all shadow pgtble for a gfn. If kvm supports
> SMP,
> > the API should zap pgtble for all vcpus, but kvm shadow page table
> > really should be per-vm, instead of per-vcpu.
> >
> >  
> 
> kvm shadow page tables _are_ per-vm.  Current kvm.git even makes that
> more explicit where functions that remove stuff (like rmap_remove())
> don't require a vcpu.
Ok, I didn't check the latest kvm.git.
> 
> > Index: linux/drivers/kvm/kvm.h
> > ===
> > --- linux.orig/drivers/kvm/kvm.h  2007-07-20 14:19:15.0
> +0800
> > +++ linux/drivers/kvm/kvm.h   2007-07-20 14:26:10.0 +0800
> > @@ -621,6 +621,7 @@ int kvm_mmu_unprotect_page_virt(struct k
> >  void kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
> >  int kvm_mmu_load(struct kvm_vcpu *vcpu);
> >  void kvm_mmu_unload(struct kvm_vcpu *vcpu);
> > +void kvm_mmu_zap_pagetbl(struct kvm_vcpu *vcpu, u64 gfn);
> > 
> >  int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run);
> > 
> > Index: linux/drivers/kvm/mmu.c
> > ===
> > --- linux.orig/drivers/kvm/mmu.c  2007-07-20 14:25:25.0
> +0800
> > +++ linux/drivers/kvm/mmu.c   2007-07-20 14:26:10.0 +0800
> > @@ -1324,6 +1324,34 @@ void kvm_mmu_zap_all(struct kvm_vcpu *vc
> >   init_kvm_mmu(vcpu);
> >  }
> > 
> > +/* FIXME: this should zap all vcpu's shadow pgtbl for gfn */
> > +void kvm_mmu_zap_pagetbl(struct kvm_vcpu *vcpu, u64 gfn)
> > +{
> > + struct kvm *kvm = vcpu->kvm;
> > + struct kvm_rmap_desc *desc;
> > + struct page *page;
> > + u64 *spte;
> > +
> > + page = gfn_to_page(kvm, gfn);
> > + BUG_ON(!page);
> > +
> > + while (page_private(page)) {
> > + if (!(page_private(page) & 1))
> > + spte = (u64 *)page_private(page);
> > + else {
> > + desc = (struct kvm_rmap_desc
> *)(page_private(page) & ~1ul);
> > + spte = desc->shadow_ptes[0];
> > + }
> > + BUG_ON(!spte);
> > + BUG_ON((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT
> > + != page_to_pfn(page));
> > + BUG_ON(!(*spte & PT_PRESENT_MASK));
> > + rmap_remove(vcpu, spte);
> > + *spte = 0;
> > + }
> > + kvm_flush_remote_tlbs(vcpu->kvm);
> > +}
> > +
> 
> Suggest kvm_mmu_unmap_page() as a name for this.
ok.

Thanks,
Shaohua
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] lguest: documentation pt I: Preparation

2007-07-23 Thread Rusty Russell
On Mon, 2007-07-23 at 18:18 -0700, Linus Torvalds wrote:
> 
> On Tue, 24 Jul 2007, Rusty Russell wrote:
> > 
> > Indeed, no code changes, and I feel strongly that it should go into
> > 2.6.23 because it's *fun*.   And (as often complained) there's not
> > enough poetry in the kernel.
> 
> There's a reason for that.
> 
>   There once was a lad from Braidwood
>   With a wife and a hatred for FUD
> He hacked kernels for fun,
> couldn't get them to run.
>   But he always felt that he should.
> 
> See?

There once was a virtualization coder,
Whose patches kept getting older,
  Each time upstream would drop,
  His documentation would slightly rot,
SO APPLY MY FUCKING PATCHES OR I'LL KEEP WRITING LIMERICKS.

Thanks!
Rusty.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/8]KVM: fix bugs in kvm sched integration patch

2007-07-23 Thread Shaohua Li
On Mon, 2007-07-23 at 18:46 +0800, Avi Kivity wrote:
> Shaohua Li wrote:
> > fix some bugs in kvm-sch patch.
> >  
> 
> There is now a 'preempt-hooks' branch on kvm.git with the
> preempt-hooks
> work.  I'll continually update and rebase it against master.
> 
> > 1. vmcs_readl/vmcs_writel are called with preempt enabled
> >  
> 
> Why is that bad?
1. raw_smp_processor_id()
2. migrate to other cpu
3. current->kvm_vcpu->cpu != the cpu id of step 1.
you will see the warning.

Thanks,
Shaohua
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8]KVM: swap out guest pages

2007-07-23 Thread Shaohua Li
On Mon, 2007-07-23 at 18:27 +0800, Avi Kivity wrote:
> Shaohua Li wrote:
> > This patch series make kvm guest pages be able to be swapped out and
> > dynamically allocated. Without it, all guest memory is allocated at
> > guest start time.
> >
> > patches are against latest git, and you need first patch Avi's
> kvm-sch
> > integration patch
> >
> (http://sourceforge.net/mailarchive/forum.php?thread_name=11841693332609-git-send-email-avi%40qumranet.com&forum_name=kvm-devel
>  ).
> >
> > Patch is quite stable in my test. With the patch, I can run a 256M
> > memory guest in a 300M memory host.
> 
> What about the opposite?
> 
> > If guest is idle, the memory it used
> > can be less than 10M. I did a simple performance test (measure
> kernel
> > build time in guest), if there is few swap, the performance w/wo the
> > patch difference isn't significent. If you have better measurement
> > approach, please let me try.
> >
> > Unresolved issue:
> > 1. swapoff doesn't work, we need a hook.
> > 2. SMP guest might not work, as kvm doesn't support smp till now.
> > 3. better algorithm to select swaped out guest pages according to
> > guest's memory usage.
> > Maybe more.
> >
> > Any suggests and comments are appreciated.
> >  
> 
> The big question is whether to have kvm's own address_space or not.
> 
> Having an address_space (like your patch does) is remarkably simple,
> and
> requires few hooks from the current vm.  However using existing vmas
> mapped by the user has many advantages:
> 
> - compatible with s390 requirements
> - allows the user to use hugetlbfs pages, which have a performance
> advantage using ept/npt (but which are unswappable)
> - allows the user to map a file (which can be regarded as way to
> specify
> the swap device)
> - better ingration with the rest of the vm
> 
> I am quite torn between the simplicity of your approach and the
> advantages of using generic vmas.  However, s390 pretty much forces
> our
> hand.
> 
> What is your opinion of extending generic vmas to back kvm guest
> memory?
several issues:
1. vma is to manage usersapce address, kvm guest uses full address
space.
2. qemu itself must use some address space.
3. kvm need special page fault for shadow page table. generic page table
operations can't be directly used for guest.
I have no idea if your idea is feasible. The s390 guys said their shadow
page table is the same as host, this is why they can easily implement
swap, x86 is hard.

Thanks,
Shaohua
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc1: i386 section mismatch warnings

2007-07-23 Thread Al Viro
On Mon, Jul 23, 2007 at 09:18:38PM -0400, Jeff Garzik wrote:
> make allmodconfig on i386:
> 
> WARNING: vmlinux(.text+0xc0101183): Section mismatch: reference to 

Ignore.  vmlinux.o ones are interesting; so are ones in modules.
vmlinux ones are either duplicates of vmlinux.o or false positives.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] lguest: documentation pt I: Preparation

2007-07-23 Thread Rusty Russell
On Mon, 2007-07-23 at 18:20 -0700, Andrew Morton wrote:
> On Tue, 24 Jul 2007 11:01:48 +1000 Rusty Russell <[EMAIL PROTECTED]> wrote:
> > But writing this documentation took *weeks*, to document 5000 lines of
> > code.  Perhaps this effort, if merged, will inspire others, but I've
> > seen little indication so far: we have enough trouble getting them
> > documenting a single public function.
> 
> Yeah, I suspect there will be insufficient interest and energy for anyone
> else to take this anywhere.
> 
> Could you please redo the changes after "link lguest example launcher
> non-static", which made a fairly big mess?

Indeed... which is why I was waiting to see if it got into Linus' tree
(I sent it before rc1, but too late now).

I'm waiting until Linus says he'll apply my documentation patches, then
I'll rejig that patch...

Thanks,
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] lguest: documentation pt I: Preparation

2007-07-23 Thread Andrew Morton
On Tue, 24 Jul 2007 11:01:48 +1000
Rusty Russell <[EMAIL PROTECTED]> wrote:

> > But I worry that you're proposing adding what appears to be new
> > Documentation-related machinery and infrastructure when there's already
> > increased activity in that area from other people and we might all be
> > headed in different directions and stuff.
> 
> It does add an lguest-specific script: I aimed for the minimal
> documentation script required to weave the code and documentation
> (basically extracts and orders by comment prefix, because code order
> isn't the same as optimal documentation order).
> 
> This is great for lguest, where the entire codebase is small and
> self-contained enough to be woven into a narrative, and the maintainer
> is prepared to put in the cycles to keep it uptodate.
> 
> But writing this documentation took *weeks*, to document 5000 lines of
> code.  Perhaps this effort, if merged, will inspire others, but I've
> seen little indication so far: we have enough trouble getting them
> documenting a single public function.

Yeah, I suspect there will be insufficient interest and energy for anyone
else to take this anywhere.

Could you please redo the changes after "link lguest example launcher
non-static", which made a fairly big mess?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] lguest: documentation pt I: Preparation

2007-07-23 Thread Linus Torvalds


On Tue, 24 Jul 2007, Rusty Russell wrote:
> 
> Indeed, no code changes, and I feel strongly that it should go into
> 2.6.23 because it's *fun*.   And (as often complained) there's not
> enough poetry in the kernel.

There's a reason for that.

There once was a lad from Braidwood
With a wife and a hatred for FUD
  He hacked kernels for fun,
  couldn't get them to run.
But he always felt that he should.

See?

So when you say "there's not enough poetry", next time you'll know why. 
You *really* don't want want poetry.

Linus

PS. Nothing rhymes with Ballalaba.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.23-rc1: i386 section mismatch warnings

2007-07-23 Thread Jeff Garzik

make allmodconfig on i386:

WARNING: vmlinux(.text+0xc0101183): Section mismatch: reference to 
.init.text:start_kernel (between 'is386' and 'check_x87')
WARNING: vmlinux(.text+0xc02cfcdb): Section mismatch: reference to 
.init.text:kernel_init (between 'rest_init' and 'kthreadd_setup')
WARNING: vmlinux(.text+0xc02d5ed2): Section mismatch: reference to 
.init.text: (between 'iret_exc' and '_etext')
WARNING: vmlinux(.text+0xc02d5ede): Section mismatch: reference to 
.init.text: (between 'iret_exc' and '_etext')
WARNING: vmlinux(.text+0xc02d5eea): Section mismatch: reference to 
.init.text: (between 'iret_exc' and '_etext')
WARNING: vmlinux(.text+0xc02d5ef6): Section mismatch: reference to 
.init.text: (between 'iret_exc' and '_etext')
WARNING: vmlinux(.text+0xc02cfda4): Section mismatch: reference to 
.init.text:__alloc_bootmem_node (between 'alloc_node_mem_map' and 
'zone_wait_table_init')
WARNING: vmlinux(.text+0xc02cfe4e): Section mismatch: reference to 
.init.text:__alloc_bootmem_node (between 'zone_wait_table_init' and 
'vgacon_scrollback_startup')
WARNING: vmlinux(.text+0xc02d64c6): Section mismatch: reference to 
.init.text: (between 'iret_exc' and '_etext')
WARNING: vmlinux(.text+0xc02cfea7): Section mismatch: reference to 
.init.text:__alloc_bootmem (between 'vgacon_scrollback_startup' and 
'fb_find_logo')
WARNING: vmlinux(.text+0xc02cfecb): Section mismatch: reference to 
.init.data:logo_linux_mono (between 'fb_find_logo' and 'schedule')
WARNING: vmlinux(.text+0xc02cfed5): Section mismatch: reference to 
.init.data:logo_linux_clut224 (between 'fb_find_logo' and 'schedule')
WARNING: vmlinux(.text+0xc02cfeda): Section mismatch: reference to 
.init.data:logo_linux_vga16 (between 'fb_find_logo' and 'schedule')
WARNING: vmlinux(.text+0xc02d6612): Section mismatch: reference to 
.init.text: (between 'iret_exc' and '_etext')

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] readahead drop behind and size adjustment

2007-07-23 Thread Andrew Morton
On Tue, 24 Jul 2007 08:47:28 +0800
Fengguang Wu <[EMAIL PROTECTED]> wrote:

> Subject: convert ill defined log2() to ilog2()
> 
> It's *wrong* to have
>   #define log2(n) ffz(~(n))
> It should be *reversed*:
>   #define log2(n) flz(~(n))
> or
>   #define log2(n) fls(n)
> or just use
>   ilog2(n) defined in linux/log2.h.
> 
> This patch follows the last solution, recommended by Andrew Morton.
> 
> //Or are they simply the wrong naming, and is in fact correct in behavior?
> 
> Cc: [EMAIL PROTECTED]
> Cc: Mingming Cao <[EMAIL PROTECTED]>
> Cc: Bjorn Helgaas <[EMAIL PROTECTED]>
> Cc: Chris Ahna <[EMAIL PROTECTED]>
> Cc: David Mosberger-Tang <[EMAIL PROTECTED]>
> Cc: Kyle McMartin <[EMAIL PROTECTED]>
> Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
> ---
>  drivers/char/agp/hp-agp.c |9 +++--
>  drivers/char/agp/i460-agp.c   |5 ++---
>  drivers/char/agp/parisc-agp.c |7 ++-
>  fs/ext4/super.c   |6 ++
>  4 files changed, 9 insertions(+), 18 deletions(-)

hm, yes, there is a risk that the code was accidentally correct.  Or the
code has only ever dealt with power-of-2 inputs, in which case it happens
to work either way.

David(s) and ext4-people: could we please have a close review of these
changes?

Thanks.

> --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/hp-agp.c
> +++ linux-2.6.22-rc6-mm1/drivers/char/agp/hp-agp.c
> @@ -14,15 +14,12 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
>  #include "agp.h"
>  
> -#ifndef log2
> -#define log2(x)  ffz(~(x))
> -#endif
> -
>  #define HP_ZX1_IOC_OFFSET0x1000  /* ACPI reports SBA, we want IOC */
>  
>  /* HP ZX1 IOC registers */
> @@ -256,7 +253,7 @@ hp_zx1_configure (void)
>   readl(hp->ioc_regs+HP_ZX1_IMASK);
>   writel(hp->iova_base|1, hp->ioc_regs+HP_ZX1_IBASE);
>   readl(hp->ioc_regs+HP_ZX1_IBASE);
> - writel(hp->iova_base|log2(HP_ZX1_IOVA_SIZE), 
> hp->ioc_regs+HP_ZX1_PCOM);
> + writel(hp->iova_base|ilog2(HP_ZX1_IOVA_SIZE), 
> hp->ioc_regs+HP_ZX1_PCOM);
>   readl(hp->ioc_regs+HP_ZX1_PCOM);
>   }
>  
> @@ -284,7 +281,7 @@ hp_zx1_tlbflush (struct agp_memory *mem)
>  {
>   struct _hp_private *hp = &hp_private;
>  
> - writeq(hp->gart_base | log2(hp->gart_size), hp->ioc_regs+HP_ZX1_PCOM);
> + writeq(hp->gart_base | ilog2(hp->gart_size), hp->ioc_regs+HP_ZX1_PCOM);
>   readq(hp->ioc_regs+HP_ZX1_PCOM);
>  }
>  
> --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/i460-agp.c
> +++ linux-2.6.22-rc6-mm1/drivers/char/agp/i460-agp.c
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "agp.h"
>  
> @@ -59,8 +60,6 @@
>   */
>  #define WR_FLUSH_GATT(index) RD_GATT(index)
>  
> -#define log2(x)  ffz(~(x))
> -
>  static struct {
>   void *gatt; /* ioremap'd GATT area */
>  
> @@ -148,7 +147,7 @@ static int i460_fetch_size (void)
>* values[i].size.
>*/
>   values[i].num_entries = (values[i].size << 8) >> 
> (I460_IO_PAGE_SHIFT - 12);
> - values[i].page_order = log2((sizeof(u32)*values[i].num_entries) 
> >> PAGE_SHIFT);
> + values[i].page_order = 
> ilog2((sizeof(u32)*values[i].num_entries) >> PAGE_SHIFT);
>   }
>  
>   for (i = 0; i < agp_bridge->driver->num_aperture_sizes; i++) {
> --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/parisc-agp.c
> +++ linux-2.6.22-rc6-mm1/drivers/char/agp/parisc-agp.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -27,10 +28,6 @@
>  #define DRVNAME  "quicksilver"
>  #define DRVPFX   DRVNAME ": "
>  
> -#ifndef log2
> -#define log2(x)  ffz(~(x))
> -#endif
> -
>  #define AGP8X_MODE_BIT   3
>  #define AGP8X_MODE   (1 << AGP8X_MODE_BIT)
>  
> @@ -92,7 +89,7 @@ parisc_agp_tlbflush(struct agp_memory *m
>  {
>   struct _parisc_agp_info *info = &parisc_agp_info;
>  
> - writeq(info->gart_base | log2(info->gart_size), 
> info->ioc_regs+IOC_PCOM);
> + writeq(info->gart_base | ilog2(info->gart_size), 
> info->ioc_regs+IOC_PCOM);
>   readq(info->ioc_regs+IOC_PCOM); /* flush */
>  }
>  
> --- linux-2.6.22-rc6-mm1.orig/fs/ext4/super.c
> +++ linux-2.6.22-rc6-mm1/fs/ext4/super.c
> @@ -1433,8 +1433,6 @@ static void ext4_orphan_cleanup (struct 
>   sb->s_flags = s_flags; /* Restore MS_RDONLY status */
>  }
>  
> -#define log2(n) ffz(~(n))
> -
>  /*
>   * Maximal file size.  There is a direct, and {,double-,triple-}indirect
>   * block limit, and also a limit of (2^32 - 1) 512-byte sectors in i_blocks.
> @@ -1706,8 +1704,8 @@ static int ext4_fill_super (struct super
>   sbi->s_desc_per_block = blocksize / EXT4_DESC_SIZE(sb);
>   sbi->s_sbh = bh;
>   sbi->s_mount_state = le16_to_cpu(es->s_state);
> - sbi->s_addr_per_block_bits = l

[PATCH] sata_qstor, pdc_adma, sata_sx4: convert to new EH

2007-07-23 Thread Jeff Garzik

This is just a refresh of the existing libata-dev.git#new-eh patches
that convert all remaining old-EH drivers to new EH, against 2.6.23-rc1.

All three conversions are completely untested.  pdc_adma and sata_qstor
need reviewing by someone with docs, in addition to testing.

Even "it still works" or "this patch breaks stuff" feedback from users
is useful.

Jeff




commit 99ad0b4cd2d73815db6370fa6f89d7053ca21016
Author: Jeff Garzik <[EMAIL PROTECTED]>
Date:   Mon May 28 06:21:45 2007 -0400

[libata] sata_sx4: convert to new EH

Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]>

commit 9b105869fb953485ff7e7a8f9bb9f3dcae8aa502
Author: Jeff Garzik <[EMAIL PROTECTED]>
Date:   Sat May 26 19:48:07 2007 -0400

[libata] sata_qstor: rough draft conversion to new libata EH

Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]>

commit 1a9161f86ef2ae955267c78f1c16273cad4fdbb4
Author: Jeff Garzik <[EMAIL PROTECTED]>
Date:   Fri Jul 6 19:28:32 2007 -0400

[libata] pdc_adma: rough draft conversion to new-EH

Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]>

 drivers/ata/pdc_adma.c   |   82 
 drivers/ata/sata_qstor.c |   65 +++
 drivers/ata/sata_sx4.c   |   96 +--
 3 files changed, 174 insertions(+), 69 deletions(-)

diff --git a/drivers/ata/pdc_adma.c b/drivers/ata/pdc_adma.c
index bec1de5..bf4dba0 100644
--- a/drivers/ata/pdc_adma.c
+++ b/drivers/ata/pdc_adma.c
@@ -92,6 +92,8 @@ enum {
 
/* CPB bits */
cDONE   = (1 << 0),
+   cATERR  = (1 << 3),
+
cVLD= (1 << 0),
cDAT= (1 << 2),
cIEN= (1 << 3),
@@ -131,14 +133,15 @@ static int adma_ata_init_one (struct pci_dev *pdev,
 static int adma_port_start(struct ata_port *ap);
 static void adma_host_stop(struct ata_host *host);
 static void adma_port_stop(struct ata_port *ap);
-static void adma_phy_reset(struct ata_port *ap);
 static void adma_qc_prep(struct ata_queued_cmd *qc);
 static unsigned int adma_qc_issue(struct ata_queued_cmd *qc);
 static int adma_check_atapi_dma(struct ata_queued_cmd *qc);
 static void adma_bmdma_stop(struct ata_queued_cmd *qc);
 static u8 adma_bmdma_status(struct ata_port *ap);
 static void adma_irq_clear(struct ata_port *ap);
-static void adma_eng_timeout(struct ata_port *ap);
+static void adma_freeze(struct ata_port *ap);
+static void adma_thaw(struct ata_port *ap);
+static void adma_error_handler(struct ata_port *ap);
 
 static struct scsi_host_template adma_ata_sht = {
.module = THIS_MODULE,
@@ -165,12 +168,13 @@ static const struct ata_port_operations adma_ata_ops = {
.exec_command   = ata_exec_command,
.check_status   = ata_check_status,
.dev_select = ata_std_dev_select,
-   .phy_reset  = adma_phy_reset,
.check_atapi_dma= adma_check_atapi_dma,
.data_xfer  = ata_data_xfer,
.qc_prep= adma_qc_prep,
.qc_issue   = adma_qc_issue,
-   .eng_timeout= adma_eng_timeout,
+   .freeze = adma_freeze,
+   .thaw   = adma_thaw,
+   .error_handler  = adma_error_handler,
.irq_clear  = adma_irq_clear,
.irq_on = ata_irq_on,
.irq_ack= ata_irq_ack,
@@ -184,7 +188,7 @@ static const struct ata_port_operations adma_ata_ops = {
 static struct ata_port_info adma_port_info[] = {
/* board_1841_idx */
{
-   .flags  = ATA_FLAG_SLAVE_POSS | ATA_FLAG_SRST |
+   .flags  = ATA_FLAG_SLAVE_POSS |
  ATA_FLAG_NO_LEGACY | ATA_FLAG_MMIO |
  ATA_FLAG_PIO_POLLING,
.pio_mask   = 0x10, /* pio4 */
@@ -273,24 +277,41 @@ static inline void adma_enter_reg_mode(struct ata_port 
*ap)
readb(chan + ADMA_STATUS);  /* flush */
 }
 
-static void adma_phy_reset(struct ata_port *ap)
+static void adma_freeze(struct ata_port *ap)
 {
-   struct adma_port_priv *pp = ap->private_data;
+   void __iomem *chan = ADMA_PORT_REGS(ap);
+
+   /* mask/clear ATA interrupts */
+   writeb(ATA_NIEN, ap->ioaddr.ctl_addr);
+   ata_check_status(ap);
+
+   /* reset ADMA to idle state */
+   writew(aPIOMD4 | aNIEN | aRSTADM, chan + ADMA_CONTROL);
+   udelay(2);
+   writew(aPIOMD4 | aNIEN, chan + ADMA_CONTROL);
+   udelay(2);
+}
 
-   pp->state = adma_state_idle;
+static void adma_thaw(struct ata_port *ap)
+{
adma_reinit_engine(ap);
-   ata_port_probe(ap);
-   ata_bus_reset(ap);
 }
 
-static void adma_eng_timeout(struct ata_port *ap)
+static int adma_prereset(struct ata_port *ap, unsigned long deadline)
 {
struct adma_port_priv *pp =

[PATCH] sata_mv: non-working NCQ support

2007-07-23 Thread Jeff Garzik

This patch adds NCQ support to the sata_mv driver.  Currently it does
not work:  FPDMA commands time out, and eventually EH falls back to
non-NCQ, which works.

My attention has turned to other things for moment.  Anybody interested
in sata_mv NCQ is encouraged to pick up where this left off, as this
patch, buggy or not, includes the changes that will be required to
enable command queueing in sata_mv.


commit 6bef64243c68bf637b5594a0a363d8105efedfa6
Author: Jeff Garzik <[EMAIL PROTECTED]>
Date:   Wed Jul 11 18:56:46 2007 -0400

[libata mv-ncq] sata_mv: Add NCQ support

Currently not working.

Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]>

 drivers/ata/sata_mv.c |   70 +++---
 1 file changed, 49 insertions(+), 21 deletions(-)

diff --git a/drivers/ata/sata_mv.c b/drivers/ata/sata_mv.c
index 8ec5208..228f71a 100644
--- a/drivers/ata/sata_mv.c
+++ b/drivers/ata/sata_mv.c
@@ -29,8 +29,6 @@
   I distinctly remember a couple workarounds (one related to PCI-X)
   are still needed.
 
-  4) Add NCQ support (easy to intermediate, once new-EH support appears)
-
   5) Investigate problems with PCI Message Signalled Interrupts (MSI).
 
   6) Add port multiplier support (intermediate)
@@ -417,6 +415,7 @@ static void mv_error_handler(struct ata_port *ap);
 static void mv_post_int_cmd(struct ata_queued_cmd *qc);
 static void mv_eh_freeze(struct ata_port *ap);
 static void mv_eh_thaw(struct ata_port *ap);
+static void mv6_dev_config(struct ata_device *dev);
 static int mv_init_one(struct pci_dev *pdev, const struct pci_device_id *ent);
 
 static void mv5_phy_errata(struct mv_host_priv *hpriv, void __iomem *mmio,
@@ -440,6 +439,8 @@ static void mv6_reset_flash(struct mv_host_priv *hpriv, 
void __iomem *mmio);
 static void mv_reset_pci_bus(struct pci_dev *pdev, void __iomem *mmio);
 static void mv_channel_reset(struct mv_host_priv *hpriv, void __iomem *mmio,
 unsigned int port_no);
+static void mv_edma_cfg(struct ata_port *ap, struct mv_host_priv *hpriv,
+   void __iomem *port_mmio);
 
 static struct scsi_host_template mv5_sht = {
.module = THIS_MODULE,
@@ -464,7 +465,8 @@ static struct scsi_host_template mv6_sht = {
.name   = DRV_NAME,
.ioctl  = ata_scsi_ioctl,
.queuecommand   = ata_scsi_queuecmd,
-   .can_queue  = ATA_DEF_QUEUE,
+   .change_queue_depth = ata_scsi_change_queue_depth,
+   .can_queue  = MV_MAX_Q_DEPTH - 1,
.this_id= ATA_SHT_THIS_ID,
.sg_tablesize   = MV_MAX_SG_CT,
.cmd_per_lun= ATA_SHT_CMD_PER_LUN,
@@ -510,6 +512,7 @@ static const struct ata_port_operations mv5_ops = {
 
 static const struct ata_port_operations mv6_ops = {
.port_disable   = ata_port_disable,
+   .dev_config = mv6_dev_config,
 
.tf_load= ata_tf_load,
.tf_read= ata_tf_read,
@@ -590,26 +593,29 @@ static const struct ata_port_info mv_port_info[] = {
.port_ops   = &mv5_ops,
},
{  /* chip_604x */
-   .flags  = MV_COMMON_FLAGS | MV_6XXX_FLAGS,
+   .flags  = MV_COMMON_FLAGS | MV_6XXX_FLAGS |
+ ATA_FLAG_NCQ,
.pio_mask   = 0x1f, /* pio0-4 */
.udma_mask  = ATA_UDMA6,
.port_ops   = &mv6_ops,
},
{  /* chip_608x */
.flags  = MV_COMMON_FLAGS | MV_6XXX_FLAGS |
- MV_FLAG_DUAL_HC,
+ MV_FLAG_DUAL_HC | ATA_FLAG_NCQ,
.pio_mask   = 0x1f, /* pio0-4 */
.udma_mask  = ATA_UDMA6,
.port_ops   = &mv6_ops,
},
{  /* chip_6042 */
-   .flags  = MV_COMMON_FLAGS | MV_6XXX_FLAGS,
+   .flags  = MV_COMMON_FLAGS | MV_6XXX_FLAGS |
+ ATA_FLAG_NCQ,
.pio_mask   = 0x1f, /* pio0-4 */
.udma_mask  = ATA_UDMA6,
.port_ops   = &mv_iie_ops,
},
{  /* chip_7042 */
-   .flags  = MV_COMMON_FLAGS | MV_6XXX_FLAGS,
+   .flags  = MV_COMMON_FLAGS | MV_6XXX_FLAGS |
+ ATA_FLAG_NCQ,
.pio_mask   = 0x1f, /* pio0-4 */
.udma_mask  = ATA_UDMA6,
.port_ops   = &mv_iie_ops,
@@ -808,18 +814,23 @@ static void mv_set_edma_ptrs(void __iomem *port_mmio,
  *  LOCKING:
  *  Inherited from caller.
  */
-static void mv_start_dma(void __iomem *base, struct mv_host_priv *hpriv,
+static void mv_start_dma(struct ata_port *ap, void __iomem *base,
 struct mv_port_priv *pp)
 {
if (!(pp->pp_flags &

Re: [PATCH 1/7] lguest: documentation pt I: Preparation

2007-07-23 Thread Rusty Russell
On Mon, 2007-07-23 at 17:12 -0700, Andrew Morton wrote:
> On Sat, 21 Jul 2007 11:17:58 +1000
> Rusty Russell <[EMAIL PROTECTED]> wrote:
> 
> > The netfilter code had very good documentation: the Netfilter Hacking
> > HOWTO.  Noone ever read it.
> > 
> > So this time I'm trying something different, using a bit of
> > Knuthiness.  Start with drivers/lguest/README.
> 
> um.
> 
> I'm OK with merging patches and given lguest's newness, the timestamp on
> these patches, the fact that they don't change code generation (right?) and
> my reluctance to carry large do-nothing patches for two months, I'd be OK
> with squeaking them into 2.6.23.

Indeed, no code changes, and I feel strongly that it should go into
2.6.23 because it's *fun*.   And (as often complained) there's not
enough poetry in the kernel.

> But I worry that you're proposing adding what appears to be new
> Documentation-related machinery and infrastructure when there's already
> increased activity in that area from other people and we might all be
> headed in different directions and stuff.

It does add an lguest-specific script: I aimed for the minimal
documentation script required to weave the code and documentation
(basically extracts and orders by comment prefix, because code order
isn't the same as optimal documentation order).

This is great for lguest, where the entire codebase is small and
self-contained enough to be woven into a narrative, and the maintainer
is prepared to put in the cycles to keep it uptodate.

But writing this documentation took *weeks*, to document 5000 lines of
code.  Perhaps this effort, if merged, will inspire others, but I've
seen little indication so far: we have enough trouble getting them
documenting a single public function.

> IOW, I'd be interested in hearing Rob and Randy's opinions on it all,
> please.

So they can see what we're talking about, here's an example of the
output:

http://lguest.ozlabs.org/lguest-journey.c.bz2

Cheers,
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] readahead drop behind and size adjustment

2007-07-23 Thread Fengguang Wu
On Mon, Jul 23, 2007 at 12:40:09PM -0700, Andrew Morton wrote:
> On Mon, 23 Jul 2007 22:24:57 +0800
> Fengguang Wu <[EMAIL PROTECTED]> wrote:
> 
> > On Mon, Jul 23, 2007 at 07:00:59PM +1000, Nick Piggin wrote:
> > > Rusty Russell wrote:
> > > >On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote:
> > > 
> > > >>So I opt for it being made tunable, safe, and turned off by default.
> > > 
> > > I hate tunables :) Unless we have workload A that gets a reasonable
> > > benefit from something and workload B that gets a significant regression,
> > > and no clear way to reconcile them...
> > 
> > Me too ;)
> > 
> > But sometimes we really want to avoid flushing the cache.
> > Andrew's user space LD_PRELOAD+fadvise based tool fit nicely here.
> 
> It's the only way to go in some situations.  Sometimes the kernel just
> cannot predict the future sufficiently well, and the costs of making a
> mistake are terribly high.  We need human help.  And it should be
> administration-time help, not programming-time help.

Agreed. I feel that drop behind is not a universal applicable.
Cost based reclaim sounds better, but who knows before field use ;)

> > > >I'd like to see it turned on by default in -mm, and try to come up with
> > > >some server-like workload to measure the effect.  Should be easy to
> > > >simulate something (eg. apache server, where clients grab some files in
> > > >preference, and apache server where clients grab different files).
> > > 
> > > I don't like this kind of conditional information going from something
> > > like readahead into page reclaim. Unless it is for readahead _specific_
> > > data such as "I got these all wrong, so you can reclaim them" (which
> > > this isn't).
> > > 
> > > Possibly it makes sense to realise that the given pages are cheaper
> > > to read back in as they are apparently being read-ahead very nicely.
> > 
> > In fact I have talked to Jens about it in last year's kernel summit.
> > The patch below explains itself.
> > ---
> > Subject: cost based page reclaim
> > 
> > Cost based page reclaim - a minimalist implementation.
> > 
> > Suppose we cached 32 small files each with 1 page, and one 32-page chunk 
> > from a
> > large file.  Should we first drop the 32-pages which are read in one I/O, or
> > drop the 32 distinct pages, each costs one I/O? (Given that the files are of
> > equal hotness.)
> > 
> > Page replacement algorithms should be designed to minimize the number of 
> > I/Os,
> > instead of the number of page faults. Dividing the cost of I/O by the 
> > number of
> > pages it bring in, we get the cost of the page. The bigger page cost, the 
> > more
> > 'lives/bloods' the page should have.
> > 
> > This patch adds life to costly pages by pretending that they are
> > referenced more times. Possible downsides:
> > - burdens the pressure of vmscan
> > - active pages are no longer that 'active'
> > 
> 
> This is all fun stuff, but how do we find out that changes like this are
> good ones, apart from shipping it and seeing who gets hurt 12 months later?

One thing I can imagine now is that the first pages may get more life
because of the conservative initial readahead size.

Generally file servers use sendfile(wholefile), so not a problem.

> > +#define log2(n) fls(n)
> 
> 

Thank you, this comment lead to another patch :)
---
Subject: convert ill defined log2() to ilog2()

It's *wrong* to have
#define log2(n) ffz(~(n))
It should be *reversed*:
#define log2(n) flz(~(n))
or
#define log2(n) fls(n)
or just use
ilog2(n) defined in linux/log2.h.

This patch follows the last solution, recommended by Andrew Morton.

//Or are they simply the wrong naming, and is in fact correct in behavior?

Cc: [EMAIL PROTECTED]
Cc: Mingming Cao <[EMAIL PROTECTED]>
Cc: Bjorn Helgaas <[EMAIL PROTECTED]>
Cc: Chris Ahna <[EMAIL PROTECTED]>
Cc: David Mosberger-Tang <[EMAIL PROTECTED]>
Cc: Kyle McMartin <[EMAIL PROTECTED]>
Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
---
 drivers/char/agp/hp-agp.c |9 +++--
 drivers/char/agp/i460-agp.c   |5 ++---
 drivers/char/agp/parisc-agp.c |7 ++-
 fs/ext4/super.c   |6 ++
 4 files changed, 9 insertions(+), 18 deletions(-)

--- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/hp-agp.c
+++ linux-2.6.22-rc6-mm1/drivers/char/agp/hp-agp.c
@@ -14,15 +14,12 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
 #include "agp.h"
 
-#ifndef log2
-#define log2(x)ffz(~(x))
-#endif
-
 #define HP_ZX1_IOC_OFFSET  0x1000  /* ACPI reports SBA, we want IOC */
 
 /* HP ZX1 IOC registers */
@@ -256,7 +253,7 @@ hp_zx1_configure (void)
readl(hp->ioc_regs+HP_ZX1_IMASK);
writel(hp->iova_base|1, hp->ioc_regs+HP_ZX1_IBASE);
readl(hp->ioc_regs+HP_ZX1_IBASE);
-   writel(hp->iova_base|log2(HP_ZX1_IOVA_SIZE), 
hp->ioc_regs+HP_ZX1_PCOM);
+   writel(hp->iova_b

Re: [DRIVER SUBMISSION] DRBD wants to go mainline

2007-07-23 Thread Kyle Moffett
For the guys on netdev, would you please look at the tcp_recvmsg- 
threading and TCP_NAGLE_CORK issues below and give opinions on the  
best way to proceed?


One thing to remember, you don't necessarily have to merge every  
feature right away.  As long as the new code is configured "off" by  
default with an "(EXPERIMENTAL)" warning, you can start getting the  
core parts and the cleanups upstream before you have to resolve all  
the issues with low-latency, dynamic-tracing-frameworks, etc.


On Jul 23, 2007, at 09:32:02, Lars Ellenberg wrote:

On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote:

+/* I don't remember why XCPU ...
+ * This is used to wake the asender,
+ * and to interrupt sending the sending task
+ * on disconnect.
+ */
+#define DRBD_SIG SIGXCPU


Don't use signals between kernel threads, use proper primitives  
like notifiers and waitqueues, which means you should also  
probably switch away from kernel_thread() to the kthread_*()  
APIs.  Also you should fix this FIXME or remove it if it no longer  
applies:-D.


right.
but how to I tell a network thread in tcp_recvmsg to stop early,  
without using signals?


I'm not really a kernel-networking guy, so I can't answer this  
definitively, but I'm pretty sure the problem has been solved in many  
network filesystems and such, so I've added a netdev CC.  The way I'd  
do it in userspace is with nonblocking IO and epoll(), that way I  
don't actually have to "stop" or "signal" the thread, I can just add  
a socket to epoll fd when I want to pay attention to it, and remove  
it from my epoll fd when I'm done with it.  I'd assume there's some  
equivalent way in kernelspace based around the "struct kiocb *iocb"  
and "int nonblock" parameters to the tcp_recvmsg() kernel function.



+/* see kernel/printk.c:printk_ratelimit
+ * macro, so it is easy do have independend rate limits at  
different locations

+ * "initializer element not constant ..." with kernel 2.4 :(
+ * so I initialize toks to something large
+ */
+#define DRBD_ratelimit(ratelimit_jiffies, ratelimit_burst) \

Any particular reason you can't just use printk_ratelimit for this?


I want to be able to do a rate-limit per specific message/code  
fragment, without affecting other messages or execution paths.


Ok, so could you change your patch to modify __printk_ratelimit() to  
also accept a "struct printk_rate" datastructure and make  
printk_ratelimit() call "__printk_ratelimit(&global_printk_rate);"??


Typically if $KERNEL_FEATURE is insufficient for your needs you  
should fix $KERNEL_FEATURE instead of duplicating a replacement in  
your driver.  This applies to basically all of the things I'm talking  
about, kernel-threads, workqueues (BTW: I believe you can make your  
own custom workqueue thread(s) instead of using the default "events/ 
*" ones), debugging macros, fault-insertion, integer math, lock- 
checking, dynamic tracing, etc.  If you find some reason that some  
generic code won't work for you, please try to fix it first so we can  
all benefit from it.


Umm, how about fixing this to actually use proper workqueues or  
something instead of this open-coded mess?


unlikely to happen "right now".  but it is on our todo list...


Unfortunately problems like these need to be fixed before a mainline  
merge.  Merging duplicated code is a big no-no, and historically  
there have been problems with people who merge code and never  
properly maintain it once it's in tree.  As a result the rule is your  
code has to be easily maintainable before anybody will even  
*consider* merging it.



+/* I want the packet to fit within one page
+ * THINK maybe use a special bitmap header,
+ * including offset and compression scheme and whatnot
+ * Do not use PAGE_SIZE here! Use a architecture agnostic constant!
+ */
+#define BM_PACKET_WORDS ((4096-sizeof(struct Drbd_Header))/sizeof 
(long))


Yuck.  Definitely use PAGE_SIZE here, so at least if it's broken  
on an arch with multiple page sizes, somebody can grep for  
PAGE_SIZE to fix it.  It also means that on archs/configs with 8k  
or 64k pages you won't waste a bunch of memory.


No. This is not to allocate anything, but defines the chunk size  
with which we transmit the bitmap, when we have to.  We need to be  
able to talk from one arch (say i586) to some other (say s390, or  
sparc, or whatever).  The receiving side has a one-page buffer,  
from which it may or may not to endian-conversion.  The hardcoded  
4096 is the minimal common denominator here.


Ahhh.  Please replace the constant "4096" with:
/* This is the maximum amount of bitmap we will send per packet */
# define MAX_BITMAP_CHUNK_SIZE 4096
# define BM_PACKET_WORDS \
((MAX_BITMAP_CHUNK_SIZE - sizeof(struct Drbd_Header))/sizeof(long))

It's more text but dramatically improves the readability by  
eliminating more magic numbers.  This is a much milder case than I've  
seen in the past, so it's not that big of a deal.




+/* Dynamic tracing f

Re: [PATCH][08/37] Clean up duplicate includes in drivers/input/

2007-07-23 Thread Dmitry Torokhov
On Saturday 21 July 2007 11:02, Jesper Juhl wrote:
> Hi,
> 
> This patch cleans up duplicate includes in
>   drivers/input/
>

Applied to for-linus branch of input tree, thank you.
 
-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] m68k/mac: Make mac_hid_mouse_emulate_buttons() declaration visible

2007-07-23 Thread Dmitry Torokhov
On Sunday 22 July 2007 08:51, Geert Uytterhoeven wrote:
> On Sun, 22 Jul 2007, Dmitry Torokhov wrote:
> > On Saturday 21 July 2007 04:27, Geert Uytterhoeven wrote:
> > > On Fri, 20 Jul 2007, Dmitry Torokhov wrote:
> > > > I am OK with adding a new header file. I was just saying that placing
> > > > that declaration in linux/hid.h makes about the same sense as putting
> > > > it into linux/scsi.h
> > > 
> > > At first I just wanted to move it. Then I thought about the angry
> > > comments I would get about not moving it to a header file ;-)
> > > 
> > >  looked like the best candidate.  is
> > > another option.
> > > 
> > 
> > linux/kbd_kern.h sounds much better.
> 
> And so it will be.

Applied to 'for-linus' branch of input tree, thank you.

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   >