Re: refcounting drivers' data structures used in sysfs buffers

2007-03-12 Thread Alan Stern
On Mon, 12 Mar 2007, Dmitry Torokhov wrote:

  Do you think Linus would listen if all three of us (plus maybe Greg) tried
  to convince him?
 
 
 If we'd accompany the argument with the patch that changes scsi to use
 wq to perform deletion so we don't have deadlock regression in the
 kernel he might be more perceptive...

I wrote that patch over the weekend but forgot to bring it in to work.  
I'll post it tonight or tomorrow.

  He is right about lifetime
 issues but this is not strictly lifetime issue as you correctly point
 out. Plus, refcounting also bloats the kernel so I don't relly want to
 use refcount for every integer I happen to export through sysfs if I
 can simply revoke access.

Agreed.

Alan Stern

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-12 Thread Mike Galbraith
On Tue, 2007-03-13 at 07:38 +1100, Con Kolivas wrote:
 On Tuesday 13 March 2007 07:11, Mike Galbraith wrote:
 
  Killing the known corner case starvation scenarios is wonderful, but
  let's not just pretend that interactive tasks don't have any special
  requirements.
 
 Now you're really making a stretch of things. Where on earth did I say that 
 interactive tasks don't have special requirements? It's a fundamental feature 
 of this scheduler that I go to great pains to get them as low latency as 
 possible and their fair share of cpu despite having a completely fair cpu 
 distribution.

As soon as your cpu is fully utilized, fairness looses or interactivity
loses.  Pick one.

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Slab corruption - file_free_rcu ?

2007-03-12 Thread Ian McDonald

Folks,

I'm getting this sort of message in my logs on occasion and my system
dies on me some time later.

Mar 13 08:52:02 localhost kernel: [  343.931624] Slab corruption:
start=d2756f04, len=208
Mar 13 08:52:02 localhost kernel: [  343.932366] Redzone: 0x5a2cf071/0x5a2cf071.
Mar 13 08:52:02 localhost kernel: [  343.932797] Last user:
[c0155562](file_free_rcu+0xf/0x11)
Mar 13 08:52:02 localhost kernel: [  343.933429] 090: 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b 6b 6b 6b 75 6b
Mar 13 08:52:02 localhost kernel: [  343.934225] 0a0: 6b 6b 6b 6b 6b
6b 00 6b 6b 6b 6b 6b 6b 6b 00 6b
Mar 13 08:52:02 localhost kernel: [  343.934999] 0b0: 6b 6b 6b 6b 6b
6b ad 6b 6b 6b 6b 6b 6b 6b 6b 6b
Mar 13 08:52:02 localhost kernel: [  343.935995] Prev obj:
start=d2756e28, len=208
Mar 13 08:52:02 localhost kernel: [  343.936740] Redzone: 0x5a2cf071/0x5a2cf071.
Mar 13 08:52:02 localhost kernel: [  343.937182] Last user:
[c0155562](file_free_rcu+0xf/0x11)
Mar 13 08:52:02 localhost kernel: [  343.937682] 000: 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Mar 13 08:52:02 localhost kernel: [  343.938473] 010: 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Mar 13 09:06:37 localhost kernel: klogd 1.4.1#20, log source =
/proc/kmsg started.

Kernel is Linus' git tree as of 3 days ago (i.e. post 2.6.21rc3). I do
have some DCCP changes in there but those modules were not loaded at
the time.

I've had a quick look through lkml archives and can't find anything on
this in last few days. Apologies if I've missed something.

I'm not sure how long this has been occurring. I have been having slab
corruption on earlier kernels but they did not put an identifier on
last usage. This may have been issues with my Broadcom 4306 card as
this used to have lots of errors as well, until more recent kernels
where stability on that is much better.

My config is attached.

Please cc me on any queries as I'm not subscribed to lkml.

Regards,

Ian
--
Web: http://wand.net.nz/~iam4
Blog: http://iansblog.jandi.co.nz
WAND Network Research Group


config.gz
Description: GNU Zip compressed data


Re: [PATCH] x86_64, i386: Add command line length to boot protocol

2007-03-12 Thread Dave Jones
On Mon, Mar 12, 2007 at 10:43:52AM +, Pavel Machek wrote:
  On Tue 2007-03-06 13:21:34, Dave Jones wrote:
   On Tue, Mar 06, 2007 at 07:14:30PM +0100, Bernhard Walle wrote:
   
 +cmdline_size:   .long   COMMAND_LINE_SIZE-1 #length of the command 
   line,
   
   Why a long? It's unlikely that someone is going to have a command line
   bigger than 0x.
  
  Well, I could imagine overflowing that. Describing your numa setup,
  excluding few bad bits of ram using memmap=exact, set up your boot
  over iscsi on cmdline these are likely to eat insane ammount of
  cmdline space.

65535 characters? Are you for real?
Stop and think about just how big that is. If you have to create
a boot command line that long, you have serious, serious issues.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Software Suspend: Fix suspend when console is in VT_AUTO+KD_GRAPHICS mode

2007-03-12 Thread Pavel Machek
Hi!

 When the console is in VT_AUTO+KD_GRAPHICS mode, switching to the
 SUSPEND_CONSOLE fails, resulting in vt_waitactive() waiting indefinitely
 or until the task is interrupted.  This patch tests if a console switch
 can occur in set_console() and returns early if a console switch is not
 possible.

 Signed-off-by: Andrew Johnson [EMAIL PROTECTED]

ACK. (I hope it still applies to latest mainline).
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] MPT FUSION: Delete unused header files.

2007-03-12 Thread David Miller
From: Moore, Eric [EMAIL PROTECTED]
Date: Mon, 12 Mar 2007 10:19:18 -0600

 Valdis.Kletnieks silly little rant: 
 
  Certainly appropriate content for something on your website, 
  and vendors who
  provide programs like dmidecode and parsemce are always 
  welcome. I could
  probably be convinced that such info should have at least a 
  pointer somewhere
  in Documentation/lsi_debug.txt or some such.  But quite 
  frankly, if I'm reduced
  to wading through *.h files to figure out what some 
  recalcitrant hardware is
  upset about, there's been a failure in documentation.  
  *ESPECIALLY* if I
  go look at drivers/whatever/source.c and it doesn't even 
  *reference* the *.h
  file in question.
 
 
 Its apparent to me that you don't have our hardware, nor have you
 actually waded thru this driver source code. 
 
 If you did, you would of noticed that the header you want to delete, is
 actually referenced in the *.c source code.   The file mpi_log_fc.h,
 is indeed mentioned in mptbase.c, in the function called
 mpt_fc_log_info, in the documention section above the function.This
 header file is very helpful to those supporting our hardware, and those
 using it
 For SAS(mpi_log_sas.h), I have broken out each loginfo in the strings
 you will find defined in originator_str, iop_code_str, pl_code_str, etc,
 I probably do that with fibre.
 
 If its that important to you to have the header files included, I will
 provide a patch that does that.

If you're going to include it just for the sake of including it, not
because the code in question actually uses types or function
declarations defined in there, don't bother, you're just using an
anti-social mechanism to keep this header file in the tree.

Please, let's kill this header file if it is unused.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: move common segment checks to separate helper function (v6)

2007-03-12 Thread Dmitriy Monakhov
Nick Piggin [EMAIL PROTECTED] writes:

 On Mon, Mar 12, 2007 at 10:57:53AM +0300, Dmitriy Monakhov wrote:
 I realy don't want to be annoying by sending this patcheset over and over
 again. If anyone think this patch is realy cappy, please comment what 
 exectly is bad. Thank you.

 Doesn't seem like a bad idea.

 
 Changes:
   - patch was split in two patches.

 +/*
 + * Performs necessary checks before doing a write
 + *
 + * Adjust number of segments and amount of bytes to write.
 + * Returns appropriate error code that caller should return or
 + * zero in case that write should be allowed.
 + */
 +inline int generic_segment_checks(const struct iovec *iov,
 +unsigned long *nr_segs, size_t *count,
 +unsigned long access_flags)

 Make it static and not inline, and the compiler will work it out.
Wow i've just carefully checked and found more functions with duplicating code:
fs/xfs/linux-2.6/xfs_lrw.c:655 xfs_write()
fs/ntfs/file.c:2339 ntfs_file_aio_write_nolock()
So i think nobody will object against exporting generic_segment_checks()
and removing doplicating code.

 This function name doesn't really imply that it returns you the
 nr_segs and count, but that's not a big deal I guess.

 You also don't say that nr_segs should be initialised to the amount
 you which to write, while count must be initialised to zero.

 +{
 +unsigned long   seg;
 +for (seg = 0; seg  *nr_segs; seg++) {
 +const struct iovec *iv = iov[seg];
 +
 +/*
 + * If any segment has a negative length, or the cumulative
 + * length ever wraps negative then return -EINVAL.
 + */
 +*count += iv-iov_len;
 +if (unlikely((ssize_t)(*count|iv-iov_len)  0))
 +return -EINVAL;
 +if (access_ok(access_flags, iv-iov_base, iv-iov_len))
 +continue;

 Why now insert the above test, and put the below statements inside the
 branch? OTOH, that makes it less obviously cp from the others. Maybe
 a subsequent patch.

 +if (seg == 0)
 +return -EFAULT;
 +*nr_segs = seg;
 +*count -= iv-iov_len;  /* This segment is no good */
 +break;
 +}


 You could assign to *count here, once, and remove the requirement
 that the caller initialised it to zero?

 +return 0;
 +}
 +
  /**
   * generic_file_aio_read - generic filesystem read routine
   * @iocb:   kernel I/O control block
 @@ -1180,24 +1213,9 @@ generic_file_aio_read(struct kiocb *iocb, const 
 struct iovec *iov,
  loff_t *ppos = iocb-ki_pos;
  
  count = 0;
 -for (seg = 0; seg  nr_segs; seg++) {
 -const struct iovec *iv = iov[seg];
 -
 -/*
 - * If any segment has a negative length, or the cumulative
 - * length ever wraps negative then return -EINVAL.
 - */
 -count += iv-iov_len;
 -if (unlikely((ssize_t)(count|iv-iov_len)  0))
 -return -EINVAL;
 -if (access_ok(VERIFY_WRITE, iv-iov_base, iv-iov_len))
 -continue;
 -if (seg == 0)
 -return -EFAULT;
 -nr_segs = seg;
 -count -= iv-iov_len;   /* This segment is no good */
 -break;
 -}
 +retval = generic_segment_checks(iov, nr_segs, count, VERIFY_WRITE);
 +if (retval)
 +return retval;
  
  /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
  if (filp-f_flags  O_DIRECT) {
 @@ -2094,30 +2112,14 @@ __generic_file_aio_write_nolock(struct kiocb *iocb, 
 const struct iovec *iov,
  size_t ocount;  /* original count */
  size_t count;   /* after file limit checks */
  struct inode*inode = mapping-host;
 -unsigned long   seg;
  loff_t  pos;
  ssize_t written;
  ssize_t err;
  
  ocount = 0;
 -for (seg = 0; seg  nr_segs; seg++) {
 -const struct iovec *iv = iov[seg];
 -
 -/*
 - * If any segment has a negative length, or the cumulative
 - * length ever wraps negative then return -EINVAL.
 - */
 -ocount += iv-iov_len;
 -if (unlikely((ssize_t)(ocount|iv-iov_len)  0))
 -return -EINVAL;
 -if (access_ok(VERIFY_READ, iv-iov_base, iv-iov_len))
 -continue;
 -if (seg == 0)
 -return -EFAULT;
 -nr_segs = seg;
 -ocount -= iv-iov_len;  /* This segment is no good */
 -break;
 -}
 +err = generic_segment_checks(iov, nr_segs, ocount, VERIFY_READ);
 +if (err)
 +return err;
  
  count = ocount;
  pos = *ppos;
 -- 
 1.5.0.1
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  

[PATCH 2/2] incorrect direct io error handling (v7)

2007-03-12 Thread Dmitriy Monakhov
Changes against v6:
 - Handle direct_io failure inside generic_file_direct_write() as it was
   recommend by Andrew (during discussion v1), and by Nick (during 
   discussion v6).
 - change comments, make it more clear.
 - one more time check what __generic_file_aio_write_nolock() always called
   under i_mutex for non blkdev files.
Tested with: fsstress, manual direct_io tests

Log:
If generic_file_direct_write() has fail (ENOSPC condition) inside 
__generic_file_aio_write_nolock() it may have instantiated
a few blocks outside i_size. And fsck will complain about wrong i_size
(ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
error),
after fsck will fix error i_size will be increased to the biggest block,
but this blocks contain gurbage from previous write attempt, this is not 
information leak, but its silence file data corruption. This issue affect 
fs regardless the values of blocksize or pagesize, and off corse only for non 
blkdev files.
We need truncate any block beyond i_size after write have failed , do in simular
generic_file_buffered_write() error path. We may safely call vmtruncate() here
because i_mutex always held for non blkdev files.
 
TEST_CASE:
open(/mnt/test/BIG_FILE, O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
write(3, aaa..., 104857600) = -1 ENOSPC (No space left on device)

#stat /mnt/test/BIG_FILE
  File: `/mnt/test/BIG_FILE'
  Size: 0   Blocks: 110896 IO Block: 1024   regular empty file
file size is less than biggest block idx

Device: fe07h/65031dInode: 14  Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
Access: 2007-01-24 20:03:38.0 +0300
Modify: 2007-01-24 20:03:38.0 +0300
Change: 2007-01-24 20:03:39.0 +0300

#fsck.ext3 -f /dev/VG/test 
e2fsck 1.39 (29-May-2006)
Pass 1: Checking inodes, blocks, and sizes
Inode 14, i_size is 0, should be 56556544.  Fixy? yes
Pass 2: Checking directory structure

 

Signed-off-by: Monakhov Dmitriy [EMAIL PROTECTED]
---
 mm/filemap.c |   28 
 1 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 8bd1ea4..95d49fe 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1932,8 +1932,10 @@ generic_file_direct_write(struct kiocb *iocb, const 
struct iovec *iov,
/*
 * Sync the fs metadata but not the minor inode changes and
 * of course not the data as we did direct DMA for the IO.
-* i_mutex is held, which protects generic_osync_inode() from
-* livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
+* i_mutex is held in case of DIO_LOCKING, which protects 
+* generic_osync_inode() from livelocking. If it is not held, then 
+* the filesystem must prevent this livelock. AIO O_DIRECT ops 
+* attempt to sync metadata here.
 */
if ((written = 0 || written == -EIOCBQUEUED) 
((file-f_flags  O_SYNC) || IS_SYNC(inode))) {
@@ -2155,8 +2157,26 @@ __generic_file_aio_write_nolock(struct kiocb *iocb, 
const struct iovec *iov,
loff_t endbyte;
ssize_t written_buffered;
 
+   /*
+* In case of non blockdev we may fail to buffered I/O.
+* So i_mutex must be held.
+*/
+   if (!S_ISBLK(inode-i_mode))
+   BUG_ON(!mutex_is_locked(inode-i_mutex));
+
written = generic_file_direct_write(iocb, iov, nr_segs, pos,
ppos, count, ocount);
+   /*
+* If host is not S_ISBLK generic_file_direct_write() may 
+* have instantiated a few blocks outside i_size  files
+* Trim these off again.
+*/
+   if (unlikely(written  0)  !S_ISBLK(inode-i_mode)) {
+   loff_t isize = i_size_read(inode);
+   if (pos + count  isize)
+   vmtruncate(inode, isize);
+   }
+
if (written  0 || written == count)
goto out;
/*
@@ -2261,8 +2281,8 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const 
struct iovec *iov,
 EXPORT_SYMBOL(generic_file_aio_write);
 
 /*
- * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if 
something
- * went wrong during pagecache shootdown.
+ * Called under i_mutex for writes to S_ISREG files in case of DIO_LOCKING.
+ * Returns -EIO if something went wrong during pagecache shootdown.
  */
 static ssize_t
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
-- 
1.5.0.1


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] mm: move common segment checks to separate helper function (v7)

2007-03-12 Thread Dmitriy Monakhov
Changes against v6
 - remove duplicated code from xfs,ntfs
 - export generic_segment_checks, because it used by xfs,nfs now.
 - change arguments initialization pocily according to Nick's comments.

Tested with: ltp readv/writev tests

Signed-off-by: Monakhov Dmitriy [EMAIL PROTECTED]
---
 fs/ntfs/file.c |   21 ++-
 fs/xfs/linux-2.6/xfs_lrw.c |   22 ++--
 include/linux/fs.h |3 ++
 mm/filemap.c   |   83 ---
 4 files changed, 55 insertions(+), 74 deletions(-)

diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index dbbac55..621de36 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2129,28 +2129,13 @@ static ssize_t ntfs_file_aio_write_nolock(struct kiocb 
*iocb,
struct address_space *mapping = file-f_mapping;
struct inode *inode = mapping-host;
loff_t pos;
-   unsigned long seg;
size_t count;   /* after file limit checks */
ssize_t written, err;
 
count = 0;
-   for (seg = 0; seg  nr_segs; seg++) {
-   const struct iovec *iv = iov[seg];
-   /*
-* If any segment has a negative length, or the cumulative
-* length ever wraps negative then return -EINVAL.
-*/
-   count += iv-iov_len;
-   if (unlikely((ssize_t)(count|iv-iov_len)  0))
-   return -EINVAL;
-   if (access_ok(VERIFY_READ, iv-iov_base, iv-iov_len))
-   continue;
-   if (!seg)
-   return -EFAULT;
-   nr_segs = seg;
-   count -= iv-iov_len;   /* This segment is no good */
-   break;
-   }
+   err = generic_segment_checks(iov, nr_segs, count, VERIFY_READ);
+   if (err)
+   return err;
pos = *ppos;
vfs_check_frozen(inode-i_sb, SB_FREEZE_WRITE);
/* We can write back this queue in page reclaim. */
diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
index ff8d64e..558076d 100644
--- a/fs/xfs/linux-2.6/xfs_lrw.c
+++ b/fs/xfs/linux-2.6/xfs_lrw.c
@@ -639,7 +639,6 @@ xfs_write(
xfs_fsize_t isize, new_size;
xfs_iocore_t*io;
bhv_vnode_t *vp;
-   unsigned long   seg;
int iolock;
int eventsent = 0;
bhv_vrwlock_t   locktype;
@@ -652,24 +651,9 @@ xfs_write(
vp = BHV_TO_VNODE(bdp);
xip = XFS_BHVTOI(bdp);
 
-   for (seg = 0; seg  segs; seg++) {
-   const struct iovec *iv = iovp[seg];
-
-   /*
-* If any segment has a negative length, or the cumulative
-* length ever wraps negative then return -EINVAL.
-*/
-   ocount += iv-iov_len;
-   if (unlikely((ssize_t)(ocount|iv-iov_len)  0))
-   return -EINVAL;
-   if (access_ok(VERIFY_READ, iv-iov_base, iv-iov_len))
-   continue;
-   if (seg == 0)
-   return -EFAULT;
-   segs = seg;
-   ocount -= iv-iov_len;  /* This segment is no good */
-   break;
-   }
+   error = generic_segment_checks(iovp, segs, ocount, VERIFY_READ);
+   if (error)
+   return error;
 
count = ocount;
pos = *offset;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6a3d22e..3b99450 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1778,6 +1778,9 @@ extern ssize_t generic_file_sendfile(struct file *, 
loff_t *, size_t, read_actor
 extern void do_generic_mapping_read(struct address_space *mapping,
struct file_ra_state *, struct file *,
loff_t *, read_descriptor_t *, 
read_actor_t);
+extern int generic_segment_checks(const struct iovec *iov,
+   unsigned long *nr_segs, size_t *count,
+   unsigned long access_flags);
 
 /* fs/splice.c */
 extern ssize_t generic_file_splice_read(struct file *, loff_t *,
diff --git a/mm/filemap.c b/mm/filemap.c
index 8e1849a..8bd1ea4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1159,6 +1159,46 @@ success:
return size;
 }
 
+/*
+ * Performs necessary checks before doing a write
+ * @iov:   io vector request
+ * @nr_segs:   number of segments in the iovec
+ * @count: number of bytes to write
+ * @access_flags: type of access: %VERIFY_READ or %VERIFY_WRITE
+ *
+ * Adjust number of segments and amount of bytes to write (nr_segs should be
+ * properly initialized first). Returns appropriate error code that caller 
+ * should return or zero in case that write should be allowed.
+ */
+int generic_segment_checks(const struct iovec *iov,
+   unsigned long *nr_segs, size_t *count,
+   unsigned long access_flags)
+{
+   unsigned 

Re: [ckrm-tech] [PATCH 0/2] resource control file system - aka containers on top of nsproxy!

2007-03-12 Thread Serge E. Hallyn
Quoting Srivatsa Vaddagiri ([EMAIL PROTECTED]):
 On Fri, Mar 09, 2007 at 02:09:35PM -0800, Paul Menage wrote:
   3. This next leads me to think that 'tasks' file in each directory doesnt 
   make
  sense for containers. In fact it can lend itself to error situations 
   (by
  administrator/script mistake) when some tasks of a container are in one
  resource class while others are in a different class.
  
   Instead, from a containers pov, it may be usefull to write
   a 'container id' (if such a thing exists) into the tasks file
   which will move all the tasks of the container into
   the new resource class. This is the same requirement we
   discussed long back of moving all threads of a process into new
   resource class.
  
  I think you need to give a more concrete example and use case of what
  you're trying to propose here. I don't really see what advantage
  you're getting.
 
 Ok, this is what I had in mind:
 
 
   mount -t container -o ns /dev/namespace
   mount -t container -o cpu /dev/cpu
 
 Lets we have the namespaces/resource-groups created as under:
 
   /dev/namespace
   |-- prof
   ||- tasks - (T1, T2)
   ||- container_id - 1 (doesnt exist today perhaps)
   |
   |-- student
   ||- tasks - (T3, T4)
   ||- container_id - 2 (doesnt exist today perhaps)
 
   /dev/cpu
  |-- prof
  ||-- tasks
  ||-- cpu_limit (40%)
  |
  |-- student
  ||-- tasks
  ||-- cpu_limit (20%)
  |
  |
 
 
 Is it possible to create the above structure in container patches? 
 /me thinks so.
 
 If so, then accidentally someone can do this:
 
   echo T1  /dev/cpu/prof/tasks
   echo T2  /dev/cpu/student/tasks
 
 with the result that tasks of the same container are now in different
 resource classes.

What's wrong with that?

 Thats why in case of containers I felt we shldnt allow individual tasks
 to be cat'ed to tasks file. 
 
 Or rather, it may be nice to say :
 
   echo cid 2  /dev/cpu/prof/tasks 
 
 and have all tasks belonging to container id 2 move to the new resource
 group.

Adding that feature sounds fine, but don't go stopping me from putting
T1 into /dev/cpu/prof/tasks and T2 into /dev/cpu/student/tasks just
because you have your own notion of what each task is supposed to be.
Just because they're in the same namespaces doesn't mean they should get
the same resource allocations.  If you want to add that kind of policy,
well, it should be policy - user-definable.

-serge
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [PATCH 0/2] resource control file system - aka containers on top of nsproxy!

2007-03-12 Thread Srivatsa Vaddagiri
On Mon, Mar 12, 2007 at 10:56:43AM -0500, Serge E. Hallyn wrote:
 What's wrong with that?

I had been asking around on what is the fundamental unit of res mgmt
for vservers and the answer I got (from Herbert) was all tasks that are
in the same pid namespace. From what you are saying above, it seems to
be that there is no such fundamental unit. It can be a random mixture
of tasks (taken across vservers) whose resource consumption needs to be
controlled. Is that correct?

  echo cid 2  /dev/cpu/prof/tasks 
 
 Adding that feature sounds fine, 

Ok yes ..that can be a optional feature.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [PATCH 0/2] resource control file system - aka containers on top of nsproxy!

2007-03-12 Thread Serge E. Hallyn
Quoting Srivatsa Vaddagiri ([EMAIL PROTECTED]):
 On Mon, Mar 12, 2007 at 10:56:43AM -0500, Serge E. Hallyn wrote:
  What's wrong with that?
 
 I had been asking around on what is the fundamental unit of res mgmt
 for vservers and the answer I got (from Herbert) was all tasks that are
 in the same pid namespace. From what you are saying above, it seems to
 be that there is no such fundamental unit. It can be a random mixture
 of tasks (taken across vservers) whose resource consumption needs to be
 controlled. Is that correct?

If I'm reading it right, yes.

If for vservers the fundamental unit of res mgmt is a vserver, that can
surely be done at a higher level than in the kernel.

Actually, these could be tied just by doing

mount -t container -o ns,cpuset /containers

So now any task in /containers/vserver1 or any subdirectory thereof
would have the same cpuset constraints as /containers.  OTOH, you could
mount them separately

mount -t container -o ns /nsproxy
mount -t container -o cpuset /cpuset

and now you have the freedom to split tasks in the same vserver
(under /nsproxy/vserver1) into different cpusets.

-serge

 echo cid 2  /dev/cpu/prof/tasks 
  
  Adding that feature sounds fine, 
 
 Ok yes ..that can be a optional feature.
 
 -- 
 Regards,
 vatsa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [PATCH 0/2] resource control file system - aka containers on top of nsproxy!

2007-03-12 Thread Paul Jackson
vatsa wrote:
 This assumes that you can see the global vfs namespace right?
 
 What if you are inside a container/vserver which restricts your vfs
 namespace? i.e /dev/cpusets seen from one container is not same as what
 is seen from another container .

Well, yes.  But that restriction on the namespace is no doing of
cpusets.

It's some vfs namespace restriction, which should be an orthogonal
mechanism.

Well, it's probably not orthogonal at present.  Cpusets might not yet
handle a restricted vfs name space very well.

For example the /proc/pid/cpuset path, giving path below /dev/cpuset
of task pid's cpuset, might not be restricted.  And the set of all CPUs
and Memory Nodes that are online, which is visible in various /proc
files, and also visible in ones top cpuset, might be inconsistent if
restricted vfs namespace mapped you to a different top cpuset.

There are probably other loose ends as well.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [PATCH 1/7] containers (V7): Generic container system abstracted from cpusets code

2007-03-12 Thread Srivatsa Vaddagiri
On Sun, Mar 11, 2007 at 12:38:43PM -0700, Paul Jackson wrote:
 The primary reason for the cpuset double locking, as I recall, was because
 cpusets needs to access cpusets inside the memory allocator.  

needs to access cpusets - can you be more specific?

Being able to safely walk cpuset-parent list - is it the only access
you are talking of or more?

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Keyboard stops working after *lock [Was: 2.6.21-rc2-mm1]

2007-03-12 Thread Jiri Kosina
(trimmed CC list a bit)

On Mon, 12 Mar 2007, Jiri Slaby wrote:

UHCI: Eliminate asynchronous skeleton Queue Headers
  Post it along with the usbmon log, and I'll try to figure out what happened.
 Here it comes:
 USBMON:
 f7525b40 1832950485 C Ii:004:01 0 8 = 5300 
 f7525b40 1832950517 S Ii:004:01 -115 8 
 f7525140 1832950540 S Co:004:00 s 21 09 0200  0001 1 = 01
 f7525140 1832952485 C Co:004:00 0 1 
 Corresponds to numlock; 7; numlock; 7.

Jiri,

thanks. Could you also please redo the test with the offending uhci patch 
reverted and send the output of a working situation?

Thanks,

-- 
Jiri Kosina
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-12 Thread Serge Belyshev
Mike Galbraith [EMAIL PROTECTED] writes:

[snip]
 And let's not lose sight of things with this one testcase.
 
 RSDL fixes
 - every starvation case
 - all fairness isssues
 - is better 95% of the time on the desktop

 I don't know where you got that 95% number from.  For the most part, the
 existing scheduler does well.  If it sucked 95% of the time, it would
 have been shredded a long time ago.


I tell you.

http://article.gmane.org/gmane.linux.kernel/500027
http://article.gmane.org/gmane.linux.kernel/502996
http://article.gmane.org/gmane.linux.kernel/500119
http://article.gmane.org/gmane.linux.kernel/500784
http://article.gmane.org/gmane.linux.kernel/500768
http://article.gmane.org/gmane.linux.kernel/502255
http://article.gmane.org/gmane.linux.kernel/502282
http://article.gmane.org/gmane.linux.kernel/503650
http://article.gmane.org/gmane.linux.kernel/503695
http://article.gmane.org/gmane.linux.kernel.ck/6512
http://article.gmane.org/gmane.linux.kernel.ck/6539
http://article.gmane.org/gmane.linux.kernel.ck/6565


Also, count my email too.
I'm using RSDL since day one on my laptop and my router/compute server
and I wont come back to mainline, needless to say why.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mtd: PMC MSP71xx flash/rootfs mappings

2007-03-12 Thread Marc St-Jean
[PATCH] mtd: PMC MSP71xx flash/rootfs mappings

Patch to add flash and rootfs mappings for the PMC-Sierra
MSP71xx devices.

This patch references some platform support files previously
submitted to the [EMAIL PROTECTED] list.

Thanks,
Marc

Signed-off-by: Marc St-Jean [EMAIL PROTECTED]
---
This patch was first posted on Feb. 23rd. I didn't receive any
feedback but I'm reposting based on feedback to other patches.

If this is no longer the maintainer address please let me know.

Changes:
-Cleanup on style and formatting for comments, macros, etc.

 Kconfig  |   33 +
 Makefile |2 
 pmcmsp-flash.c   |  184 +++
 pmcmsp-ramroot.c |  105 +++
 4 files changed, 324 insertions(+)

diff --git a/drivers/mtd/maps/Kconfig b/drivers/mtd/maps/Kconfig
index bbf0553..e28a1ad 100644
--- a/drivers/mtd/maps/Kconfig
+++ b/drivers/mtd/maps/Kconfig
@@ -69,6 +69,39 @@ config MTD_PHYSMAP_OF
  physically into the CPU's memory. The mapping description here is
  taken from OF device tree.
 
+config MTD_PMC_MSP_EVM
+   tristate CFI Flash device mapped on PMC-Sierra MSP
+   depends on PMC_MSP  MTD_CFI
+   select MTD_PARTITIONS
+   help
+ This provides a 'mapping' driver which support the way
+  in which user-programmable flash chips are connected on the
+  PMC-Sierra MSP eval/demo boards
+
+choice
+   prompt Maximum mappable memory avialable for flash IO
+   depends on MTD_PMC_MSP_EVM
+   default MSP_FLASH_MAP_LIMIT_32M
+
+config MSP_FLASH_MAP_LIMIT_32M
+   bool 32M
+
+endchoice
+
+config MSP_FLASH_MAP_LIMIT
+   hex
+   default 0x0200
+   depends on MSP_FLASH_MAP_LIMIT_32M
+
+config MTD_PMC_MSP_RAMROOT
+   tristate Embedded RAM block device for root on PMC-Sierra MSP
+   depends on PMC_MSP_EMBEDDED_ROOTFS  \
+   (MTD_BLOCK || MTD_BLOCK_RO)  \
+   MTD_RAM
+   help
+ This provides support for the embedded root file system
+  on PMC MSP devices.  This memory is mapped as a MTD block device. 
+
 config MTD_SUN_UFLASH
tristate Sun Microsystems userflash support
depends on SPARC  MTD_CFI
diff --git a/drivers/mtd/maps/Makefile b/drivers/mtd/maps/Makefile
index 071d0bf..de036c5 100644
--- a/drivers/mtd/maps/Makefile
+++ b/drivers/mtd/maps/Makefile
@@ -27,6 +27,8 @@ obj-$(CONFIG_MTD_CEIVA)   += ceiva.o
 obj-$(CONFIG_MTD_OCTAGON)  += octagon-5066.o
 obj-$(CONFIG_MTD_PHYSMAP)  += physmap.o
 obj-$(CONFIG_MTD_PHYSMAP_OF)   += physmap_of.o
+obj-$(CONFIG_MTD_PMC_MSP_EVM)   += pmcmsp-flash.o
+obj-$(CONFIG_MTD_PMC_MSP_RAMROOT)+= pmcmsp-ramroot.o
 obj-$(CONFIG_MTD_PNC2000)  += pnc2000.o
 obj-$(CONFIG_MTD_PCMCIA)   += pcmciamtd.o
 obj-$(CONFIG_MTD_RPXLITE)  += rpxlite.o
diff --git a/drivers/mtd/maps/pmcmsp-flash.c b/drivers/mtd/maps/pmcmsp-flash.c
new file mode 100644
index 000..24cd8c0
--- /dev/null
+++ b/drivers/mtd/maps/pmcmsp-flash.c
@@ -0,0 +1,184 @@
+/*
+ * Mapping of a custom board with both AMD CFI and JEDEC flash in partitions.
+ * Config with both CFI and JEDEC device support.
+ *
+ * Basically physmap.c with the addition of partitions and 
+ * an array of mapping info to accomodate more than one flash type per board.
+ *
+ * Copyright 2005-2007 PMC-Sierra, Inc.
+ *
+ *  This program is free software; you can redistribute  it and/or modify it
+ *  under  the terms of  the GNU General  Public License as published by the
+ *  Free Software Foundation;  either version 2 of the  License, or (at your
+ *  option) any later version.
+ *
+ *  THIS  SOFTWARE  IS PROVIDED   ``AS  IS'' AND   ANY  EXPRESS OR IMPLIED
+ *  WARRANTIES,   INCLUDING, BUT NOT  LIMITED  TO, THE IMPLIED WARRANTIES OF
+ *  MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN
+ *  NO  EVENT  SHALL   THE AUTHOR  BELIABLE FOR ANY   DIRECT, INDIRECT,
+ *  INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ *  NOT LIMITED   TO, PROCUREMENT OF  SUBSTITUTE GOODS  OR SERVICES; LOSS OF
+ *  USE, DATA,  OR PROFITS; OR  BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
+ *  ANY THEORY OF LIABILITY, WHETHER IN  CONTRACT, STRICT LIABILITY, OR TORT
+ *  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ *  THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ *  You should have received a copy of the  GNU General Public License along
+ *  with this program; if not, write  to the Free Software Foundation, Inc.,
+ *  675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include linux/module.h
+#include linux/types.h
+#include linux/kernel.h
+#include linux/mtd/mtd.h
+#include linux/mtd/map.h
+#include linux/mtd/partitions.h
+
+#include asm/io.h
+
+#include msp_prom.h
+#include msp_regs.h
+
+
+static struct mtd_info **msp_flash;
+static struct mtd_partition **msp_parts;
+static struct map_info *msp_maps;

Re: [RFC][PATCH 2/7] RSS controller core

2007-03-12 Thread Herbert Poetzl
On Mon, Mar 12, 2007 at 12:02:01PM +0300, Pavel Emelianov wrote:
  Maybe you have some ideas how we can decide on this?
  We need to work out what the requirements are before we can 
  settle on an implementation.
  
  Linux-VServer (and probably OpenVZ):
  
   - shared mappings of 'shared' files (binaries 
 and libraries) to allow for reduced memory
 footprint when N identical guests are running
 
 This is done in current patches.

nice, but the question was about _requirements_
(so your requirements are?)

   - virtual 'physical' limit should not cause
 swap out when there are still pages left on
 the host system (but pages of over limit guests
 can be preferred for swapping)
 
 So what to do when virtual physical limit is hit?
 OOM-kill current task?

when the RSS limit is hit, but there _are_ enough
pages left on the physical system, there is no
good reason to swap out the page at all

 - there is no benefit in doing so (performance
   wise, that is)

 - it actually hurts performance, and could
   become a separate source for DoS

what should happen instead (in an ideal world :)
is that the page is considered swapped out for
the guest (add guest penality for swapout), and 
when the page would be swapped in again, the guest
takes a penalty (for the 'virtual' page in) and
the page is returned to the guest, possibly kicking
out (again virtually) a different page

   - accounting and limits have to be consistent
 and should roughly represent the actual used
 memory/swap (modulo optimizations, I can go
 into detail here, if necessary)
 
 This is true for current implementation for
 booth - this patchset ang OpenVZ beancounters.
 
 If you sum up the physpages values for all containers
 you'll get the exact number of RAM pages used.

hmm, including or excluding the host pages?

   - OOM handling on a per guest basis, i.e. some
 out of memory condition in guest A must not
 affect guest B
 
 This is done in current patches.

 Herbert, did you look at the patches before
 sending this mail or do you just want to
 'take part' in conversation w/o understanding
 of hat is going on?

again, the question was about requirements, not
your patches, and yes, I had a look at them _and_
the OpenVZ implementations ...

best,
Herbert

PS: hat is going on? :)

  HTC,
  Herbert
  
  Sigh.  Who is running this show?   Anyone?
 
  You can actually do a form of overcommittment by allowing multiple
  containers to share one or more of the zones. Whether that is
  sufficient or suitable I don't know. That depends on the requirements,
  and we haven't even discussed those, let alone agreed to them.
 
  ___
  Containers mailing list
  [EMAIL PROTECTED]
  https://lists.osdl.org/mailman/listinfo/containers
  
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/3] swsusp: Stop using page flags

2007-03-12 Thread Rafael J. Wysocki
Hi,

The following three patches make swsusp use its own data structures for memory
management instead of special page flags, so that these page flags can be used
for other purposes.

Greetings,
Rafael


-- 
If you don't have the time to read,
you don't have the time or the tools to write.
- Stephen King

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] swsusp: Use inline functions for changing page flags

2007-03-12 Thread Rafael J. Wysocki
From: Rafael J. Wysocki [EMAIL PROTECTED]

Replace direct invocations of SetPageNosave(), SetPageNosaveFree() etc. with
calls to inline functions that can be changed in subsequent patches without
modifying the code calling them.

Signed-off-by: Rafael J. Wysocki [EMAIL PROTECTED]
Acked-by: Pavel Machek [EMAIL PROTECTED]
---
 include/linux/suspend.h |   33 +
 kernel/power/snapshot.c |   48 +---
 mm/page_alloc.c |6 +++---
 3 files changed, 61 insertions(+), 26 deletions(-)

Index: linux-2.6.21-rc2/include/linux/suspend.h
===
--- linux-2.6.21-rc2.orig/include/linux/suspend.h   2007-03-02 
09:05:53.0 +0100
+++ linux-2.6.21-rc2/include/linux/suspend.h2007-03-02 09:24:02.0 
+0100
@@ -8,6 +8,7 @@
 #include linux/notifier.h
 #include linux/init.h
 #include linux/pm.h
+#include linux/mm.h
 
 /* struct pbe is used for creating lists of pages that should be restored
  * atomically during the resume from disk, because the page frames they have
@@ -49,6 +50,38 @@ void __save_processor_state(struct saved
 void __restore_processor_state(struct saved_context *ctxt);
 unsigned long get_safe_page(gfp_t gfp_mask);
 
+/* Page management functions for the software suspend (swsusp) */
+
+static inline void swsusp_set_page_forbidden(struct page *page)
+{
+   SetPageNosave(page);
+}
+
+static inline int swsusp_page_is_forbidden(struct page *page)
+{
+   return PageNosave(page);
+}
+
+static inline void swsusp_unset_page_forbidden(struct page *page)
+{
+   ClearPageNosave(page);
+}
+
+static inline void swsusp_set_page_free(struct page *page)
+{
+   SetPageNosaveFree(page);
+}
+
+static inline int swsusp_page_is_free(struct page *page)
+{
+   return PageNosaveFree(page);
+}
+
+static inline void swsusp_unset_page_free(struct page *page)
+{
+   ClearPageNosaveFree(page);
+}
+
 /*
  * XXX: We try to keep some more pages free so that I/O operations succeed
  * without paging. Might this be more?
Index: linux-2.6.21-rc2/kernel/power/snapshot.c
===
--- linux-2.6.21-rc2.orig/kernel/power/snapshot.c   2007-03-02 
09:05:53.0 +0100
+++ linux-2.6.21-rc2/kernel/power/snapshot.c2007-03-02 09:27:06.0 
+0100
@@ -67,15 +67,15 @@ static void *get_image_page(gfp_t gfp_ma
 
res = (void *)get_zeroed_page(gfp_mask);
if (safe_needed)
-   while (res  PageNosaveFree(virt_to_page(res))) {
+   while (res  swsusp_page_is_free(virt_to_page(res))) {
/* The page is unsafe, mark it for swsusp_free() */
-   SetPageNosave(virt_to_page(res));
+   swsusp_set_page_forbidden(virt_to_page(res));
allocated_unsafe_pages++;
res = (void *)get_zeroed_page(gfp_mask);
}
if (res) {
-   SetPageNosave(virt_to_page(res));
-   SetPageNosaveFree(virt_to_page(res));
+   swsusp_set_page_forbidden(virt_to_page(res));
+   swsusp_set_page_free(virt_to_page(res));
}
return res;
 }
@@ -91,8 +91,8 @@ static struct page *alloc_image_page(gfp
 
page = alloc_page(gfp_mask);
if (page) {
-   SetPageNosave(page);
-   SetPageNosaveFree(page);
+   swsusp_set_page_forbidden(page);
+   swsusp_set_page_free(page);
}
return page;
 }
@@ -110,9 +110,9 @@ static inline void free_image_page(void 
 
page = virt_to_page(addr);
 
-   ClearPageNosave(page);
+   swsusp_unset_page_forbidden(page);
if (clear_nosave_free)
-   ClearPageNosaveFree(page);
+   swsusp_unset_page_free(page);
 
__free_page(page);
 }
@@ -615,7 +615,8 @@ static struct page *saveable_highmem_pag
 
BUG_ON(!PageHighMem(page));
 
-   if (PageNosave(page) || PageReserved(page) || PageNosaveFree(page))
+   if (swsusp_page_is_forbidden(page) ||  swsusp_page_is_free(page) ||
+   PageReserved(page))
return NULL;
 
return page;
@@ -681,7 +682,7 @@ static struct page *saveable_page(unsign
 
BUG_ON(PageHighMem(page));
 
-   if (PageNosave(page) || PageNosaveFree(page))
+   if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
return NULL;
 
if (PageReserved(page)  pfn_is_nosave(pfn))
@@ -821,9 +822,10 @@ void swsusp_free(void)
if (pfn_valid(pfn)) {
struct page *page = pfn_to_page(pfn);
 
-   if (PageNosave(page)  PageNosaveFree(page)) {
-   ClearPageNosave(page);
-   ClearPageNosaveFree(page);
+   if 

[PATCH 3/3] mm: Remove unused page flags

2007-03-12 Thread Rafael J. Wysocki
From: Rafael J. Wysocki [EMAIL PROTECTED]

Remove the two page flags that were previously used by swsusp and are no longer
needed.

Signed-off-by: Rafael J. Wysocki [EMAIL PROTECTED]
Acked-by: Pavel Machek [EMAIL PROTECTED]
---
 include/linux/page-flags.h |   12 
 1 file changed, 12 deletions(-)

Index: linux-2.6.21-rc3/include/linux/page-flags.h
===
--- linux-2.6.21-rc3.orig/include/linux/page-flags.h
+++ linux-2.6.21-rc3/include/linux/page-flags.h
@@ -82,13 +82,11 @@
 #define PG_private 11  /* If pagecache, has fs-private data */
 
 #define PG_writeback   12  /* Page is under writeback */
-#define PG_nosave  13  /* Used for system suspend/resume */
 #define PG_compound14  /* Part of a compound page */
 #define PG_swapcache   15  /* Swap page: swp_entry_t in private */
 
 #define PG_mappedtodisk16  /* Has blocks allocated on-disk 
*/
 #define PG_reclaim 17  /* To be reclaimed asap */
-#define PG_nosave_free 18  /* Used for system suspend/resume */
 #define PG_buddy   19  /* Page is free, on buddy lists */
 
 /* PG_owner_priv_1 users should have descriptive aliases */
@@ -214,16 +212,6 @@ static inline void SetPageUptodate(struc
ret;\
})
 
-#define PageNosave(page)   test_bit(PG_nosave, (page)-flags)
-#define SetPageNosave(page)set_bit(PG_nosave, (page)-flags)
-#define TestSetPageNosave(page)test_and_set_bit(PG_nosave, 
(page)-flags)
-#define ClearPageNosave(page)  clear_bit(PG_nosave, (page)-flags)
-#define TestClearPageNosave(page)  test_and_clear_bit(PG_nosave, 
(page)-flags)
-
-#define PageNosaveFree(page)   test_bit(PG_nosave_free, (page)-flags)
-#define SetPageNosaveFree(page)set_bit(PG_nosave_free, (page)-flags)
-#define ClearPageNosaveFree(page)  clear_bit(PG_nosave_free, 
(page)-flags)
-
 #define PageBuddy(page)test_bit(PG_buddy, (page)-flags)
 #define __SetPageBuddy(page)   __set_bit(PG_buddy, (page)-flags)
 #define __ClearPageBuddy(page) __clear_bit(PG_buddy, (page)-flags)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] swsusp: Do not use page flags

2007-03-12 Thread Rafael J. Wysocki
From: Rafael J. Wysocki [EMAIL PROTECTED]

Make swsusp use memory bitmaps instead of page flags for marking 'nosave' and
free pages.  This allows us to 'recycle' two page flags that can be used for 
other
purposes.  Also, the memory needed to store the bitmaps is allocated when
necessary (ie. before the suspend) and freed after the resume which is more
reasonable.

The patch is designed to minimize the amount of changes and there are some nice
simplifications and optimizations possible on top of it.  I am going to
implement them separately in the future.

Signed-off-by: Rafael J. Wysocki [EMAIL PROTECTED]
Acked-by: Pavel Machek [EMAIL PROTECTED]
---
 arch/x86_64/kernel/e820.c |   26 +---
 include/linux/suspend.h   |   58 +++---
 kernel/power/disk.c   |   23 +++-
 kernel/power/power.h  |2 
 kernel/power/snapshot.c   |  250 +++---
 kernel/power/user.c   |4 
 6 files changed, 281 insertions(+), 82 deletions(-)

Index: linux-2.6.21-rc3/include/linux/suspend.h
===
--- linux-2.6.21-rc3.orig/include/linux/suspend.h
+++ linux-2.6.21-rc3/include/linux/suspend.h
@@ -24,63 +24,41 @@ struct pbe {
 extern void drain_local_pages(void);
 extern void mark_free_pages(struct zone *zone);
 
-#ifdef CONFIG_PM
-/* kernel/power/swsusp.c */
-extern int software_suspend(void);
-
-#if defined(CONFIG_VT)  defined(CONFIG_VT_CONSOLE)
+#if defined(CONFIG_PM)  defined(CONFIG_VT)  defined(CONFIG_VT_CONSOLE)
 extern int pm_prepare_console(void);
 extern void pm_restore_console(void);
 #else
 static inline int pm_prepare_console(void) { return 0; }
 static inline void pm_restore_console(void) {}
-#endif /* defined(CONFIG_VT)  defined(CONFIG_VT_CONSOLE) */
+#endif
+
+#if defined(CONFIG_PM)  defined(CONFIG_SOFTWARE_SUSPEND)
+/* kernel/power/swsusp.c */
+extern int software_suspend(void);
+/* kernel/power/snapshot.c */
+extern void __init register_nosave_region(unsigned long, unsigned long);
+extern int swsusp_page_is_forbidden(struct page *);
+extern void swsusp_set_page_free(struct page *);
+extern void swsusp_unset_page_free(struct page *);
+extern unsigned long get_safe_page(gfp_t gfp_mask);
 #else
 static inline int software_suspend(void)
 {
printk(Warning: fake suspend called\n);
return -ENOSYS;
 }
-#endif /* CONFIG_PM */
+
+static inline void register_nosave_region(unsigned long b, unsigned long e) {}
+static inline int swsusp_page_is_forbidden(struct page *p) { return 0; }
+static inline void swsusp_set_page_free(struct page *p) {}
+static inline void swsusp_unset_page_free(struct page *p) {}
+#endif /* defined(CONFIG_PM)  defined(CONFIG_SOFTWARE_SUSPEND) */
 
 void save_processor_state(void);
 void restore_processor_state(void);
 struct saved_context;
 void __save_processor_state(struct saved_context *ctxt);
 void __restore_processor_state(struct saved_context *ctxt);
-unsigned long get_safe_page(gfp_t gfp_mask);
-
-/* Page management functions for the software suspend (swsusp) */
-
-static inline void swsusp_set_page_forbidden(struct page *page)
-{
-   SetPageNosave(page);
-}
-
-static inline int swsusp_page_is_forbidden(struct page *page)
-{
-   return PageNosave(page);
-}
-
-static inline void swsusp_unset_page_forbidden(struct page *page)
-{
-   ClearPageNosave(page);
-}
-
-static inline void swsusp_set_page_free(struct page *page)
-{
-   SetPageNosaveFree(page);
-}
-
-static inline int swsusp_page_is_free(struct page *page)
-{
-   return PageNosaveFree(page);
-}
-
-static inline void swsusp_unset_page_free(struct page *page)
-{
-   ClearPageNosaveFree(page);
-}
 
 /*
  * XXX: We try to keep some more pages free so that I/O operations succeed
Index: linux-2.6.21-rc3/kernel/power/snapshot.c
===
--- linux-2.6.21-rc3.orig/kernel/power/snapshot.c
+++ linux-2.6.21-rc3/kernel/power/snapshot.c
@@ -21,6 +21,7 @@
 #include linux/kernel.h
 #include linux/pm.h
 #include linux/device.h
+#include linux/init.h
 #include linux/bootmem.h
 #include linux/syscalls.h
 #include linux/console.h
@@ -34,6 +35,10 @@
 
 #include power.h
 
+static int swsusp_page_is_free(struct page *);
+static void swsusp_set_page_forbidden(struct page *);
+static void swsusp_unset_page_forbidden(struct page *);
+
 /* List of PBEs needed for restoring the pages that were allocated before
  * the suspend and included in the suspend image, but have also been
  * allocated by the resume kernel, so their contents cannot be written
@@ -224,11 +229,6 @@ static void chain_free(struct chain_allo
  * of type unsigned long each).  It also contains the pfns that
  * correspond to the start and end of the represented memory area and
  * the number of bit chunks in the block.
- *
- * NOTE: Memory bitmaps are used for two types of operations only:
- * set a bit and find the next bit set.  Moreover, the searching
- * is always 

RE: [PATCH] MPT FUSION: Delete unused header files.

2007-03-12 Thread Moore, Eric

 If you're going to include it just for the sake of including it, not
 because the code in question actually uses types or function
 declarations defined in there, don't bother, you're just using an
 anti-social mechanism to keep this header file in the tree.
 
 Please, let's kill this header file if it is unused.


Beside including the header I plan to use every define in that header
defined someplace in the source code.  

Now can I keep the header?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] MPT FUSION: Delete unused header files.

2007-03-12 Thread David Miller
From: Moore, Eric [EMAIL PROTECTED]
Date: Mon, 12 Mar 2007 15:29:45 -0600

 Beside including the header I plan to use every define in that header
 defined someplace in the source code.  
 
 Now can I keep the header?

For sure :-)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ck] Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-12 Thread jos poortvliet
Op Monday 12 March 2007, schreef Con Kolivas:
   If we fix 95% of the desktop and worsen 5% is that bad given how much
   else we've gained in the process?
 
  Killing the known corner case starvation scenarios is wonderful, but
  let's not just pretend that interactive tasks don't have any special
  requirements.

 Now you're really making a stretch of things. Where on earth did I say that
 interactive tasks don't have special requirements? It's a fundamental
 feature of this scheduler that I go to great pains to get them as low
 latency as possible and their fair share of cpu despite having a completely
 fair cpu distribution.

As far as I understand it, RSDL always gives an equal share of cpu, but 
interactive tasks can have lower latency, right? So you get in trouble with 
interactive tasks only when their share isn't enough to actually do what they 
have to do in that period, eg on a heavily (over?) loaded box. Staircase, 
like mainline which gave them MORE than their share, would support that 
(though this comes at a price).

So, if your box is overloaded to a great extend, X, which can use a lot of 
cpu, can get unresponsive - unless it's negatively niced. But most other apps 
aren't as demanding as X is, so they won't really suffer. Thus the problem is 
mostly X. And at least part of that problem is being solved - X wasting cpu 
cycles. Also, cpu's are getting stronger, and I think it's likely X's 
relative CPU usage goes down as well.

In the long term, RSDL seems like the best way to go. Nice X down, and you got 
most of the disadvantages. You still have the perfect fairness, no stalls and 
starvation ;-)

If RSDL can be improved to help X, great. But introducing again the problem 
which RSDL was supposed to solve would be pretty pointless. I think that's 
what grumpy Con is trying to say, and he's right at it.

grtz

Jos

-- 
Disclaimer:

Alles wat ik doe denk en zeg is gebaseerd op het wereldbeeld wat ik nu heb. 
Ik ben niet verantwoordelijk voor wijzigingen van de wereld, of het beeld wat 
ik daarvan heb, noch voor de daaruit voortvloeiende gedragingen van mezelf. 
Alles wat ik zeg is aardig bedoeld, tenzij expliciet vermeld.


pgpsAMjZ37p5Q.pgp
Description: PGP signature


Re: refcounting drivers' data structures used in sysfs buffers

2007-03-12 Thread Richard Purdie
On Mon, 2007-03-12 at 16:31 -0400, Dmitry Torokhov wrote:
 On 3/12/07, Alan Stern [EMAIL PROTECTED] wrote:
  On Mon, 12 Mar 2007, Oliver Neukum wrote:
  I don't like reverting my own code. But I predict he'll tell you 
 that a
 driver's bond with a device should be represented in a data structure
 that is to be refcounted.
 
  There still would be a synchronization problem.  Refcounts don't solve
  races; they only solve lifetime problems.  And you would still have to
  change the sysfs API, plus all the other stuff...
 
  Do you think Linus would listen if all three of us (plus maybe Greg) tried
  to convince him?
 
 
 If we'd accompany the argument with the patch that changes scsi to use
 wq to perform deletion so we don't have deadlock regression in the
 kernel he might be more perceptive... He is right about lifetime
 issues but this is not strictly lifetime issue as you correctly point
 out. Plus, refcounting also bloats the kernel so I don't relly want to
 use refcount for every integer I happen to export through sysfs if I
 can simply revoke access.

For what its worth, I think it makes sense if the driver no longer has
to worry about sysfs attributes after they've been removed. This is
something the core should look after, not each and every driver.

http://marc.theaimsgroup.com/?l=linux-kernelm=117355959020831w=2

makes a lot of sense, particularly that No driver callbacks occur after
unregistration. When writing the backlight class code, I remember
checking into this, concluding that seemed to be the design of sysfs and
thinking it a sane design.

The alternative is to force each and every driver to do its own
refcounting. My experience with locking in the extremely simple
backlight class shows nobody reads the documentation or writes the code
correctly. With that, I've given up and added suitable locking to the
core even if not every driver needs it. In doing so, I made a net
removal of a few hundred lines of broken ticking timebomb style code.
I dread to think what would happen if every driver had to deal with
sysfs refcounting.

So count me as a vote for handling this in the sysfs core, not the
drivers.

Richard


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ck] Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-12 Thread michael chang

On 3/12/07, Con Kolivas [EMAIL PROTECTED] wrote:

On Tuesday 13 March 2007 07:11, Mike Galbraith wrote:
 On Tue, 2007-03-13 at 05:49 +1100, Con Kolivas wrote:
  On Tuesday 13 March 2007 01:34, Mike Galbraith wrote:
   On Mon, 2007-03-12 at 22:23 +1100, Con Kolivas wrote:
Mike the cpu is being proportioned out perfectly according to
fairness as I mentioned in the prior email, yet X is getting the
lower latency scheduling. I'm not sure within the bounds of fairness
what more would you have happen to your liking with this test case?
  
   It has been said that perfection is the enemy of good.  The two
   interactive tasks receiving 40% cpu while two niced background jobs
   receive 60% may well be perfect, but it's damn sure not good.
 
  Again I think your test is not a valid testcase. Why use two threads for
  your encoding with one cpu? Is that what other dedicated desktop OSs
  would do?

 The testcase is perfectly valid.  My buddies box has two full cores, so
 we used two encoders such that whatever bandwidth is not being actively
 consumed by more important things gets translated into mp3 encoding.

 How would you go about ensuring that there won't be any cycles wasted?

 _My_ box has 1 core that if fully utilized translates to 1.2 cores.. or
 whatever, depending on the phase of the moon.  But no matter, logical vs
 physical cpu argument is pure hand-waving.  What really matters here is
 the bottom line: your fair scheduler ignores the very real requirements
 of interactivity.

Definitely not. It does not give unfair cpu towards interactive tasks. That's
a very different argument.


I think the issue here is that the scheduler is doing what Con expects
it to do, but not what Mike Galbraith here feels it should do. Maybe
Con and Mike here are using different definitions, as such, for
interactivity, or at least have different ideas of how this is
supposed to be accomplished. Does that sound right?

I've begun using RSDL on my machines here, and so far there haven't
been any issues with it, in my opinion. From a feel standpoint, it's
not what I would call perfectly smooth, but it is better than the
other schedulers I've seen (and the one case where there are still
problems it is an issue of I/O contention, not CPU -- using RSDL has
made a surprisingly large impact regardless).

Perhaps, Mike Galbraith, do you feel that it should be possible to use
the CPU at 100% for some task and still maintain excellent
interactivity? (It has always seemed to me that if you wanted
interactivity, you had to have the CPU idle at least a couple percent
of the time. How much or how little that many percent had to be was
usually affected by how much preempting you put in the kernel, and
what CPU scheduler was in it at the time.)

Considering the concepts put out by projects such as BOINC and
[EMAIL PROTECTED], I wouldn't be thoroughly surprised by this ideology,
although I do question the particular way this test case is being run.

That said, I haven't run the test case in particular yet, although I
will see if I can get the time to do so soon. In any case, I
personally do have a few qualms about this test case being run on HT
virtual cores:

* I am curious about why splitting a task and running them on separate
HT virtual cores improves interactivity any. (If it was Amarok on one
virtual CPU and one lame on the other, I would get it. But I see two
lame processes here -- wouldn't they just be allocated one to each
virtual CPU, leaving Amarok out most of the time? How do you get
interactivity with that?) Does using HT really fill up the CPU better
than having the CPU announce itself as the single core it is? My
understanding is that throughput goes down somewhat even just by using
multiple threads with HT, compared to the single thread on the single
core, and why would you use more than one lame thread unless you seek
throughput?

* Where are the lame processes encoding to/from? For example, are the
results for both being sent to /dev/null? To a hard drive? etc. etc.
In a real-world test case, I would imagine a user running TWO lame
processes would be encoding from two sources to the same hard drive.
(Or, they might even be both encoding FROM that same hard drive. Or
both.) The need for the single HD to seek so much reduces throughput
on most of these cases in HT, IIRC, which may be a factor that would
probably defeat the point of this case for most users. Of course, my
point is negated if they have multiple drives for their use of lame,
and/or if they have sufficient memory and bandwidth to handle the
issue, or if encoding throughput isn't their aim.

The only reason I can think of that running two lame processes would
improve interactivity would be so that if one particular portion
gets stuck, then there's a chance the other thread will be working on
an easier portion, making it appear like more is being done.  This
occurs, for example, with POV-Ray and Blender, where some parts of the
image may require 

Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-12 Thread Mike Galbraith
On Tue, 2007-03-13 at 00:05 +0300, Serge Belyshev wrote:
 Mike Galbraith [EMAIL PROTECTED] writes:
 
 [snip]
  And let's not lose sight of things with this one testcase.
  
  RSDL fixes
  - every starvation case
  - all fairness isssues
  - is better 95% of the time on the desktop
 
  I don't know where you got that 95% number from.  For the most part, the
  existing scheduler does well.  If it sucked 95% of the time, it would
  have been shredded a long time ago.
 
 
 I tell you.
 
 http://article.gmane.org/gmane.linux.kernel/500027
 http://article.gmane.org/gmane.linux.kernel/502996
 http://article.gmane.org/gmane.linux.kernel/500119
 http://article.gmane.org/gmane.linux.kernel/500784
 http://article.gmane.org/gmane.linux.kernel/500768
 http://article.gmane.org/gmane.linux.kernel/502255
 http://article.gmane.org/gmane.linux.kernel/502282
 http://article.gmane.org/gmane.linux.kernel/503650
 http://article.gmane.org/gmane.linux.kernel/503695
 http://article.gmane.org/gmane.linux.kernel.ck/6512
 http://article.gmane.org/gmane.linux.kernel.ck/6539
 http://article.gmane.org/gmane.linux.kernel.ck/6565

Thanks, but I've already read them.  They are part of the reason I
decided to spend some time testing.

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Move to unshared VMAs in NOMMU mode?

2007-03-12 Thread Robin Getz
On Fri 9 Mar 2007 09:12, David Howells pondered:
 I've been considering how to deal with the SYSV SHM problem, and I think we
 may have to move to unshared VMAs in NOMMU mode to deal with this. 

Thanks for putting some good thoughts down.

 Currently, what we have is each mm_struct has in its arch-specific context
 argument a list of VMLs.  Take the FRV context for example:

   [include/asm-frv/mmu.h]
   typedef struct {
   #ifdef CONFIG_MMU
   ...
   struct vm_list_struct   *vmlist;
   unsigned long   end_brk;

   #endif
   ...
   } mm_context_t;

 Each VML struct containes a pointer to a systemwide VMA and the next VML in
 the list:

   struct vm_list_struct {
   struct vm_list_struct   *next;
   struct vm_area_struct   *vma;
   };

 The VMAs themselves are kept in an rb-tree in mm/nommu.c:

   /* list of shareable VMAs */
   struct rb_root nommu_vma_tree = RB_ROOT;

 which can then be displayed through /proc/maps.

 There are some restrictions of this system, mainly due to the NOMMU
 constraints:

  (*) mmap() may not be used to overlay one mapping upon another

  (*) mmap() may not be used with MAP_FIXED.

  (*) mmap()'s of the same part of the same file will result in multiple
  mappings returning the same base address, assuming the maps are
 shareable. If they aren't shareable, they'll be at different base
 addresses.

  (*) for normal shareable file mappings, two mappings will only be shared
 if they precisely match offset, size and protection, otherwise a new
 mapping will be created (this is because VMAs will be shared).  Splitting
 VMAs would reduce the this restriction, though subsequent mappings would
 have to be bounded by the first mapping, but wouldn't have to be the same
 size.

  (*) munmap() may only unmap a precise match amongst the mappings made; it
 may not be used to cut down or punch a hole in an existing mapping.

 The VMAs for private file mappings, private blockdev mappings and anonymous
 mappings, be they shared[*] or unshared, hold a pointer to the kmalloc()'d
 region of memory in which the mapping contents reside.  This region is
 discarded when the VMA is deleted.  When a region can be shared the VMA is
 also shared, and so no reference counting need take place on the mapping
 contents as that is implied by the VMA.

 [*] MAP_PRIVATE+!PROT_WRITE+!PT_PTRACED regions may be shared

 Note that for mappable chardevs with special BDI capability flags, extra
 VMAs may be allocated because (a) they may need to overlap non-exactly, and
 (b) the chardev itself pins the backing storage, if the backing storage is
 potentially transient.


 If VMAs are not shared for shared memory regions then some other means of
 retaining the actual allocated memory region must be found.  The obvious
 way to do this is to have the VMA point to a shared, refcounted record that
 keeps track of the region:

   struct vm_region {
   /* the first parameters define the region as for the VMA */
   pgprot_tvm_page_prot;
   unsigned long   vm_start;
   unsigned long   vm_end
   unsigned long   vm_pgoff;
   struct file *vm_file;

   atomic_tvm_usage;   /* region usage count */
   struct rb_node  vm_rb;  /* region tree */
   };

 The VMA itself would then have to be modified to include a pointer to this,
 but wouldn't then need its own refcount.  VMAs would belong, once again, to
 the mm_struct, the VML struct would vanish, and the VML list rooted in
 mm_context_t would vanish.

 For R/O shareable file mappings, it might be possible to actually use the
 target file's pagecache for the mapping.  I do something of that sort for
 shared-writable mappings on ramfs files (to support POSIX SHM and SYSV
 SHM).

 The downside of allocating all these extra VMAs is that, of course, it
 takes up more memory, though that may not be too bad, especially if it's at
 the gain of additional consistency with the MM code.

I guess I don't look at it as consistency with the MM code as being the 
primary request, but consistency in operation with the MM code from a user 
space perspective - hopefully the two goals are not divergent.

 However, consistency isn't for the most part a real issue.  As I see it,
 drivers and filesystems should not concern themselves with anything other
 than the VMA they're given, and so it doesn't matter if these are shared or
 not.

 That brings us on to the problem with SYSV SHM which keeps an attachment
 count that the VMA mmap(), open() and release() ops manipulate.  This means
 that the nattch count comes out wrong on NOMMU systems.  Note that on MMU
 systems, doing a munmap() in the middle of an attached region will *also*
 break the nattch count, though this is self-correcting.

 Another way of dealing with the nattch count on NOMMU systems is to do it
 through 

Re: _proxy_pda still makes linking modules fail

2007-03-12 Thread Rusty Russell
On Mon, 2007-03-12 at 10:48 +0100, Andi Kleen wrote:
  Rusty's pda-per_cpu patch will deal with this once and for all; have
 
 Not on x86-64.

Indeed.  Perhaps it's time I join the modern world and compile a 64-bit
kernel...

Will prepare patches,
Rusty.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-03-12 Thread Anton Blanchard
 
Hi Nick,

 Anyway, I'll keep experimenting. If anyone from MySQL wants to help look
 at this, send me a mail (eg. especially with the sched_setscheduler issue,
 you might be able to do something better).

I took a look at this today and figured Id document it:

http://ozlabs.org/~anton/linux/sysbench/

Bottom line: it looks like issues in the glibc malloc library, replacing
it with the google malloc library fixes the negative scaling:

# apt-get install libgoogle-perftools0
# LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld

Anton
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/3] Add ability to keep track of callers of symbol_(get|put)

2007-03-12 Thread Rusty Russell
Hi Trent,

Patch looks good, just one comment:

On Mon, 2007-03-12 at 07:07 -0700, Trent Piepho wrote:
 +   use = already_uses(a, b);
 +   if (!use) {
 +   printk(KERN_ERR module %s trying to un-use a module, %s, which 
 +  it is not using, a-name, b-name);
 +return 0;
 +   }

s/return 0/BUG()/.  This is potentially quite a nasty bug.

Thanks!
Rusty.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] BUILD_BUG_ON_ZERO - BUILD_BUG_OR_ZERO

2007-03-12 Thread Rusty Russell
On Mon, 2007-03-12 at 15:14 +0100, Stefan Richter wrote:
 Robert P. J. Day wrote:
  On Mon, 12 Mar 2007, Stefan Richter wrote:
  Rusty Russell wrote:
   OTOH, BUILD_BUG_OR_ZERO says what happens: either it's a build bug, or
   it's zero.
 
  What about ZERO_UNLESS_BUILD_BUG_ON(e)? It's long though...
  
  how often is this going to be used?  it's not like the tree is
  currently awash in calls to BUILD_BUG_ON_ZERO as it is.
 
 Most of the time it will hidden as a macro-in-a-macro, like in
 ARRAY_SIZE().  So the length of the name doesn't matter much.  But then,
 the _name_ itself doesn't matter much because authors of public macros
 are the primary user group, not John Driverhacker.

Well, there's a four line comment above it, so *someone* thought it
worth documenting.  Even if the new name isn't great, the old name is
actively misleading.  That's a 13, and we could be a 4.

http://ozlabs.org/~rusty/ols-2003-keynote/img52.html

Cheers,
Rusty.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Keyboard stops working after *lock [Was: 2.6.21-rc2-mm1]

2007-03-12 Thread Jiri Slaby

Jiri Kosina napsal(a):

(trimmed CC list a bit)

On Mon, 12 Mar 2007, Jiri Slaby wrote:


 UHCI: Eliminate asynchronous skeleton Queue Headers

Post it along with the usbmon log, and I'll try to figure out what happened.

Here it comes:
USBMON:
f7525b40 1832950485 C Ii:004:01 0 8 = 5300 
f7525b40 1832950517 S Ii:004:01 -115 8 
f7525140 1832950540 S Co:004:00 s 21 09 0200  0001 1 = 01
f7525140 1832952485 C Co:004:00 0 1 
Corresponds to numlock; 7; numlock; 7.


Alan, sorry for the previous bad post, I mismatched 2 files. This is 
hopefully correct.


thanks. Could you also please redo the test with the offending uhci patch 
reverted and send the output of a working situation?


- BAD kernel:

USBMON output:
d28dba40 1882513063 C Ii:008:01 0 8 = 5300 
d28dba40 1882513090 S Ii:008:01 -115 8 
f7b31340 1882515363 S Co:008:00 s 21 09 0200  0001 1 = 00
f7b31340 1882517065 C Co:008:00 0 1 




UHCI snapshot before hang:
Root-hub state: running   FSBR: 0
HC status
  usbcmd= 00c1   Maxp64 CF RS
  usbstat   = 
  usbint= 000f
  usbfrnum  =   (1)764
  flbaseadd = 0303d764
  sof   =   40
  stat1 = 01a5   LowSpeed Enabled Connected
  stat2 = 0095   Enabled Connected
Most recent frame: 75a2 (418)   Last ISO frame: 75a2 (418)
Periodic load table
12  0   0   0   127 0   0   0
0   0   0   0   127 0   0   0
0   0   0   0   127 0   0   0
0   0   0   0   127 0   0   0
Total: 520, #INT: 4, #ISO: 0
Frame List
Skeleton QHs
- skel_unlink_qh
[c3c41000] Skel QH link (0001) element (0001)
  queue is empty
- skel_iso_qh
[c3c41060] Skel QH link (0001) element (0001)
  queue is empty
- skel_int128_qh
[c3c410c0] Skel QH link (03c41542) element (0001)
  queue is empty
[c3c41540] INT QH link (03c41362) element (02c4a0f0)
period 128 phase 0 load 12 us
urb_priv [f7b2da4c] urb [f7b314c0] qh [c3c41540] Dev=7 EP=1(IN) INT Actlen=0
1: [c2c4a0f0] link (02c4a0c0) e3 IOC Active NAK Length=7ff MaxLen=0 
DT1 EndPt=1 Dev=7, PID=69(IN) (buf=36a4a040)

  Dummy TD
[c2c4a0c0] link (02c4a120) e0 Length=0 MaxLen=7ff DT0 EndPt=0 Dev=0, 
PID=e1(OUT) (buf=)

- skel_int64_qh
[c3c41120] Skel QH link (03c41362) element (0001)
  queue is empty
- skel_int32_qh
[c3c41180] Skel QH link (03c41362) element (0001)
  queue is empty
- skel_int16_qh
[c3c411e0] Skel QH link (03c41362) element (0001)
  queue is empty
- skel_int8_qh
[c3c41240] Skel QH link (03c41482) element (0001)
  queue is empty
[c3c41480] INT QH link (03c41602) element (02c4a030)
period 8 phase 4 load 93 us
urb_priv [f7b2d3bc] urb [d28dbc40] qh [c3c41480] Dev=2 EP=1(IN) INT Actlen=0
1: [c2c4a030] link (02c4a060) e3 LS IOC Active NAK Length=7ff 
MaxLen=3 DT0 EndPt=1 Dev=2, PID=69(IN) (buf=037c5000)

  Dummy TD
[c2c4a060] link (02c4a0f0) e0 Length=0 MaxLen=7ff DT0 EndPt=0 Dev=0, 
PID=e1(OUT) (buf=)

[c3c41600] INT QH link (03c41662) element (02c4a150)
period 8 phase 4 load 17 us
urb_priv [f7b2da30] urb [d28dba40] qh [c3c41600] Dev=8 EP=1(IN) INT Actlen=0
1: [c2c4a150] link (02c4a120) e3 IOC Active NAK Length=7ff MaxLen=7 
DT1 EndPt=1 Dev=8, PID=69(IN) (buf=037c5180)

  Dummy TD
[c2c4a120] link (02c4a180) e0 Length=0 MaxLen=7ff DT0 EndPt=0 Dev=0, 
PID=e1(OUT) (buf=)

[c3c41660] INT QH link (03c41362) element (02c4a1b0)
period 8 phase 4 load 17 us
urb_priv [f7b2d9f8] urb [d1622840] qh [c3c41660] Dev=8 EP=2(IN) INT Actlen=0
1: [c2c4a1b0] link (02c4a1e0) e3 IOC Active NAK Length=7ff MaxLen=4 
DT0 EndPt=2 Dev=8, PID=69(IN) (buf=037c5300)

  Dummy TD
[c2c4a1e0] link (02c4a210) e0 Length=0 MaxLen=7ff DT0 EndPt=0 Dev=0, 
PID=e1(OUT) (buf=)

- skel_int4_qh
[c3c412a0] Skel QH link (03c41362) element (0001)
  queue is empty
- skel_int2_qh
[c3c41300] Skel QH link (03c41362) element (0001)
  queue is empty
- skel_async_qh
[c3c41360] Skel QH link (0001) element (02c4a000)
  queue is empty
- skel_term_qh
[c3c413c0] Skel QH link (0001) element (02c4a000)
  queue is empty




UHCI snapshot after hang:
Root-hub state: running   FSBR: 0
HC status
  usbcmd= 00c1   Maxp64 CF RS
  usbstat   = 
  usbint= 000f
  usbfrnum  =   (1)c2c
  flbaseadd = 0303dc2c
  sof   =   40
  stat1 = 01a5   LowSpeed Enabled Connected
  stat2 = 0095   Enabled Connected
Most recent frame: 9efc (764)   Last ISO frame: 9efc (764)
Periodic load table
12  0   0   0   127 0   0   0
0   0   0   0   127 0   0   0
0   0   0   0   127 0   0   0
0   0   0   0   127 0   0   0
Total: 520, 

Djprobes questions

2007-03-12 Thread Mathieu Desnoyers
Hi Masami,

I recently had to add support for inline code patching on i386 to my
marker infrastructure. Clearly, it looks like what is done in djprobes,
with the main difference that I only patch the immediate value of a 2
bytes load immediate instruction.

I think I found a solution to one of the main issues with djprobes : it
currently has to wait for each CPU to hit the probe before being sure
that it's safe to patch the code with something else than an int3. This
is due to PIII errata 49, which says that a CPU much execute a
serializing instruction before executing cross-modified code.

Here is what I do : While I use a breakpoint to fall in a trap for the
CPUs that hit the site currently being modified, I also send an IPI to
all CPUs so they execute cpuid. Once it returns, I am sure that every
CPU has executed a serializing instruction, which enables me to go on
with the complete code modification, therefore removing the initial
breakpoint.

Here is my code :

http://ltt.polymtl.ca/cgi-bin/gitweb.cgi?p=linux-2.6-lttng.git;a=blob;f=arch/i386/kernel/marker.c;h=89b06f02f0966685be260d6364a0dd94c3d14456;hb=v2.6.20-lttng

(Comments are welcome)

On a second note, looking at the djprobes code triggered some question 
in my mind about the safety of using a worker thread to make sure
every interrupt context has returned (so there is no IP pointing into
the modified code). The following scenario might be possible : an
interrupt handler (or trap handler) reenables interrupts, does irq_exit()
or nmi_exit() (which reenables preemption) but does not do iret yet. My
understanding is that it could be scheduled and have a return IP
pointing to the code that is being modified. Am I right ?

Regards,

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kthread_should_stop_check_freeze (was: Re: [PATCH -mm 3/7] Freezer: Remove PF_NOFREEZE from rcutorture thread)

2007-03-12 Thread Pavel Machek
Hi!

   Looks good to me!  The other kthread_should_stop() calls in
   rcutorture.c should also become
   kthread_should_top_check_freeze().

  Why is it useful?
 
 Because we want to avoid repeating
 
 while (!kthread_should_stop()) {
   try_to_freeze();
   ...
 }
 
 in many places?

Do not do it, then. Confusion it causes is not worth saving one line
of code. 

You do less typing, but the resulting code is _less_ readable, not
more.

NAK.

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 2/7] RSS controller core

2007-03-12 Thread Herbert Poetzl
On Mon, Mar 12, 2007 at 11:42:59AM -0700, Dave Hansen wrote:
 How about we drill down on these a bit more.
 
 On Mon, 2007-03-12 at 02:00 +0100, Herbert Poetzl wrote:
   - shared mappings of 'shared' files (binaries 
 and libraries) to allow for reduced memory
 footprint when N identical guests are running
 
 So, it sounds like this can be phrased as a requirement like:
 
   Guests must be able to share pages.
 
 Can you give us an idea why this is so? 

sure, one reason for this is that guests tend to
be similar (or almost identical) which results
in quite a lot of 'shared' libraries and executables
which would otherwise get cached for each guest and
would also be mapped for each guest separately

 On a typical vserver system,

there is nothing like a typical Linux-VServer system :)

 how much memory would be lost if guests were not permitted 
 to share pages like this? 

let me give a real world example here:

 - typical guest with 600MB disk space
 - about 100MB guest specific data (not shared)
 - assumed that 80% of the libs/tools are used

gives 400MB of shared read only data

assumed you are running 100 guests on a host,
that makes ~39GB of virtual memory which will
get paged in and out over and over again ...

.. compared to 400MB shared pages in memory :)

 How much does this decrease the density of vservers?

well, let's look at the overall memory resource
function with the above assumptions:

 with sharing:  f(N) = N*80M + 400M
 without sharing:   g(N) = N*480M

so the decrease N-inf: g/f - 6 (factor)

which is quite realistic, if you consider that
there are only so many distributions, OTOH, the
factor might become less important when the 
guest specific data grows ...

   - virtual 'physical' limit should not cause
 swap out when there are still pages left on
 the host system (but pages of over limit guests
 can be preferred for swapping)
 
 Is this a really hard requirement?  

no, not hard, but a reasonable optimization ...

let me note once again, that for full isolation
you better go with Xen or some other Hypervisor
because if you make it work like Xen, it will
become as slow and resource hungry as any other
paravirtualization solution ...

 It seems a bit fluffy to me.  

most optimizations might look strange at first
glance, but when you check what the limitting
factors for OS-Level virtualizations are, you
will find that it looks like this:

(in order of decreasing relevance)

 - I/O subsystem
 - available memory 
 - network performance
 - CPU performance

note: this is for 'typical' guests, not for
number crunching or special database, or pure
network bound applications/guests ...

 An added bonus if we can do it, but certainly not the 
 most important requirement in the bunch.

nope, not the _most_ important one, but it
all summs up :)

 What are the consequences if this isn't done?  Doesn't 
 a loaded system eventually have all of its pages used 
 anyway, so won't this always be a temporary situation?

let's consider a quite limited guest (or several
of them) which have a 'RAM' limit of 64MB and 
additional 64MB of 'virtual swap' assigned ...

if they use roughly 96MB (memory footprint) then
having this 'fluffy' optimization will keep them
running without any effect on the host side, but
without, they will continously swap in and out
which will affect not only the host, but also the
other guests ...

 This also seems potentially harmful if we aren't able 
 to get pages *back* that we've given to a guest.  

no, the idea is not to keep them unconditionally,
the concept is to allow them to stay, even if the
guest has reached the RSS limit and a 'real' system
would have to swap pages out (or simply drop them)
to get other pages mapped ...

 Tasks can pin pages in lots of creative ways.

sure, this is why we should have proper limits
for that too :)

   - accounting and limits have to be consistent
 and should roughly represent the actual used
 memory/swap (modulo optimizations, I can go
 into detail here, if necessary)
 
 So, consistency is important, but is precision?  

IMHO precision is not that important, of course,
the values should be in the same ballpark ...

 If we, for instance, used one of the hashing schemes, 
 we could have some imprecise decisions made but the 
 system would stay consistent overall.

it is also important that the lack of precision
cannot be exploited to allocate unreasonable
ammounts of resources ... 

at least Linux-VServer could live with +/- 10%
(or probably more) as I said, it is mainly used
for preventing DoS or DoR attacks ...

 This requirement also doesn't seem to push us in the 
 direction of having distinct page owners, or some 
 sharing mechanism, because both would be consistent.

   - OOM handling on a per guest basis, i.e. some
 out of memory condition in guest A must not
 affect guest B
 
 I'll agree that this one is important and well stated 
 as-is.  Any disagreement on this one?

nope 

Re: [PATCH 2/2] pci: Repair pci_save/restore_state so we can restore one save many times.

2007-03-12 Thread Kok, Auke

Eric W. Biederman wrote:

Because we do not reserve space for the pci-x and pci-e state in struct
pci dev we need to dynamically allocate it.  However because we need
to support restore being called multiple times after a single save
it is never safe to free the buffers we have allocated to hold the
state.

So this patch modifies the save routines to first check to see
if we have already allocated a state buffer before allocating
a new one.  Then the restore routines are modified to not free
the state after restoring it.  Simple and it fixes some subtle
error path handling bugs, that are hard to test for.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]



I tested this patch and the other 2 in this series:

[PATCH 0/2] Repair pci_restore_state when used with device resets
[PATCH 1/2] msi: Safer state caching.

against e1000 with suspend/resume functionality. Apart from a minor symmetry 
violation in e1000 for which I will send a patch later, these patches appear to 
work fine on my ich8 with 5 msi capable e1000 ports.


Feel free to add my Signed-off-by: Auke Kok [EMAIL PROTECTED]

Cheers,

Auke




---
 drivers/pci/pci.c   |   12 ++--
 include/linux/pci.h |5 -
 2 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 6fb78df..b292c9a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -551,7 +551,9 @@ static int pci_save_pcie_state(struct pci_dev *dev)
if (pos = 0)
return 0;
 
-	save_state = kzalloc(sizeof(*save_state) + sizeof(u16) * 4, GFP_KERNEL);

+   save_state = pci_find_saved_cap(dev, PCI_CAP_ID_EXP);
+   if (!save_state)
+   save_state = kzalloc(sizeof(*save_state) + sizeof(u16) * 4, 
GFP_KERNEL);
if (!save_state) {
dev_err(dev-dev, Out of memory in pci_save_pcie_state\n);
return -ENOMEM;
@@ -582,8 +584,6 @@ static void pci_restore_pcie_state(struct pci_dev *dev)
pci_write_config_word(dev, pos + PCI_EXP_LNKCTL, cap[i++]);
pci_write_config_word(dev, pos + PCI_EXP_SLTCTL, cap[i++]);
pci_write_config_word(dev, pos + PCI_EXP_RTCTL, cap[i++]);
-   pci_remove_saved_cap(save_state);
-   kfree(save_state);
 }
 
 
@@ -597,7 +597,9 @@ static int pci_save_pcix_state(struct pci_dev *dev)

if (pos = 0)
return 0;
 
-	save_state = kzalloc(sizeof(*save_state) + sizeof(u16), GFP_KERNEL);

+   save_state = pci_find_saved_cap(dev, PCI_CAP_ID_EXP);
+   if (!save_state)
+   save_state = kzalloc(sizeof(*save_state) + sizeof(u16), 
GFP_KERNEL);
if (!save_state) {
dev_err(dev-dev, Out of memory in pci_save_pcie_state\n);
return -ENOMEM;
@@ -622,8 +624,6 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
cap = (u16 *)save_state-data[0];
 
 	pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]);

-   pci_remove_saved_cap(save_state);
-   kfree(save_state);
 }
 
 
diff --git a/include/linux/pci.h b/include/linux/pci.h

index 78417e4..481ea06 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -209,11 +209,6 @@ static inline void pci_add_saved_cap(struct pci_dev 
*pci_dev,
hlist_add_head(new_cap-next, pci_dev-saved_cap_space);
 }
 
-static inline void pci_remove_saved_cap(struct pci_cap_saved_state *cap)

-{
-   hlist_del(cap-next);
-}
-
 /*
  *  For PCI devices, the region numbers are assigned this way:
  *

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kthread_should_stop_check_freeze (was: Re: [PATCH -mm 3/7] Freezer: Remove PF_NOFREEZE from rcutorture thread)

2007-03-12 Thread Anton Blanchard

 Do not do it, then. Confusion it causes is not worth saving one line
 of code. 
 
 You do less typing, but the resulting code is _less_ readable, not
 more.

Then please document it _clearly_ with the kthread code somewhere. The
reason I brought this up is I had no idea we had to put the freezer gunk
in all kernel thread loops and Ive been writing kernel threads for years.

Anton
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-12 Thread Con Kolivas

On 13/03/07, Mike Galbraith [EMAIL PROTECTED] wrote:

On Tue, 2007-03-13 at 07:38 +1100, Con Kolivas wrote:
 On Tuesday 13 March 2007 07:11, Mike Galbraith wrote:
 
  Killing the known corner case starvation scenarios is wonderful, but
  let's not just pretend that interactive tasks don't have any special
  requirements.

 Now you're really making a stretch of things. Where on earth did I say that
 interactive tasks don't have special requirements? It's a fundamental feature
 of this scheduler that I go to great pains to get them as low latency as
 possible and their fair share of cpu despite having a completely fair cpu
 distribution.

As soon as your cpu is fully utilized, fairness looses or interactivity
loses.  Pick one.


That's not true unless you refuse to prioritise your tasks
accordingly. Let's take this discussion in a different direction. You
already nice your lame processes. Why? You already have the concept
that you are prioritising things to normal or background tasks. You
say so yourself that lame is a background task. Stating the bleedingly
obvious, the unix way of prioritising things is via nice. You already
do that. So moving on from that...

Your test case you ask how can I maximise cpu usage. Well you know
the answer already. You run two threads. I won't dispute that.

The debate seems to be centered on whether two tasks that are niced +5
or to a higher value is background. In my opinion, nice 5 is not
background, but relatively less cpu. You already are savvy enough to
be using two threads and nicing them. All I ask you to do when using
RSDL is to change your expectations slightly and your settings from
nice 5 to nice 10 or 15 or even 19. Why is that so offensive to you?
nice 5 is 75% the cpu of nice 0. nice 10 is 50%, nice 15 is 25%, nice
19 is 5%.If you're so intent on defining nice 5 as background would it
be a matter of me just modifying nice 5 to be 25% instead? I suspect
your answer will be no because then you'll argue that you shouldn't
nice at all, but it should be interesting to see your response. You
seem to be advocating that the scheduler does everything and we need
to implement some complex flag instead. I don't believe that's the
right thing to do at all. So I offer you some options.

1. Be happy with changing your nice from 5 to15. I still don't think
this is in any way unreasonable.
2. Wait for me to fix -niced tasks behaviour and -nice your X. I plan
to implement this change anyway, not necessarily for X.
3. Have me redefine what nice 5 is, and tell me what percentage cpu
you think is right.
4. Any combination of the above.

Please don't pick 5.none of the above. Please try to work with me on this.

--
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1] mm: Inconsistent use of node IDs

2007-03-12 Thread Ethan Solomita

This patch corrects inconsistent use of node numbers (variously nid or
node) in the presence of fake NUMA.

Both AMD and Intel x86_64 discovery code will determine a CPU's physical
node and use that node when calling numa_add_cpu() to associate that CPU
with the node, but numa_add_cpu() treats the node argument as a fake
node. This physical node may not exist within the fake nodespace, and
even if it does, it will likely incorrectly associate a CPU with a fake
memory node that may not share the same underlying physical NUMA node.

Similarly, the PCI code which determines the node of the PCI bus saves
it in the pci_sysdata structure. This node then propagates down to other
buses and devices which hang off the PCI bus, and is used to specify a
node when allocating memory. The purpose is to provide NUMA locality,
but the node is a physical node, and the memory allocation code expects
a fake node argument.

Provide a routine (get_fake_node()) to map a physical node ID to a fake
node ID, where the fake node ID contains memory on the specified
physical node ID. This fake node's zonelist is tied to other close fake
nodes, maintaining NUMA locality. Also provide numa_online_phys() which
is the same as numa_online() but takes a physical node ID.

Change init_cpu_to_node(), x86_64 and PCI code use get_fake_node() and
numa_online_phys() in order to convert to an appropriate fake ID.

Signed-off-by: Ethan Solomita [EMAIL PROTECTED]
---
arch/i386/pci/acpi.c  |6 +++
arch/x86_64/kernel/setup.c|   14 
arch/x86_64/mm/numa.c |   70 +-
arch/x86_64/pci/k8-bus.c  |3 +
include/asm-x86_64/topology.h |8 
5 files changed, 85 insertions(+), 16 deletions(-)



diff -uprN -x install -X linux-2.6.21-rc3-mm2/Documentation/dontdiff 
linux-2.6.21-rc3-mm2/arch/i386/pci/acpi.c 
linux-2.6.21-rc3-mm2-phystofake/arch/i386/pci/acpi.c
--- linux-2.6.21-rc3-mm2/arch/i386/pci/acpi.c   2007-03-09 16:42:42.0 
-0800
+++ linux-2.6.21-rc3-mm2-phystofake/arch/i386/pci/acpi.c2007-03-12 
12:36:50.0 -0700
@@ -35,8 +35,13 @@ struct pci_bus * __devinit pci_acpi_scan

pxm = acpi_get_pxm(device-handle);
#ifdef CONFIG_ACPI_NUMA
-   if (pxm = 0)
+   if (pxm = 0) {
sd-node = pxm_to_node(pxm);
+#ifdef CONFIG_NUMA_EMU
+   if (sd-node != -1)
+   sd-node = get_fake_node(sd-node);
+#endif
+   }
#endif

bus = pci_scan_bus_parented(NULL, busnum, pci_root_ops, sd);
diff -uprN -x install -X linux-2.6.21-rc3-mm2/Documentation/dontdiff 
linux-2.6.21-rc3-mm2/arch/x86_64/kernel/setup.c 
linux-2.6.21-rc3-mm2-phystofake/arch/x86_64/kernel/setup.c
--- linux-2.6.21-rc3-mm2/arch/x86_64/kernel/setup.c 2007-03-09 
16:42:42.0 -0800
+++ linux-2.6.21-rc3-mm2-phystofake/arch/x86_64/kernel/setup.c  2007-03-12 
12:44:31.0 -0700
@@ -476,20 +476,20 @@ static void __cpuinit display_cacheinfo(
}

#ifdef CONFIG_NUMA
-static int nearby_node(int apicid)
+static int __init nearby_node(int apicid)
{
int i;
for (i = apicid - 1; i = 0; i--) {
int node = apicid_to_node[i];
-   if (node != NUMA_NO_NODE  node_online(node))
+   if (node != NUMA_NO_NODE  node_online_phys(node))
return node;
}
for (i = apicid + 1; i  MAX_LOCAL_APIC; i++) {
int node = apicid_to_node[i];
-   if (node != NUMA_NO_NODE  node_online(node))
+   if (node != NUMA_NO_NODE  node_online_phys(node))
return node;
}
-   return first_node(node_online_map); /* Shouldn't happen */
+   return NUMA_NO_NODE; /* Shouldn't happen */
}
#endif

@@ -528,7 +528,7 @@ static void __init amd_detect_cmp(struct
node = c-phys_proc_id;
if (apicid_to_node[apicid] != NUMA_NO_NODE)
node = apicid_to_node[apicid];
-   if (!node_online(node)) {
+   if (!node_online_phys(node)) {
/* Two possibilities here:
   - The CPU is missing memory and no node was created.
   In that case try picking one from a nearby CPU
@@ -543,9 +543,10 @@ static void __init amd_detect_cmp(struct
apicid_to_node[ht_nodeid] != NUMA_NO_NODE)
node = apicid_to_node[ht_nodeid];
/* Pick a nearby node */
-   if (!node_online(node))
+   if (!node_online_phys(node))
node = nearby_node(apicid);
}
+   node = get_fake_node(node);
numa_set_node(cpu, node);

printk(KERN_INFO CPU %d/%x - Node %d\n, cpu, apicid, node);
@@ -679,7 +680,7 @@ static int __cpuinit intel_num_cpu_cores
return 1;
}

-static void srat_detect_node(void)
+static void __cpuinit srat_detect_node(void)
{
#ifdef CONFIG_NUMA
unsigned node;
@@ -689,6 +690,7 @@ static void srat_detect_node(void)
/* Don't do 

[PATCH] Fix vmi time header bug

2007-03-12 Thread Zachary Amsden
Some gcc put this function in .init.text because the header didn't 
match.  For 2.6.21-rc.


Zach


Index: linux-2.6.21/include/asm-i386/vmi_time.h
===
--- linux-2.6.21.orig/include/asm-i386/vmi_time.h   2007-03-06 
18:56:03.0 -0800
+++ linux-2.6.21/include/asm-i386/vmi_time.h2007-03-12 13:55:16.0 
-0800
@@ -54,7 +54,7 @@ extern unsigned long vmi_cpu_khz(void);
 
 #ifdef CONFIG_X86_LOCAL_APIC
 extern void __init vmi_timer_setup_boot_alarm(void);
-extern void __init vmi_timer_setup_secondary_alarm(void);
+extern void __devinit vmi_timer_setup_secondary_alarm(void);
 extern void apic_vmi_timer_interrupt(void);
 #endif
 


Re: /sys/devices/system/cpu/cpuX/online are missing

2007-03-12 Thread Giuliano Pochini



On Mon, 12 Mar 2007, Heiko Carstens wrote:


On Sun, Mar 11, 2007 at 10:26:52PM +0100, Giuliano Pochini wrote:



Since 2.6.20 /sys/devices/system/cpu/cpuX/online isn't there anymore. The
directories exist, though. I also tested linux-2.6.21rc3. I had a look at the
archives and I found nothing about the removal of that file, which is still
documented in Documentation/cpu-hotplug.txt. I don't know if other
architectures are affected.

$ uname -a
Linux Jay 2.6.20 #1 SMP Mon Feb 5 22:42:18 CET 2007 ppc 7455, altivec supported 
PowerMac3,6 GNU/Linux

No cpusets. CONFIG_HOTPLUG_CPU=y


Somebody inverted the logic when and if the 'online' attribute for cpu devices
appear. See 72486f1f8f0a2bc828b9d30cf4690cf2dd6807fc.
The fix for s390 is this: 6721f77810dfcb7cbf8e97be6fa43fe2740dd0aa.
Looks like arch/ppc was left out as well.


I had a look at arch/powerpc/kernel/smp.c but I'm not familiar at all with 
those parts of the kernel. I'm cc'ing this message to linuxppc-dev.



--
Giuliano.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)

2007-03-12 Thread Blaisorblade
On Wednesday 07 March 2007 11:02, Nick Piggin wrote:
 On Wed, Mar 07, 2007 at 10:49:47AM +0100, Nick Piggin wrote:
  On Wed, Mar 07, 2007 at 01:44:20AM -0800, Bill Irwin wrote:
   On Wed, Mar 07, 2007 at 10:28:21AM +0100, Nick Piggin wrote:
Depending on whether anyone wants it, and what features they want, we
could emulate the old syscall, and make a new restricted one which is
much less intrusive.
For example, if we can operate only on MAP_ANONYMOUS memory and
specify that nonlinear mappings effectively mlock the pages, then we
can get rid of all the objrmap and unmap_mapping_range handling,
forget about the writeout and msync problems...
  
   Anonymous-only would make it a doorstop for Oracle, since its entire
   motive for using it is to window into objects larger than user virtual
 
  Uh, duh yes I don't mean MAP_ANONYMOUS, I was just thinking of the shmem
  inode that sits behind MAP_ANONYMOUS|MAP_SHARED. Of course if you don't
  have a file descriptor to get a pgoff, then remap_file_pages is a
  doorstop for everyone ;)
 
   address spaces (this likely also applies to UML, though they should
   really chime in to confirm). Restrictions to tmpfs and/or ramfs would
   likely be liveable, though I suspect some things might want to do it to
   shm segments (I'll ask about that one). There's definitely no need for
   a persistent backing store for the object to be remapped in Oracle's
   case, in any event. It's largely the in-core destination and source of
   IO, not something saved on-disk itself.
 
  Yeah, tmpfs/shm segs are what I was thinking about. If UML can live with
  that as well, then I think it might be a good option.

 Oh, hmm if you can truncate these things then you still need to
 force unmap so you still need i_mmap_nonlinear.

Well, we don't need truncate(), but MADV_REMOVE for memory hotunplug, which is 
way similar I guess.

About the restriction to tmpfs, I have just discovered 
'[PATCH] mm: tracking shared dirty pages' (commit 
d08b3851da41d0ee60851f2c75b118e1f7a5fc89), which already partially conflicts 
with remap_file_pages for file-based mmaps (and that's fully fine, for now).

Even if UML does not need it, till now if there is a VMA protection and a page 
hasn't been remapped with remap_file_pages, the VMA protection is used (just 
because it makes sense).

However, it is only used when the PTE is first created - we can never change 
protections on a VMA  - so it vma_wants_writenotify() is true (on all 
file-based and on no shmfs based mapping, right?), and we write-protect the 
VMA, it will always be write-protected.

That's no problem for UML, but for any other user (I guess I'll have to 
prevent callers from trying such stuff - I started from a pretty generic 
patch).

 But come to think of it, I still don't think nonlinear mappings are
 too bad as they are ;)

Btw, I really like removing -populate and merging the common code together. 
filemap_populate and shmem_populate are so obnoxiously different that I 
already wanted to do that (after merging remap_file_pages() core).

Also, I'm curious. Since my patches are already changing remap_file_pages() 
code, should they be absolutely merged after yours?
-- 
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make sure we populate the initroot filesystem late enough

2007-03-12 Thread Paul TBBle Hampson
On Thu, Mar 01, 2007 at 09:30:56AM +0900, Michael Ellerman wrote:
 On Wed, 2007-02-28 at 10:13 +, David Woodhouse wrote:
 On Wed, 2007-02-28 at 07:43 +0100, Benjamin Herrenschmidt wrote:
  I wouldn't be that sure ... I've had problems in the past with PMU based
  cpufreq... looks like flushing all caches and hard-resetting the
  processor on the fly when there can be pending DMAs might be a source of
  trouble... especially on CPUs that don't have working cache flush HW
  assist. 
 
 I've seen it on a PowerMac3,1 (400MHz G4) where we don't have cpufreq.
 I've also seen it on the latest 1.5GHz Mac Mini, and on my shinybook.
 They all fall over with the latest kernel, although the shinybook only
 does so immediately when booted with mem=512M. The shinybook does crash
 later with new kernels though; I don't yet know why. It could be the
 same thing, or it could be something different. That one seemed to
 appear between Fedora's 2.6.19-1.2913 and 2.6.19-1.2914 kernels, where
 we did nothing but turned CONFIG_SYSFS_DEPRECATED on.
 
 I don't blame cpufreq. At various times I've been equally convinced that
 it was due to CONFIG_KPROBES, and Linus' initrd-moving patch.

 Is there any pattern to the way it dies? Or is it just randomly dieing
 somewhere depending on which config options you have enabled?

 This is starting to sound reminiscent of a bug I chased for a while last
 year on Power5, but didn't find. It was fixed on some machines by
 disabling CONFIG_KEXEC, and/or other random unrelated CONFIG options.
 Unfortunately it magically stopped reproducing so I never caught it :/

Hmm. The crash came back after I booted into Mac OS X and back. It was however
a different crash, I believe it was coming from the USB modules (as it would
keep going when it happened, and get another crash, which tended to scroll away
too fast for me to capture) but I believe it was still getting down into the
slab code and actually dying there.

However, reverting the reversion of
8d610dd52dd1da696e199e4b4545f33a2a5de5c6 and instead applying
the following patch:

diff -ru linux-source-2.6.20.orig/arch/powerpc/mm/init_32.c 
linux-source-2.6.20/arch/powerpc/mm/init_32.c
--- linux-source-2.6.20.orig/arch/powerpc/mm/init_32.c  2007-02-05 
05:44:54.0 +1100
+++ linux-source-2.6.20/arch/powerpc/mm/init_32.c   2007-03-10 
11:03:56.0 +1100
@@ -244,7 +244,8 @@
 void free_initrd_mem(unsigned long start, unsigned long end)
 {
if (start  end)
-   printk (Freeing initrd memory: %ldk freed\n, (end - start)  
10);
+   printk (NOT Freeing initrd memory: %ldk freed\n, (end - 
start)  10);
+   return;
for (; start  end; start += PAGE_SIZE) {
ClearPageReserved(virt_to_page(start));
init_page_count(virt_to_page(start));

which if I recall correctly David Woodhouse posted to this thread,
seems to have fixed it.

I dunno if it's relevant, but my initrd.img is 13193315 bytes long,
(ie 99 bytes over 12884k) and the above logs:
NOT Freeing initrd memory: 12888k freed
which makes sense...

I of course completely failed to think to check this with the crashing
kernel, if it seems relevant I can roll back to it and get the numbers.

-- 
---
Paul TBBle Hampson, B.Sc, LPI, MCSE
On-hiatus Asian Studies student, ANU
The Boss, Bubblesworth Pty Ltd (ABN: 51 095 284 361)
[EMAIL PROTECTED]

Of course Pacman didn't influence us as kids. If it did,
we'd be running around in darkened rooms, popping pills and
listening to repetitive music.
 -- Kristian Wilson, Nintendo, Inc, 1989

License: http://creativecommons.org/licenses/by/2.1/au/
---


pgpjlI9DiEDO9.pgp
Description: PGP signature


Re: [RFC][PATCH 2/7] RSS controller core

2007-03-12 Thread Dave Hansen
On Mon, 2007-03-12 at 23:41 +0100, Herbert Poetzl wrote:
 On Mon, Mar 12, 2007 at 11:42:59AM -0700, Dave Hansen wrote:
  How about we drill down on these a bit more.
  
  On Mon, 2007-03-12 at 02:00 +0100, Herbert Poetzl wrote:
- shared mappings of 'shared' files (binaries 
  and libraries) to allow for reduced memory
  footprint when N identical guests are running
  
  So, it sounds like this can be phrased as a requirement like:
  
  Guests must be able to share pages.
  
  Can you give us an idea why this is so? 
 
 sure, one reason for this is that guests tend to
 be similar (or almost identical) which results
 in quite a lot of 'shared' libraries and executables
 which would otherwise get cached for each guest and
 would also be mapped for each guest separately
 
  On a typical vserver system,
 
 there is nothing like a typical Linux-VServer system :)
 
  how much memory would be lost if guests were not permitted 
  to share pages like this? 
 
 let me give a real world example here:
 
  - typical guest with 600MB disk space
  - about 100MB guest specific data (not shared)
  - assumed that 80% of the libs/tools are used

I get the general idea here, but I just don't think those numbers are
very accurate.  My laptop has a bunch of gunk open (xterm, evolution,
firefox, xchat, etc...).  I ran this command:

lsof | egrep '/(usr/|lib.*\.so)' | awk '{print $9}' | sort | uniq | xargs du 
-Dcs

and got:

113840  total

On a web/database server that I have (ps aux | wc -l == 128), I just ran
the same:

39168   total

That's assuming that all of the libraries are fully read in and
populated, just by their on-disk sizes. Is that not a reasonable measure
of the kinds of things that we can expect to be shared in a vserver?  If
so, it's a long way from 400MB.

Could you try a similar measurement on some of your machines?  Perhaps
mine are just weird.

- virtual 'physical' limit should not cause
  swap out when there are still pages left on
  the host system (but pages of over limit guests
  can be preferred for swapping)
  
  Is this a really hard requirement?  
 
 no, not hard, but a reasonable optimization ...
 
 let me note once again, that for full isolation
 you better go with Xen or some other Hypervisor
 because if you make it work like Xen, it will
 become as slow and resource hungry as any other
 paravirtualization solution ...

Believe me, _I_ don't want Xen. :)

  It seems a bit fluffy to me.  
 
 most optimizations might look strange at first
 glance, but when you check what the limitting
 factors for OS-Level virtualizations are, you
 will find that it looks like this:
 
 (in order of decreasing relevance)
 
  - I/O subsystem
  - available memory 
  - network performance
  - CPU performance
 
 note: this is for 'typical' guests, not for
 number crunching or special database, or pure
 network bound applications/guests ...

I don't doubt this, but doing this two-level page-out thing for
containers/vservers over their limits is surely something that we should
consider farther down the road, right?

It's important to you, but you're obviously not doing any of the
mainline coding, right?

  What are the consequences if this isn't done?  Doesn't 
  a loaded system eventually have all of its pages used 
  anyway, so won't this always be a temporary situation?
 
 let's consider a quite limited guest (or several
 of them) which have a 'RAM' limit of 64MB and 
 additional 64MB of 'virtual swap' assigned ...
 
 if they use roughly 96MB (memory footprint) then
 having this 'fluffy' optimization will keep them
 running without any effect on the host side, but
 without, they will continously swap in and out
 which will affect not only the host, but also the
 other guests ...

All workloads that use $limit+1 pages of memory will always pay the
price, right?  :)

-- Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, i386: Add command line length to boot protocol

2007-03-12 Thread Pavel Machek
Hi!

  +cmdline_size:   .long   COMMAND_LINE_SIZE-1 #length of the 
 command line,

Why a long? It's unlikely that someone is going to have a command line
bigger than 0x.
   
   Well, I could imagine overflowing that. Describing your numa setup,
   excluding few bad bits of ram using memmap=exact, set up your boot
   over iscsi on cmdline these are likely to eat insane ammount of
   cmdline space.
 
 65535 characters? Are you for real?
 Stop and think about just how big that is. If you have to create
 a boot command line that long, you have serious, serious issues.

Well, it is about the same size as my .config...

I agree we are unlikely to hit it any time soon... I could imagine
some (ab)uses, like fixed_acpi_bios=lots of hex digits, but those
are ugly. I could also imagine some uses where entire embedded machine
is described at kernel commandline.

Yes, all those are ugly/unlikely. OTOH saving 2 bytes does not seem
like that great goal.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


CONFIG_REORDER Kconfig help strange sentence.

2007-03-12 Thread Rusty Russell
OK, this confused me:

Function reordering (REORDER) [N/y/?] (NEW) ?

This option enables the toolchain to reorder functions for a more 
optimal TLB usage. If you have pretty much any version of binutils, 
this can increase your kernel build time by roughly one minute.

If you have pretty much any version of binutils?  Huh?

You mean This will slow your kernel build by about a minute?

Rusty.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] mm: Inconsistent use of node IDs

2007-03-12 Thread Andi Kleen
On Monday 12 March 2007 23:51, Ethan Solomita wrote:
 This patch corrects inconsistent use of node numbers (variously nid or
 node) in the presence of fake NUMA.

I think it's very consistent -- your patch would make it inconsistent though.

 Both AMD and Intel x86_64 discovery code will determine a CPU's physical
 node and use that node when calling numa_add_cpu() to associate that CPU
 with the node, but numa_add_cpu() treats the node argument as a fake
 node. This physical node may not exist within the fake nodespace, and
 even if it does, it will likely incorrectly associate a CPU with a fake
 memory node that may not share the same underlying physical NUMA node.
 
 Similarly, the PCI code which determines the node of the PCI bus saves
 it in the pci_sysdata structure. This node then propagates down to other
 buses and devices which hang off the PCI bus, and is used to specify a
 node when allocating memory. The purpose is to provide NUMA locality,
 but the node is a physical node, and the memory allocation code expects
 a fake node argument.

Sorry, but when you ask for NUMA emulation you will get it. I don't see
any point in a half way only for some subsystems I like NUMA emulation. 
It's unlikely that your ideas of where it is useful and where is not
matches other NUMA emulation user's ideas too.

Besides adding such a secondary node space would be likely a huge long term 
mainteance issue. I just can it see breaking with every non trivial change.

NACK.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, i386: Add command line length to boot protocol

2007-03-12 Thread Dave Jones
On Tue, Mar 13, 2007 at 12:12:20AM +0100, Pavel Machek wrote:

   65535 characters? Are you for real?
   Stop and think about just how big that is. If you have to create
   a boot command line that long, you have serious, serious issues.
  
  Well, it is about the same size as my .config...

So? That has *nothing* to do with the boot command line

  I agree we are unlikely to hit it any time soon... I could imagine
  some (ab)uses, like fixed_acpi_bios=lots of hex digits, but those
  are ugly.

That's beyond ugly, and rapidly heading towards 'loony'.

  I could also imagine some uses where entire embedded machine
  is described at kernel commandline.

There are far better ways to get configuration into the kernel
than the boot command line.

Anyways, I'm tired of arguing for the sake of arguing.
I really could care less.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 2/7] RSS controller core

2007-03-12 Thread Herbert Poetzl
On Mon, Mar 12, 2007 at 03:25:07PM +0530, Balbir Singh wrote:
  doesn't look so good for me, mainly becaus of the
  additional per page data and per page processing
 
  on 4GB memory, with 100 guests, 50% shared for each
  guest, this basically means ~1mio pages, 500k shared
  and 1500k x sizeof(page_container) entries, which
  roughly boils down to ~25MB of wasted memory ...
 
  increase the amount of shared pages and it starts
  getting worse, but maybe I'm missing something here
 
   We need to decide whether we want to do per-container memory
   limitation via these data structures, or whether we do it via
   a physical scan of some software zone, possibly based on Mel's
   patches.
 
  why not do simple page accounting (as done currently
  in Linux) and use that for the limits, without
  keeping the reference from container to page?
 
  best,
  Herbert
 
 
 Herbert,
 
 You lost me in the cc list and I almost missed this part of the
 thread. 

hmm, it is very unlikely that this would happen,
for several reasons ... and indeed, checking the 
thread in my mailbox shows that akpm dropped you ...


Subject: [RFC][PATCH 2/7] RSS controller core
From: Pavel Emelianov [EMAIL PROTECTED]
To: Andrew Morton [EMAIL PROTECTED], Paul Menage [EMAIL PROTECTED],
Srivatsa Vaddagiri [EMAIL PROTECTED],
Balbir Singh [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED],
Linux Kernel Mailing List linux-kernel@vger.kernel.org
Date: Tue, 06 Mar 2007 17:55:29 +0300

Subject: Re: [RFC][PATCH 2/7] RSS controller core
From: Andrew Morton [EMAIL PROTECTED]
To: Pavel Emelianov [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED],
Paul Menage [EMAIL PROTECTED],
List linux-kernel@vger.kernel.org
Date: Tue, 6 Mar 2007 14:00:36 -0800

that's the one I 'group' replied to ...

 Could you please not modify the cc list.

I never modify the cc unless explicitely asked
to do so. I wish others would have it that way
too :)

best,
Herbert

 Thanks,
 Balbir
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RSDL v0.30 cpu scheduler for mainline kernels

2007-03-12 Thread David Miller
From: Con Kolivas [EMAIL PROTECTED]
Date: Mon, 12 Mar 2007 10:58:11 +1100

 http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc3-sched-rsdl-0.30.patch

FWIW, this boots and seems to work well on sparc64.  Tested
on UP SunBlade1500 and 24cpu Niagara T1000.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: irda rmmod lockdep trace.

2007-03-12 Thread David Miller
From: Samuel Ortiz [EMAIL PROTECTED]
Date: Mon, 12 Mar 2007 02:38:43 +0200

 On Sat, Mar 10, 2007 at 07:43:26PM +0200, Samuel Ortiz wrote:
  Hi Dave,
  
  On Thu, Mar 08, 2007 at 05:54:36PM -0500, Dave Jones wrote:
   modprobe irda ; rmmod irda in 2.6.21rc3 gets me the spew below..
  Well it seems that we call __irias_delete_object() from hashbin_delete(). 
  Then
  __irias_delete_object() calls itself hashbin_delete() again. We're trying to
  get the lock recursively.
 Looking at the code more carefully, this seems to be a false positive:
 iriap_cleanup and and __irias_delete_object are taking 2 different locks from
 2 different hashbin instances. The locks belong to the same lock class but
 they are hierarchically different. We need to tell the validator about it and
 the following patch does that. Comments are welcomed as I'm planning to push
 it to netdev soon:

I would strongly caution against adding any run-time overhead just to
cure a false lockdep warning.  Even adding a new function argument
is too much IMHO.

Make the cost show up for lockdep only, perhaps by putting each
hashbin lock into a seperate locking class?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-12 Thread Herbert Poetzl
On Mon, Mar 12, 2007 at 09:50:08AM -0700, Dave Hansen wrote:
 On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
  
  For these you essentially need per-container page-_mapcount counter,
  otherwise you can't detect whether rss group still has the page 
  in question being mapped in its processes' address spaces or not. 

 What do you mean by this?  You can always tell whether a process has a
 particular page mapped.  Could you explain the issue a bit more.  I'm
 not sure I get it.

OpenVZ wants to account _shared_ pages in a guest
different than separate pages, so that the RSS
accounted values reflect the actual used RAM instead
of the sum of all processes RSS' pages, which for
sure is more relevant to the administrator, but IMHO
not so terribly important to justify memory consuming
structures and sacrifice performance to get it right

YMMV, but maybe we can find a smart solution to the
issue too :)

best,
Herbert

 -- Dave
 
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] mm: Inconsistent use of node IDs

2007-03-12 Thread Ethan Solomita

Andi Kleen wrote:

On Monday 12 March 2007 23:51, Ethan Solomita wrote:

This patch corrects inconsistent use of node numbers (variously nid or
node) in the presence of fake NUMA.


I think it's very consistent -- your patch would make it inconsistent though.


	It's consistent to call node_online() with a physical node ID when the 
online node mask is composed of fake nodes?



Sorry, but when you ask for NUMA emulation you will get it. I don't see
any point in a half way only for some subsystems I like NUMA emulation. 
It's unlikely that your ideas of where it is useful and where is not

matches other NUMA emulation user's ideas too.


	I don't understand your comments. My code is intended to work for all 
systems. If the system is non-NUMA by nature, then all CPUs map to fake 
node 0.


	As an example, on a two chip dual-core AMD opteron system, there are 4 
cpus where CPUs 0 and 1 are close to the first half of memory, and 
CPUs 2 and 3 are close to the second half. Without this change CPUs 2 
and 3 are mapped to fake node 1. This results in awful performance. With 
this change, CPUs 2 and 3 are mapped to (roughly) 1/2 the fake node 
count. Their zonelists[] are ordered to do allocations preferentially 
from zones that are local to CPUs 2 and 3.


Can you tell me the scenario where my code makes things worse?

Besides adding such a secondary node space would be likely a huge long term 
mainteance issue. I just can it see breaking with every non trivial change.


	I'm adding no data structures to do this. The current code already has 
get_phys_node. My changes use the existing information about node 
layout, both the physical and fake, and defines a mapping. The current 
mapping just takes a physical node and says it's the fake node too.



NACK.


	I wish you would include some specifics as to why you think what you 
do. You're suggesting we leave in place a system that destroys NUMA 
locality when using fake numa, and passes around physical node ids as an 
index into nodes[] whihc is indexed by fake nodes. My change has no 
effect without fake numa, and harms no one with fake numa.

-- Ethan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_REORDER Kconfig help strange sentence.

2007-03-12 Thread Andi Kleen
On Tue, Mar 13, 2007 at 10:18:03AM +1100, Rusty Russell wrote:
 OK, this confused me:
 
 Function reordering (REORDER) [N/y/?] (NEW) ?
 
 This option enables the toolchain to reorder functions for a more 
 optimal TLB usage. If you have pretty much any version of binutils, 
 this can increase your kernel build time by roughly one minute.
 
 If you have pretty much any version of binutils?  Huh?
 
 You mean This will slow your kernel build by about a minute?

Yes. Lots of sections seem to trigger some quadratic behaviour in ld.

It might be fixed in some unreleased CVS version though (not 100% sure) 

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-12 Thread David Lang

On Mon, 12 Mar 2007, Mike Galbraith wrote:


On Tue, 2007-03-13 at 07:38 +1100, Con Kolivas wrote:

On Tuesday 13 March 2007 07:11, Mike Galbraith wrote:


Killing the known corner case starvation scenarios is wonderful, but
let's not just pretend that interactive tasks don't have any special
requirements.


Now you're really making a stretch of things. Where on earth did I say that
interactive tasks don't have special requirements? It's a fundamental feature
of this scheduler that I go to great pains to get them as low latency as
possible and their fair share of cpu despite having a completely fair cpu
distribution.


As soon as your cpu is fully utilized, fairness looses or interactivity
loses.  Pick one.


correct.

the problem is that it's hard (if not impossible) to properly identify what is 
needed to make a system have good interactivity. in some cases it's a matter of 
low latency (wake up a process as quickly as you can when whatever it was 
waiting on is available), but in others it's a matter of allocating the _right_ 
process enough CPU (X needs enough CPU to do things)


where it's a matter of needing low-latency, it's possible to design a scheduler 
that will do things in a predictable enough way that you know the max latency 
you have to deal with (and the RSDL seems to do this)


the problem comes when this isn't enough. if you have several CPU hogs on a 
system, and they are all around the same priority level, how can the scheduler 
know which one needs the CPU the most for good interactivity?


in some cases you may be able to directly detect that your high-priority process 
is waiting for another one (tracing pipes and local sockets for example), but 
what if you are waiting for several of them? (think a multimedia desktop waiting 
for the sound card, CDRom, hard drive, and video all at once) which one needs 
the extra CPU the most?


Fairness is much easier to enforce (and much easier to understand)

the RSDL is concentrating on enforcing fairness, with bounded (and predictable) 
latencies.


if you are willing to tell the system what you consider more important (and how 
much more important you consider it), then it's much easier to figure out who to 
give the CPU to. Con is just asking you to do this (and you already do, by doing 
a nice -5. but it sounds like you want that to mean more then it currently does)


David Lang


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Delete superfluous source file net/wanrouter/af_wanpipe.c.

2007-03-12 Thread David Miller
From: Robert P. J. Day [EMAIL PROTECTED]
Date: Sat, 10 Mar 2007 03:49:52 -0500 (EST)

 
   Delete the apparently superfluous source file
 net/wanrouter/af_wanpipe.c.
 
 Signed-off-by: Robert P. J. Day [EMAIL PROTECTED]

Applied, thanks Robert.

This thing isn't even built in 2.4.x :-)  Although there is
some ancient reference to the build module in
Documentation/networking/wan-router.txt, a heavily out of date
document.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] xfs: stop using kmalloc in xfs_buf_get_noaddr

2007-03-12 Thread Timothy Shimmin

Hi,

--On 9 March 2007 12:55:11 PM +0100 Christoph Hellwig [EMAIL PROTECTED] wrote:


Ed Cashin found a bug in the error handling code for the case where
a page allocation fails.  Here's the updated version:

Index: linux-2.6/fs/xfs/linux-2.6/xfs_buf.c
===
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_buf.c   2007-03-08 19:08:38.0 
+0100
+++ linux-2.6/fs/xfs/linux-2.6/xfs_buf.c2007-03-09 08:59:15.0 
+0100



+   for (i = 0; i  page_count; i++) {
+   bp-b_pages[i] = alloc_page(GFP_KERNEL);
+   if (!bp-b_pages[i])
+   goto fail_free_mem;
+   }
+   bp-b_flags |= _XBF_PAGES;
+
+   error = _xfs_buf_map_pages(bp, XBF_MAPPED);
+   if (unlikely(error)) {
+   printk(KERN_WARNING %s: failed to map pages\n,
+   __FUNCTION__);
goto fail_free_mem;
-   bp-b_flags |= _XBF_KMEM_ALLOC;
+   }

xfs_buf_unlock(bp);

XB_TRACE(bp, no_daddr, data);
return bp;
+
  fail_free_mem:
-   kmem_free(data, malloc_len);
+   for ( ; i = 0; i--)
+   __free_page(bp-b_pages[i]);
  fail_free_buf:
xfs_buf_free(bp);
  fail:


It looks like you might need: for (i--; i = 0; i--)
(or: for (j = 0; j  i; j++) etc.)

Because if the initial alloc_page loop goes to completion then:
 i == pagecount
and if alloc_page loop terminates early then
 bp-b_pages[i] == NULL
So we have gone 1 too far in both cases and need to
start free'ing back one.
Unless I missed something.

--Tim



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ck] Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-12 Thread Thibaut VARENE

On 3/12/07, michael chang [EMAIL PROTECTED] wrote:


Considering the concepts put out by projects such as BOINC and
[EMAIL PROTECTED], I wouldn't be thoroughly surprised by this ideology,
although I do question the particular way this test case is being run.


If Con actually implements SCHED_IDLEPRIO in RSDL, life is good even
in that case.


This seems to me like he's saying that there has to be a mechanism
(outside of nice) that can be used to treat processes that I want to
be interactive all special-like. It feels like something that would
have been said in the design of what the scheduler was in -ck and is
currently in vanilla.


Exactly. Driving us again toward the fact that different workloads
might benefit from different schedulers (eg: RSDL is cool for server
loads, previous staircase did an excellent job on desktop, etc) and
thus that having a choice of schedulers might be something that would
satisfy (some) people...


To me, that fundamentally clashes with the design behind RSDL. That
said, I could be wrong -- Con appears to have something that could be
very promising up his sleeve that could come out sooner or later. Once
he's written it, of course. In any case, RSDL seems very promising,
for the most part.


It certainly is. Negative feedback can be a good thing too, as it
helps improving it anyway. It's nonetheless true that it's practically
impossible to satisfy 100% of use case with a single design, so
choices will have to be made.

HTH

T-Bone

--
Thibaut VARENE
http://www.parisc-linux.org/~varenet/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: /sys/devices/system/cpu/cpuX/online are missing

2007-03-12 Thread Andreas Schwab
Giuliano Pochini [EMAIL PROTECTED] writes:

 I had a look at arch/powerpc/kernel/smp.c but I'm not familiar at all with 
 those parts of the kernel.

See arch/powerpc/kernel/sysfs.c:topology_init.  I don't think there is
anything to do here.  You probably don't have CONFIG_HOTPLUG_CPU enabled.

Andreas.

-- 
Andreas Schwab, SuSE Labs, [EMAIL PROTECTED]
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
And now for something completely different.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sys_write() racy for multi-threaded append?

2007-03-12 Thread Michael K. Edwards

On 3/12/07, Bodo Eggert [EMAIL PROTECTED] wrote:

On Mon, 12 Mar 2007, Michael K. Edwards wrote:
 That's fine when you're doing integration test, and should probably be
 the default during development.  But if the race is first exposed in
 the field, or if the developer is trying to concentrate on a different
 problem, spectacular crash and burn may do more harm than good.
 It's easy enough to refactor the f_pos handling in the kernel so that
 it all goes through three or four inline accessor functions, at which
 point you can choose your trade-off between speed and idiot-proofness
 -- at _kernel_ compile time, or (given future hardware that supports
 standardized optionally-atomic-based-on-runtime-flag operations) per
 process at run-time.

CONFIG_WOMBAT

Waste memory, brain and time in order to grant an atomic write which is
neither guaranteed by the standard nor expected by any sane programmer,
just in case some idiot tries to write to one file from multiple
processes.

Warning: Programs expecting this behaviour are buggy and non-portable.


OK, I laughed out loud at this.  But I think you're missing my point,
which is that there's a time to be hard-core about code quality and
there's a time to be hard-core about _product_ quality.  Face it, all
products containing software more or less suck.  This is because most
programmers write crap code most of the time.  The only way to cope
with this, outside the confines of the European defense industry and
other niches insulated from economic reality, is to make the
production environment gentler on _application_ code than the
development environment is.  Hence CONFIG_WOMBAT.  (I like that name.
I'm going to use it in my patch, with your permission.  :-)

Writing to a file from multiple processes is not usually the problem.
Writing to a common struct file from multiple threads is.  99.999%
of the time it will work, because you're only writing as far as VFS
cache and then bumping f_pos, and your threads are probably on the
same processor anyway.  0.001% of the time the second thread will see
a stale f_pos and clobber the first write.  This is true even on file
types that can never return a short write.  If you remember to open
with O_APPEND so the pos argument to vfs_write is silently ignored, or
if the implementation underlying vfs_write effectively ignores the pos
argument irrespective of flags, you're OK.  If the pos argument isn't
ignored, or if you ever look at the result of a relative seek on any
fd that maps to that struct file, you're screwed.

(Note to the alert reader:  yes, this means shell scripts should
always use  rather than  when routing stdout and/or stderr to a
file.  You're just as vulnerable to interleaving due to stdio
buffering issues as you are when stdio and stderr are sent to the tty,
and short writes may still be a problem if you are so foolish as to
use a filesystem that generates them on anything short of a
catastrophic error, but at least you get O_APPEND and sane behavior on
ftruncate().)


 Frankly, I think that unless application programmers poke at some sort
 of magic I promise to handle short writes correctly bit, write()
 should always return either the full number of bytes requested or an
 error code.

If you asume that you won't have short writes, your programs may fail on
e.g. solaris. There may be reasons for linux to use the same semantics at
some time in the future, you never know.


So what?  My products are shipping _now_.  Future kernels are
guaranteed to break them anyway because sysfs is a moving target.
Solaris is so not in the game for my kind of embedded work, it's not
even funny.  If POSIX mandates stupid shit, and application
programmers don't read that part of the manual anyway (and don't code
on that assumption in practice), to hell with POSIX.  On many file
descriptors, short writes simply can't happen -- and code that
purports to handle short writes but has never been exercised is
arguably worse than code that simply bombs on short write.  So if I
can't shim in an induce-short-writes-randomly-on-purpose mechanism
during development, I don't want short writes in production, period.

In my world, GNU/Linux is not a crappy imitation Solaris that you get
to pay out the wazoo for to Red Hat (and get no documentation and
lousy tech support that doesn't even cover your hardware).  It's a
full-source-code platform on which you can engineer robust industrial
and consumer products, because you can control the freeze and release
schedule component-by-component, and you can point fix anything in the
system at any time.  If, that is, you understand that the source code
is not the software, and that you can't retrofit stability and
security overnight onto code that was written with no thought of
anything but performance.


If you asume you *may* have short writes, you have no problem.


Sure -- until the one code path in a hundred that handles the short
write case incorrectly gets traversed in production, after having

[PATCH] i386: Simplify smp_call_function*() by using common implementation

2007-03-12 Thread Jeremy Fitzhardinge
Subject: Simplify smp_call_function*() by using common implementation

smp_call_function and smp_call_function_single are almost complete
duplicates of the same logic.  This patch combines them by
implementing them in terms of the more general
smp_call_function_mask().

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
Cc: Stephane Eranian [EMAIL PROTECTED]
Cc: Andrew Morton [EMAIL PROTECTED]
Cc: Andi Kleen [EMAIL PROTECTED]
Cc: Randy.Dunlap [EMAIL PROTECTED]
Cc: Ingo Molnar [EMAIL PROTECTED]

---
 arch/i386/kernel/smp.c |  213 ++--
 1 file changed, 102 insertions(+), 111 deletions(-)

===
--- a/arch/i386/kernel/smp.c
+++ b/arch/i386/kernel/smp.c
@@ -515,6 +515,73 @@ void unlock_ipi_call_lock(void)
 
 static struct call_data_struct *call_data;
 
+
+/**
+ * smp_call_function_mask(): Run a function on a set of other CPUs.
+ * @mask: The set of cpus to run on.  Must not include the current cpu.
+ * @func: The function to run. This must be fast and non-blocking.
+ * @info: An arbitrary pointer to pass to the function.
+ * @wait: If true, wait (atomically) until function has completed on other 
CPUs.
+ *
+ * Returns 0 on success, else a negative status code. Does not return until
+ * remote CPUs are nearly ready to execute func or are or have finished.
+ *
+ * You must not call this function with disabled interrupts or from a
+ * hardware interrupt handler or from a bottom half handler.
+ */
+int smp_call_function_mask(cpumask_t mask,
+  void (*func)(void *), void *info,
+  int wait)
+{
+   struct call_data_struct data;
+   cpumask_t allbutself;
+   int cpus;
+
+   /* Can deadlock when called with interrupts disabled */
+   WARN_ON(irqs_disabled());
+
+   /* Holding any lock stops cpus from going down. */
+   spin_lock(call_lock);
+
+   allbutself = cpu_online_map;
+   cpu_clear(smp_processor_id(), allbutself);
+
+   cpus_and(mask, mask, allbutself);
+   cpus = cpus_weight(mask);
+
+   if (!cpus) {
+   spin_unlock(call_lock);
+   return 0;
+   }
+
+   data.func = func;
+   data.info = info;
+   atomic_set(data.started, 0);
+   data.wait = wait;
+   if (wait)
+   atomic_set(data.finished, 0);
+
+   call_data = data;
+   mb();
+
+   /* Send a message to other CPUs */
+   if (cpus_equal(mask, allbutself))
+   send_IPI_allbutself(CALL_FUNCTION_VECTOR);
+   else
+   send_IPI_mask(mask, CALL_FUNCTION_VECTOR);
+
+   /* Wait for response */
+   while (atomic_read(data.started) != cpus)
+   cpu_relax();
+
+   if (wait)
+   while (atomic_read(data.finished) != cpus)
+   cpu_relax();
+   spin_unlock(call_lock);
+
+   return 0;
+}
+
 /**
  * smp_call_function(): Run a function on all other CPUs.
  * @func: The function to run. This must be fast and non-blocking.
@@ -528,48 +595,43 @@ static struct call_data_struct *call_dat
  * You must not call this function with disabled interrupts or from a
  * hardware interrupt handler or from a bottom half handler.
  */
-int smp_call_function (void (*func) (void *info), void *info, int nonatomic,
-   int wait)
-{
-   struct call_data_struct data;
-   int cpus;
-
-   /* Holding any lock stops cpus from going down. */
-   spin_lock(call_lock);
-   cpus = num_online_cpus() - 1;
-   if (!cpus) {
-   spin_unlock(call_lock);
-   return 0;
-   }
-
-   /* Can deadlock when called with interrupts disabled */
-   WARN_ON(irqs_disabled());
-
-   data.func = func;
-   data.info = info;
-   atomic_set(data.started, 0);
-   data.wait = wait;
-   if (wait)
-   atomic_set(data.finished, 0);
-
-   call_data = data;
-   mb();
-   
-   /* Send a message to all other CPUs and wait for them to respond */
-   send_IPI_allbutself(CALL_FUNCTION_VECTOR);
-
-   /* Wait for response */
-   while (atomic_read(data.started) != cpus)
-   cpu_relax();
-
-   if (wait)
-   while (atomic_read(data.finished) != cpus)
-   cpu_relax();
-   spin_unlock(call_lock);
-
-   return 0;
+int smp_call_function(void (*func) (void *info), void *info, int nonatomic,
+ int wait)
+{
+   return smp_call_function_mask(cpu_online_map, func, info, wait);
 }
 EXPORT_SYMBOL(smp_call_function);
+
+/*
+ * smp_call_function_single - Run a function on another CPU
+ * @func: The function to run. This must be fast and non-blocking.
+ * @info: An arbitrary pointer to pass to the function.
+ * @nonatomic: Currently unused.
+ * @wait: If true, wait until function has completed on other CPUs.
+ *
+ * Retrurns 0 on success, else a negative status code.
+ *
+ * Does 

Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)

2007-03-12 Thread Nick Piggin
On Tue, Mar 13, 2007 at 12:01:13AM +0100, Blaisorblade wrote:
 On Wednesday 07 March 2007 11:02, Nick Piggin wrote:
  
   Yeah, tmpfs/shm segs are what I was thinking about. If UML can live with
   that as well, then I think it might be a good option.
 
  Oh, hmm if you can truncate these things then you still need to
  force unmap so you still need i_mmap_nonlinear.
 
 Well, we don't need truncate(), but MADV_REMOVE for memory hotunplug, which 
 is 
 way similar I guess.
 
 About the restriction to tmpfs, I have just discovered 
 '[PATCH] mm: tracking shared dirty pages' (commit 
 d08b3851da41d0ee60851f2c75b118e1f7a5fc89), which already partially conflicts 
 with remap_file_pages for file-based mmaps (and that's fully fine, for now).
 
 Even if UML does not need it, till now if there is a VMA protection and a 
 page 
 hasn't been remapped with remap_file_pages, the VMA protection is used (just 
 because it makes sense).
 
 However, it is only used when the PTE is first created - we can never change 
 protections on a VMA  - so it vma_wants_writenotify() is true (on all 
 file-based and on no shmfs based mapping, right?), and we write-protect the 
 VMA, it will always be write-protected.

Yes, I believe that is the case, however I wonder if that is going to be
a problem for you to distinguish between write faults for clean writable
ptes, and write faults for readonly ptes?

 That's no problem for UML, but for any other user (I guess I'll have to 
 prevent callers from trying such stuff - I started from a pretty generic 
 patch).
 
  But come to think of it, I still don't think nonlinear mappings are
  too bad as they are ;)
 
 Btw, I really like removing -populate and merging the common code together. 
 filemap_populate and shmem_populate are so obnoxiously different that I 
 already wanted to do that (after merging remap_file_pages() core).

Yeah they are also frustratingly similar to filemap_nopage and shmem_nopage,
and duplicate a lot of the same code ;)

 Also, I'm curious. Since my patches are already changing remap_file_pages() 
 code, should they be absolutely merged after yours?

Is there a big clash? I don't think I did a great deal to fremap.c (mainly
just removing stuff)...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sys_write() racy for multi-threaded append?

2007-03-12 Thread Alan Cox
 Writing to a file from multiple processes is not usually the problem.
 Writing to a common struct file from multiple threads is.

Not normally because POSIX sensibly invented pread/pwrite. Forgot
preadv/pwritev but they did the basics and end of problem

 So what?  My products are shipping _now_.  

That doesn't inspire confidence.

 even funny.  If POSIX mandates stupid shit, and application
 programmers don't read that part of the manual anyway (and don't code
 on that assumption in practice), to hell with POSIX.  On many file

Thats funny, you were talking about quality a moment ago.

 descriptors, short writes simply can't happen -- and code that

There is almost no descriptor this is true for. Any file I/O can and will
end up short on disk full or resource limit exceeded or quota exceeded or
NFS server exploded or ...

And on the device side about the only thing with the vaguest guarantees
is pipe().

 purports to handle short writes but has never been exercised is
 arguably worse than code that simply bombs on short write.  So if I
 can't shim in an induce-short-writes-randomly-on-purpose mechanism
 during development, I don't want short writes in production, period.

Easy enough to do and gcov plus dejagnu or similar tools will let you
coverage analyse the resulting test set and replay it.

 Sure -- until the one code path in a hundred that handles the short
 write case incorrectly gets traversed in production, after having
 gone untested in a development environment that used a different
 filesystem that never happened to trigger it.

Competent QA and testing people test all the returns in the manual as
well as all the returns they can find in the code. See ptrace(2) if you
don't want to do a lot of relinking and strace for some useful worked
examples of syscall hooking.

Alan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Stracing Amanda (was: RSDL for 2.6.21-rc3- 0.29)

2007-03-12 Thread Douglas McNaught
Patrick Mau [EMAIL PROTECTED] writes:

 Why not temporarly replace /bin/tar with a shell script that does:

 #!/bin/sh
 exec strace -f -o output /bin/real.tar $@

You beat me to it.  :) I've done that before; it's a great suggestion.

Except that if you expect 'tar' to be invoked multiple times in a run,
you should probably use 'output.$$' for the output filename so things
don't get clobbered.

-Doug
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Fwd: PROBLEM: 2.6.20-1 not working on ibook g4 (BUG/Oops)

2007-03-12 Thread young dave

-- Forwarded message --
Hi,
I have tested on my mac mini g4.

The 2.6.21-rc2 will cause oops like the above post.

And for the new 2.6.21-rc3-git7 , the kernel load ok,  penguin pixmap
appears, but then it stopped, there's no error messages also.

Regards
dave


2007/3/7, Benjamin Herrenschmidt [EMAIL PROTECTED]:

On Wed, 2007-03-07 at 17:53 +1300, Paul Collins wrote:
 David Woodhouse [EMAIL PROTECTED] writes:

  On Tue, 2007-03-06 at 14:53 +1300, Paul Collins wrote:
  In case it's of interest, 2.6.20 has been running fine on my
  PowerBook5,4.
 
  How much memory? What if you boot with mem=512M or mem=256M?

 1GB.  Also works fine when booted with those options.

Can you try 2.6.21-rc3 ? We just fixed a nasty bug causing memory
corruption.

Ben.


___
Linuxppc-dev mailing list
[EMAIL PROTECTED]
https://ozlabs.org/mailman/listinfo/linuxppc-dev


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 2/7] RSS controller core

2007-03-12 Thread Balbir Singh

hmm, it is very unlikely that this would happen,
for several reasons ... and indeed, checking the
thread in my mailbox shows that akpm dropped you ...



But, I got Andrew's email.



Subject: [RFC][PATCH 2/7] RSS controller core
From: Pavel Emelianov [EMAIL PROTECTED]
To: Andrew Morton [EMAIL PROTECTED], Paul Menage [EMAIL PROTECTED],
Srivatsa Vaddagiri [EMAIL PROTECTED],
Balbir Singh [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED],
Linux Kernel Mailing List linux-kernel@vger.kernel.org
Date: Tue, 06 Mar 2007 17:55:29 +0300

Subject: Re: [RFC][PATCH 2/7] RSS controller core
From: Andrew Morton [EMAIL PROTECTED]
To: Pavel Emelianov [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED],
Paul Menage [EMAIL PROTECTED],
List linux-kernel@vger.kernel.org
Date: Tue, 6 Mar 2007 14:00:36 -0800

that's the one I 'group' replied to ...

 Could you please not modify the cc list.

I never modify the cc unless explicitely asked
to do so. I wish others would have it that way
too :)



Thats good to know, but my mailer shows


Andrew Morton [EMAIL PROTECTED]
to  Pavel Emelianov [EMAIL PROTECTED]   
cc  
Paul Menage [EMAIL PROTECTED],
Srivatsa Vaddagiri [EMAIL PROTECTED],
Balbir Singh [EMAIL PROTECTED] (see I am HERE),
devel@openvz.org,
Linux Kernel Mailing List linux-kernel@vger.kernel.org,
[EMAIL PROTECTED],
Kirill Korotaev [EMAIL PROTECTED]   
dateMar 7, 2007 3:30 AM 
subject Re: [RFC][PATCH 2/7] RSS controller core
mailed-by   vger.kernel.org 
On Tue, 06 Mar 2007 17:55:29 +0300

and your reply as

Andrew Morton [EMAIL PROTECTED],
Pavel Emelianov [EMAIL PROTECTED],
[EMAIL PROTECTED],
[EMAIL PROTECTED],
[EMAIL PROTECTED],
Paul Menage [EMAIL PROTECTED],
List linux-kernel@vger.kernel.org   
to  Andrew Morton [EMAIL PROTECTED] 
cc  
Pavel Emelianov [EMAIL PROTECTED],
[EMAIL PROTECTED],
[EMAIL PROTECTED],
[EMAIL PROTECTED],
Paul Menage [EMAIL PROTECTED],
List linux-kernel@vger.kernel.org   
dateMar 9, 2007 10:18 PM
subject Re: [RFC][PATCH 2/7] RSS controller core
mailed-by   vger.kernel.org

I am not sure what went wrong. Could you please check your mail
client, cause it seemed to even change email address to smtp.osdl.org
which bounced back when I wrote to you earlier.


best,
Herbert



Cheers,
Balbir
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


rmmod uhci_hcd - BUG: atomic counter underflow

2007-03-12 Thread Jiri Slaby

Hi.

After rmmoding of uhci_hcd on fresh booted 2.6.21-rc3-mm2 I got this:

BUG: atomic counter underflow at:
 [c0104f0b] show_trace_log_lvl+0x1a/0x30
 [c01055f3] show_trace+0x12/0x14
 [c010567a] dump_stack+0x16/0x18
 [c01dc41b] kref_put+0x4d/0xb2
 [c01db754] kobject_put+0x14/0x16
 [c01db8a3] kobject_unregister+0x22/0x25
 [c024c987] bus_remove_driver+0x75/0x82
 [c024d3b8] driver_unregister+0xb/0x18
 [c01e7020] pci_unregister_driver+0x13/0x73
 [f88dbbd9] uhci_hcd_cleanup+0xd/0x2d [uhci_hcd]
 [c013fb69] sys_delete_module+0x133/0x195
 [c0103fe0] syscall_call+0x7/0xb
 ===

Note, that this is connected:
Bus 004 Device 001: ID :
Bus 003 Device 001: ID :
Bus 002 Device 004: ID 0458:004c KYE Systems Corp. (Mouse Systems) Slimstar 
Pro Keyboard

Bus 002 Device 003: ID 04b4:2050 Cypress Semiconductor Corp.
Bus 002 Device 002: ID 045e:00f0 Microsoft Corp.
Bus 002 Device 001: ID :
Bus 001 Device 001: ID :
Bus 005 Device 001: ID :

What other info do you want me to post?

regards,
--
http://www.fi.muni.cz/~xslaby/Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8  22A0 32CC 55C3 39D4 7A7E

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rmmod uhci_hcd - BUG: atomic counter underflow

2007-03-12 Thread Alan Stern
On Mon, 12 Mar 2007, Jiri Slaby wrote:

 Hi.
 
 After rmmoding of uhci_hcd on fresh booted 2.6.21-rc3-mm2 I got this:
 
 BUG: atomic counter underflow at:
   [c0104f0b] show_trace_log_lvl+0x1a/0x30
   [c01055f3] show_trace+0x12/0x14
   [c010567a] dump_stack+0x16/0x18
   [c01dc41b] kref_put+0x4d/0xb2
   [c01db754] kobject_put+0x14/0x16
   [c01db8a3] kobject_unregister+0x22/0x25
   [c024c987] bus_remove_driver+0x75/0x82
   [c024d3b8] driver_unregister+0xb/0x18
   [c01e7020] pci_unregister_driver+0x13/0x73
   [f88dbbd9] uhci_hcd_cleanup+0xd/0x2d [uhci_hcd]
   [c013fb69] sys_delete_module+0x133/0x195
   [c0103fe0] syscall_call+0x7/0xb
   ===
 
 Note, that this is connected:
 Bus 004 Device 001: ID :
 Bus 003 Device 001: ID :
 Bus 002 Device 004: ID 0458:004c KYE Systems Corp. (Mouse Systems) Slimstar 
 Pro Keyboard
 Bus 002 Device 003: ID 04b4:2050 Cypress Semiconductor Corp.
 Bus 002 Device 002: ID 045e:00f0 Microsoft Corp.
 Bus 002 Device 001: ID :
 Bus 001 Device 001: ID :
 Bus 005 Device 001: ID :
 
 What other info do you want me to post?

My guess is that this was caused by changes to the driver core, not by 
anything connected to USB.

Would it be possible for you to add the atomic counter underflow check to 
2.6.21-rc3 and see if the problem still occurs?  If it doesn't, that's a 
good indication the USB stack isn't guilty -- the bus registration code 
hasn't changed for several kernel releases.

Alan Stern

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rmmod uhci_hcd - BUG: atomic counter underflow

2007-03-12 Thread Jiri Slaby

Alan Stern napsal(a):

On Mon, 12 Mar 2007, Jiri Slaby wrote:

After rmmoding of uhci_hcd on fresh booted 2.6.21-rc3-mm2 I got this:

BUG: atomic counter underflow at:

[...]

  [c01db754] kobject_put+0x14/0x16
  [c01db8a3] kobject_unregister+0x22/0x25
  [c024c987] bus_remove_driver+0x75/0x82
  [c024d3b8] driver_unregister+0xb/0x18
  [c01e7020] pci_unregister_driver+0x13/0x73
  [f88dbbd9] uhci_hcd_cleanup+0xd/0x2d [uhci_hcd]

[...]
Would it be possible for you to add the atomic counter underflow check to 
2.6.21-rc3 and see if the problem still occurs?  If it doesn't, that's a 
good indication the USB stack isn't guilty -- the bus registration code 
hasn't changed for several kernel releases.


Yes.

regards,
--
http://www.fi.muni.cz/~xslaby/Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8  22A0 32CC 55C3 39D4 7A7E

Hnus [EMAIL PROTECTED] is an alias for /dev/null
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rmmod uhci_hcd - BUG: atomic counter underflow

2007-03-12 Thread Jiri Slaby

Jiri Slaby napsal(a):

Alan Stern napsal(a):

On Mon, 12 Mar 2007, Jiri Slaby wrote:

After rmmoding of uhci_hcd on fresh booted 2.6.21-rc3-mm2 I got this:

BUG: atomic counter underflow at:

[...]

  [c01db754] kobject_put+0x14/0x16
  [c01db8a3] kobject_unregister+0x22/0x25
  [c024c987] bus_remove_driver+0x75/0x82
  [c024d3b8] driver_unregister+0xb/0x18
  [c01e7020] pci_unregister_driver+0x13/0x73
  [f88dbbd9] uhci_hcd_cleanup+0xd/0x2d [uhci_hcd]

[...]
Would it be possible for you to add the atomic counter underflow check 
to 2.6.21-rc3 and see if the problem still occurs?  If it doesn't, 
that's a good indication the USB stack isn't guilty -- the bus 
registration code hasn't changed for several kernel releases.


Yes.


I can confirm, that this issue went upstream and is currently present there.

regards,
--
http://www.fi.muni.cz/~xslaby/Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8  22A0 32CC 55C3 39D4 7A7E

Hnus [EMAIL PROTECTED] is an alias for /dev/null
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] usb-serial regression (Oops) in 2.6.21-rc*

2007-03-12 Thread Greg KH
On Mon, Mar 12, 2007 at 04:22:22PM -0400, Mark Lord wrote:
 Oliver Neukum wrote:
 Mark Lord wrote:
 Okay, from that part (above), the problem is obvious:
 in that the MCT U232 converter now disconnected appears,
 and then we continue to try and call the driver's method.. Oops!
 ..
 IMHO shutdown() is using serial-port[] and bombs.
 Could you reverse the order here?
 
 Yup.  Fixed.  Tested.  Works.
 
 This patch fixes the Oops that otherwise occurs whenever
 a USB serial adapter is unplugged from a system, as well
 the Oops seen when one is in use before resume (to RAM).
 
 GregKH:  This needs to go into 2.6.21-rc*.
 
 Signed-off-by:  Mark Lord [EMAIL PROTECTED]
 ---
 --- 2.6.21-rc3/drivers/usb/serial/usb-serial.c2007-03-12 
 11:22:43.0 -0400
 +++ linux/drivers/usb/serial/usb-serial.c 2007-03-12 
 16:12:53.0 -0400
 @@ -141,6 +141,9 @@
   for (i = 0; i  serial-num_ports; ++i)
   serial-port[i]-open_count = 0;
 
 + if (serial-type-shutdown)
 + serial-type-shutdown(serial);
 +
   /* the ports are cleaned up and released in port_release() */
   for (i = 0; i  serial-num_ports; ++i)
   if (serial-port[i]-dev.parent != NULL) {
 @@ -148,9 +151,6 @@
   serial-port[i] = NULL;
   }
 
 - if (serial-type-shutdown)
 - serial-type-shutdown(serial);
 -


Argh, no, this change was done to help the ftdi drivers out.

Look at changeset d9a7ecacac5f8274d2afce09aadcf37bdb42b93a in Linus's
tree from Jim Radford:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d9a7ecacac5f8274d2afce09aadcf37bdb42b93a

It makes this change because the usb-serial drivers need the port
devices when the port_remove() callbacks happen.  Otherwise you get an
oops that way.

Jim, can you take a look at this and see if you can figure something
out?

thanks,

greg k-h
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-usb-devel] 2.6.21-rc3-mm1

2007-03-12 Thread Mariusz Kozlowski
Hello, 

  Any thoughts?
 
 Another mistake on my part.  The correct command is
 
   echo -n '2-2:1.0' /sys/bus/usb/drivers/usbhid/unbind
 
 Without the -n, the system thinks that the newline character at the end 
 of the line written by echo is part of the filename.

Nice tip. Thanks. I've run some tests and as expected - no failure so far.



Regards,

Mariusz Kozlowski
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-usb-devel] 2.6.21-rc3-mm1

2007-03-12 Thread Jiri Kosina
On Mon, 12 Mar 2007, Mariusz Kozlowski wrote:

  echo -n '2-2:1.0' /sys/bus/usb/drivers/usbhid/unbind
  Without the -n, the system thinks that the newline character at the end 
  of the line written by echo is part of the filename.
 Nice tip. Thanks. I've run some tests and as expected - no failure so far.

Thanks for testing. The patch fixing this already went to Linus in todays 
HID/USB HID update (which has not yet been merged).

Thanks,

-- 
Jiri Kosina
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-12 Thread David Chinner
On Mon, Mar 12, 2007 at 12:40:47PM +0100, Miklos Szeredi wrote:
   I have no idea how serious the scalability problems with this are.  If
   they are serious, different solutions can probably be found for the
   above, but this is certainly the simplest.
  
  Atomic operations to a single per-backing device from all CPUs at once?
  That's a pretty serious scalability issue and it will cause a major
  performance regression for XFS.
 
 OK.  How about just accounting writeback pages?  That should be much
 less of a problem, since normally writeback is started from
 pdflush/kupdate in large batches without any concurrency.

Except when you are throttling you bounce the cacheline around
each cpu as it triggers foreground writeback.

 Or is it possible to export the state of the device queue to mm?
 E.g. could balance_dirty_pages() query the backing dev if there are
 any outstanding write requests?

Not directly - writeback_in_progress(bdi) is a coarse measure
indicating pdflush is active on this bdi, which implies outstanding
write requests).

  I'd call this a showstopper right now - maybe you need to look at
  something like the ZVC code that Christoph Lameter wrote, perhaps?
 
 That's rather a heavyweight approach for this I think.

But if you want to use per-page accounting, you are going to
need a per-cpu or per-zone set of counters on each bdi to do
this without introducing regressions.

 The only info balance_dirty_pages() really needs is whether there are
 any dirty+writeback bound for the backing dev or not.

writeback bound (i.e. writing as fast as we can) is probably
indicated fairly reliably by bdi_congested(bdi).

Now all you need is the number of dirty pages

 It knows about the diry pages, since it calls writeback_inodes() which
 scans the dirty pages for this backing dev looking for ones to write
 out.

It scans the dirty inode list for dirty inodes which indirectly finds
the dirty pages. It does not know about the number of dirty pages
directly...

 If after returning from writeback_inodes() wbc-nr_to_write
 didn't decrease and wbc-pages_skipped is zero then we know that there
 are no more dirty pages for the device.  Or at least there are no
 dirty pages which aren't already under writeback.

Sure, you can tell if there are _no_ dirty pages on the bdi, but
if there are dirty pages, you can't tell how many there are. Your
followup patches need to know how many dirty+writeback pages there
are on the bdi, so I don't really see any way you can solve the
deadlock in this manner without scalable bdi-nr_dirty accounting.



IIUC, your problem is that there's another bdi that holds all the
dirty pages, and this throttle loop never flushes pages from that
other bdi and we sleep instead. It seems to me that the fundamental
problem is that to clean the pages we need to flush both bdi's, not
just the bdi we are directly dirtying.

How about a dependent bdi link? i.e. if you have a loopback
filesystem, it has a direct bdi (the loopback device) and a
dependent bdi - the bdi that belongs to the underlying filesystem.

When we enter the throttle loop we flush from the direct bdi
and if we fail to flush all the pages we require, we flush
the dependent bdi (maybe even just kick pdflush for that bdi)
before we call congestion_wait() and go to sleep. This way
we are always making progress cleaning pages on the machine,
not just transferring dirty pages form one bdi to another.

Wouldn't that solve the deadlock without needing painful
accounting?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-12 Thread Miklos Szeredi
I'll try to explain the reason for the deadlock first.

 IIUC, your problem is that there's another bdi that holds all the
 dirty pages, and this throttle loop never flushes pages from that
 other bdi and we sleep instead. It seems to me that the fundamental
 problem is that to clean the pages we need to flush both bdi's, not
 just the bdi we are directly dirtying.

This is what happens:

write fault on upper filesystem
  balance_dirty_pages
submit write requests
  loop ...
--- fuse IPC ---
[fuse loopback fs thread 1]
read request
sys_write
  mutex_lock(i_mutex)
  ...
 balance_dirty_pages
submit write requests
loop ... write requests completed ... dirty still over limit ... 
... loop forever

[fuse loopback fs thread 1]
read request
sys_write
  mute_lock(i_mutex) blocks

So the queue for the upper filesystem is full.  The queue for the
lower filesystem is empty.  There are no dirty pages in the lower
filesystem.

So kicking pdflush for the lower filesystem doesn't help, there's
nothing to do.  balance_dirty_pages() for the lower filesystem should
just realize that there's nothing to do and return, and then there
would be progress.

So there's there's really no need to do any accounting, just some
logic to determine that a backing dev is nearly or completely
quiescent.

And getting out of this tight situation doesn't have to be efficient.
This is probably a very rare corner case, that almost never happens in
real life, only with aggressive test tools like bash_shared_mapping.

  OK.  How about just accounting writeback pages?  That should be much
  less of a problem, since normally writeback is started from
  pdflush/kupdate in large batches without any concurrency.
 
 Except when you are throttling you bounce the cacheline around
 each cpu as it triggers foreground writeback.

Yeah, we'd loose a bit of CPU, but not any write performance, since it
is being throttled back anyway.

  Or is it possible to export the state of the device queue to mm?
  E.g. could balance_dirty_pages() query the backing dev if there are
  any outstanding write requests?
 
 Not directly - writeback_in_progress(bdi) is a coarse measure
 indicating pdflush is active on this bdi, which implies outstanding
 write requests).

Hmm, not quite what I need.

   I'd call this a showstopper right now - maybe you need to look at
   something like the ZVC code that Christoph Lameter wrote, perhaps?
  
  That's rather a heavyweight approach for this I think.
 
 But if you want to use per-page accounting, you are going to
 need a per-cpu or per-zone set of counters on each bdi to do
 this without introducing regressions.

Yes, this is an option, but I hope for a simpler solution.

Thanks,
Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-12 Thread David Chinner
On Mon, Mar 12, 2007 at 11:36:16PM +0100, Miklos Szeredi wrote:
 I'll try to explain the reason for the deadlock first.

Ah, thanks for that.

  IIUC, your problem is that there's another bdi that holds all the
  dirty pages, and this throttle loop never flushes pages from that
  other bdi and we sleep instead. It seems to me that the fundamental
  problem is that to clean the pages we need to flush both bdi's, not
  just the bdi we are directly dirtying.
 
 This is what happens:
 
 write fault on upper filesystem
   balance_dirty_pages
 submit write requests
   loop ...

Isn't this loop transferring the dirty state from the upper
filesystem to the lower filesystem? What I don't see here is
how the pages on this filesystem are not getting cleaned if
the lower filesystem is being flushed properly.

I'm probably missing something big and obvious, but I'm not
familiar with the exact workings of FUSE so please excuse my
ignorance

 --- fuse IPC ---
 [fuse loopback fs thread 1]

This is the lower filesystem? Or a callback thread for
doing the write requests to the lower filesystem?

 read request
 sys_write
   mutex_lock(i_mutex)
   ...
  balance_dirty_pages
 submit write requests
 loop ... write requests completed ... dirty still over limit ... 
   ... loop forever

Hmmm - the situation in balance_dirty_pages() after an attempt
to writeback_inodes(wbc) that has written nothing because there
is nothing to write would be:

wbc-nr_write == write_chunk 
wbc-pages_skipped == 0 
wbc-encountered_congestion == 0 
!bdi_congested(wbc-bdi)

What happens if you make that an exit condition to the loop?
Or alternatively, adding another bit to the wbc structure to
say there was nothing to do and setting that if we find
list_empty(sb-s_dirty) when trying to flush dirty inodes.

[ FWIW, this may also solve another problem of fast block devices
being throttled incorrectly when a slow block dev is consuming
all the dirty pages... ]

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [Patch 1/1] IBAC Patch

2007-03-12 Thread Mimi Zohar
On Thu, 2007-03-08 at 22:19 -0500, [EMAIL PROTECTED] wrote:
 On Thu, 08 Mar 2007 17:58:16 EST, Mimi Zohar said:
  This is a request for comments for a new Integrity Based Access
  Control(IBAC) LSM module which bases access control decisions
  on the new integrity framework services. 
  
  (Hopefully this will help clarify the interaction between an LSM 
  module and LIM module.)
 
 OK, between this and the additional LIM hooks I didn't notice in an earlier
 patch, we're starting to see the API.   The only problem is that although
 it may be the right API for *your* code, I suspect it's a non-starter without
 a discussion about whether it's the right *generic* API for an LIM (which will
 require at least one dramatic bun fight about what Integrity means).

Absolutely, we need to make sure that the set of LIM hooks is complete and that
nothing is missing in order to implement different types of LIM providers.  I'm 
copying the digsig mailing list for their input on requirements, which this API 
might not satisfy or perhaps address.

  Index: linux-2.6.21-rc3-mm2/security/ibac/Kconfig
 
 Minor congnitive-dissonance alert:
 
  +config SECURITY_IBAC_BOOTPARAM
  +   bool IBAC boot parameter
  +   depends on SECURITY_IBAC
  +   default y
 
  + If you are unsure how to answer this question, answer N.
 
 The 'default' should in general match the hint we give the user.

Oops, blush.  It will obviously be corrected in the next IBAC patch
release.

Mimi Zohar

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Odd suspend regression in 2.6.21-rc[123]

2007-03-12 Thread Len Brown
On Saturday 10 March 2007 01:18, Ray Lee wrote:
 Ray Lee wrote:
  In 2.6.21-rc1,2,3, my laptop will fully suspend to ram, but then
  *immediately* resumes back from suspension. (It resumes just fine, as well.)
 [...]
  HP/Compaq NX6125 system, AMD64, dmesg attached.
 
 hg bisect found the below patch as the culprit, and reverting it does
 fix the regression. It's supposed to address sometime ac/battery update
 stops after resume from disk. This thread:
 http://lkml.org/lkml/2007/2/24/111 appears to talk about the same issue,
 and therefore it may be solved without the below patch, so perhaps we
 can all be happy.
 
 Regardless, I think my laptop no longer being able to go into S3 sleep
 is a bit more important than someone else's laptop merely not showing
 the correct AC status :-).
 
 Please revert. (git patch id ed41dab90eb40ac4911e60406bc653661f0e4ce1)

I'd rather not break the Acer, if possible.

Ray, Please test the incremental patch below.

---
Subject: ACPI: resolve GPE immediate wakeup regression
From: Alexey Starikovskiy [EMAIL PROTECTED]

Removing disabling of GPEs from enter_sleep function causes regression on 
nx6125.
Doing disable_all_gpes both in prepare to sleep and in enter sleep resolves 
regression,
while still fixes Acer notebooks.

Signed-off-by: Alexey Starikovskiy [EMAIL PROTECTED]
Signed-off-by: Len Brown [EMAIL PROTECTED]
---

 drivers/acpi/hardware/hwsleep.c |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/acpi/hardware/hwsleep.c b/drivers/acpi/hardware/hwsleep.c
index 8fa9312..c84b1fa 100644
--- a/drivers/acpi/hardware/hwsleep.c
+++ b/drivers/acpi/hardware/hwsleep.c
@@ -300,6 +300,11 @@ acpi_status asmlinkage acpi_enter_sleep_state(u8 
sleep_state)
/*
 * 2) Enable all wakeup GPEs
 */
+   status = acpi_hw_disable_all_gpes();
+   if (ACPI_FAILURE(status)) {
+   return_ACPI_STATUS(status);
+   }
+
acpi_gbl_system_awake_and_running = FALSE;
 
status = acpi_hw_enable_all_wakeup_gpes();
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


ACPI: resolve GPE immediate wakeup regression

2007-03-12 Thread Ray Lee
Len Brown wrote:
 On Saturday 10 March 2007 01:18, Ray Lee wrote:
 Ray Lee wrote:
 In 2.6.21-rc1,2,3, my laptop will fully suspend to ram, but then
 *immediately* resumes back from suspension. (It resumes just fine, as well.)
 [...]
 HP/Compaq NX6125 system, AMD64, dmesg attached.
 
 I'd rather not break the Acer, if possible.
 
 Ray, Please test the incremental patch below.

Tested and Alexey's patch (copied below) fixes the problem. I added a
signed-off-by just in case; feel free to yank it if inappropriate.
Regardless, please apply.

Thanks Len, Alexey.

Ray
---
Subject: ACPI: resolve GPE immediate wakeup regression
From: Alexey Starikovskiy [EMAIL PROTECTED]

Removing disabling of GPEs from enter_sleep function causes regression
on nx6125.
Doing disable_all_gpes both in prepare to sleep and in enter sleep
resolves regression,
while still fixes Acer notebooks.

Signed-off-by: Alexey Starikovskiy [EMAIL PROTECTED]
Signed-off-by: Len Brown [EMAIL PROTECTED]
Signed-off-by: Ray Lee [EMAIL PROTECTED]
---

 drivers/acpi/hardware/hwsleep.c |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/acpi/hardware/hwsleep.c
b/drivers/acpi/hardware/hwsleep.c
index 8fa9312..c84b1fa 100644
--- a/drivers/acpi/hardware/hwsleep.c
+++ b/drivers/acpi/hardware/hwsleep.c
@@ -300,6 +300,11 @@ acpi_status asmlinkage acpi_enter_sleep_state(u8
sleep_state)
/*
 * 2) Enable all wakeup GPEs
 */
+   status = acpi_hw_disable_all_gpes();
+   if (ACPI_FAILURE(status)) {
+   return_ACPI_STATUS(status);
+   }
+
acpi_gbl_system_awake_and_running = FALSE;

status = acpi_hw_enable_all_wakeup_gpes();


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ACPI: resolve GPE immediate wakeup regression

2007-03-12 Thread Len Brown
On Monday 12 March 2007 12:59, Ray Lee wrote:
 Len Brown wrote:
  On Saturday 10 March 2007 01:18, Ray Lee wrote:
  Ray Lee wrote:
  In 2.6.21-rc1,2,3, my laptop will fully suspend to ram, but then
  *immediately* resumes back from suspension. (It resumes just fine, as 
  well.)
  [...]
  HP/Compaq NX6125 system, AMD64, dmesg attached.
  
  I'd rather not break the Acer, if possible.
  
  Ray, Please test the incremental patch below.
 
 Tested and Alexey's patch (copied below) fixes the problem. I added a
 signed-off-by just in case; feel free to yank it if inappropriate.
 Regardless, please apply.
 
 Thanks Len, Alexey.

Thanks for testing Ray,
I'll apply this one.

-Len

 ---
 Subject: ACPI: resolve GPE immediate wakeup regression
 From: Alexey Starikovskiy [EMAIL PROTECTED]
 
 Removing disabling of GPEs from enter_sleep function causes regression
 on nx6125.
 Doing disable_all_gpes both in prepare to sleep and in enter sleep
 resolves regression,
 while still fixes Acer notebooks.
 
 Signed-off-by: Alexey Starikovskiy [EMAIL PROTECTED]
 Signed-off-by: Len Brown [EMAIL PROTECTED]
 Signed-off-by: Ray Lee [EMAIL PROTECTED]
 ---
 
  drivers/acpi/hardware/hwsleep.c |5 +
  1 files changed, 5 insertions(+), 0 deletions(-)
 
 diff --git a/drivers/acpi/hardware/hwsleep.c
 b/drivers/acpi/hardware/hwsleep.c
 index 8fa9312..c84b1fa 100644
 --- a/drivers/acpi/hardware/hwsleep.c
 +++ b/drivers/acpi/hardware/hwsleep.c
 @@ -300,6 +300,11 @@ acpi_status asmlinkage acpi_enter_sleep_state(u8
 sleep_state)
   /*
* 2) Enable all wakeup GPEs
*/
 + status = acpi_hw_disable_all_gpes();
 + if (ACPI_FAILURE(status)) {
 + return_ACPI_STATUS(status);
 + }
 +
   acpi_gbl_system_awake_and_running = FALSE;
 
   status = acpi_hw_enable_all_wakeup_gpes();
 
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ACPI disabled due to DMI failure or blacklisted year should be noted, as is done with other ACPI blacklisting

2007-03-12 Thread Anthony Godshall, Ampro Computers, Inc.

On Wednesday 07 March 2007 17:00, [EMAIL PROTECTED] wrote:


The patch titled
 ACPI disabled due to DMI failure or blacklisted year should be noted, as 
is done with other ACPI blacklisting
has been added to the -mm tree.  Its filename is
 
acpi-disabled-due-to-dmi-failure-or-blacklisted-year-should-be-noted-as-is-done-with-other-acpi-blacklisting.patch



Thank you for applying it, Andrew.


good one -- i just ran into this yesterday:-)

applied.
thanks,
-len


No, thank you.

Tony Godshall
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 2/7] RSS controller core

2007-03-12 Thread Srivatsa Vaddagiri
On Tue, Mar 13, 2007 at 07:27:06AM +0530, Balbir Singh wrote:
 I am not sure what went wrong. Could you please check your mail
 client, cause it seemed to even change email address to smtp.osdl.org
 which bounced back when I wrote to you earlier.

I have a problem doing a group-reply in mutt to Herbert's mails. His
email id gets dropped from the To or Cc list. Is that his email setting?
Don't know.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-12 Thread Lee Revell

On 3/12/07, David Lang [EMAIL PROTECTED] wrote:

the problem comes when this isn't enough. if you have several CPU hogs on a
system, and they are all around the same priority level, how can the scheduler
know which one needs the CPU the most for good interactivity?

in some cases you may be able to directly detect that your high-priority process
is waiting for another one (tracing pipes and local sockets for example), but
what if you are waiting for several of them? (think a multimedia desktop waiting
for the sound card, CDRom, hard drive, and video all at once) which one needs
the extra CPU the most?


I'm not an expert in this area by any means but after reading this
thread the OSX solution of simply telling the kernel I'm the GUI,
schedule me accordingly looks increasingly attractive.  Why make the
kernel guess when we can just be explicit?

Does anyone know of a UNIX-like system that has managed to solve this
problem without hooking the GUI into the scheduler?

Lee
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Stracing Amanda (was: RSDL for 2.6.21-rc3- 0.29)

2007-03-12 Thread Gene Heskett
On Monday 12 March 2007, Douglas McNaught wrote:
Patrick Mau [EMAIL PROTECTED] writes:
 Why not temporarly replace /bin/tar with a shell script that does:

 #!/bin/sh
 exec strace -f -o output /bin/real.tar $@

You beat me to it.  :) I've done that before; it's a great suggestion.

Except that if you expect 'tar' to be invoked multiple times in a run,
you should probably use 'output.$$' for the output filename so things
don't get clobbered.

-Doug

In my case, Doug, it will get invoked 64 times, amanda does a dummy run to 
get an estimate, calculates what to do based on that output which is 32 
runs, 1 per disklist entry and I have 32, and then reruns tar with the 
appropriate level options against each individual disklist entry.

But I'm puzzled a bit, what does the double $$ do?, or it buried someplace 
in the bash manpage?  Its not something I've stumbled over yet.

-- 
Cheers, Gene
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)
rugged, adj.:
Too heavy to lift.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Stracing Amanda (was: RSDL for 2.6.21-rc3- 0.29)

2007-03-12 Thread Nish Aravamudan

On 3/12/07, Gene Heskett [EMAIL PROTECTED] wrote:

On Monday 12 March 2007, Douglas McNaught wrote:
Patrick Mau [EMAIL PROTECTED] writes:
 Why not temporarly replace /bin/tar with a shell script that does:

 #!/bin/sh
 exec strace -f -o output /bin/real.tar $@

You beat me to it.  :) I've done that before; it's a great suggestion.

Except that if you expect 'tar' to be invoked multiple times in a run,
you should probably use 'output.$$' for the output filename so things
don't get clobbered.

-Doug

In my case, Doug, it will get invoked 64 times, amanda does a dummy run to
get an estimate, calculates what to do based on that output which is 32
runs, 1 per disklist entry and I have 32, and then reruns tar with the
appropriate level options against each individual disklist entry.

But I'm puzzled a bit, what does the double $$ do?, or it buried someplace
in the bash manpage?  Its not something I've stumbled over yet.


buried indeed:

Special Parameters:
 ...
  $  Expands to the process ID of the shell.  In a  ()  subshell,  it
 expands  to  the  process  ID of the current shell, not the sub‐
 shell.


Thanks,
Nish
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make sure we populate the initroot filesystem late enough

2007-03-12 Thread Kumar Gala


On Mar 12, 2007, at 6:01 PM, Paul TBBle Hampson wrote:


On Thu, Mar 01, 2007 at 09:30:56AM +0900, Michael Ellerman wrote:

On Wed, 2007-02-28 at 10:13 +, David Woodhouse wrote:

On Wed, 2007-02-28 at 07:43 +0100, Benjamin Herrenschmidt wrote:
I wouldn't be that sure ... I've had problems in the past with  
PMU based

cpufreq... looks like flushing all caches and hard-resetting the
processor on the fly when there can be pending DMAs might be a  
source of
trouble... especially on CPUs that don't have working cache  
flush HW

assist.


I've seen it on a PowerMac3,1 (400MHz G4) where we don't have  
cpufreq.
I've also seen it on the latest 1.5GHz Mac Mini, and on my  
shinybook.
They all fall over with the latest kernel, although the shinybook  
only
does so immediately when booted with mem=512M. The shinybook does  
crash

later with new kernels though; I don't yet know why. It could be the
same thing, or it could be something different. That one seemed to
appear between Fedora's 2.6.19-1.2913 and 2.6.19-1.2914 kernels,  
where

we did nothing but turned CONFIG_SYSFS_DEPRECATED on.

I don't blame cpufreq. At various times I've been equally  
convinced that

it was due to CONFIG_KPROBES, and Linus' initrd-moving patch.


Is there any pattern to the way it dies? Or is it just randomly  
dieing

somewhere depending on which config options you have enabled?


This is starting to sound reminiscent of a bug I chased for a  
while last

year on Power5, but didn't find. It was fixed on some machines by
disabling CONFIG_KEXEC, and/or other random unrelated CONFIG options.
Unfortunately it magically stopped reproducing so I never caught  
it :/


Hmm. The crash came back after I booted into Mac OS X and back. It  
was however
a different crash, I believe it was coming from the USB modules (as  
it would
keep going when it happened, and get another crash, which tended to  
scroll away
too fast for me to capture) but I believe it was still getting down  
into the

slab code and actually dying there.

However, reverting the reversion of
8d610dd52dd1da696e199e4b4545f33a2a5de5c6 and instead applying
the following patch:

diff -ru linux-source-2.6.20.orig/arch/powerpc/mm/init_32.c linux- 
source-2.6.20/arch/powerpc/mm/init_32.c
--- linux-source-2.6.20.orig/arch/powerpc/mm/init_32.c  2007-02-05  
05:44:54.0 +1100
+++ linux-source-2.6.20/arch/powerpc/mm/init_32.c   2007-03-10  
11:03:56.0 +1100

@@ -244,7 +244,8 @@
 void free_initrd_mem(unsigned long start, unsigned long end)
 {
if (start  end)
-   printk (Freeing initrd memory: %ldk freed\n, (end  
- start)  10);
+   printk (NOT Freeing initrd memory: %ldk freed\n,  
(end - start)  10);

+   return;
for (; start  end; start += PAGE_SIZE) {
ClearPageReserved(virt_to_page(start));
init_page_count(virt_to_page(start));

which if I recall correctly David Woodhouse posted to this thread,
seems to have fixed it.

I dunno if it's relevant, but my initrd.img is 13193315 bytes long,
(ie 99 bytes over 12884k) and the above logs:
NOT Freeing initrd memory: 12888k freed
which makes sense...

I of course completely failed to think to check this with the crashing
kernel, if it seems relevant I can roll back to it and get the  
numbers.


Have you tried 2.6.20.2, there was a significant bug in get_order()  
that was deemed to be causing these issues.


- k
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RSDL v0.30 cpu scheduler for mainline kernels

2007-03-12 Thread Con Kolivas
On Tuesday 13 March 2007 10:46, David Miller wrote:
 From: Con Kolivas [EMAIL PROTECTED]
 Date: Mon, 12 Mar 2007 10:58:11 +1100

  http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc3-sched-rsdl-0.
 30.patch

 FWIW, this boots and seems to work well on sparc64.  Tested
 on UP SunBlade1500 and 24cpu Niagara T1000.

Very nice. Thanks for the feedback and I'm sorry you have to work with such 
lousy hardware.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Question: removal of syscall macros?

2007-03-12 Thread albcamus

2006/12/14, Teunis Peters [EMAIL PROTECTED]:


Now that syscall macros have been pulled from the -mm tree, what method
is recommended to use syscalls?

(I've wasted a day grubbing through sources before giving up and copying
the old syscall macros into one key driver)

_syscall macros are used by:
ATI driver  (no choice.  I'm working with laptops)


I have the same problem as yours.  Do  you have any idea to use ATI
firegl driver
in recent kernels ? Thanks in advance.

Regards,
albcamus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kthread_should_stop_check_freeze (was: Re: [PATCH -mm 3/7] Freezer: Remove PF_NOFREEZE from rcutorture thread)

2007-03-12 Thread Srivatsa Vaddagiri
On Mon, Mar 12, 2007 at 05:45:24PM -0500, Anton Blanchard wrote:
 Then please document it _clearly_ with the kthread code somewhere. 

Document as well in the kernel_thread() API, as I notice people still
use kernel_thread() some places (ex: rtasd.c in powerpc arch)?

 The reason I brought this up is I had no idea we had to put the freezer gunk
 in all kernel thread loops and Ive been writing kernel threads for years.

I noticed that in the Powerpc code (atleast for rtas kernel thread)
here:

http://lkml.org/lkml/2007/1/9/61

That was not a serious problem perhaps because process freezer was mostly used
in software suspend and only those platforms supporting software suspend
had to worry abt it.

But now we intend to use process freezer for CPU hotplug as well, so all
platforms wanting to support CPU hotplug better support process freezer!

P.S : I believe kprobes is already using process freezer as well.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Attachment Received Autoreply

2007-03-12 Thread Virus Research
Thank you for your file-sample. We will review your email and either send
you a response or forward to the appropriate contact. If you have sent us a
file which is not in a password protected zip file (password - infected)
then your sample will not be reviewed. 

__

Virus Research accepts file-samples for analysis and possible inclusion into
AV signature DAT sets. We are also prepared to answer general virus
questions. Virus Research does not handle product related issues. 

This message has been sent based upon keywords in your message.  If you have
been sent this message in error, please resend your message with the word
noauto in the subject line. 

__

Information on recent threats, along with other AVERT resources and tools,
can be found at: http://www.mcafeesecurity.com/us/security/home.asp

All product-related questions and comments can be addressed through
technical support. Contact information for Technical Support can be found
at: http://www.mcafeesecurity.com/us/contact/home.htm.

Engine and DAT updates are available at:
http://www.mcafeesecurity.com/us/downloads/updates

For instructions on submitting a sample to AVERT please see:
http://vil.nai.com/vil/submit-sample.asp

If you suspect you have a new, unknown virus and have a system where you can
do a test scan, you may first wish to try our Beta Hourly DATs to get the
latest detection available at:
http://vil.mcafeesecurity.com/vil/averttools.asp


Thanks - McAfee AVERT(tm)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Stracing Amanda (was: RSDL for 2.6.21-rc3- 0.29)

2007-03-12 Thread Gene Heskett
On Monday 12 March 2007, Nish Aravamudan wrote:
On 3/12/07, Gene Heskett [EMAIL PROTECTED] wrote:
 On Monday 12 March 2007, Douglas McNaught wrote:
 Patrick Mau [EMAIL PROTECTED] writes:
  Why not temporarly replace /bin/tar with a shell script that
  does:
 
  #!/bin/sh
  exec strace -f -o output /bin/real.tar $@
 
 You beat me to it.  :) I've done that before; it's a great
  suggestion.
 
 Except that if you expect 'tar' to be invoked multiple times in a
  run, you should probably use 'output.$$' for the output filename so
  things don't get clobbered.
 
 -Doug

 In my case, Doug, it will get invoked 64 times, amanda does a dummy
 run to get an estimate, calculates what to do based on that output
 which is 32 runs, 1 per disklist entry and I have 32, and then reruns
 tar with the appropriate level options against each individual
 disklist entry.

 But I'm puzzled a bit, what does the double $$ do?, or it buried
 someplace in the bash manpage?  Its not something I've stumbled over
 yet.

buried indeed:

Special Parameters:
  ...
   $  Expands to the process ID of the shell.  In a  () 
 subshell,  it expands  to  the  process  ID of the current shell, not
 the sub‐ shell.


Well, that's clear enough, but what of the double $$ case?  Would this 
them make a PID unique to each invocation untill it finally wraps a 16 
bit value, or will the kernel re-use them because they won't all be 
running simultainiously, but limited by the number of unique 'spindle' 
numbers on the system, this to prevent as best as it can, the thrashing 
of a drive by having tar working on 2 separate (or more) partitions at 
the same time.  In my case 2 are possible, as /var is on a separate 
drive.

Thanks,
Nish



-- 
Cheers, Gene
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)
Say yur prayers, yuh flea-pickin' varmint!
-- Yosemite Sam
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3280277 - ynlg

2007-03-12 Thread Virus Research
AVERT Labs - Beaverton

Current Scan Engine Version:5100.0194

Current DAT Version:4982.

Thank you for your submission.


Analysis ID: 3280277

File NameFindings   Detection
Type Extra
|--|
||-
[EMAIL PROTECTED]|current detection |w32/[EMAIL PROTECTED]
|Virus   |no   

current detection [EMAIL PROTECTED]


   The file received is infected and can be detected and removed with our
current DAT 
files and engine. It is recommended that you update your DAT and engine
files and scan 
your computer again.


If you are not seeing this with the product you are using, please speak with
technical 
support so that they can help you determine the cause of this discrepancy.


To find detailed information about viruses and other malware, please review
AVERT's
Virus Information Library:


http://vil.mcafeesecurity.com


In order to get the fastest possible response, you may wish to submit future

virus-samples to:


https://www.webimmune.net/default.asp


In most cases it can respond almost instantly with a solution. This may also
be the
best option if you are having a problem with gateway scanners stripping your
sample
submission.


If you believe your computer is infected, but are unsure which files should
be 
submitted to AVERT for review, please visit:


http://vil.mcafeesecurity.com/vil/submit-sample.aspx


For other virus-related information, please review the AVERT homepage at:


http://www.mcafee.com/us/threat_center/default.asp


Support -


Virus Research accepts file-samples for analysis and possible inclusion into
AV
signature DAT sets. We are also prepared to answer general virus questions.
All
product-related questions and comments can be addressed through technical
support and  
customer service, including:


* Product installation and update questions

* Product usage questions

* Specific operating system/version questions

* Assistance with detection and cleaning or removal of viruses or trojans


Use the following link to update your DAT and scan engine to the most
current version: 

http://www.mcafee.com/apps/downloads/security_updates/dat.asp


Use the following links to reach online technical support for McAfee
products -

Corporate Customers:


http://www.mcafeesecurity.com/us/support/


Single User/Retail Customers:


http://www.mcafeehelp.com


Note -


Due to the prevalence of network gateway AV products, it is important that
all 
submissions be zipped and the zip file password-protected (password -
infected). Some  
products will reject an email that contains a virus that is not sent in this
way. In   
addition, often we receive a file that appears not to have been infected, to
find  
later that the file was infected when it left the sender, and was cleaned
somewhere
along the line.


Regards,




McAfee AVERT tm

A division of McAfee, Inc

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.21rc suspend to ram regression on Lenovo X60

2007-03-12 Thread Dave Jones
I spent considerable time over the last day or so bisecting to
find out why an X60 stopped resuming somewhen between 2.6.20 and current -git.
(Total lockup, black screen of death).

The bisect log looked like this.

git-bisect start
# bad: [c8f71b01a50597e298dc3214a2f2be7b8d31170c] Linux 2.6.21-rc1
git-bisect bad c8f71b01a50597e298dc3214a2f2be7b8d31170c
# good: [fa285a3d7924a0e3782926e51f16865c5129a2f7] Linux 2.6.20
git-bisect good fa285a3d7924a0e3782926e51f16865c5129a2f7
# bad: [574009c1a895aeeb85eaab29c235d75852b09eb8] Merge branch 'upstream' of 
git://ftp.linux-mips.org/pub/scm/upstream-linus
git-bisect bad 574009c1a895aeeb85eaab29c235d75852b09eb8
# bad: [43187902cbfafe73ede0144166b741fb0f7d04e1] Merge 
master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6
git-bisect bad 43187902cbfafe73ede0144166b741fb0f7d04e1
# good: [1545085a28f226b59c243f88b82ea25393b0d63f] drm: Allow for 44 bit 
user-tokens (or drm_file offsets)
git-bisect good 1545085a28f226b59c243f88b82ea25393b0d63f
# good: [c96e2c92072d3e78954c961f53d8c7352f7abbd7] Merge 
master.kernel.org:/pub/scm/linux/kernel/git/gregkh/usb-2.6
git-bisect good c96e2c92072d3e78954c961f53d8c7352f7abbd7
# good: [31c56d820e03a2fd47f81d6c826f92caf511f9ee] [POWERPC] pasemi: iommu 
support
git-bisect good 31c56d820e03a2fd47f81d6c826f92caf511f9ee
# bad: [78149df6d565c36675463352d0bfeb02b7a7] Merge 
master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6
git-bisect bad 78149df6d565c36675463352d0bfeb02b7a7
# good: [3d9c18872fa1db5c43ab97d8cbca43775998e49c] shpchp: remove 
CONFIG_HOTPLUG_PCI_SHPC_POLL_EVENT_MODE
git-bisect good 3d9c18872fa1db5c43ab97d8cbca43775998e49c
# good: [88187dfa4d8bb565df762f272511d2c91e427e0d] MSI: Replace pci_msi_quirk 
with calls to pci_no_msi()
git-bisect good 88187dfa4d8bb565df762f272511d2c91e427e0d
# good: [866a8c87c4e51046602387953bbef76992107bcb] msi: Fix 
msi_remove_pci_irq_vectors.
git-bisect good 866a8c87c4e51046602387953bbef76992107bcb
# good: [f7feaca77d6ad6bcfcc88ac54e3188970448d6fe] msi: Make MSI useable more 
architectures
git-bisect good f7feaca77d6ad6bcfcc88ac54e3188970448d6fe
# good: [14719f325e1cd4ff757587e9a221ebaf394563ee] Revert PCI: remove 
duplicate device id from ata_piix
git-bisect good 14719f325e1cd4ff757587e9a221ebaf394563ee

which led me to a final 'bad' commit of 78149df6d565c36675463352d0bfeb02b7a7
which is a merge changeset of lots of PCI bits.
Seeing a couple of MSI changes in there, on a hunch I booted latest tree with
pci=nomsi, and it resumed again.

Any ideas how to further debug this?
I'll try backing out individual changes from that merge tomorrow.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-12 Thread Kyle Moffett

On Mar 12, 2007, at 11:26:25, Linus Torvalds wrote:
So good fairness really should involve some notion of work done  
for others. It's just not very easy to do..


Maybe extend UNIX sockets to add another passable object type vis-a- 
vis SCM_RIGHTS, except in this case SCM_CPUTIME.  You call  
SCM_CPUTIME with a time value in monotonic real-time nanoseconds  
(duration) and a value out of 100 indicating what percentage of your  
timeslices to give to the process (for the specified duration).  The  
receiving process would be informed of the estimated total number of  
nanoseconds of timeslice that it will be given based on the priority  
of the processes. (Maybe it could prioritize requests?).  The X  
libraries could then properly pass CPU time to the X server to help  
with rendering their requests, and the X server could give priority  
to tasks which give up more CPU time than is needed to render their  
data, and penalize those which use more than they give.  Initially  
even if you don't patch the X server you could at least patch the X  
clients to give up CPU to the X server to promote interactivity.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RSDL v0.30 cpu scheduler for mainline kernels

2007-03-12 Thread Willy Tarreau
On Tue, Mar 13, 2007 at 02:05:23PM +1100, Con Kolivas wrote:
 On Tuesday 13 March 2007 10:46, David Miller wrote:
  From: Con Kolivas [EMAIL PROTECTED]
  Date: Mon, 12 Mar 2007 10:58:11 +1100
 
   http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc3-sched-rsdl-0.
  30.patch
 
  FWIW, this boots and seems to work well on sparc64.  Tested
  on UP SunBlade1500 and 24cpu Niagara T1000.
 
 Very nice. Thanks for the feedback and I'm sorry you have to work with such 
 lousy hardware.

BTW, I don't know if you say this as a joke, but those are not necessarily
lousy hardware. Sun does lousy hardware when they put Sparcs in PCs (ultra5,
ultra10, blade100). But their servers generally are nice with large memory
busses and very scalable SMP architectures.

Regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/5] statically initialize struct pid for swapper

2007-03-12 Thread sukadev


From: Sukadev Bhattiprolu [EMAIL PROTECTED]
Subject: [PATCH 1/5] statically initialize struct pid for swapper

Statically initialize a struct pid for the swapper process (pid_t == 0) and
attach it to init_task.  This is needed so task_pid(), task_pgrp() and
task_session() interfaces work on the swapper process also.

Signed-off-by: Sukadev Bhattiprolu [EMAIL PROTECTED]
Cc: Cedric Le Goater [EMAIL PROTECTED]
Cc: Dave Hansen [EMAIL PROTECTED]
Cc: Serge Hallyn [EMAIL PROTECTED]
Cc: Eric Biederman [EMAIL PROTECTED]
Cc: Herbert Poetzl [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Acked-by: Eric W. Biederman [EMAIL PROTECTED]
---

 include/linux/init_task.h |   27 +++
 include/linux/pid.h   |2 ++
 kernel/pid.c  |2 ++
 3 files changed, 31 insertions(+)

Index: lx26-20-mm2c/include/linux/init_task.h
===
--- lx26-20-mm2c.orig/include/linux/init_task.h 2007-02-28 15:47:44.0 
-0800
+++ lx26-20-mm2c/include/linux/init_task.h  2007-02-28 15:48:07.0 
-0800
@@ -96,6 +96,28 @@ extern struct group_info init_groups;
 #define INIT_PREEMPT_RCU
 #endif
 
+#define INIT_STRUCT_PID {  \
+   .count  = ATOMIC_INIT(1),   \
+   .nr = 0,\
+   /* Don't put this struct pid in pid_hash */ \
+   .pid_chain  = { .next = NULL, .pprev = NULL },  \
+   .tasks  = { \
+   { .first = init_task.pids[PIDTYPE_PID].node }, \
+   { .first = init_task.pids[PIDTYPE_PGID].node },\
+   { .first = init_task.pids[PIDTYPE_SID].node }, \
+   },  \
+   .rcu= RCU_HEAD_INIT,\
+}
+
+#define INIT_PID_LINK(type)\
+{  \
+   .node = {   \
+   .next = NULL,   \
+   .pprev = init_struct_pid.tasks[type].first,\
+   },  \
+   .pid = init_struct_pid,\
+}
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1f (=2MB)
@@ -145,6 +167,11 @@ extern struct group_info init_groups;
.cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers),  \
.fs_excl= ATOMIC_INIT(0),   \
.pi_lock= SPIN_LOCK_UNLOCKED,   \
+   .pids = {   \
+   [PIDTYPE_PID]  = INIT_PID_LINK(PIDTYPE_PID),\
+   [PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID),   \
+   [PIDTYPE_SID]  = INIT_PID_LINK(PIDTYPE_SID),\
+   },  \
INIT_TRACE_IRQFLAGS \
INIT_LOCKDEP\
 }
Index: lx26-20-mm2c/include/linux/pid.h
===
--- lx26-20-mm2c.orig/include/linux/pid.h   2007-02-28 15:48:07.0 
-0800
+++ lx26-20-mm2c/include/linux/pid.h2007-02-28 15:48:07.0 -0800
@@ -51,6 +51,8 @@ struct pid
struct rcu_head rcu;
 };
 
+extern struct pid init_struct_pid;
+
 struct pid_link
 {
struct hlist_node node;
Index: lx26-20-mm2c/kernel/pid.c
===
--- lx26-20-mm2c.orig/kernel/pid.c  2007-02-28 15:48:07.0 -0800
+++ lx26-20-mm2c/kernel/pid.c   2007-02-28 15:48:07.0 -0800
@@ -27,11 +27,13 @@
 #include linux/bootmem.h
 #include linux/hash.h
 #include linux/pid_namespace.h
+#include linux/init_task.h
 
 #define pid_hashfn(nr) hash_long((unsigned long)nr, pidhash_shift)
 static struct hlist_head *pid_hash;
 static int pidhash_shift;
 static struct kmem_cache *pid_cachep;
+struct pid init_struct_pid = INIT_STRUCT_PID;
 
 int pid_max = PID_MAX_DEFAULT;
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/5] Use struct pid parameter in copy_process()

2007-03-12 Thread sukadev

From: Sukadev Bhattiprolu [EMAIL PROTECTED]
Subject: [PATCH 3/5] Use struct pid parameter in copy_process()

Modify copy_process() to take a struct pid * parameter instead of a pid_t.
This simplifies the code a bit and also avoids having to call find_pid()
to convert the pid_t to a struct pid.

Changelog: 
- Fixed Badari Pulavarty's comments and passed in init_struct_pid
  from fork_idle().
- Fixed Eric Biederman's comments and simplified this patch and
  used a new patch to remove the likely(pid) check.

Signed-off-by: Sukadev Bhattiprolu [EMAIL PROTECTED]
Cc: Cedric Le Goater [EMAIL PROTECTED]
Cc: Dave Hansen [EMAIL PROTECTED]
Cc: Serge Hallyn [EMAIL PROTECTED]
Cc: Eric Biederman [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Acked-by: Eric W. Biederman [EMAIL PROTECTED]
---
 kernel/fork.c |   11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

Index: lx26-21-rc3-mm2/kernel/fork.c
===
--- lx26-21-rc3-mm2.orig/kernel/fork.c  2007-03-12 17:16:39.0 -0700
+++ lx26-21-rc3-mm2/kernel/fork.c   2007-03-12 17:17:48.0 -0700
@@ -966,7 +966,7 @@ static struct task_struct *copy_process(
unsigned long stack_size,
int __user *parent_tidptr,
int __user *child_tidptr,
-   int pid)
+   struct pid *pid)
 {
int retval;
struct task_struct *p = NULL;
@@ -1033,7 +1033,7 @@ static struct task_struct *copy_process(
p-did_exec = 0;
delayacct_tsk_init(p);  /* Must remain after dup_task_struct() */
copy_flags(clone_flags, p);
-   p-pid = pid;
+   p-pid = pid_nr(pid);
 
INIT_LIST_HEAD(p-children);
INIT_LIST_HEAD(p-sibling);
@@ -1265,7 +1265,7 @@ static struct task_struct *copy_process(
list_add_tail_rcu(p-tasks, init_task.tasks);
__get_cpu_var(process_counts)++;
}
-   attach_pid(p, PIDTYPE_PID, find_pid(p-pid));
+   attach_pid(p, PIDTYPE_PID, pid);
nr_threads++;
}
 
@@ -1336,7 +1336,8 @@ struct task_struct * __cpuinit fork_idle
struct task_struct *task;
struct pt_regs regs;
 
-   task = copy_process(CLONE_VM, 0, idle_regs(regs), 0, NULL, NULL, 0);
+   task = copy_process(CLONE_VM, 0, idle_regs(regs), 0, NULL, NULL,
+   init_struct_pid);
if (!IS_ERR(task))
init_idle(task, cpu);
 
@@ -1364,7 +1365,7 @@ long do_fork(unsigned long clone_flags,
return -EAGAIN;
nr = pid-nr;
 
-   p = copy_process(clone_flags, stack_start, regs, stack_size, 
parent_tidptr, child_tidptr, nr);
+   p = copy_process(clone_flags, stack_start, regs, stack_size, 
parent_tidptr, child_tidptr, pid);
/*
 * Do this prior waking up the new thread - the thread pointer
 * might get invalid after that point, if the thread exits quickly.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/5] Explicitly set pgid and sid of init process

2007-03-12 Thread sukadev


From: Sukadev Bhattiprolu [EMAIL PROTECTED]
Subject: [PATCH 2/5] Explicitly set pgid and sid of init process

Explicitly set pgid and sid of init process to 1.

Signed-off-by: Sukadev Bhattiprolu [EMAIL PROTECTED]
Cc: Cedric Le Goater [EMAIL PROTECTED]
Cc: Dave Hansen [EMAIL PROTECTED]
Cc: Serge Hallyn [EMAIL PROTECTED]
Cc: Eric Biederman [EMAIL PROTECTED]
Cc: Herbert Poetzl [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Acked-by: Eric W. Biederman [EMAIL PROTECTED]
---

 init/main.c |1 +
 1 file changed, 1 insertion(+)

Index: lx26-20-mm2c/init/main.c
===
--- lx26-20-mm2c.orig/init/main.c   2007-02-28 15:49:13.0 -0800
+++ lx26-20-mm2c/init/main.c2007-02-28 15:49:35.0 -0800
@@ -791,6 +791,7 @@ static int __init init(void * unused)
 */
init_pid_ns.child_reaper = current;
 
+   __set_special_pids(1, 1);
cad_pid = task_pid(current);
 
smp_prepare_cpus(max_cpus);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    4   5   6   7   8   9   10   >