date:20070421

Re: [kvm-devel] [GIT PULL] kvm oops fix

2007-04-21 Thread Avi Kivity

David Brown wrote:
> On 4/19/07, Avi Kivity <[EMAIL PROTECTED]> wrote:
>> Linus,
>>
>> Please pull from the 'linus' branch of
>>
>>   git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git
>>
>> To get a one-liner fixing a host oops running  non-pae guests.
>>
>> Avi Kivity (1):
>>   KVM: Fix off-by-one when writing to a nonpae guest pde
>
> Ooo I thought of something else.
> Should this be applied to the current 2.6.20.7 for the next 2.6.20.8
> release?
>

Yes.  I'll prepare a patch.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm] Taskstats fix the structure members alignment issue

2007-04-21 Thread Balbir Singh


Andrew Morton wrote:

On Sat, 21 Apr 2007 18:29:21 +0530 Balbir Singh <[EMAIL PROTECTED]> wrote:


The patch adds an __attribute__((aligned(8))) to the
taskstats structure members so that 32 bit applications using taskstats
can work with a 64 bit kernel.

But there might be 32-bit applications out there which are using the
present wrong structure?

otoh, I assume that those applications would be using taskstats.h and would
hence encounter this bug and we would have heard about it, is that correct?


Yes, correct.


otoh^2, 32-bit applications running under 32-bit kernels will presently be
functioning correctly, and your change will require that those applications
be recompiled, I think?


Yes, correct. They would be broken with this fix. We could  bump up the
version TASKSTATS_VERSION to 4. Would you like a new patch the version
bumped up?


I can do that.


Thanks




This patch looks like 2.6.20 and 2.6.21 material, but very carefully...

Yes, 2.6.20 and 2.6.21 sound correct.


OK.  I guess we have little choice but to slam it in asap, with a 2.6.20.x 
backport
before too many people start using the old interface.


Thanks, again!

--
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Mike Galbraith

On Sun, 2007-04-22 at 10:08 +1000, Con Kolivas wrote:
> On Sunday 22 April 2007 08:54, Denis Vlasenko wrote:
> > On Saturday 21 April 2007 18:00, Ingo Molnar wrote:
> > > correct. Note that Willy reniced X back to 0 so it had no relevance on
> > > his test. Also note that i pointed this change out in the -v4 CFS
> > >
> > > announcement:
> > > || Changes since -v3:
> > > ||
> > > ||  - usability fix: automatic renicing of kernel threads such as
> > > ||keventd, OOM tasks and tasks doing privileged hardware access
> > > ||(such as Xorg).
> > >
> > > i've attached it below in a standalone form, feel free to put it into
> > > SD! :)
> >
> > But X problems have nothing to do with "privileged hardware access".
> > X problems are related to priority inversions between server and client
> > processes, and "one server process - many client processes" case.
> 
> It's not a privileged hardware access reason that this code is there. This is 
> obfuscation/advertising to make it look like there is a valid reason for X 
> getting negative nice levels somehow in the kernel to make interactive 
> testing of CFS better by default.

That's not a very nice thing to say, and it has no benefit unless you
specifically want to run multiple heavy X hitting clients.

I boot with that feature disabled specifically to be able to measure
fairness in a pure environment, and it's still _much_ smoother and
snappier than any RSDL/SD kernel I ever tried.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[ANNOUNCE] Staircase Deadline cpu scheduler version 0.45

2007-04-21 Thread Con Kolivas

A significant bugfix for SMP balancing was just posted for the staircase 
deadline cpu scheduler which improves behaviour dramatically on any SMP 
machine.

Thanks to Willy Tarreau for noticing likely fault point.

Also requested was a version in the Makefile so this version of the patch 
adds -sd045 to the kernel version.

http://ck.kolivas.org/patches/staircase-deadline/2.6.20.7-sd-0.45.patch
http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc7-sd-0.45.patch

Incrementals from 0.44:
http://ck.kolivas.org/patches/staircase-deadline/2.6.20.7/sd-0.44-0.45.patch
http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc7/sd-0.44-0.45.patch

Renicing X to -10, while not essential, may be desirable on the desktop. 
Unlike the CFS scheduler which renices X without your intervention to 
nice -19, the SD patches do not alter nice level on their own.

See the patch just posted called 'sched: implement staircase deadline 
scheduler ymf accounting fixes' for details of the fixes.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Al Boldi

Con Kolivas wrote:
> On Sunday 22 April 2007 02:00, Ingo Molnar wrote:
> > * Con Kolivas <[EMAIL PROTECTED]> wrote:
> > > >   Feels even better, mouse movements are very smooth even under high
> > > >   load. I noticed that X gets reniced to -19 with this scheduler.
> > > >   I've not looked at the code yet but this looked suspicious to me.
> > > >   I've reniced it to 0 and it did not change any behaviour. Still
> > > >   very good.
> > >
> > > Looks like this code does it:
> > >
> > > +int sysctl_sched_privileged_nice_level __read_mostly = -19;
> >
> > correct.
>
> Oh I definitely was not advocating against renicing X, I just suspect that
> virtually all the users who gave glowing reports to CFS comparing it to SD
> had no idea it had reniced X to -19 behind their back and that they were
> comparing it to SD running X at nice 0. I think had they been comparing
> CFS with X nice -19 to SD running nice -10 in this interactivity soft and
> squishy comparison land their thoughts might have been different. I missed
> it in the announcement and had to go looking in the code since Willy just
> kinda tripped over it unwittingly as well.

I tried this with the vesa driver of X, and reflect from the mesa-demos 
heavily starves new window creation on cfs-v4 with X niced -19.  X reniced 
to 0 removes these starves.  On SD, X reniced to -10 works great.


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Wrong free clusters count on FAT32

2007-04-21 Thread OGAWA Hirofumi

Andrew Morton <[EMAIL PROTECTED]> writes:

> On Sun, 22 Apr 2007 07:42:22 +0900 OGAWA Hirofumi <[EMAIL PROTECTED]> wrote:
>
>> Recent Windows doesn't update ->free_clusters correctly. The "nofree"
>> option ignores the ->free_clusters stored on FSINFO.
>
> It'd be better to avoid the new mount option if possible: fatfs is used by
> a *lot* of people who aren't particualrly expert, and we want to make things
> easy for them.

Ok.

> Is there some way in which we can work out what's happened and fix it up?

It seems that the recent Windows changed specification, and it's
undocumented. Windows doesn't update ->free_clusters correctly.

Probably, what we can do is to throw away speed of statfs(2) and not
using ->free_clusters. (And if possible, recover it by optimization.)

This patch doesn't use ->free_clusters by default. (instead, add
"usefree" for forcing to use it)
-- 
OGAWA Hirofumi <[EMAIL PROTECTED]>


It seems that the recent Windows changed specification, and it's
undocumented. Windows doesn't update ->free_clusters correctly.

This patch doesn't use ->free_clusters by default. (instead, add
"usefree" for forcing to use it)

Signed-off-by: OGAWA Hirofumi <[EMAIL PROTECTED]>
---

 fs/fat/inode.c   |   14 +++---
 include/linux/msdos_fs.h |3 ++-
 2 files changed, 13 insertions(+), 4 deletions(-)

diff -puN fs/fat/inode.c~fat_dont-use_free_clusters-for-fat32 fs/fat/inode.c
--- linux-2.6/fs/fat/inode.c~fat_dont-use_free_clusters-for-fat32	2007-04-22 13:07:05.0 +0900
+++ linux-2.6-hirofumi/fs/fat/inode.c	2007-04-22 13:18:07.0 +0900
@@ -825,6 +825,8 @@ static int fat_show_options(struct seq_f
 	}
 	if (opts->name_check != 'n')
 		seq_printf(m, ",check=%c", opts->name_check);
+	if (opts->usefree)
+		seq_puts(m, ",usefree");
 	if (opts->quiet)
 		seq_puts(m, ",quiet");
 	if (opts->showexec)
@@ -850,7 +852,7 @@ static int fat_show_options(struct seq_f
 
 enum {
 	Opt_check_n, Opt_check_r, Opt_check_s, Opt_uid, Opt_gid,
-	Opt_umask, Opt_dmask, Opt_fmask, Opt_codepage, Opt_nocase,
+	Opt_umask, Opt_dmask, Opt_fmask, Opt_codepage, Opt_usefree, Opt_nocase,
 	Opt_quiet, Opt_showexec, Opt_debug, Opt_immutable,
 	Opt_dots, Opt_nodots,
 	Opt_charset, Opt_shortname_lower, Opt_shortname_win95,
@@ -872,6 +874,7 @@ static match_table_t fat_tokens = {
 	{Opt_dmask, "dmask=%o"},
 	{Opt_fmask, "fmask=%o"},
 	{Opt_codepage, "codepage=%u"},
+	{Opt_usefree, "usefree"},
 	{Opt_nocase, "nocase"},
 	{Opt_quiet, "quiet"},
 	{Opt_showexec, "showexec"},
@@ -951,7 +954,7 @@ static int parse_options(char *options, 
 	opts->quiet = opts->showexec = opts->sys_immutable = opts->dotsOK =  0;
 	opts->utf8 = opts->unicode_xlate = 0;
 	opts->numtail = 1;
-	opts->nocase = 0;
+	opts->usefree = opts->nocase = 0;
 	*debug = 0;
 
 	if (!options)
@@ -979,6 +982,9 @@ static int parse_options(char *options, 
 		case Opt_check_n:
 			opts->name_check = 'n';
 			break;
+		case Opt_usefree:
+			opts->usefree = 1;
+			break;
 		case Opt_nocase:
 			if (!is_vfat)
 opts->nocase = 1;
@@ -1306,7 +1312,9 @@ int fat_fill_super(struct super_block *s
 			   le32_to_cpu(fsinfo->signature2),
 			   sbi->fsinfo_sector);
 		} else {
-			sbi->free_clusters = le32_to_cpu(fsinfo->free_clusters);
+			if (sbi->options.usefree)
+sbi->free_clusters =
+	le32_to_cpu(fsinfo->free_clusters);
 			sbi->prev_free = le32_to_cpu(fsinfo->next_cluster);
 		}
 
diff -puN include/linux/msdos_fs.h~fat_dont-use_free_clusters-for-fat32 include/linux/msdos_fs.h
--- linux-2.6/include/linux/msdos_fs.h~fat_dont-use_free_clusters-for-fat32	2007-04-22 13:07:05.0 +0900
+++ linux-2.6-hirofumi/include/linux/msdos_fs.h	2007-04-22 13:09:14.0 +0900
@@ -205,7 +205,8 @@ struct fat_mount_options {
 		 numtail:1,   /* Does first alias have a numeric '~1' type tail? */
 		 atari:1, /* Use Atari GEMDOS variation of MS-DOS fs */
 		 flush:1,	  /* write things quickly */
-		 nocase:1;	  /* Does this need case conversion? 0=need case conversion*/
+		 nocase:1,	  /* Does this need case conversion? 0=need case conversion*/
+		 usefree:1;	  /* Use free_clusters for FAT32 */
 };
 
 #define FAT_HASH_BITS	8
_

[PATCH] sched: ymf typo

2007-04-21 Thread Con Kolivas

Typo in comment, 1us not 1ms.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>

---
 kernel/sched.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.21-rc7-sd/kernel/sched.c
===
--- linux-2.6.21-rc7-sd.orig/kernel/sched.c 2007-04-22 14:22:14.0 
+1000
+++ linux-2.6.21-rc7-sd/kernel/sched.c  2007-04-22 14:22:34.0 +1000
@@ -3045,7 +3045,7 @@ update_cpu_clock(struct task_struct *p, 
/*
 * Called from context_switch there should be less than one
 * jiffy worth, and not negative/overflow. There should be
-* some time banked here so use a nominal 1ms.
+* some time banked here so use a nominal 1us.
 */
if (time_diff > JIFFIES_TO_NS(1) || time_diff < 1)
time_diff = 1000;

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] sched: implement staircase deadline scheduler ymf accounting fixes

2007-04-21 Thread Con Kolivas

This causes significant improvements on SMP hardware. I don't think the kernel
should be -nicing X by itself; that should be a sysadmin choice so I won't
be including that change in the SD patches. The following change will be in
the next release of SD (v0.45).

Andrew Please apply on top of yaf-fix

---
SMP balancing broke on converting time_slice to usecs.

update_cpu_clock is unnecessarily complex and doesn't allow sub usec values.

Thanks to Willy Tarreau <[EMAIL PROTECTED]> for picking up SMP idle anomalies.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>

---
 kernel/sched.c |   42 +-
 1 file changed, 17 insertions(+), 25 deletions(-)

Index: linux-2.6.21-rc7-sd/kernel/sched.c
===
--- linux-2.6.21-rc7-sd.orig/kernel/sched.c 2007-04-21 22:50:31.0 
+1000
+++ linux-2.6.21-rc7-sd/kernel/sched.c  2007-04-22 13:29:29.0 +1000
@@ -88,12 +88,10 @@ unsigned long long __attribute__((weak))
 #define SCHED_PRIO(p)  ((p)+MAX_RT_PRIO)
 
 /* Some helpers for converting to/from various scales.*/
-#define NS_TO_JIFFIES(TIME)((TIME) / (10 / HZ))
 #define JIFFIES_TO_NS(TIME)((TIME) * (10 / HZ))
+#define MS_TO_NS(TIME) ((TIME) * 100)
 #define MS_TO_US(TIME) ((TIME) * 1000)
-/* Can return 0 */
-#define MS_TO_JIFFIES(TIME)((TIME) * HZ / 1000)
-#define JIFFIES_TO_MS(TIME)((TIME) * 1000 / HZ)
+#define US_TO_MS(TIME) ((TIME) / 1000)
 
 #define TASK_PREEMPTS_CURR(p, curr)((p)->prio < (curr)->prio)
 
@@ -876,29 +874,28 @@ static void requeue_task(struct task_str
 
 /*
  * task_timeslice - the total duration a task can run during one major
- * rotation. Returns value in jiffies.
+ * rotation. Returns value in milliseconds as the smallest value can be 1.
  */
-static inline int task_timeslice(struct task_struct *p)
+static int task_timeslice(struct task_struct *p)
 {
-   int slice;
+   int slice = p->quota;   /* quota is in us */
 
-   slice = NS_TO_JIFFIES(p->quota);
if (!rt_task(p))
slice += (PRIO_RANGE - 1 - TASK_USER_PRIO(p)) * slice;
-   return slice;
+   return US_TO_MS(slice);
 }
 
 /*
  * Assume: static_prio_timeslice(NICE_TO_PRIO(0)) == DEF_TIMESLICE
  * If static_prio_timeslice() is ever changed to break this assumption then
- * this code will need modification
+ * this code will need modification. Scaled as multiples of milliseconds.
  */
 #define TIME_SLICE_NICE_ZERO DEF_TIMESLICE
 #define LOAD_WEIGHT(lp) \
(((lp) * SCHED_LOAD_SCALE) / TIME_SLICE_NICE_ZERO)
 #define TASK_LOAD_WEIGHT(p)LOAD_WEIGHT(task_timeslice(p))
 #define RTPRIO_TO_LOAD_WEIGHT(rp)  \
-   (LOAD_WEIGHT((MS_TO_JIFFIES(rr_interval) + 20 + (rp
+   (LOAD_WEIGHT((rr_interval + 20 + (rp
 
 static void set_load_weight(struct task_struct *p)
 {
@@ -3035,32 +3032,27 @@ static void
 update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now,
 int tick)
 {
-   cputime64_t time_diff = now - p->last_ran;
-   const unsigned int min_diff = 1000;
-   int us_time_diff;
+   long time_diff = now - p->last_ran;
 
if (tick) {
/*
 * Called from scheduler_tick() there should be less than two
 * jiffies worth, and not negative/overflow.
 */
-   if (time_diff > JIFFIES_TO_NS(2) || time_diff < min_diff)
+   if (time_diff > JIFFIES_TO_NS(2) || time_diff < 0)
time_diff = JIFFIES_TO_NS(1);
} else {
/*
 * Called from context_switch there should be less than one
-* jiffy worth, and not negative/overflowed. In the case when
-* sched_clock fails to return high resolution values this
-* also ensures at least 1 min_diff gets banked.
+* jiffy worth, and not negative/overflow. There should be
+* some time banked here so use a nominal 1ms.
 */
-   if (time_diff > JIFFIES_TO_NS(1) || time_diff < min_diff)
-   time_diff = min_diff;
+   if (time_diff > JIFFIES_TO_NS(1) || time_diff < 1)
+   time_diff = 1000;
}
/* time_slice accounting is done in usecs to avoid overflow on 32bit */
-   us_time_diff = time_diff;
-   us_time_diff /= 1000;
if (p != rq->idle && p->policy != SCHED_FIFO)
-   p->time_slice -= us_time_diff;
+   p->time_slice -= time_diff / 1000;
p->sched_time += time_diff;
p->last_ran = rq->most_recent_timestamp = now;
 }
@@ -4636,8 +4628,8 @@ long sys_sched_rr_get_interval(pid_t pid
if (retval)
goto out_unlock;
 
-   jiffies_to_timespec(p->policy == SCHED_FIFO ?
-   0 : task_timeslice(p), );
+   t =

Re: Wrong free clusters count on FAT32

2007-04-21 Thread Andrew Morton

On Sun, 22 Apr 2007 07:42:22 +0900 OGAWA Hirofumi <[EMAIL PROTECTED]> wrote:

> Recent Windows doesn't update ->free_clusters correctly. The "nofree"
> option ignores the ->free_clusters stored on FSINFO.

It'd be better to avoid the new mount option if possible: fatfs is used by
a *lot* of people who aren't particualrly expert, and we want to make things
easy for them.

Is there some way in which we can work out what's happened and fix it up?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-21 Thread Nick Piggin


Nick Piggin wrote:

Rik van Riel wrote:


Andrew Morton wrote:


On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel <[EMAIL PROTECTED]> wrote:


Andrew Morton wrote:


I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".

- Nick's patch also will help this problem.  It could be that your 
patch

  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, 
but it
  would be nice to firm that up a bit.  



I'll test that.




Thanks.




Well, good news.

It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545
2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999




Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch?

The strange thing with your madv_free kernel is that it doesn't
help single-threaded performance at all. So that work to avoid
zeroing the new page is not a win at all there (maybe due to the
cache effects I was worried about?).

However MADV_FREE does improve scalability, which is interesting.
The most likely reason I can see why that may be the case is that
it avoids mmap_sem when faulting pages back in (I doubt it is due
to avoiding the page allocator, but maybe?).

So where is the down_write coming from in this workload, I wonder?
Heap management? What syscalls?

x86_64's rwsems are crap under heavy parallelism (even read-only),
as I fixed in my recent generic rwsems patch. I don't expect MySQL
to be such a mmap_sem microbenchmark, but I wonder how much this
would help?

What if we ran the private futexes patch to further cut down
mmap_sem contention?


Hmm, without the MADV_FREE patch, I wonder if it isn't doing something
silly like read-faulting in a ZERO_PAGE then write faulting a new page
straight afterwards.. I'll have to try a few tests.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-21 Thread Nick Piggin


Rik van Riel wrote:

Andrew Morton wrote:


On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel <[EMAIL PROTECTED]> wrote:


Andrew Morton wrote:


I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".

- Nick's patch also will help this problem.  It could be that your 
patch

  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, 
but it
  would be nice to firm that up a bit.  


I'll test that.



Thanks.



Well, good news.

It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545
2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999



Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch?

The strange thing with your madv_free kernel is that it doesn't
help single-threaded performance at all. So that work to avoid
zeroing the new page is not a win at all there (maybe due to the
cache effects I was worried about?).

However MADV_FREE does improve scalability, which is interesting.
The most likely reason I can see why that may be the case is that
it avoids mmap_sem when faulting pages back in (I doubt it is due
to avoiding the page allocator, but maybe?).

So where is the down_write coming from in this workload, I wonder?
Heap management? What syscalls?

x86_64's rwsems are crap under heavy parallelism (even read-only),
as I fixed in my recent generic rwsems patch. I don't expect MySQL
to be such a mmap_sem microbenchmark, but I wonder how much this
would help?

What if we ran the private futexes patch to further cut down
mmap_sem contention?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] v9fs: don't use primary fid when removing file

2007-04-21 Thread Latchesar Ionkov


v9fs_insert uses v9fs_fid_lookup (which also locks the fid) to get the primary
fid associated with the dentry and destroys the v9fs_fid struct after removing
the file. If another process called v9fs_fid_lookup on the same dentry, it may
wait undefinitely for the fid's lock (as the struct is freed).

This patch changes v9fs_remove to use a cloned fid, so the primary fid is
not locked and freed.

Signed-off-by: Latchesar Ionkov <[EMAIL PROTECTED]>

---
commit ca1a80584fc3211dac158492173467d4f87a27ac
tree 787de07bd6d24bdcc9907f90d9085dcd774b2ea4
parent 0f851021c0f91e5073fa89f26b5ac68e23df8e11
author Latchesar Ionkov <[EMAIL PROTECTED]> Sat, 21 Apr 2007 13:37:15 -0600
committer Latchesar Ionkov <[EMAIL PROTECTED]> Sat, 21 Apr 2007 13:37:15 -0600

fs/9p/vfs_inode.c |2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 124a085..b01b0a4 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -415,7 +415,7 @@ static int v9fs_remove(struct inode *dir, struct
dentry *file, int rmdir)
file_inode = file->d_inode;
sb = file_inode->i_sb;
v9ses = v9fs_inode2v9ses(file_inode);
-   v9fid = v9fs_fid_lookup(file);
+   v9fid = v9fs_fid_clone(file);
if(IS_ERR(v9fid))
return PTR_ERR(v9fid);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Gene Heskett

On Saturday 21 April 2007, Con Kolivas wrote:
>On Sunday 22 April 2007 04:17, Gene Heskett wrote:
>> More first impressions of sd-0.44 vs CFS-v4
>
>Thanks Gene.
>
>> CFS-v4 is quite smooth in terms of the users experience but after
>> prolonged observations approaching 24 hours, it appears to choke the cpu
>> hog off a bit even when the system has nothing else to do.  My amanda runs
>> went from 1 to 1.5 hours depending on how much time it took gzip to handle
>> the amount of data tar handed it, up to about 165m & change, or nearly 3
>> hours pretty consistently over 5 runs.
>>
>> sd-0.44 so far seems to be handling the same load (theres a backup running
>> right now) fairly well also, and possibly theres a bit more snap to the
>> system now.  A switch to screen 1 from this screen 8, and the loading of
>> that screen image, which is the Cassini shot of saturn from the backside,
>> the one showing that teeny dot to the left of Saturn that is actually us,
>> took 10 seconds with the stock 2.6.21-rc7, 3 seconds with the best of
>> Ingo's patches, and now with Con's latest, is 1 second flat. Another
>> screen however is 4 seconds, so maybe that first scren had been looked at
>> since I rebooted. However, amanda is still getting estimates so gzip
>> hasn't put a tiewrap around the kernels neck just yet.
>>
>> Some minutes later, gzip is smunching /usr/src, and the machine doesn't
>> even know its running as sd-0.44 isn't giving gzip more than 75% to gzip,
>> and probably averaging less than 50%. And it scared me a bit as it started
>> out at not over 5% for the first minute or so.  Running in the 70's now
>> according to gkrellm, with an occasional blip to 95%.  And the machine
>> generally feels good.
>>
>> I had previously given CFS-v4 a 95 score but that was before I saw the
>> general slowdown, and I believe my first impression of this one is also a
>> 95.  This on a scale of the best one of the earlier CFS patches being 100,
>> and stock 2.6.21-rc7 gets a 0.0.  This scheduler seems to be giving gzip
>> ever more cpu as time progresses, and the cpu is warming up quite nicely,
>> from about 132F idling to 149.9F now.  And my keyboard is still alive and
>> well.
>
>I'm not sure how much weight to put on what you see as the measured cpu
> usage. I have a feeling it's being wrongly reported in SD currently.
> Concentrate more on the actual progress and behaviour of things as you've
> already done.
>
>> Generally speaking, Con, I believe this one is also a keeper.  And we'll
>> see how long a backup run takes.

It looks as if it could have been 10 minutes quicker according to amplot, but 
that's entirely within the expected variations that amanda's scheduler might 
do to it.  But the one that just finished, running under CFS-v5 was only 
1h:47m, not including the verify run.  The previous backup using sd-0.44, 
took 2h:28m for a similar but not identical operation according to amplot.  
That's a big enough diff to be an indicator I believe, but without knowing 
how much of that time was burned by gzip, its an apples and oranges compare.  
We'll see if it repeats, I coded 'catchup' to do 2 in a row.

>Great thanks for feedback.

You're quite welcome, Con.

ATM I'm doing the same thing again but booted to a CFS-v5 delta that Ingo sent 
me privately,  and except for the kmail lag/freezes everything is cool except 
the cpu, it managed to hit 150.7F during the height of one of the gzip -best 
smunching operations.  I believe the /dev/hdd writes are cranked well up from 
the earlier CSF patches also.  Unforch, this isn't something that's been 
coded into amplot, so I'm stuck watching the hdd display in gkrellm and 
making SWAG's.  And we all know what they are worth.  I've made a lot of them 
in my 72 years, and my track record, with some glaring exceptions like my 2nd 
wife that I won't bore you with the details of, has been fairly decent. :)

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
You will be run over by a beer truck.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.20.7 locking up hard on boot

2007-04-21 Thread Len Brown

On Saturday 21 April 2007 06:54, Marcos Pinto wrote:
> It took me several hours, but I just got done combing things over with
> bisect as Greg requested.  This is what git spit out as the problem
> patch in the end:
> 
> 7639e962234c76031d1ddf436def7fd9602be560 is first bad commit
> commit 7639e962234c76031d1ddf436def7fd9602be560
> Author: Jan Beulich <[EMAIL PROTECTED]>
> Date:   Tue Mar 13 14:04:11 2007 -0400
> 
> adjust legacy IDE resource setting (v2)
> 
> adjust legacy IDE resource setting (v2)
> 
> The change to force legacy mode IDE channels' resources to fixed non-zero
> values confuses (at least some versions of) X, because the values reported
> by the kernel and those readable from PCI config space aren't consistent
> anymore.  Therefore, this patch arranges for the respective BARs to also
> get updated if possible.
> 
> Signed-off-by: Jan Beulich <[EMAIL PROTECTED]>
> Acked-by: Alan Cox <[EMAIL PROTECTED]>
> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
> Signed-off-by: Bartlomiej Zolnierkiewicz <[EMAIL PROTECTED]>
> Cc: Chuck Ebbert <[EMAIL PROTECTED]>
> Signed-off-by: Greg Kroah-Hartman <[EMAIL PROTECTED]>
> 
> :04 04 d4ee6822208dc3e205bfc92fd30121e7894e63a9
> 5155044aa75f0d2671e7f5081f5b2999f24034bd M  drivers
> bisect run success
> 

Looks like others are seeing failures due to this patch also:
http://bugzilla.kernel.org/show_bug.cgi?id=7562

-Len

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Con Kolivas

On Saturday 21 April 2007 22:12, Willy Tarreau wrote:
> 2) SD-0.44
>
>Feels good, but becomes jerky at moderately high loads. I've started
>64 ocbench with a 250 ms busy loop and 750 ms sleep time. The system
>always responds correctly but under X, mouse jumps quite a bit and
>typing in xterm or even text console feels slightly jerky. The CPU is
>not completely used, and the load varies a lot (see below). However,
>the load is shared equally between all 64 ocbench, and they do not
>deviate even after 4000 iterations. X uses less than 1% CPU during
>those tests.

Found it. I broke SMP balancing again so there is serious scope for 
improvement on SMP hardware. That explains the huge load variations. Expect 
yet another fix soon, which should improve behaviour further :)

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linux 2.6.21-rc7 - ACPI issues? (Namespace lookup failure, AE_NOT_FOUND)

2007-04-21 Thread Len Brown

On Saturday 21 April 2007 10:03, Sunil Naidu wrote:
> Hello,
> 
> I did compile 2.6.21-rc7 for a P-III machine. Here is the ACPI part in
> the dmesg:-
> 
> ACPI Error (psargs-0355): [PRSE] Namespace lookup failure, AE_NOT_FOUND
> ACPI Error (psparse-0537): Method parse/execution failed
> [\_SB_.LNKE._PRS] (Node dfd63f40), AE_NOT_FOUND
> ACPI Exception (pci_link-0180): AE_NOT_FOUND, Evaluating _PRS [20060707]
> ACPI Error (psargs-0355): [PRSF] Namespace lookup failure, AE_NOT_FOUND
> ACPI Error (psparse-0537): Method parse/execution failed
> [\_SB_.LNKF._PRS] (Node dfd63ea0), AE_NOT_FOUND
> ACPI Exception (pci_link-0180): AE_NOT_FOUND, Evaluating _PRS [20060707]
> ACPI Error (psargs-0355): [PRSG] Namespace lookup failure, AE_NOT_FOUND
> ACPI Error (psparse-0537): Method parse/execution failed
> [\_SB_.LNKG._PRS] (Node dfd63e00), AE_NOT_FOUND
> ACPI Exception (pci_link-0180): AE_NOT_FOUND, Evaluating _PRS [20060707]
> ACPI Error (psargs-0355): [PRSH] Namespace lookup failure, AE_NOT_FOUND
> ACPI Error (psparse-0537): Method parse/execution failed
> [\_SB_.LNKH._PRS] (Node c147d75c), AE_NOT_FOUND
> ACPI Exception (pci_link-0180): AE_NOT_FOUND, Evaluating _PRS [20060707]
> 
> I tried with few configurations (config) to solve the problem, am not
> sure what's causing this failure. Any hint?


This is an AML run-time error from a PCI Interrupt Link
trying to find its "Present Resource Settings" --
ie. the current IRQ for a programmable IRQ.

Please open up a bug report here:
http://bugzilla.kernel.org/enter_bug.cgi?product=ACPI

For 2.6.20.stable and the latest 2.6.21, please
build with CONFIG_ACPI_DEBUG=y, and
attach the complete output from dmesg -s64000
and paste the /proc/interrupts.

Also, please attach the output from acpidump
and lspci -vv taken from either boot.

thanks,
-Len
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Ulrich Drepper


On 4/21/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:

And how the hell do you imagine you'd even *know* what thread holds the
futex?


We know this in most cases.  This is information recorded, for
instance, in the mutex data structure.  You might have missed my "the
interface must be extended" part.  This means the PID of the owning
thread will have to be passed done.  For PI mutexes this is not
necessary since the kernel already has access to the information.



The whole point of the "f" part of the mutex is that it's fast, and we
never see the non-contended case in the kernel.


See above.  Believe me, I know how futexes work.  But I also know what
additional information we collect.  For mutexes and in part for
rwlocks we know which thread owns the sync object.  In that case we
can easily provide the kernel with the information.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [mmc] alternative TI FM MMC/SD driver for 2.6.21-rc7

2007-04-21 Thread Alex Dubov

> Finally, tifm_sd module needs to be manually inserted.

By the way, the driver emits custom uevent when the new card is detected. I was 
going to look at
this some day in the future, but if you want to mess a little with hotplug 
scripts the issue can
be easily solved.

As I already said before, many of the complications exist because this is  an 
universal adapter,
and memorystick support is quite near in the queue. A good hotplug script will, 
therefore, look at
the "TIFM_CARD_TYPE" event var and load the appropriate media driver.


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Con Kolivas

On Sunday 22 April 2007 04:17, Gene Heskett wrote:
> More first impressions of sd-0.44 vs CFS-v4

Thanks Gene.
>
> CFS-v4 is quite smooth in terms of the users experience but after prolonged
> observations approaching 24 hours, it appears to choke the cpu hog off a
> bit even when the system has nothing else to do.  My amanda runs went from
> 1 to 1.5 hours depending on how much time it took gzip to handle the amount
> of data tar handed it, up to about 165m & change, or nearly 3 hours pretty
> consistently over 5 runs.
>
> sd-0.44 so far seems to be handling the same load (theres a backup running
> right now) fairly well also, and possibly theres a bit more snap to the
> system now.  A switch to screen 1 from this screen 8, and the loading of
> that screen image, which is the Cassini shot of saturn from the backside,
> the one showing that teeny dot to the left of Saturn that is actually us,
> took 10 seconds with the stock 2.6.21-rc7, 3 seconds with the best of
> Ingo's patches, and now with Con's latest, is 1 second flat. Another screen
> however is 4 seconds, so maybe that first scren had been looked at since I
> rebooted. However, amanda is still getting estimates so gzip hasn't put a
> tiewrap around the kernels neck just yet.
>
> Some minutes later, gzip is smunching /usr/src, and the machine doesn't
> even know its running as sd-0.44 isn't giving gzip more than 75% to gzip,
> and probably averaging less than 50%. And it scared me a bit as it started
> out at not over 5% for the first minute or so.  Running in the 70's now
> according to gkrellm, with an occasional blip to 95%.  And the machine
> generally feels good.
>
> I had previously given CFS-v4 a 95 score but that was before I saw the
> general slowdown, and I believe my first impression of this one is also a
> 95.  This on a scale of the best one of the earlier CFS patches being 100,
> and stock 2.6.21-rc7 gets a 0.0.  This scheduler seems to be giving gzip
> ever more cpu as time progresses, and the cpu is warming up quite nicely,
> from about 132F idling to 149.9F now.  And my keyboard is still alive and
> well.

I'm not sure how much weight to put on what you see as the measured cpu usage. 
I have a feeling it's being wrongly reported in SD currently. Concentrate 
more on the actual progress and behaviour of things as you've already done.

> Generally speaking, Con, I believe this one is also a keeper.  And we'll
> see how long a backup run takes.

Great thanks for feedback.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] CFS scheduler, v4

2007-04-21 Thread Gene Heskett

On Saturday 21 April 2007, S.Çağlar Onur wrote:
>21 Nis 2007 Cts tarihinde, Gene Heskett şunları yazmıştı:
>> This one is another keeper IMO, or as we are fond of saying around here,
>> its good enough for the girls I go with.  If this isn't the best one so
>> far, its very very close and I'm getting pickier.  kmail is the only thing
>> that's lagging, and that's just kmail, which I believe is single threaded.
>
>Add +1 for kmail lags (by the way mines are freezes instead of lags, cause i
>cannot use konsole etc. while these happens)
>
>Cheers

yes, you are correct, the composer in particular, or the response to the + key 
for next message, will freeze for the second or maybe 2, that kmail is 
sorting and storing incoming mail.  This is a major problem for users of 
dialup on an auto basis because its frozen for much of the time it takes the 
much slower modem communications to complete, compared to a dsl circuit where 
one can have fetchmail doing the sucking, and handing it off to procmail for 
treatment by spamassassin and its ilk before finally storing the incoming 
mail in /var/spool/mail/gene.  kmail sees none of that background activity at 
all.  They actually run asynchronously here.

kmail then picks that up and sorts it to the correct kmail folder and this 
does cause the lag/freeze while its doing that.

This latter lag/freeze is all I see, but for those that are using kmail to 
directly access their ISP's mailserver(s), this lag/freeze isn't a 1 second 
freeze, but a 10-30 second freeze, and that is truly a cast iron bitch 
version of a PITA.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
VICARIOUSLY experience some reason to LIVE!!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] allow unprivileged mounts

2007-04-21 Thread Shaya Potter


Andrew Morton wrote:

On Fri, 20 Apr 2007 12:25:39 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote:


Define a new fs flag FS_SAFE, which denotes, that unprivileged
mounting of this filesystem may not constitute a security problem.

Since most filesystems haven't been designed with unprivileged
mounting in mind, a thorough audit is needed before setting this flag.


Practically speaking, is there any realistic likelihood that any filesystem
apart from FUSE will ever use this?


Would it be interesting to support mounting of external file systems (be 
it USB, NFS or whatever) in a way that automatically forces it to ignore 
suid and devices (which are already mount time options)?   The question 
I guess is, how much do you gain over a setuid program (hack?) that can 
handle this?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/8] Kconfig: unwanted menus for s390.

2007-04-21 Thread Arnd Bergmann

On Sunday 22 April 2007, Arnd Bergmann wrote:
> I would prefer to not have 'depends on !S390' but rather 'depends on MMIO',
> because that is what really drives stuff like IPMI: they expect the device
> to be reachable through the use of ioremap or inX/outX instructions, which
> don't exist on s390.
> 
> While it's unlikely that another architecture has the same restriction,
> it expresses much clearer what you mean.
> 
> In drivers/Kconfig, you can then simply add a
> 
> config MMIO
> def_bool !S390

I just saw that we already have an option like that, with a slightly different
name.

arch/s390/Kconfig contains

config NO_IOMEM
def_bool y

and lib/Kconfig contains

config HAS_IOMEM
boolean
depends on !NO_IOMEM
default y

You should probably just use one of these two to disable any driver that
uses ioremap or similar.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Con Kolivas

On Sunday 22 April 2007 08:54, Denis Vlasenko wrote:
> On Saturday 21 April 2007 18:00, Ingo Molnar wrote:
> > correct. Note that Willy reniced X back to 0 so it had no relevance on
> > his test. Also note that i pointed this change out in the -v4 CFS
> >
> > announcement:
> > || Changes since -v3:
> > ||
> > ||  - usability fix: automatic renicing of kernel threads such as
> > ||keventd, OOM tasks and tasks doing privileged hardware access
> > ||(such as Xorg).
> >
> > i've attached it below in a standalone form, feel free to put it into
> > SD! :)
>
> But X problems have nothing to do with "privileged hardware access".
> X problems are related to priority inversions between server and client
> processes, and "one server process - many client processes" case.

It's not a privileged hardware access reason that this code is there. This is 
obfuscation/advertising to make it look like there is a valid reason for X 
getting negative nice levels somehow in the kernel to make interactive 
testing of CFS better by default.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Con Kolivas

On Sunday 22 April 2007 02:00, Ingo Molnar wrote:
> * Con Kolivas <[EMAIL PROTECTED]> wrote:
> > >   Feels even better, mouse movements are very smooth even under high
> > >   load. I noticed that X gets reniced to -19 with this scheduler.
> > >   I've not looked at the code yet but this looked suspicious to me.
> > >   I've reniced it to 0 and it did not change any behaviour. Still
> > >   very good.
> >
> > Looks like this code does it:
> >
> > +int sysctl_sched_privileged_nice_level __read_mostly = -19;
>
> correct. 

Oh I definitely was not advocating against renicing X, I just suspect that 
virtually all the users who gave glowing reports to CFS comparing it to SD 
had no idea it had reniced X to -19 behind their back and that they were 
comparing it to SD running X at nice 0. I think had they been comparing CFS 
with X nice -19 to SD running nice -10 in this interactivity soft and squishy 
comparison land their thoughts might have been different. I missed it in the 
announcement and had to go looking in the code since Willy just kinda tripped 
over it unwittingly as well.

> Note that Willy reniced X back to 0 so it had no relevance on 
> his test.

Oh yes I did notice that, but since the array swap is the remaining longest 
deadline in SD which would cause noticeable jerks, renicing X on SD by 
default would make the experience very different since reniced tasks do much 
better over array swaps compared to non niced tasks. I really should go and 
make the whole thing one circular list and blow away the array swap (if I can 
figure out how to do it). 

> Also note that i pointed this change out in the -v4 CFS 
>
> announcement:
> || Changes since -v3:
> ||
> ||  - usability fix: automatic renicing of kernel threads such as
> ||keventd, OOM tasks and tasks doing privileged hardware access
> ||(such as Xorg).

Reading the changelog in the gloss-over fashion that I unfortunately did, even 
I missed it. 

> i've attached it below in a standalone form, feel free to put it into
> SD! :)

Hmm well I have tried my very best to do all the changes without 
changing "policy" as much as possible since that trips over so many emotive 
issues that noone can agree on, and I don't have a strong opinion on this as 
I thought it would be better for it to be a config option for X in userspace 
instead. Either way it needs to be turned on/off by admin and doing it by 
default in the kernel is... not universally accepted as good. What else 
accesses ioports that can get privileged nice levels? Does this make it 
relatively exploitable just by poking an ioport?

>   Ingo
>
> ---
>  arch/i386/kernel/ioport.c   |   13 ++---
>  arch/x86_64/kernel/ioport.c |8 ++--
>  drivers/block/loop.c|5 -
>  include/linux/sched.h   |7 +++
>  kernel/sched.c  |   40

Thanks for the patch. I'll consider it. Since end users are testing this in 
fuzzy interactivity land I may simply be forced to do this just for 
comparisons to be meaningful between CFS and SD otherwise they're not really 
comparing them on a level playing field. I had almost given up SD for dead 
meat with all the momentum CFS had gained... until recently.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [ck] [ANNOUNCE] Staircase Deadline cpu scheduler version 0.44

2007-04-21 Thread Fortier,Vincent [Montreal]

> -Message d'origine-
> De : [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] De la part de Con Kolivas
> Envoyé : 21 avril 2007 03:57
> 
> A significant bugfix for forking tasks was just posted, so 
> here is an updated version of the staircase deadline cpu 
> scheduler. This may cause noticeable behavioural improvements 
> under certain workloads (such as compiling software with make).
> 
> Thanks to Al Boldi for making me check the fork code!
> 
> http://ck.kolivas.org/patches/staircase-deadline/2.6.20.7-sd-0.44.patch
> http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc7-sd-0.44.patch
> 
> Incrementals in
> http://ck.kolivas.org/patches/staircase-deadline/2.6.20.7/
> http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc7/
> 
> Renicing X to -10, while not essential, is preferable.
> 
> See the patch just posted called 'sched: implement staircase 
> scheduler yaf fix' for full changelog.

Kernels for FC6 x86_64 & i686 using SD 0.44 and latest source (2944) now 
available at http://linux-dev.qc.ec.gc.ca

Note that backport patches for 2.6.18 & 2.6.19 kernels are also available.

> 
> --
> -ck
> 

Again, nice work CK!

-vin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Linus Torvalds

On Sat, 21 Apr 2007, Ulrich Drepper wrote:
> 
> If you do this, and it has been requested many a times, then please
> generalize it.  We have the same issue with futexes.  If a FUTEX_WAIT
> call is issues the remaining time in the slot should be given to the
> thread currently owning the futex.

And how the hell do you imagine you'd even *know* what thread holds the 
futex?

The whole point of the "f" part of the mutex is that it's fast, and we 
never see the non-contended case in the kernel. 

So we know who *blocks*, but we don't know who actually didn't block.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFCPT] CONFIG_DEBUG_MEMCPY: memcpy() and overlapping areas

2007-04-21 Thread Alexey Dobriyan

Overlapping areas are subject to memmove() not memcpy().

CONFIG_DEBUG_MEMCPY will print pointers and length in question as well as
backtrace to logs. Also, it should help if one of 3 variables got
corrupted somehow and you were lucky enough.

So far it was booted successfully on one x86_64 box and spit a couple of test
overlapping memcpy() becktraces, which means it's ready.
---

 arch/x86_64/lib/memmove.c   |8 
 include/asm-x86_64/string.h |9 +
 lib/Kconfig.debug   |6 ++
 3 files changed, 23 insertions(+)

--- a/arch/x86_64/lib/memmove.c
+++ b/arch/x86_64/lib/memmove.c
@@ -9,7 +9,15 @@
 void *memmove(void * dest,const void *src,size_t count)
 {
if (dest < src) { 
+#ifdef CONFIG_DEBUG_MEMCPY
+   char *d = dest;
+   const char *s = src;
+   while (count--)
+   *d++ = *s++;
+   return dest;
+#else
return memcpy(dest,src,count);
+#endif
} else {
char *p = (char *) dest + count;
char *s = (char *) src + count;
--- a/include/asm-x86_64/string.h
+++ b/include/asm-x86_64/string.h
@@ -3,6 +3,8 @@
 
 #ifdef __KERNEL__
 
+#include 
+
 /* Written 2002 by Andi Kleen */ 
 
 /* Only used for special circumstances. Stolen from i386/string.h */ 
@@ -32,6 +34,13 @@ return (to);
 extern void *__memcpy(void *to, const void *from, size_t len); 
 static inline void *memcpy(void *dst, const void *src, size_t len)
 {
+#ifdef CONFIG_DEBUG_MEMCPY
+   if ((src < dst && src + len > dst) ||
+   (src > dst && dst + len > src)) {
+   printk("memcpy(0x%p, 0x%p, %zu);\n", dst, src, len);
+   WARN_ON(1);
+   }
+#endif
if (__builtin_constant_p(len) && len >= 64)
return __memcpy(dst, src, len);
else
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -355,6 +355,12 @@ config DEBUG_LIST
 
  If unsure, say N.
 
+config DEBUG_MEMCPY
+   bool "Debug memcpy() function calls"
+   depends on DEBUG_KERNEL
+   help
+ Enable checking for memcpy() on overlapping areas.
+
 config FRAME_POINTER
bool "Compile the kernel with frame pointers"
depends on DEBUG_KERNEL && (X86 || CRIS || M68K || M68KNOMMU || FRV || 
UML || S390 || AVR32 || SUPERH)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: unable to run busybox /sbin/int

2007-04-21 Thread Denis Vlasenko

Hi Tom.

On Thursday 19 April 2007 21:00, Tom Strader wrote:
> This is the final output from my kernel as I try to launch busybox
> (/sbin/init is linked to /bin/busybox)
> As it launches the kernel looks for libraries which do not exist (not
> sure why), but it appears to find /lib/libcrypt.so.1 and /lib/libc.so.6
> but the system does not output after that.  I can press keys on the
> keyboard and there are echoed to the screen, I can also use the control
> characters C-c, C-s, C-q, and so on and I see kernel messages indication
> the uart_flush_buffer(0) is being called but busybox does not appear to
> start.  Here is my kernel output, any suggestions would help. Thanks.

Ok, here we go again.

Does "hello, world" program works
as init, do you see its output? (init=/path/to/hello_world)

If no: what is your console, serial I think? How do you specify
it on kernel command line?

If yes: does init=/bin/sh work?
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Loud "pop" coming from hard drive on reboot

2007-04-21 Thread Peter Zijlstra

On Wed, 2007-04-18 at 17:27 -0400, Chuck Ebbert wrote:
> Bartlomiej Zolnierkiewicz wrote:
> > On Wednesday 18 April 2007, Chuck Ebbert wrote:
> >> Mark Lord wrote:
> >>> Mark Lord wrote:
>  With the patch applied, I don't see *any* new activity in those
>  S.M.A.R.T.
>  attributes over multiple hibernates (Linux "suspend-to-disk").
> >>> Scratch that -- operator failure.  ;)
> >>> The patch makes no difference over hibernates in the SMART logs.
> >>>
> >>> It's still logging extra Power-Off_Retract_Count pegs,
> >>> which it DID NOT USED TO DO not so long ago.
> >>>
> >> Just to add to the fun, my problems are happening with the "old"
> >> IDE drivers...
> > 
> > The issue you are experiencing results in the same problem (disk doing
> > power off retract) but it has a totally different root cause - your notebook
> > loses power on reboot.  It is actually a hardware problem and as you have
> > reported the same problem is present when using "the other" OS.
> > 
> 
> My "power off retract count" increases whether I do a halt/poweroff or
> a reboot. The only difference is the volume of the noise.
> 
> And I just noticed my "seek error rate" is increasing.
> 
> /me plans purchase of another drive, definitely not Seagate...
> 
> > I think that the issue needs to be fixed (by detecting affected notebook(s)
> > using DMI?) in Linux PM handling and not in IDE subsystem because:
> >
> > * there may be some other hardware devices affected by the power loss
> >   (== they require shutdown sequence)
> >
> > * the same problem will bite if somebody decides to use libata (FC7?)
> 
> Yeah, this needs fixing too. I've been playing with another notebook and
> the power does stay on during reboot, so I wonder how widespread the problem 
> is?

/me too

Thinkpad T23, with a ST980815A

Ticks ever few seconds, but seems to mostly go away with 
  hdparm -B255 /dev/sda1

but I have an increasing seek error rate as well. I got the ST disk
because thinkwiki suggested it.

I suspect this problem killed the previous disk in this laptop.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 7/8] Kconfig: silicon backplane dependency.

2007-04-21 Thread Arnd Bergmann

On Friday 20 April 2007, Martin Schwidefsky wrote:
> 
> From: Martin Schwidefsky <[EMAIL PROTECTED]>
> 
> Make the "Sonics Silicon Backplane" menu dependent on the two buses
> it can be found on.
> Goes on top of git-wireless.patch.
> 
> Cc: Michael Buesch <[EMAIL PROTECTED]>
> Cc: John W. Linville <[EMAIL PROTECTED]>
> Signed-off-by: Martin Schwidefsky <[EMAIL PROTECTED]>
> ---
> 
>  drivers/ssb/Kconfig |    1 +
>  1 files changed, 1 insertion(+)
> 
> diff -urpN linux-2.6/drivers/ssb/Kconfig linux-2.6-patched/drivers/ssb/Kconfig
> --- linux-2.6/drivers/ssb/Kconfig   2007-04-19 15:24:40.0 +0200
> +++ linux-2.6-patched/drivers/ssb/Kconfig   2007-04-19 15:55:44.0 
> +0200
> @@ -1,4 +1,5 @@
>  menu "Sonics Silicon Backplane"
> +   depends on PCI || PCMCIA

No, this doesn't look right. There are other devices that come with
SiliconBackplane but are not PCI or PCMCIA style devices.

I'd make this 'depends on MMIO' as well if you add that option.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread William Lee Irwin III

On 4/21/07, Kyle Moffett <[EMAIL PROTECTED]> wrote:
>> It might be nice if it was possible to actively contribute your CPU
>> time to a child process.  For example:
>> int sched_donate(pid_t pid, struct timeval *time, int percentage);

On Sat, Apr 21, 2007 at 12:49:52PM -0700, Ulrich Drepper wrote:
> If you do this, and it has been requested many a times, then please
> generalize it.  We have the same issue with futexes.  If a FUTEX_WAIT
> call is issues the remaining time in the slot should be given to the
> thread currently owning the futex.  For non-PI futexes this needs an
> extension of the interface but I would be up for that.  It can have
> big benefits on the throughput of an application.

It's encouraging to hear support for a more full-featured API (or, for
that matter, any response at all) on this front.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/8] Kconfig: unwanted config options for s390.

2007-04-21 Thread Arnd Bergmann

On Friday 20 April 2007, Martin Schwidefsky wrote:

> diff -urpN linux-2.6/drivers/char/Kconfig 
> linux-2.6-patched/drivers/char/Kconfig
> --- linux-2.6/drivers/char/Kconfig2007-04-19 15:49:51.0 +0200
> +++ linux-2.6-patched/drivers/char/Kconfig2007-04-19 15:50:50.0 
> +0200
> @@ -6,6 +6,7 @@ menu "Character devices"
>  
>  config VT
>   bool "Virtual terminal" if EMBEDDED
> + depends on !S390
>   select INPUT
>   default y if !VIOCONS
>   ---help---

ok

> @@ -81,6 +82,7 @@ config VT_HW_CONSOLE_BINDING
>  
>  config SERIAL_NONSTANDARD
>   bool "Non-standard serial port support"
> + depends on !S390
>   ---help---
> Say Y here if you have any non-standard serial boards -- boards
> which aren't supported using the standard "dumb" serial driver.

depends on MMIO

> @@ -774,7 +776,7 @@ config NVRAM
>  
>  config RTC
>   tristate "Enhanced Real Time Clock Support"
> - depends on !PPC && !PARISC && !IA64 && !M68K && (!SPARC || PCI) && !FRV 
> && !ARM && !SUPERH
> + depends on !PPC && !PARISC && !IA64 && !M68K && (!SPARC || PCI) && !FRV 
> && !ARM && !SUPERH && !S390
>   ---help---
> If you say Y here and create a character special file /dev/rtc with
> major number 10 and minor number 135 using mknod ("man mknod"), you
> @@ -822,7 +824,7 @@ config SGI_IP27_RTC
>  
>  config GEN_RTC
>   tristate "Generic /dev/rtc emulation"
> - depends on RTC!=y && !IA64 && !ARM && !M32R && !SPARC && !FRV
> + depends on RTC!=y && !IA64 && !ARM && !M32R && !SPARC && !FRV && !S390
>   ---help---
> If you say Y here and create a character special file /dev/rtc with
> major number 10 and minor number 135 using mknod ("man mknod"), you

ok.

this one is bad in general and should probably be a select from the 
architecture,
but that should not stop you from adding another architecture...

> @@ -878,6 +880,7 @@ config DTLK
>  
>  config R3964
>   tristate "Siemens R3964 line discipline"
> + depends on !S390
>   ---help---
> This driver allows synchronous communication with devices using the
> Siemens R3964 packet protocol. Unless you are dealing with special

Does it build? I don't see a point disabling this one just because there are
no users. Most architectures also don't have users for this one, but it
doesn't hurt be able to build it using allyesconfig.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/8] Kconfig: unwanted menus for s390.

2007-04-21 Thread Arnd Bergmann

On Friday 20 April 2007, Martin Schwidefsky wrote:
> diff -urpN linux-2.6/drivers/char/ipmi/Kconfig 
> linux-2.6-patched/drivers/char/ipmi/Kconfig
> --- linux-2.6/drivers/char/ipmi/Kconfig 2007-02-04 19:44:54.0 +0100
> +++ linux-2.6-patched/drivers/char/ipmi/Kconfig 2007-04-19 15:49:55.0 
> +0200
> @@ -3,6 +3,8 @@
>  #
>  
>  menu "IPMI"
> +   depends on !S390
> +
>  config IPMI_HANDLER
>         tristate 'IPMI top-level message handler'
>         help

I think I made this comment the last time we discussed this topic, but don't
remember the exact outcome.

I would prefer to not have 'depends on !S390' but rather 'depends on MMIO',
because that is what really drives stuff like IPMI: they expect the device
to be reachable through the use of ioremap or inX/outX instructions, which
don't exist on s390.

While it's unlikely that another architecture has the same restriction,
it expresses much clearer what you mean.

In drivers/Kconfig, you can then simply add a

config MMIO
def_bool !S390

There are a few exceptions though, that I think should not depend on MMIO:

> --- linux-2.6/drivers/dma/Kconfig 2007-04-19 15:24:33.0 +0200
> +++ linux-2.6-patched/drivers/dma/Kconfig 2007-04-19 15:49:55.0 
> +0200
> @@ -3,6 +3,7 @@
>  #
>  
>  menu "DMA Engine support"
> + depends on !S390
>  
>  config DMA_ENGINE
>   bool "Support for DMA engines"

I'd leave the menu enabled. If the DMA engine infrastructure becomes more widely
used, you may want to add an implementation for s390 using milicoded 
instructions
like xor-string or copy-page.

> diff -urpN linux-2.6/drivers/input/Kconfig 
> linux-2.6-patched/drivers/input/Kconfig
> --- linux-2.6/drivers/input/Kconfig   2007-02-04 19:44:54.0 +0100
> +++ linux-2.6-patched/drivers/input/Kconfig   2007-04-19 15:49:55.0 
> +0200
> @@ -3,6 +3,7 @@
>  #
>  
>  menu "Input device support"
> + depends on !S390
>  
>  config INPUT
>   tristate "Generic input layer (needed for keyboard, mouse, ...)" if 
> EMBEDDED

Probably leave this as !S390. One could imagine channel-attached input devices
or the idea of intepreting a terminal as an input device, but no driver 
currently
does and probably never will.

> diff -urpN linux-2.6/drivers/isdn/Kconfig 
> linux-2.6-patched/drivers/isdn/Kconfig
> --- linux-2.6/drivers/isdn/Kconfig2007-02-04 19:44:54.0 +0100
> +++ linux-2.6-patched/drivers/isdn/Kconfig2007-04-19 15:49:55.0 
> +0200
> @@ -3,6 +3,7 @@
>  #
>  
>  menu "ISDN subsystem"
> + depends on !S390
>  
>  config ISDN
>   tristate "ISDN support"

Same here, actually there was an IBM 2216 ISDN adapter with channel attachment,
but I don't think anybody wants to add a driver for that one.

> diff -urpN linux-2.6/drivers/misc/Kconfig 
> linux-2.6-patched/drivers/misc/Kconfig
> --- linux-2.6/drivers/misc/Kconfig2007-04-19 15:24:35.0 +0200
> +++ linux-2.6-patched/drivers/misc/Kconfig2007-04-19 15:49:55.0 
> +0200
> @@ -3,6 +3,7 @@
>  #
>  
>  menu "Misc devices"
> + depends on !S390
>  
>  config IBM_ASM
>   tristate "Device driver for IBM RSA service processor"

Maybe just leave the menu open, all drivers in it are already depending on PCI
or similar and someone might add a driver that does work on s390 here.

> diff -urpN linux-2.6/drivers/net/phy/Kconfig 
> linux-2.6-patched/drivers/net/phy/Kconfig
> --- linux-2.6/drivers/net/phy/Kconfig 2007-02-04 19:44:54.0 +0100
> +++ linux-2.6-patched/drivers/net/phy/Kconfig 2007-04-19 15:49:55.0 
> +0200
> @@ -3,6 +3,7 @@
>  #
>  
>  menu "PHY device support"
> + depends on !S390
>  
>  config PHYLIB
>   tristate "PHY Device support and infrastructure"

Also depends on !S390, not MMIO. A future network adapter might give you access
to the phy device through other means than MMIO.

> diff -urpN linux-2.6/drivers/rtc/Kconfig linux-2.6-patched/drivers/rtc/Kconfig
> --- linux-2.6/drivers/rtc/Kconfig 2007-04-19 15:24:39.0 +0200
> +++ linux-2.6-patched/drivers/rtc/Kconfig 2007-04-19 15:49:55.0 
> +0200
> @@ -3,6 +3,7 @@
>  #
>  
>  menu "Real Time Clock"
> + depends on !S390
>  
>  config RTC_LIB
>   tristate

Applications might actually want to use the RTC interface to access the system 
time
or get accurate timers, but the rtc drivers are all very dependant on either 
MMIO
or I2C. Not sure what would be best here.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc7: HPET enabled freeze my machine at boot

2007-04-21 Thread Guilherme M. Schroeder


John,

Boot ok with clocksource=acpi_pm and HPET enabled.
Any clue?

john stultz wrote:

On 4/19/07, guilherme <[EMAIL PROTECTED]> wrote:

Hi,

If i enable "High Resolution Timer Support", my machine stops here at 
boot:


Clocksource tsc unstable (delta = -297340790165 ns)
Time: hpet clocksource has been installed.

If i disable HPET, it boots fine.


Hmmm.. What happens if you boot w/ clocksource=acpi_pm ?



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: gtod/clocksource/clockevents documentation

2007-04-21 Thread David Brownell

On Saturday 21 April 2007, Remy Bohmer wrote:
> Hello All,
> 
> I need to implement a gtod/clocksource/clockevents implementation for
> the Atmel ARM AT91SAM9261 CPU, and I am looking for some kernel
> (interface) documentation about these mechanisms.
> 
> I already investigated the 'examples'/implementations of other
> architectures in the kernel, but that did not really help me... I also
> looked at the patch for the AT91RM9200 CPU posted by David Brownell...

You know, I started that partly because I couldn't find enough
info to satisfy my curiousity about how the "new dyntick" was
expected to work ... ;)

> I hacked something which now makes the RT kernel to boot, but the
> time-of-day warps several minutes per second, back and forward in
> time... ;-)

Maybe you've got the scaling wrong on the clocksources?  The
concerns with the shift/mult stuff were particularly opaque;
some sanity checks might be worth adding, it's easy enough
to provide wrong values with exactly those failures mode, and
it's not immediately obvious what "good" values may be.

> So, Can anybody please give me pointer to some useful documentation
> and/or examples?

The best I found was www.kernel.org, current GIT trees.  There
were some old OLS papers too.

One thing I found lacking was the simple statement up front that
clocksources are just fixed-rate free running counters; and that
clockevents are just dual-mode (PIT or oneshot) timers.

> The reason behind this initiative is to make the RT-preempt patch
> compile and work on this architecture. Currently the GENERIC_TIME is
> not set, which results in several unresolved externals at compile
> time, and I believe the best way to fix this is to implement proper
> gtod/clocksource/clockevents to allow the enabling of GENERIC_TIME

Try the appended patch ... the clocksource should work for you,
giving GENERIC_TIME with maybe a bit of tweaking.

- Dave

=   CUT HERE
This defines a simple library to allocate at91 timer/counter blocks,
insulating drivers from CPU differences.  Making that work also implies
defining TC clocks for each processor, and handing a table of TC block
descriptors to that library.

It also defines a simple TC based clocksource.  Any available TC block
can be used as a higher precision clocksource than one based on the 32KHz
slow clock.  Its timebase is MCLK/8, usually 5-10 MHz.  The way it works
is to combine two 16-bit counters to get a 32-bit clocksource.

INCOMPLETE:
(a) The clocksource is currently not efficient for suspend/resume;
it doesn't have suspend or resume methods to shut down the TC
clocks it uses.

(b) AVR32 uses this same timer/counter block, albeit with different
input clocks and prescalers ... the library to should work for
those platforms too, but  has AT91-isms.

(c) Recent versions of AVR32 code stopped using the architectural
counter for the clocksource, and switched over to the TC block.
Similar code should become fully sharable with at least the
AT91sam926 chips.

Note that AT91rm9200 chips don't really have a need for this other than
the additional precision; their "system timer" module is more powerful
than the corresponding AT91sam926 logic, and can fully support the new
style dynamic tick framework.  (Although the timer IRQ path becomes so
long that DBGU consistently drops multiple character for things like
uparrow keys at 38400 baud...)

Signed-off-by: David Brownell <[EMAIL PROTECTED]>
---
 arch/arm/mach-at91/Kconfig   |   16 ++
 arch/arm/mach-at91/Makefile  |1 
 arch/arm/mach-at91/at91rm9200.c  |   36 ++
 arch/arm/mach-at91/at91sam9260.c |   36 ++
 arch/arm/mach-at91/at91sam9261.c |   27 
 arch/arm/mach-at91/at91sam9263.c |   27 
 arch/arm/mach-at91/generic.h |4 
 arch/arm/mach-at91/tclib.c   |  220 +++
 include/linux/atmel_tc.h |  187 +
 9 files changed, 554 insertions(+)

--- at91.orig/arch/arm/mach-at91/Kconfig2007-03-08 20:17:01.0 
-0800
+++ at91/arch/arm/mach-at91/Kconfig 2007-03-11 21:58:10.0 -0700
@@ -171,6 +171,22 @@ config AT91_PROGRAMMABLE_CLOCKS
  Select this if you need to program one or more of the PCK0..PCK3
  programmable clock outputs.

+config ATMEL_TCLIB
+   bool "Timer/Counter Library"
+   help
+ Select this if you want a library to allocate the Timer/Counter
+ blocks found on many Atmel processors.  This facilitates using
+ these modules despite processor differences.
+
+config AT91_TC_CLOCKSOURCE
+   bool "Timer/Counter Clocksource"
+   depends on ATMEL_TCLIB && ARCH_AT91
+   help
+ Select this to get a higher precision clocksource, with a
+ multi-MHz PLL as the base rather than a 32 Khz oscillator.
+ This prevents other drivers from using that TC block (which
+ is currently rare), and

Re: problem with

2007-04-21 Thread Robert Hancock


liangbowen wrote:

maybe you've misunderstood my meaning.  I mean the whole  header file has only  4 lines of code in total:
#ifndef I386_SEMAPHORE_H
#define I386_SEMAPHORE_H
#include 
#endif

it's supposed to have more codes than that. like
struct semaphore {
int count;
int waking;
int lock ;  /* to make waking testing atomic */
struct wait_queue * wait;
};
and the down(), up() functions.

but I can't see any of those codes, not even the #ifdef __KERNEL__
macro.



It does in the kernel version of that header. The userspace version of 
that header has everything inside #ifdef __KERNEL__ stripped out.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Denis Vlasenko

On Saturday 21 April 2007 18:00, Ingo Molnar wrote:
> correct. Note that Willy reniced X back to 0 so it had no relevance on 
> his test. Also note that i pointed this change out in the -v4 CFS 
> announcement:
> 
> || Changes since -v3:
> ||
> ||  - usability fix: automatic renicing of kernel threads such as 
> ||keventd, OOM tasks and tasks doing privileged hardware access
> ||(such as Xorg).
> 
> i've attached it below in a standalone form, feel free to put it into 
> SD! :)

But X problems have nothing to do with "privileged hardware access".
X problems are related to priority inversions between server and client
processes, and "one server process - many client processes" case.

I think syncronous nature of Xlib (clients cannot fire-and-forget
their commands to X server, with Xlib each command waits for ACK
from server) also add some amount of pain.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Sleep during spinlock in TPM driver

2007-04-21 Thread Andrew Morton

On Fri, 20 Apr 2007 18:11:10 -0400 "David Kyle" <[EMAIL PROTECTED]> wrote:

> I've been working with the TPM driver, and I found that if I opened,
> used, then closed the TPM char device very frequently, I would get a
> kernel BUG message saying that the kernel tried to sleep while holding
> a spinlock.  I think I've isolated the problem to this function, in
> drivers/char/tpm/tpm.c:
> 
> int tpm_release(struct inode *inode, struct file *file)
> {
> struct tpm_chip *chip = file->private_data;
> spin_lock(_lock);
> file->private_data = NULL;
> chip->num_opens--;
> del_singleshot_timer_sync(>user_read_timer);
> flush_scheduled_work();
> atomic_set(>data_pending, 0);
> put_device(chip->dev);
> kfree(chip->data_buffer);
> spin_unlock(_lock);
> return 0;
> }
> EXPORT_SYMBOL_GPL(tpm_release);
> 
> I believe that flush_scheduled_work can sleep, correct?  Does anyone
> know why this function is called while the spinlock is held?
> 

yup, that's a bug.  It's not immediately clear to e what driver_lock is
protecting.  Some global things, some per-device things, it appears.

A suitable fix might be to make driver_lock a mutex.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/8] Kconfig: refine depends statements.

2007-04-21 Thread Arnd Bergmann

On Friday 20 April 2007, Martin Schwidefsky wrote:
> diff -urpN linux-2.6/drivers/auxdisplay/Kconfig 
> linux-2.6-patched/drivers/auxdisplay/Kconfig
> --- linux-2.6/drivers/auxdisplay/Kconfig2007-04-19 15:23:55.0 
> +0200
> +++ linux-2.6-patched/drivers/auxdisplay/Kconfig2007-04-19 
> 15:49:17.0 +0200
> @@ -6,6 +6,7 @@
>  #
>  
>  menu "Auxiliary Display support"
> +   depends on PARPORT_PC
>  
>  config KS0108
> tristate "KS0108 LCD Controller"

I would guess that this actually depends on PARPORT, not PARPORT_PC.

The rest of this patch looks good.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Wrong free clusters count on FAT32

2007-04-21 Thread OGAWA Hirofumi

DervishD <[EMAIL PROTECTED]> writes:

>  * Juergen Beisert <[EMAIL PROTECTED]> dixit:
>> On Thursday 19 April 2007 10:57, DervishD wrote:
>> > I have a portable device with a FAT32 formatted hard disk in it, and
>> > everytime I delete a file in the device *using the device itself to
>> > do it* the device increases its count of free space and if I plug
>> > the device in a Windows system, Windows agrees on the free space.
>> > Linux doesn't. Linux believes that the files are still there
>> > ocuppying space, and I have to run fsck.vfat to fix the problem.
>> 
>> As I remember: It needs a large amount of time to calculate the free
>> space on a big FAT32 system.
>
> Big fat truth, I'm afraid. The thing is that I thought that Linux
> did that from time to time to update the count. Obviously, doing it for
> every statfs call would be very expensive :((
>
>> So the last free sector count is also stored. When mounting this
>> filesystem you don't need to walk through the whole FAT to calculate
>> the available space, you can use this "cached" value instead. And this
>> cached value seems not to be updated in your portable device.
>
> It doesn't, certainly, but Windows doesn't care. Moreover, the
> device doesn't seem to recalculate the value on every run (unless it
> does it lightning fast!), so maybe the number is stored elsewhere (the
> count can be stored in many places as far as I've read, but I don't know
> the details).
>
> A mount option to force walking the FAT and getting the real info
> could be interesting. That way, it will be only done for certain devices
> (small disks, for example).

Yes. It seems that Windows does not update the ->free_clusters correctly. 
Probably, I think the option is good for now. What do you think about
an attached patch?
-- 
OGAWA Hirofumi <[EMAIL PROTECTED]>


Recent Windows doesn't update ->free_clusters correctly. The "nofree"
option ignores the ->free_clusters stored on FSINFO.

Signed-off-by: OGAWA Hirofumi <[EMAIL PROTECTED]>
---

 fs/fat/inode.c   |   13 ++---
 include/linux/msdos_fs.h |3 ++-
 2 files changed, 12 insertions(+), 4 deletions(-)

diff -puN fs/fat/inode.c~fat_ignore_free_clusters fs/fat/inode.c
--- linux-2.6/fs/fat/inode.c~fat_ignore_free_clusters	2007-04-22 07:30:13.0 +0900
+++ linux-2.6-hirofumi/fs/fat/inode.c	2007-04-22 07:30:13.0 +0900
@@ -825,6 +825,8 @@ static int fat_show_options(struct seq_f
 	}
 	if (opts->name_check != 'n')
 		seq_printf(m, ",check=%c", opts->name_check);
+	if (opts->nofree)
+		seq_puts(m, ",nofree");
 	if (opts->quiet)
 		seq_puts(m, ",quiet");
 	if (opts->showexec)
@@ -850,7 +852,7 @@ static int fat_show_options(struct seq_f
 
 enum {
 	Opt_check_n, Opt_check_r, Opt_check_s, Opt_uid, Opt_gid,
-	Opt_umask, Opt_dmask, Opt_fmask, Opt_codepage, Opt_nocase,
+	Opt_umask, Opt_dmask, Opt_fmask, Opt_codepage, Opt_nofree, Opt_nocase,
 	Opt_quiet, Opt_showexec, Opt_debug, Opt_immutable,
 	Opt_dots, Opt_nodots,
 	Opt_charset, Opt_shortname_lower, Opt_shortname_win95,
@@ -872,6 +874,7 @@ static match_table_t fat_tokens = {
 	{Opt_dmask, "dmask=%o"},
 	{Opt_fmask, "fmask=%o"},
 	{Opt_codepage, "codepage=%u"},
+	{Opt_nofree, "nofree"},
 	{Opt_nocase, "nocase"},
 	{Opt_quiet, "quiet"},
 	{Opt_showexec, "showexec"},
@@ -951,7 +954,7 @@ static int parse_options(char *options, 
 	opts->quiet = opts->showexec = opts->sys_immutable = opts->dotsOK =  0;
 	opts->utf8 = opts->unicode_xlate = 0;
 	opts->numtail = 1;
-	opts->nocase = 0;
+	opts->nofree = opts->nocase = 0;
 	*debug = 0;
 
 	if (!options)
@@ -979,6 +982,9 @@ static int parse_options(char *options, 
 		case Opt_check_n:
 			opts->name_check = 'n';
 			break;
+		case Opt_nofree:
+			opts->nofree = 1;
+			break;
 		case Opt_nocase:
 			if (!is_vfat)
 opts->nocase = 1;
@@ -1352,7 +1358,8 @@ int fat_fill_super(struct super_block *s
 
 	sbi->max_cluster = total_clusters + FAT_START_ENT;
 	/* check the free_clusters, it's not necessarily correct */
-	if (sbi->free_clusters != -1 && sbi->free_clusters > total_clusters)
+	if ((sbi->free_clusters != -1 && sbi->free_clusters > total_clusters) ||
+	sbi->options.nofree)
 		sbi->free_clusters = -1;
 	/* check the prev_free, it's not necessarily correct */
 	sbi->prev_free %= sbi->max_cluster;
diff -puN include/linux/msdos_fs.h~fat_ignore_free_clusters include/linux/msdos_fs.h
--- linux-2.6/include/linux/msdos_fs.h~fat_ignore_free_clusters	2007-04-22 07:30:13.0 +0900
+++ linux-2.6-hirofumi/include/linux/msdos_fs.h	2007-04-22 07:30:13.0 +0900
@@ -205,7 +205,8 @@ struct fat_mount_options {
 		 numtail:1,   /* Does first alias have a numeric '~1' type tail? */
 		 atari:1, /* Use Atari GEMDOS variation of MS-DOS fs */
 		 flush:1,	  /* write things quickly */
-		 nocase:1;	  /* Does this need case conversion? 0=need case conversion*/
+		 nocase:1,	  /* Does this need case conversion? 0=need case conversion*/
+		 nofree:1;	  /* Does use free_clusters */
 };
 
 #define

Re: [RFC 0/8] Variable Order Page Cache

2007-04-21 Thread Andrew Morton

On Fri, 20 Apr 2007 17:48:18 +1000 David Chinner <[EMAIL PROTECTED]> wrote:

> Agreed - I was talking about a quick way to hack a real filesystem
> in to the VM to start exercising the new VM code without needing to
> implement compound page support down the whole I/O stack. 

Yes.  The whole point of this work is to speed stuff up, so I'd encourage
people to first work on getting some minimal scruffy proptotype in place -
whatever is needed to be able to start running performance tests.

Then we can take a look at the numbers (and the types of machines and
workloads upon which they are based) and decide whether it looks like
there's any point in proceeding with a full-on implementation.

And as part of that decision-making process we should take a detailed look
at the performance of the existing code and see if there are other ways in
which it might be acceptably sped up.

Because right now we're assuming that larger pages are the only way in
which acceptable performance may be obtained.  But that has not been proved.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fault Injection issues: stacktrace x86_64 and failslab NUMA

2007-04-21 Thread Andi Kleen

> There is no FRAME_POINTER support in the x86_64 code.  Apparently it was
> removed when the unwind code was added but now that has been removed as
> well!

That's because no assembly code in x86-64 knows anything about frame pointers.
You will always get truncated traces if anything assembler is in trace.
That's just dangerous.

> Digging around in the archives it looks like Andi Kleen knew this was an
> issue with the x86_64 fallback stacktrace code.  Now that there is no

No I didn't.

> unwind code to even attempt to avoid the problem what should be done?
> How about:
> 
> (1) Make it clear the Fault Injection with STACKTRACE on x86_64 is at
> best "Russian Roulette" -- maybe a !X86_64 in Kconfig.debug?
> 
> (2) Introduce FRAME_POINTER support back into the x86_64 code.  This is
> what Fault Injection really wants.
> 
> (3) Keep the saved stack address entries array out of sight of the
> fallback save_stack_trace() code.  Lockdep does this by storing it in
> static space but this requires locking which would be ugly for Fault
> Injection.  Another option is to mask the saved addresses so they fail
> the __kernel_text_address() test but fail_stacktrace() uses the same
> mask to make it's comparisons.  There's still the problem of avoiding
> kernel text addresses stored on the stack by other code (that is, other
> than the expected stack chain uses).

Some hack in (3) would be probably best, otherwise (1).
At some point I hope we can get the dwarf2 unwinder back, then
the problem should be also solved. But then you would need to force
the dwarf2 unwinder with fault injection on, but that shouldn't
be a problem.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Remove "obsolete" label from ISDN4Linux (v3)

2007-04-21 Thread David Miller

From: Alan Cox <[EMAIL PROTECTED]>
Date: Sat, 21 Apr 2007 21:58:44 +0100

> On Sat, 21 Apr 2007 15:07:51 +0200
> Tilman Schmidt <[EMAIL PROTECTED]> wrote:
> 
> > From: Tilman Schmidt <[EMAIL PROTECTED]>
> > 
> > The "obsolete" label on the ISDN_I4L Kconfig option is not, and
> > has never been, accurate. It has already prompted repeated attempts
> > to remove actively used functionality from the kernel without a
> > working replacement. This patch removes the incorrect label and
> > corrects the accompanying help text.
> > 
> > Signed-off-by: Tilman Schmidt <[EMAIL PROTECTED]>
> 
> Nak-by: Alan Cox <[EMAIL PROTECTED]>
> 
> If it isn't obsolete then fix the code to use the newer APIs as its about
> to end up && BROKEN let alone Obsolete. Make yourself maintainer and go
> for it.

This is my opinion too.

There is zero work being done on that subsystem to freshen it up
and make it current in any way.

Lack of a working replacement is not an argument for anything.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[no subject]

2007-04-21 Thread Miguel Angel Amador L


unsubscribe linux-kernel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Add missing USRobotics Wireless Adapter (Model 5423) id into zd1211rw

2007-04-21 Thread S.Çağlar Onur

[Forgot to CC LKML, Sorry!]

Hi;

USRobotics Wireless Adapter (Model 5423) works well with current zd1211rw 
driver also (i have tested 2.6.18, 2.6.20 and 2.6.21-rc7). I know -mm/and 
Daniel's tree has new version (i think with more features like rate estimator 
etc.) of this driver but maybe you should consider that one as a .21 material 
instead of waiting full wireless-dev merge? 

I'm not sure "Signed-off-by" cause -mm one already has that id but in case of 
need, here it is.

Signed-off-by: S.Çağlar Onur <[EMAIL PROTECTED]>

diff --git a/drivers/net/wireless/zd1211rw/zd_usb.c 
b/drivers/net/wireless/zd1211rw/zd_usb.c
index aac8a1c..edaaad2 100644
--- a/drivers/net/wireless/zd1211rw/zd_usb.c
+++ b/drivers/net/wireless/zd1211rw/zd_usb.c
@@ -62,6 +62,7 @@ static struct usb_device_id usb_ids[] = {
{ USB_DEVICE(0x0471, 0x1236), .driver_info = DEVICE_ZD1211B },
{ USB_DEVICE(0x13b1, 0x0024), .driver_info = DEVICE_ZD1211B },
{ USB_DEVICE(0x0586, 0x340f), .driver_info = DEVICE_ZD1211B },
+   { USB_DEVICE(0x0baf, 0x0121), .driver_info = DEVICE_ZD1211B },
/* "Driverless" devices that need ejecting */
{ USB_DEVICE(0x0ace, 0x2011), .driver_info = DEVICE_INSTALLER },
{}


Cheers
-- 
S.Çağlar Onur <[EMAIL PROTECTED]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


signature.asc
Description: This is a digitally signed message part.

Re: Fault Injection issues: stacktrace x86_64 and failslab NUMA

2007-04-21 Thread Scott Porter

On Fri, 2007-04-20 at 13:55 -0500, Scott Porter wrote:
> I'm attempting to use Fault Injection stacktrace filtering on an x86_64
> platform (see config details below) and finding problems:
> 
> (1) Apparently stacktrace on x86_64 isn't always reliable but the fault
> injection code path to save a stack trace looks *completely* unreliable
> 

An update after a closer look at the x86_64 stacktrace code:

There is no FRAME_POINTER support in the x86_64 code.  Apparently it was
removed when the unwind code was added but now that has been removed as
well!

Anyways, the current dump_trace() code scans the *entire* process-level
kernel stack looking for anything resembling a kernel text address.
Guess where Fault Injection saves the stacktrace entries it is going to
compare -- on the same kernel stack!  So, the code is tripping over
itself such that once Fault Injection with stacktrace has been enabled
for a while the stack is littered with saved kernel text addresses.
This causes false positives in fail_stacktrace() when matching stale
addresses from prior call chains as well as missed matches due to the
buffer filling up with these stale entries.  Not to mention the fact
that dump_stack() happily reports all these stale addresses in the trace
back!

Digging around in the archives it looks like Andi Kleen knew this was an
issue with the x86_64 fallback stacktrace code.  Now that there is no
unwind code to even attempt to avoid the problem what should be done?
How about:

(1) Make it clear the Fault Injection with STACKTRACE on x86_64 is at
best "Russian Roulette" -- maybe a !X86_64 in Kconfig.debug?

(2) Introduce FRAME_POINTER support back into the x86_64 code.  This is
what Fault Injection really wants.

(3) Keep the saved stack address entries array out of sight of the
fallback save_stack_trace() code.  Lockdep does this by storing it in
static space but this requires locking which would be ugly for Fault
Injection.  Another option is to mask the saved addresses so they fail
the __kernel_text_address() test but fail_stacktrace() uses the same
mask to make it's comparisons.  There's still the problem of avoiding
kernel text addresses stored on the stack by other code (that is, other
than the expected stack chain uses).

Comments?

- Scott

P.S. The good news is if/when the unwind code is ready to merge back in
there is a testcase ready and waiting -- just enable Fault Injection
with failslab and the stacks will get unwound on every Kmalloc call!

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFD] alternative kobject release wait mechanism

2007-04-21 Thread Alan Stern

On Fri, 20 Apr 2007, Greg KH wrote:

> > Greg, do you know of anything in particular that depends on a kobjects not 
> > being released before their children are released?
> 
> Yes, the whole driver model :)

But anything in particular?  Looking through the source code, I see 
kobj->parent gets used mainly by kobject_get_path() and not by much else.

Looking some more, kobject_get_path() is used for kobject renaming,
uevent handling, and a little bit in the input core.  None of these things 
should try to access a kobject after it has been del()ed.  After all, it's 
no longer present in the filesystem so it doesn't _have_ a path.

So I don't see any immediate problem.  A quick boot with my patch applied,
during which I installed and removed various modules and hot-pluggable
devices, didn't cause anything strange to happen.

> When adding a new device, we always grab a reference to the parent
> device so it can not go away before we do.
> 
> Look at the last kobject_put(parent); in kobject_cleanup() which ensures
> this.

Yes, I know.  The question is: What (if anything) is wrong with the parent
going away first?  So long as the parent remains present while the child
is _registered_, who will care if the parent is deallocated after the
child is unregistered but before it is released?

And if there is any code which does care, don't you think we should be
able to change it so that it doesn't?

> Ick, no, I think this used to be the way things worked, but bad things
> would end up happening, so we fixed it up to be the way things are
> today.  Read the comments for the changelog for this file for details.
> 
> Specifically, look at commit 10921a8f1305b8ec97794941db78b825db5839bc
> in the history.git repo which is almost exactly what you are proposing
> to be reverted...

Yes, it is.  I had a little trouble finding it; the search facility in the
gitweb system at git.kernel.org doesn't seem to work right.  Who should
I complain to about that?

Anyway, the patch itself is available at

http://marc.info/?l=linux-kernel=107116644617624=2

Here's what the changelog comment says:

It fixes a kobject bug where the parent could be deleted before the
child object, causing all sorts of badness later when we clean up the
child object.  It's been acked by Pat.

Not terribly explicit.  As far as I can tell, cleaning up the child object
doesn't do much except to kfree() a few items and call the ktype's
release() routine.  For a struct device, the release() routine merely
calls dev->release(), or dev->type->release(), or
dev->class->dev_release() as the case may be.  None of these should try to
access the device's parent unless they made special arrangements to
acquire a reference to it beforehand.  Which is what we're trying to
eliminate -- that's what immediate detach means.

The change was made back in December 2003, for 2.6.0-test11.  Since then
the driver-model core and its users have evolved an awful lot.  Perhaps
reverting it now won't hurt anything.

Alan Stern

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] allow unprivileged mounts

2007-04-21 Thread Eric W. Biederman

Andi Kleen <[EMAIL PROTECTED]> writes:

> Andrew Morton <[EMAIL PROTECTED]> writes:
>
>> On Fri, 20 Apr 2007 12:25:39 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote:
>> 
>> > Define a new fs flag FS_SAFE, which denotes, that unprivileged
>> > mounting of this filesystem may not constitute a security problem.
>> > 
>> > Since most filesystems haven't been designed with unprivileged
>> > mounting in mind, a thorough audit is needed before setting this flag.
>> 
>> Practically speaking, is there any realistic likelihood that any filesystem
>> apart from FUSE will ever use this?
>
> If it worked for mount --bind for any fs I could see uses of this.  I haven't
> thought
> through the security implications though, so it might not work.

Binding a directory that you have access to in other was is essentially
the same thing as a symlink.  So there are no real security implications
there.  The only problem case I can think of is removal media that you
want to remove but someone has made a bind mount to.  But that is
essentially the same case as opening a file so there are no new
real issues.  Although our diagnostic tools will likely fall behind
for a bit.

We handle the security implications by assigning an owner to all mounts
and only allowing you to add additional mounts on top of a mount you
already own.

If you have the right capabilities you can create a mount owned by
another user.

For a new mount if you don't have the appropriate capabilities nodev
and nosuid will be forced.

Initial super block creation is a lot more delicate so we need the
FS_SAFE flag, to know that the kernel is prepared to deal with the
crazy things that a hostile user space is prepared to do.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -mm 2/2] microcode: use suspend-related CPU hotplug notifications

2007-04-21 Thread Rafael J. Wysocki

From: Rafael J. Wysocki <[EMAIL PROTECTED]>

Make the microcode driver use the suspend-related CPU hotplug notifications
to handle the CPU hotplug events occuring during system-wide suspend and
resume transitions.  Remove the global variable suspend_cpu_hotplug previously
used for this purpose.

Signed-off-by: Rafael J. Wysocki <[EMAIL PROTECTED]>
---
 arch/i386/kernel/microcode.c |   62 ---
 kernel/cpu.c |   10 --
 2 files changed, 36 insertions(+), 36 deletions(-)

Index: linux-2.6.21-rc6-mm1/arch/i386/kernel/microcode.c
===
--- linux-2.6.21-rc6-mm1.orig/arch/i386/kernel/microcode.c  2007-04-21 
10:16:36.0 +0200
+++ linux-2.6.21-rc6-mm1/arch/i386/kernel/microcode.c   2007-04-21 
11:03:30.0 +0200
@@ -567,7 +567,7 @@ static int cpu_request_microcode(int cpu
return error;
 }
 
-static int apply_microcode_on_cpu(int cpu)
+static int apply_microcode_check_cpu(int cpu)
 {
struct cpuinfo_x86 *c = cpu_data + cpu;
struct ucode_cpu_info *uci = ucode_cpu_info + cpu;
@@ -575,8 +575,9 @@ static int apply_microcode_on_cpu(int cp
unsigned int val[2];
int err = 0;
 
+   /* Check if the microcode is available */
if (!uci->mc)
-   return -EINVAL;
+   return 0;
 
old = current->cpus_allowed;
set_cpus_allowed(current, cpumask_of_cpu(cpu));
@@ -614,7 +615,7 @@ static int apply_microcode_on_cpu(int cp
return err;
 }
 
-static void microcode_init_cpu(int cpu)
+static void microcode_init_cpu(int cpu, int resume)
 {
cpumask_t old;
struct ucode_cpu_info *uci = ucode_cpu_info + cpu;
@@ -624,8 +625,7 @@ static void microcode_init_cpu(int cpu)
set_cpus_allowed(current, cpumask_of_cpu(cpu));
mutex_lock(_mutex);
collect_cpu_info(cpu);
-   if (uci->valid && system_state == SYSTEM_RUNNING &&
-   !suspend_cpu_hotplug)
+   if (uci->valid && system_state == SYSTEM_RUNNING && !resume)
cpu_request_microcode(cpu);
mutex_unlock(_mutex);
set_cpus_allowed(current, old);
@@ -702,7 +702,7 @@ static struct attribute_group mc_attr_gr
.name = "microcode",
 };
 
-static int mc_sysdev_add(struct sys_device *sys_dev)
+static int __mc_sysdev_add(struct sys_device *sys_dev, int resume)
 {
int err, cpu = sys_dev->id;
struct ucode_cpu_info *uci = ucode_cpu_info + cpu;
@@ -711,39 +711,31 @@ static int mc_sysdev_add(struct sys_devi
return 0;
 
pr_debug("Microcode:CPU %d added\n", cpu);
-   /* If suspend_cpu_hotplug is set, the system is resuming and we should
-* use the data from before the suspend.
-*/
-   if (suspend_cpu_hotplug) {
-   err = apply_microcode_on_cpu(cpu);
-   if (err)
-   microcode_fini_cpu(cpu);
-   }
-   if (!uci->valid)
-   memset(uci, 0, sizeof(*uci));
+   memset(uci, 0, sizeof(*uci));
 
err = sysfs_create_group(_dev->kobj, _attr_group);
if (err)
return err;
 
-   if (!uci->valid)
-   microcode_init_cpu(cpu);
+   microcode_init_cpu(cpu, resume);
 
return 0;
 }
 
+static int mc_sysdev_add(struct sys_device *sys_dev)
+{
+   return __mc_sysdev_add(sys_dev, 0);
+}
+
 static int mc_sysdev_remove(struct sys_device *sys_dev)
 {
int cpu = sys_dev->id;
 
if (!cpu_online(cpu))
return 0;
+
pr_debug("Microcode:CPU %d removed\n", cpu);
-   /* If suspend_cpu_hotplug is set, the system is suspending and we should
-* keep the microcode in memory for the resume.
-*/
-   if (!suspend_cpu_hotplug)
-   microcode_fini_cpu(cpu);
+   microcode_fini_cpu(cpu);
sysfs_remove_group(_dev->kobj, _attr_group);
return 0;
 }
@@ -774,16 +766,34 @@ mc_cpu_callback(struct notifier_block *n
 
sys_dev = get_cpu_sysdev(cpu);
switch (action) {
+   case CPU_UP_CANCELED_FROZEN:
+   /* The CPU refused to come up during a system resume */
+   microcode_fini_cpu(cpu);
+   break;
case CPU_ONLINE:
-   case CPU_ONLINE_FROZEN:
case CPU_DOWN_FAILED:
-   case CPU_DOWN_FAILED_FROZEN:
mc_sysdev_add(sys_dev);
break;
+   case CPU_ONLINE_FROZEN:
+   /* System-wide resume is in progress, try to apply microcode */
+   if (apply_microcode_check_cpu(cpu)) {
+   /* The application of microcode failed */
+   microcode_fini_cpu(cpu);
+   __mc_sysdev_add(sys_dev, 1);
+   break;
+   }
+   case CPU_DOWN_FAILED_FROZEN:
+   if (sysfs_create_group(_dev->kobj, _attr_group))
+   printk(KERN_ERR "Microcode: Failed to create

Re: [RFC PATCH 1/3] x86: use defined names for all CPU feature flags

2007-04-21 Thread Alan Cox

> --- 2.6.21-rc7-d390.orig/arch/x86_64/kernel/setup.c
> +++ 2.6.21-rc7-d390/arch/x86_64/kernel/setup.c
> @@ -576,7 +576,7 @@ static void __cpuinit init_amd(struct cp
>  
>   /* Bit 31 in normal CPUID used for nonstandard 3DNow ID;
>  3DNow is IDd by bit 31 in extended CPUID (1*32+31) anyway */
> - clear_bit(0*32+31, >x86_capability);
> + clear_bit(X86_FEATURE_PBE, >x86_capability);

And this is more clear why ?

> +#define X86_FEATURE_PBE  (0*32+31) /* PBE */

For these platforms it isn't "PBE" its 3DNow

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -mm 0/2] Add suspend-related notifications for CPU hotplug

2007-04-21 Thread Rafael J. Wysocki

[Sorry for the duplicates, I forgot to add the LKML to the CC list]

Hi,

The following two patches are intended to deal with the problem that some
CPU hotplug notifiers misbehave when they are called after tasks have been
frozen.

The first of them introduces special notifications that should allow subsystems
to distinguished normal CPU hotplug events from the ones that occur during
a suspend/resume, and the second makes the microcode driver actually use them.

Greetings,
Rafael


-- 
If you don't have the time to read,
you don't have the time or the tools to write.
- Stephen King

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -mm 1/2] Add suspend-related notifications for CPU hotplug

2007-04-21 Thread Rafael J. Wysocki

From: Rafael J. Wysocki <[EMAIL PROTECTED]>

Since nonboot CPUs are now disabled after tasks and devices have been frozen
and the CPU hotplug infrastructure is used for this purpose, we need special
CPU hotplug notifications that will help the CPU-hotplug-aware subsystems
distinguish normal CPU hotplug events from CPU hotplug events related to a
system-wide suspend or resume operation in progress.  This patch introduces
such notifications and causes them to be used during suspend and resume
transitions.  It also changes all of the CPU-hotplug-aware subsystems to take
these notifications into consideration (for now they are handled in the same
way as the corresponding "normal" ones).

Signed-off-by: Rafael J. Wysocki <[EMAIL PROTECTED]>
---
 Documentation/cpu-hotplug.txt |9 ++-
 arch/i386/kernel/cpu/intel_cacheinfo.c|2 +
 arch/i386/kernel/cpu/mcheck/therm_throt.c |2 +
 arch/i386/kernel/cpuid.c  |2 +
 arch/i386/kernel/microcode.c  |3 ++
 arch/i386/kernel/msr.c|2 +
 arch/ia64/kernel/err_inject.c |2 +
 arch/ia64/kernel/palinfo.c|2 +
 arch/ia64/kernel/salinfo.c|2 +
 arch/ia64/kernel/topology.c   |2 +
 arch/powerpc/kernel/sysfs.c   |2 +
 arch/powerpc/mm/numa.c|3 ++
 arch/s390/appldata/appldata_base.c|2 +
 arch/s390/kernel/smp.c|2 +
 arch/x86_64/kernel/mce.c  |2 +
 arch/x86_64/kernel/mce_amd.c  |2 +
 arch/x86_64/kernel/vsyscall.c |2 -
 block/ll_rw_blk.c |2 -
 drivers/base/topology.c   |3 ++
 drivers/cpufreq/cpufreq.c |3 ++
 drivers/cpufreq/cpufreq_stats.c   |2 +
 drivers/cpuidle/cpuidle.c |4 +++
 drivers/hwmon/coretemp.c  |2 +
 drivers/infiniband/hw/ehca/ehca_irq.c |6 +
 drivers/kvm/kvm_main.c|3 ++
 fs/buffer.c   |2 -
 fs/xfs/xfs_mount.c|3 ++
 include/linux/notifier.h  |   12 ++
 kernel/cpu.c  |   34 +++---
 kernel/hrtimer.c  |2 +
 kernel/profile.c  |4 +++
 kernel/rcupdate.c |2 +
 kernel/relay.c|2 +
 kernel/sched.c|   10 
 kernel/softirq.c  |4 +++
 kernel/softlockup.c   |4 +++
 kernel/timer.c|2 +
 kernel/workqueue.c|5 
 lib/radix-tree.c  |2 -
 lib/statistic.c   |3 ++
 mm/page_alloc.c   |5 +++-
 mm/slab.c |6 +
 mm/slub.c |2 +
 mm/swap.c |2 -
 mm/vmscan.c   |2 -
 mm/vmstat.c   |3 ++
 net/core/dev.c|2 -
 net/core/flow.c   |2 -
 net/iucv/iucv.c   |6 +
 49 files changed, 162 insertions(+), 27 deletions(-)

Index: linux-2.6.21-rc6-mm1/include/linux/notifier.h
===
--- linux-2.6.21-rc6-mm1.orig/include/linux/notifier.h  2007-04-09 
15:24:25.0 +0200
+++ linux-2.6.21-rc6-mm1/include/linux/notifier.h   2007-04-17 
22:53:57.0 +0200
@@ -197,5 +197,17 @@ extern int __srcu_notifier_call_chain(st
 #define CPU_LOCK_ACQUIRE   0x0008 /* Acquire all hotcpu locks */
 #define CPU_LOCK_RELEASE   0x0009 /* Release all hotcpu locks */
 
+/* Used for CPU hotplug events occuring while tasks are frozen due to a suspend
+ * operation in progress
+ */
+#define CPU_TASKS_FROZEN   0x0010
+
+#define CPU_ONLINE_FROZEN  (CPU_ONLINE | CPU_TASKS_FROZEN)
+#define CPU_UP_PREPARE_FROZEN  (CPU_UP_PREPARE | CPU_TASKS_FROZEN)
+#define CPU_UP_CANCELED_FROZEN (CPU_UP_CANCELED | CPU_TASKS_FROZEN)
+#define CPU_DOWN_PREPARE_FROZEN(CPU_DOWN_PREPARE | CPU_TASKS_FROZEN)
+#define CPU_DOWN_FAILED_FROZEN (CPU_DOWN_FAILED | CPU_TASKS_FROZEN)
+#define CPU_DEAD_FROZEN(CPU_DEAD | CPU_TASKS_FROZEN)
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_NOTIFIER_H */
Index: linux-2.6.21-rc6-mm1/kernel/cpu.c
===
--- linux-2.6.21-rc6-mm1.orig/kernel/cpu.c  2007-04-09 15:24:25.0 
+0200
+++ linux-2.6.21-rc6-mm1/kernel/cpu.c   2007-04-17 22:55:02.0 +0200
@@ -120,12 +120,13 @@ static int take_cpu_down(void *unused)
 }
 
 /* Requires cpu_add_remove_lock to be held */
-static int _cpu_down(unsigned int

gtod/clocksource/clockevents documentation

2007-04-21 Thread Remy Bohmer


Hello All,

I need to implement a gtod/clocksource/clockevents implementation for
the Atmel ARM AT91SAM9261 CPU, and I am looking for some kernel
(interface) documentation about these mechanisms.

I already investigated the 'examples'/implementations of other
architectures in the kernel, but that did not really help me... I also
looked at the patch for the AT91RM9200 CPU posted by David Brownell...
I hacked something which now makes the RT kernel to boot, but the
time-of-day warps several minutes per second, back and forward in
time... ;-)

So, Can anybody please give me pointer to some useful documentation
and/or examples?

The reason behind this initiative is to make the RT-preempt patch
compile and work on this architecture. Currently the GENERIC_TIME is
not set, which results in several unresolved externals at compile
time, and I believe the best way to fix this is to implement proper
gtod/clocksource/clockevents to allow the enabling of GENERIC_TIME

Kind Regards,

Remy Böhmer
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] allow unprivileged mounts

2007-04-21 Thread Andi Kleen

Andrew Morton <[EMAIL PROTECTED]> writes:

> On Fri, 20 Apr 2007 12:25:39 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote:
> 
> > Define a new fs flag FS_SAFE, which denotes, that unprivileged
> > mounting of this filesystem may not constitute a security problem.
> > 
> > Since most filesystems haven't been designed with unprivileged
> > mounting in mind, a thorough audit is needed before setting this flag.
> 
> Practically speaking, is there any realistic likelihood that any filesystem
> apart from FUSE will ever use this?

If it worked for mount --bind for any fs I could see uses of this.  I haven't 
thought
through the security implications though, so it might not work.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] utilities: add helper functions for safe 64-bit integer operations as 32-bit halves

2007-04-21 Thread Andrew Morton

On Sat, 21 Apr 2007 22:01:47 +0100 Alan Cox <[EMAIL PROTECTED]> wrote:

> > > +#define lower_32_bits(n) (sizeof(n) == 8 ? (u32)(n) : (n))
> > 
> > n&0x would be simpler.
> > 
> > Do we actually have any call for this?
> 
> The only case for all of this we care about is sector_t, which is one
> type, with specific properties (eg always being positive). The rest is
> over-engineering. Call it sector_upper32() do it the simple way and stop
> trying to solve a problem we don't have

James said we have the same problem with dma_addr_t.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] allow unprivileged mounts

2007-04-21 Thread Eric W. Biederman

Jan Engelhardt <[EMAIL PROTECTED]> writes:

> On Apr 21 2007 10:57, Eric W. Biederman wrote:
>>
>>> tmpfs!
>>
>>tmpfs is a possible problem because it can consume lots of ram/swap. 
>>Which is why it has limits on the amount of space it can consume. 
>
> Users can gobble up all RAM and swap already today. (Unless they are
> confined into an rlimit, which, in most systems, is not the case.)
> And in case /dev/shm exists, they can already fill it without running
> into an rlimit early.

There are systems that care about rlimits and there is strong intersection
between caring about rlimits and user mounts.  Although I do agree that
it looks like we have gotten lazy with the default mount options for
/dev/shm.

Going a little farther any filesystem that is safe to put on a usb
stick and mount automatically should ultimately be safe for unprivileged
mounts as well.

So it looks to me like ultimately most of the common filesystems will actually
be safe for non-privileged mounting.

Regardless this looks like an important discussion as soon as we have the
glitches out of the non-privileged mount code.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] utilities: add helper functions for safe 64-bit integer operations as 32-bit halves

2007-04-21 Thread Alan Cox

> > +#define lower_32_bits(n) (sizeof(n) == 8 ? (u32)(n) : (n))
> 
> n&0x would be simpler.
> 
> Do we actually have any call for this?

The only case for all of this we care about is sector_t, which is one
type, with specific properties (eg always being positive). The rest is
over-engineering. Call it sector_upper32() do it the simple way and stop
trying to solve a problem we don't have
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Remove "obsolete" label from ISDN4Linux (v3)

2007-04-21 Thread Alan Cox

On Sat, 21 Apr 2007 15:07:51 +0200
Tilman Schmidt <[EMAIL PROTECTED]> wrote:

> From: Tilman Schmidt <[EMAIL PROTECTED]>
> 
> The "obsolete" label on the ISDN_I4L Kconfig option is not, and
> has never been, accurate. It has already prompted repeated attempts
> to remove actively used functionality from the kernel without a
> working replacement. This patch removes the incorrect label and
> corrects the accompanying help text.
> 
> Signed-off-by: Tilman Schmidt <[EMAIL PROTECTED]>

Nak-by: Alan Cox <[EMAIL PROTECTED]>

If it isn't obsolete then fix the code to use the newer APIs as its about
to end up && BROKEN let alone Obsolete. Make yourself maintainer and go
for it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] CFS scheduler, v4

2007-04-21 Thread S.Çağlar Onur

21 Nis 2007 Cts tarihinde, Gene Heskett şunları yazmıştı: 
> This one is another keeper IMO, or as we are fond of saying around here,
> its good enough for the girls I go with.  If this isn't the best one so
> far, its very very close and I'm getting pickier.  kmail is the only thing
> that's lagging, and that's just kmail, which I believe is single threaded. 

Add +1 for kmail lags (by the way mines are freezes instead of lags, cause i 
cannot use konsole etc. while these happens)

Cheers
-- 
S.Çağlar Onur <[EMAIL PROTECTED]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


signature.asc
Description: This is a digitally signed message part.

Re: [patch] CFS scheduler, v4

2007-04-21 Thread S.Çağlar Onur

Hi Ingo;

20 Nis 2007 Cum tarihinde, Ingo Molnar şunları yazmıştı: 
> As usual, any sort of feedback, bugreport, fix and suggestion is more
> than welcome,

I tried hard and found another problem for you :)

With Linus's current git + CFSv4 as soon as i start a guest in VirtualBox [1], 
system enters the following loop;

- Whole system freeze ~5 secs.
- System works well ~5 secs
- Whole system freeze ~5 secs.

Again mainline has no issues but its %100 reproducible under CFSv4.

Following "ps aux" and "top" outputs grabbed while system works well, i cannot 
do anything else while system freezes except moving mouse for fun :P

[EMAIL PROTECTED]> ps aux
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root 1  0.0  0.0  0 0 ?Ss   22:26   0:00 init [3]
root 2  0.0  0.0  0 0 ?S22:26   0:00 [migration/0]
root 3  0.3  0.0  0 0 ?SN   22:26   0:11 [ksoftirqd/0]
root 4  0.0  0.0  0 0 ?S<   22:26   0:00 [events/0]
root 5  0.0  0.0  0 0 ?S<   22:26   0:00 [khelper]
root 6  0.0  0.0  0 0 ?S<   22:26   0:00 [kthread]
root26  0.0  0.0  0 0 ?S<   22:26   0:00 [kblockd/0]
root27  0.0  0.0  0 0 ?S<   22:26   0:00 [kacpid]
root   124  0.0  0.0  0 0 ?S<   22:26   0:00 [kseriod]
root   137  0.0  0.0  0 0 ?S<   22:26   0:00 [kapmd]
root   145  0.0  0.0  0 0 ?S22:26   0:00 [pdflush]
root   146  0.0  0.0  0 0 ?S22:26   0:00 [pdflush]
root   147  0.0  0.0  0 0 ?S<   22:26   0:00 [kswapd0]
root   148  0.0  0.0  0 0 ?S<   22:26   0:00 [aio/0]
root   802  0.0  0.0  0 0 ?S<   22:26   0:00 [kpsmoused]
root   844  0.0  0.0  0 0 ?S<   22:26   0:00 [ata/0]
root   845  0.0  0.0  0 0 ?S<   22:26   0:00 [ata_aux]
root   856  0.0  0.0  0 0 ?S<   22:26   0:01 [scsi_eh_0]
root   857  0.0  0.0  0 0 ?S<   22:26   0:00 [scsi_eh_1]
root   869  0.0  0.0  0 0 ?S<   22:26   0:00 
[ksuspend_usbd]
root   872  0.0  0.0  0 0 ?S<   22:26   0:00 [khubd]
root   919  0.0  0.0  0 0 ?S<   22:26   0:00 [khpsbpkt]
root   927  0.0  0.0  0 0 ?S<   22:26   0:00 [knodemgrd_0]
root   982  0.0  0.0  0 0 ?S<   22:26   0:00 [xfslogd/0]
root   983  0.0  0.0  0 0 ?S<   22:26   0:00 [xfsdatad/0]
root   985  0.0  0.0  0 0 ?S<   22:26   0:00 [xfsbufd]
root   986  0.0  0.0  0 0 ?S<   22:26   0:00 [xfssyncd]
root  1050  0.0  0.0  0 0 ?S top
top - 23:24:27 up 58 min,  3 users,  load average: 13.59, 13.00, 7.75
Tasks: 102 total,   1 running, 101 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.4%us,  0.4%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.2%hi,  0.0%si,  0.0%st
Mem:   2067672k total,  1935300k used,   132372k free,  288k buffers
Swap:  2096440k total,0k used,  2096440k free,   961236k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 6684 caglar20   0  625m 570m  15m S 99.2 28.3   5:56.17 VirtualBox
  844 root   1 -19 000 S  0.2  0.0   0:00.56 ata/0
 6731 caglar20   0  2312 1124  856 R  0.2  0.1   0:00.04 top
1 root  20   0  1608  556  484 S  0.0  0.0   0:00.92 init
2 root  RT   0 000 S  0.0  0.0   0:00.00 migration/0
3 root  39  19 000 S  0.0  0.0   0:11.85 ksoftirqd/0
4 root   1 -19 000 S  0.0  0.0   0:00.07 events/0
5 root   1 -19 000 S  0.0  0.0   0:00.02 khelper
6 root   1 -19 000 S  0.0  0.0   0:00.00 kthread
   26 root   1 -19 000 S  0.0  0.0   0:00.04 kblockd/0
   27 root   1 -19 000 S  0.0  0.0   0:00.00 kacpid
  124 root   1 -19 000 S  0.0  0.0   0:00.00 kseriod
  137 root   1 -19 000 S  0.0  0.0   0:00.00 kapmd
  145 root  20   0 000 S  0.0  0.0   0:00.00 pdflush
  146 root  20   0 000 S  0.0  0.0   0:00.27 pdflush
  147 root   1 -19 000 S  0.0  0.0   0:00.00 kswapd0
  148 root   1 -19 000 S  0.0  0.0   0:00.00 aio/0
  802 root   1 -19 000 S  0.0  0.0   0:00.00 kpsmoused
  845 root   1 -19 000 S  0.0  0.0   0:00.00 ata_aux
  856 root   1 -19 000 D  0.0  0.0   0:01.04 scsi_eh_0
  857 root   1 -19 000 S  0.0  0.0   0:00.00 scsi_eh_1
  869 root   1 -19 000 S  0.0  0.0   0:00.00 ksuspend_usbd
  872 root   1 -19 000 S  0.0  0.0   0:00.00 khubd
  919 root   1 -19 000 S  0.0  0.0   0:00.00 khpsbpkt
  927 root   1 -19 000 S  0.0  0.0   0:00.00

Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-21 Thread Miklos Szeredi

> > The other deadlock, in throttle_vm_writeout() is still to be solved.
> 
> Let's go back to the original changelog:
> 
> Author: marcelo.tosatti 
> Date:   Tue Mar 8 17:25:19 2005 +
> 
> [PATCH] vm: pageout throttling
> 
> With silly pageout testcases it is possible to place huge amounts of 
> memory
> under I/O.  With a large request queue (CFQ uses 8192 requests) it is
> possible to place _all_ memory under I/O at the same time.
> 
> This means that all memory is pinned and unreclaimable and the VM gets
> upset and goes oom.
> 
> The patch limits the amount of memory which is under pageout writeout to 
> be
> a little more than the amount of memory at which balance_dirty_pages()
> callers will synchronously throttle.
> 
> This means that heavy pageout activity can starve heavy writeback activity
> completely, but heavy writeback activity will not cause starvation of
> pageout.  Because we don't want a simple `dd' to be causing excessive
> latencies in page reclaim.
> 
> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
> Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>
> 
> (A good one!  I wrote it ;))
> 
> 
> I believe that the combination of dirty-page-tracking and its calls to
> balance_dirty_pages() mean that we can now never get more than dirty_ratio
> of memory into the dirty-or-writeback condition.
> 
> The vm scanner can convert dirty pages into clean, under-writeback pages,
> but it cannot increase the total of dirty+writeback.

What about swapout?  That can increase the number of writeback pages,
without decreasing the number of dirty pages, no?

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ia64 sn xpc: Convert to use kthread API.

2007-04-21 Thread Eric W. Biederman

Robin Holt <[EMAIL PROTECTED]> writes:

> I think this was originally coded with daemonize to avoid issues with
> reaping children.  Dean Nelson can correct me if I am wrong.  I assume
> this patch is going in as part of the set which will make these threads
> clear themselves from the children list and if that is the case, I can
> see no issues.

One of my earlier patches guarantees that kthreadd will have pid == 2.

daemonize actually explicitly reparents to init so using daemonize and
kernel_thread provides no help at all with respect to scaling.  It in
fact guarantees you will be on init's list of child processes.

The work to enhance wait is a little tricky and it conflicts with the
utrace patches, which makes it hard to pursue at the moment.

I'm actually sorting out kthread stop so I can complete the pid namespace.
But since all kthreads are children of kthreadd this helps in a small
way with the scaling issue.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-21 Thread Peter Zijlstra

On Sat, 2007-04-21 at 14:15 +0200, Peter Zijlstra wrote:
> > > > +/*
> > > > + * maximal error of a stat counter.
> > > > + */
> > > > +static inline unsigned long bdi_stat_delta(void)
> > > > +{
> > > > +#ifdef CONFIG_SMP
> > > > +   return NR_CPUS * FBC_BATCH;
> > > 
> > > This is enormously wrong for CONFIG_NR_CPUS=1024 on a 2-way.
> 
> Right, I knew about that but, uhm.
> 
> I wanted to make that num_online_cpus(), and install a hotplug notifier
> to fold the percpu delta back into the total on cpu offline.
> 
> But I have to look into doing that hotplug notifier stuff.

Something like this should do I think, I just looked at other hotplug
code and imitated the pattern.

I assumed CONFIG_HOTPLUG_CPU requires CONFIG_SMP, I didn't actually try
that one :-)

---

In order to estimate the per stat counter error more accurately, using
num_online_cpus() instead of NR_CPUS, install a cpu hotplug notifier
(when cpu hotplug is enabled) that flushes whatever percpu delta was
present into the total on cpu unplug.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/linux/backing-dev.h|6 -
 include/linux/percpu_counter.h |1 
 lib/percpu_counter.c   |   11 +
 mm/backing-dev.c   |   47 +
 4 files changed, 64 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h  2007-04-21 21:32:49.0 
+0200
+++ linux-2.6/include/linux/backing-dev.h   2007-04-21 21:33:28.0 
+0200
@@ -51,6 +51,10 @@ struct backing_dev_info {
spinlock_t lock;/* protect the cycle count */
unsigned long cycles;   /* writeout cycles */
int dirty_exceeded;
+
+#ifdef CONFIG_HOTPLUG_CPU
+   struct notifier_block hotplug_nb;
+#endif
 };
 
 void bdi_init(struct backing_dev_info *bdi);
@@ -137,7 +141,7 @@ static inline s64 bdi_stat_sum(struct ba
 static inline unsigned long bdi_stat_delta(void)
 {
 #ifdef CONFIG_SMP
-   return NR_CPUS * FBC_BATCH;
+   return num_online_cpus() * FBC_BATCH;
 #else
return 1UL;
 #endif
Index: linux-2.6/include/linux/percpu_counter.h
===
--- linux-2.6.orig/include/linux/percpu_counter.h   2007-04-21 
21:32:49.0 +0200
+++ linux-2.6/include/linux/percpu_counter.h2007-04-21 21:33:17.0 
+0200
@@ -38,6 +38,7 @@ static inline void percpu_counter_destro
 void percpu_counter_mod(struct percpu_counter *fbc, s32 amount);
 void percpu_counter_mod64(struct percpu_counter *fbc, s64 amount);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
+void percpu_counter_fold(struct percpu_counter *fbx, int cpu);
 
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
Index: linux-2.6/lib/percpu_counter.c
===
--- linux-2.6.orig/lib/percpu_counter.c 2007-04-21 21:32:49.0 +0200
+++ linux-2.6/lib/percpu_counter.c  2007-04-21 21:33:17.0 +0200
@@ -72,3 +72,14 @@ s64 percpu_counter_sum(struct percpu_cou
return ret < 0 ? 0 : ret;
 }
 EXPORT_SYMBOL(percpu_counter_sum);
+
+void percpu_counter_fold(struct percpu_counter *fbc, int cpu)
+{
+   s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+   if (*pcount) {
+   spin_lock(>lock);
+   fbc->count += *pcount;
+   *pcount = 0;
+   spin_unlock(>lock);
+   }
+}
Index: linux-2.6/mm/backing-dev.c
===
--- linux-2.6.orig/mm/backing-dev.c 2007-04-21 21:32:49.0 +0200
+++ linux-2.6/mm/backing-dev.c  2007-04-21 21:34:47.0 +0200
@@ -4,6 +4,49 @@
 #include 
 #include 
 #include 
+#include 
+
+#ifdef CONFIG_HOTPLUG_CPU
+static int bdi_stat_fold(struct notifier_block *nb,
+   unsigned long action, void *hcpu)
+{
+   struct backing_dev_info *bdi =
+   container_of(nb, struct backing_dev_info, hotplug_nb);
+   unsigned long flags;
+   int cpu = (unsigned long)hcpu;
+   int i;
+
+   if (action == CPU_DEAD) {
+   local_irq_save(flags);
+   for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
+   percpu_counter_fold(>bdi_stat[i], cpu);
+   local_irq_restore(flags);
+   }
+   return NOTIFY_OK;
+}
+
+static void bdi_init_hotplug(struct backing_dev_info *bdi)
+{
+   bdi->hotplug_nb = (struct notifier_block){
+   .notifier_call = bdi_stat_fold,
+   .priority = 0,
+   };
+   register_hotcpu_notifier(>hotplug_nb);
+}
+
+static void bdi_destroy_hotplug(struct backing_dev_info *bdi)
+{
+   unregister_hotcpu_notifier(>hotplug_nb);
+}
+#else
+static void bdi_init_hotplug(struct backing_dev_info *bdi)
+{
+}
+
+static void bdi_destroy_hotplug(struct backing_dev_info *bdi)
+{
+}

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Ulrich Drepper


On 4/21/07, Kyle Moffett <[EMAIL PROTECTED]> wrote:

It might be nice if it was possible to actively contribute your CPU
time to a child process.  For example:
int sched_donate(pid_t pid, struct timeval *time, int percentage);


If you do this, and it has been requested many a times, then please
generalize it.  We have the same issue with futexes.  If a FUTEX_WAIT
call is issues the remaining time in the slot should be given to the
thread currently owning the futex.  For non-PI futexes this needs an
extension of the interface but I would be up for that.  It can have
big benefits on the throughput of an application.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs lockd reclaimer: Convert to kthread API

2007-04-21 Thread Eric W. Biederman

Trond Myklebust <[EMAIL PROTECTED]> writes:

> On Thu, 2007-04-19 at 14:40 -0700, Andrew Morton wrote:
>> Using signals to communicate with kernel threads is fairly unpleasant, IMO.
>> We have much simpler, faster and more idiomatic ways of communicating
>> between threads in-kernel and there are better ways in which userspace can
>> communicate with the kernel - system calls, for example...
>> 
>> So I think generally any move which gets us away from using signals in
>> kernel threads is moving in a good direction.
>
> I have yet to see a proposal which did. Eric's patch was eliminating
> signals in kernel threads that used them without proposing any
> replacement mechanism or showing that he had plans to do so. That is a
> good reason for a veto.

Possibly I just hadn't looked close enough.  The signals looked like
a redundant mechanism.

>> > > With pid namespaces all kernel threads will disappear so how do
>> > > we cope with the problem when the sysadmin can not see the kernel
>> > > threads?
>> > 
>> > Then you have a usability problem. How does the sysadmin reboot the
>> > system if there is no way to shut down the processes that are hanging on
>> > an unresponsive filesystem?
>> 
>> Where's the hang?  A user process is stuck on h_rwsem?
>> 
>> If so, would it be appropriate to convert the user process to use
>> down_foo_interruptible(), so that the operator can just kill the user
>> process as expected, rather than having to futz around killing kernel
>> threads?
>
> If an NFS server reboots, then the locks held by user processes on the
> client need to be re-established by when it comes up again. Otherwise,
> the processes that thought they were holding locks will suddenly fail.
> This recovery job is currently the done by a kernel thread.
>
> The question is then what to do if the server crashes again while the
> kernel thread is re-establishing the locks. Particularly if it never
> comes back again.
> Currently, the administrator can intervene by killing anything that has
> open files on that volume and kill the recovery kernel thread.
> You'll also note that lockd_down(), nfsd_down() etc all use signals to
> inform lockd(), nfsd() etc that they should be shutting down. Since the
> reclaimer thread is started by the lockd() thread using CLONE_SIGHAND,
> this means that we also automatically kill any lingering recovery
> threads whenever we shutdown lockd().

Maybe I'm missing something but I think you are referring to the semantics
of do_group_exit in the presence of CLONE_THREAD.  All sharing a
sighand should do is cause the sharing of the signal handler.  Causing
allow_signal and disallow_signal to act on a group of threads instead
of a single thread.   I don't recall clone_sighand having any
other effects.

> These mechanisms need to be replaced _before_ we start shooting down
> sigallow() etc in the kernel.

Reasonable if these mechanisms are not redundant.

Thinking it through because everything having to do with nfs mounting and
unmounting is behind the privileged mount operation this is not going to
become an issue until we start allowing unprivileged nfs mounts.  Because
we cannot delegate control of nfs mount and unmount operations until then.

Since signals do not pose a immediate barrier to forward progress like
daemonize and kernel_thread we can leave things as is until we can
sort this out.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [d_path 0/7] Fixes to d_path: Respin

2007-04-21 Thread Ulrich Drepper

On 4/21/07, Andreas Gruenbacher <[EMAIL PROTECTED]> wrote:

What I described is a supported feature, nothing more and nothing less. It's
also relatively easy to handle this case correctly in glibc, e.g., [...]

This is only useful if the requirement of an ordered /proc/mounts is
part of the kernel ABI.  I.e., until somebody specifies (in the
sources, in kernel docs, I don't care where exactly) that the entries
in /proc/mounts appear in the order in which the mounts happened this
change is not better than the current code.  I have never found such
an assurance.

This is what POSIX says about statvfs / fstatvfs:
> It is unspecified whether all members of the statvfs structure have
> meaningful values on all file systems.

Sure, just like POSIX in many other place leaves things unspecified.
This does not change the fact that we do a good job now.  You try to
use this wording to excuse the fact that you want to make the results
worse than they are now.  That's *not* the intend of this wording.

In my opinion, the advantage of not reporting bogus pathnames in /proc/mounts
by far outweighs the problems is sometimes causes for fstatvfs().

Hell no.  It is never acceptable to deliberately break compatibility.
Effects of bugs on the ABI might change but this is not the case here.
This is an interface which forever displayed the information in this
form and it is correct and useful.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linux 2.6.21-rc7 - ACPI issues?

2007-04-21 Thread Andrew Morton

On Sat, 21 Apr 2007 19:33:52 +0530 "Sunil Naidu" <[EMAIL PROTECTED]> wrote:

> Hello,
> 
> I did compile 2.6.21-rc7 for a P-III machine. Here is the ACPI part in
> the dmesg:-
> 
> ACPI Error (psargs-0355): [PRSE] Namespace lookup failure, AE_NOT_FOUND
> ACPI Error (psparse-0537): Method parse/execution failed
> [\_SB_.LNKE._PRS] (Node dfd63f40), AE_NOT_FOUND
> ACPI Exception (pci_link-0180): AE_NOT_FOUND, Evaluating _PRS [20060707]
> ACPI Error (psargs-0355): [PRSF] Namespace lookup failure, AE_NOT_FOUND
> ACPI Error (psparse-0537): Method parse/execution failed
> [\_SB_.LNKF._PRS] (Node dfd63ea0), AE_NOT_FOUND
> ACPI Exception (pci_link-0180): AE_NOT_FOUND, Evaluating _PRS [20060707]
> ACPI Error (psargs-0355): [PRSG] Namespace lookup failure, AE_NOT_FOUND
> ACPI Error (psparse-0537): Method parse/execution failed
> [\_SB_.LNKG._PRS] (Node dfd63e00), AE_NOT_FOUND
> ACPI Exception (pci_link-0180): AE_NOT_FOUND, Evaluating _PRS [20060707]
> ACPI Error (psargs-0355): [PRSH] Namespace lookup failure, AE_NOT_FOUND
> ACPI Error (psparse-0537): Method parse/execution failed
> [\_SB_.LNKH._PRS] (Node c147d75c), AE_NOT_FOUND
> ACPI Exception (pci_link-0180): AE_NOT_FOUND, Evaluating _PRS [20060707]
> 
> I tried with few configurations (config) to solve the problem, am not
> sure what's causing this failure. Any hint?
> 

(added linux-acpi)

Are any other problems observeable due to this?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linux 2.6.20.7 - Hard Disk rumbling?

2007-04-21 Thread Andrew Morton

On Sat, 21 Apr 2007 19:28:30 +0530 "Sunil Naidu" <[EMAIL PROTECTED]> wrote:

> Hello,
> 
> I am facing a strange problems with an old 1.2 GHz P-III machine with
> a 10 GB disk (used as a dedicated web server, later retired out of
> service!).
> 
> Out of interest to implement some wireless solution (experiment), I
> did compile 2.6.20.7 for my requirement. Strangely, I did observe:-
> 
> hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> hda: drive_cmd: error=0x04 { DriveStatusError }
> ide: failed opcode was: 0xb0
> 
> What might be the problem?
> 

So the machine is OK with an earlier kernel version?  If so, which version?

Please send the full log from attempting to boot 2.6.20.7 on that machine.  This
might require netconsole or serial console.

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] reiserfs: fix xattr root locking/refcount bug

2007-04-21 Thread Andrew Morton

On Sat, 21 Apr 2007 11:26:31 -0400 Jeff Mahoney <[EMAIL PROTECTED]> wrote:

>  The listxattr() and getxattr() operations are only protected by a read
>  lock. As a result, if either of these operations run in parallel, a race
>  condition exists where the xattr_root will end up being cached twice,
>  which results in the leaking of a reference and a BUG() on umount.
> 
>  This patch refactors get_xa_root(), __get_xa_root(), and
>  create_xa_root(), into one get_xa_root() function that takes
>  the appropriate locking around the entire critical section.

Great, thanks.

Now we need to work out the timing.  Our options are to shove
it into 2.6.21 immediately, or to give it a run in 2.6.22-rc1 then
backport into 2.6.21.x.

What is everyone's confidence level?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm] Taskstats fix the structure members alignment issue

2007-04-21 Thread Andrew Morton

On Sat, 21 Apr 2007 18:29:21 +0530 Balbir Singh <[EMAIL PROTECTED]> wrote:

> >> The patch adds an __attribute__((aligned(8))) to the
> >> taskstats structure members so that 32 bit applications using taskstats
> >> can work with a 64 bit kernel.
> > 
> > But there might be 32-bit applications out there which are using the
> > present wrong structure?
> > 
> > otoh, I assume that those applications would be using taskstats.h and would
> > hence encounter this bug and we would have heard about it, is that correct?
> > 
> 
> Yes, correct.
> 
> > otoh^2, 32-bit applications running under 32-bit kernels will presently be
> > functioning correctly, and your change will require that those applications
> > be recompiled, I think?
> > 
> 
> Yes, correct. They would be broken with this fix. We could  bump up the
> version TASKSTATS_VERSION to 4. Would you like a new patch the version
> bumped up?

I can do that.

> > 
> > This patch looks like 2.6.20 and 2.6.21 material, but very carefully...
> 
> Yes, 2.6.20 and 2.6.21 sound correct.

OK.  I guess we have little choice but to slam it in asap, with a 2.6.20.x 
backport
before too many people start using the old interface.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/10] lib: percpu_counter_mod64

2007-04-21 Thread Peter Zijlstra

On Sat, 2007-04-21 at 12:21 -0700, Andrew Morton wrote:
> On Sat, 21 Apr 2007 13:02:26 +0200 Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> 
> > > > +   cpu = get_cpu();
> > > > +   pcount = per_cpu_ptr(fbc->counters, cpu);
> > > > +   count = *pcount + amount;
> > > > +   if (count >= FBC_BATCH || count <= -FBC_BATCH) {
> > > > +   spin_lock(>lock);
> > > > +   fbc->count += count;
> > > > +   *pcount = 0;
> > > > +   spin_unlock(>lock);
> > > > +   } else {
> > > > +   *pcount = count;
> > > > +   }
> > > > +   put_cpu();
> > > > +}
> > > > +EXPORT_SYMBOL(percpu_counter_mod64);
> > > 
> > > Bloaty.  Surely we won't be needing this on 32-bit kernels?  Even monster
> > > PAE has only 64,000,000 pages and won't be using deltas of more than 4
> > > gigapages?
> > > 
> > >  > > suspects
> > > another changelog bug>
> > 
> > Yeah, /me chastises himself for that...
> > 
> > This is because percpu_counter is s64 instead of the native long; I need
> > to halve the counter at some point (bdi_writeout_norm) and do that by
> > subtracting half the current value.
> 
> ah, the mysterious bdi_writeout_norm().
> 
> I don't think it's possible to precisely halve a percpu_counter - there has
> to be some error involved.  I guess that's acceptable within the
> inscrutable bdi_writeout_norm().
> 
> otoh, there's a chance that the attempt to halve the counter will take the
> counter negative, due to races.  Does the elusive bdi_writeout_norm()
> handle that?  If not, it should.  If it does, then there should be comments
> around the places where this is being handled, because it is subtle, and 
> unobvious,
> and others might break it by accident.

The counter it is halving is only ever incremented, so we might be off a
little, but only to the safe side.

I shall do the comment thing along with all the other missing
comments :-)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/10] lib: percpu_counter_mod64

2007-04-21 Thread Andrew Morton

On Sat, 21 Apr 2007 13:02:26 +0200 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> > > + cpu = get_cpu();
> > > + pcount = per_cpu_ptr(fbc->counters, cpu);
> > > + count = *pcount + amount;
> > > + if (count >= FBC_BATCH || count <= -FBC_BATCH) {
> > > + spin_lock(>lock);
> > > + fbc->count += count;
> > > + *pcount = 0;
> > > + spin_unlock(>lock);
> > > + } else {
> > > + *pcount = count;
> > > + }
> > > + put_cpu();
> > > +}
> > > +EXPORT_SYMBOL(percpu_counter_mod64);
> > 
> > Bloaty.  Surely we won't be needing this on 32-bit kernels?  Even monster
> > PAE has only 64,000,000 pages and won't be using deltas of more than 4
> > gigapages?
> > 
> >  > another changelog bug>
> 
> Yeah, /me chastises himself for that...
> 
> This is because percpu_counter is s64 instead of the native long; I need
> to halve the counter at some point (bdi_writeout_norm) and do that by
> subtracting half the current value.

ah, the mysterious bdi_writeout_norm().

I don't think it's possible to precisely halve a percpu_counter - there has
to be some error involved.  I guess that's acceptable within the
inscrutable bdi_writeout_norm().

otoh, there's a chance that the attempt to halve the counter will take the
counter negative, due to races.  Does the elusive bdi_writeout_norm()
handle that?  If not, it should.  If it does, then there should be comments
around the places where this is being handled, because it is subtle, and 
unobvious,
and others might break it by accident.

> If percpu_counter_mod is limited to s32 this might not always work
> (although in practice it might just fit).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs lockd reclaimer: Convert to kthread API

2007-04-21 Thread Eric W. Biederman

Dave Hansen <[EMAIL PROTECTED]> writes:

> On Thu, 2007-04-19 at 17:19 -0400, Trond Myklebust wrote:
>> > With pid namespaces all kernel threads will disappear so how do
>> > we cope with the problem when the sysadmin can not see the kernel
>> > threads?
>
> Do they actually always disappear, or do we keep them in the
> init_pid_namespace?

In the init pid namespace but not in any of it's children.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [d_path 0/7] Fixes to d_path: Respin

2007-04-21 Thread Andreas Gruenbacher

On Friday 20 April 2007 21:17, Ulrich Drepper wrote:
> On 4/20/07, Andreas Gruenbacher <[EMAIL PROTECTED]> wrote:
> > The code also seems to stop at the first matching mount point. You can
> > have the same device mounted on the same mount point multiple times but
> > with different mount options, e.g., [...]
> 
> You can unfortunately do many stupid things.  That's the user's
> problem.  The point is that everything works fine in an environment
> which does not have such bogus mounts.

What I described is a supported feature, nothing more and nothing less. It's 
also relatively easy to handle this case correctly in glibc, e.g.,

--- a/sysdeps/unix/sysv/linux/internal_statvfs.c
+++ b/sysdeps/unix/sysv/linux/internal_statvfs.c
@@ -139,6 +139,7 @@ __statvfs_getflags (const char *name, in
  char *cp = mntbuf.mnt_opts;
  char *opt;

+ result = 0;
  while ((opt = strsep (, ",")) != NULL)
if (strcmp (opt, "ro") == 0)
  result |= ST_RDONLY;
@@ -157,9 +158,10 @@ __statvfs_getflags (const char *name, in
else if (strcmp (opt, "nodiratime") == 0)
  result |= ST_NODIRATIME;

- /* We can stop looking for more entries.  */
  success = true;
- break;
+ /* Don't stop looking: the same device may be mounted several
+times with different options; in that case, the last entry
+is the topmost mount.  */
}
}
   /* Maybe the kernel names for the filesystems changed or the

> > I gave a chroot example that showed that in the current implementation,
> > you can get pretty random clashes between mounts; there are other cases
> > with lazy unmounts as well. 
> 
> Irrelevant as well.  If you create chroot problems it's your problem.

There is no way to avoid these problems with chroots; it's not that anybody 
creates stupid problems on purpose.

The approach I'm proposing fixes these problems. It has a small disadvantage 
for statvfs / fstatvfs in some situations, which is due to the fact that the 
kernel doesn't offer a direct interface for querying the mount options of a 
file descriptor or path, and so glibc has to resort to messing 
with /proc/mounts. I don't see a nice way of fixing this without introducing 
[f]statvfs syscalls right now.

This is what POSIX says about statvfs / fstatvfs:
> It is unspecified whether all members of the statvfs structure have
> meaningful values on all file systems.

In my opinion, the advantage of not reporting bogus pathnames in /proc/mounts 
by far outweighs the problems is sometimes causes for fstatvfs(). Anyone 
relying on the information obtained from statvfs / fstatvfs is making false 
assumptions anyway, and in "normal setups" as you called them, nothing 
changes for fstatvfs and statvfs.

Andreas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Ingo Molnar


* Jan Engelhardt <[EMAIL PROTECTED]> wrote:

> > i've attached it below in a standalone form, feel free to put it 
> > into SD! :)
> 
> Assume X went crazy (lacking any statistics, I make the unproven 
> statement that this happens more often than kthreads going berserk), 
> then having it niced with minus something is not too nice.

i've not experienced a 'runaway X' personally, at most it would crash or 
lock up ;) The value is boot-time and sysctl configurable as well back 
to 0.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Kyle Moffett


On Apr 21, 2007, at 12:42:41, William Lee Irwin III wrote:

On Sat, 21 Apr 2007, Willy Tarreau wrote:
If you remember, with 50/50, I noticed some difficulties to fork  
many processes. I think that during a fork(), the parent has a  
higher probability of forking other processes than the child. So  
at least, we should use something like 67/33 or 75/25 for parent/ 
child.


On Sat, Apr 21, 2007 at 09:34:07AM -0700, Linus Torvalds wrote:

It would be even better to simply have the rule:
 - child gets almost no points at startup
 - but when a parent does a "waitpid()" call and blocks, it will  
spread  out its points to the childred (the "vfork()" blocking is  
another case that is really the same).
This is a very special kind of "priority inversion" logic: you  
give higher priority to the things you wait for. Not because of  
holding any locks, but simply because a blockign waitpid really is  
a damn big hint that "ok, the child now works for the parent".


An in-kernel scheduler API might help. void yield_to(struct  
task_struct *)?


A userspace API might be nice, too. e.g. int sched_yield_to(pid_t).


It might be nice if it was possible to actively contribute your CPU  
time to a child process.  For example:

int sched_donate(pid_t pid, struct timeval *time, int percentage);

Maybe a way to pass CPU time over a UNIX socket (analogous to  
SCM_RIGHTS), along with information on what process/user passed it   
That would make it possible to really fix X properly on a local  
system.  You could make the X client library pass CPU time to the X  
server whenever it requests a CPU-intensive rendering operation.   
Ordinarily X would nice all of its client service threads to +10, but  
when a client passes CPU time to its thread over the socket, then its  
service thread temporarily gets the scheduling properties of the  
client.  I'm not a scheduler guru, but that's what makes the most  
sense from an application-programmer point of view.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ieee1394: more help in Kconfig

2007-04-21 Thread Stefan Richter

  - s/Device Drivers/Controllers/
  - clarify who needs pcilynx
  - don't recommend Y for raw1394; M is typically used

Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>
---
 drivers/ieee1394/Kconfig |   20 
 1 file changed, 12 insertions(+), 8 deletions(-)

Index: linux/drivers/ieee1394/Kconfig
===
--- linux.orig/drivers/ieee1394/Kconfig
+++ linux/drivers/ieee1394/Kconfig
@@ -36,7 +36,7 @@ config IEEE1394_VERBOSEDEBUG
  Say Y if you really want or need the debugging output, everyone
  else says N.
 
-comment "Device Drivers"
+comment "Controllers"
depends on IEEE1394
 
 comment "Texas Instruments PCILynx requires I2C"
@@ -54,6 +54,10 @@ config IEEE1394_PCILYNX
  To compile this driver as a module, say M here: the
  module will be called pcilynx.
 
+ Only some old and now very rare PCI and CardBus cards and
+ PowerMacs G3 B contain the PCILynx controller.  Therefore
+ almost everybody can say N here.
+
 config IEEE1394_OHCI1394
tristate "OHCI-1394 support"
depends on PCI && IEEE1394
@@ -67,7 +71,7 @@ config IEEE1394_OHCI1394
  To compile this driver as a module, say M here: the
  module will be called ohci1394.
 
-comment "Protocol Drivers"
+comment "Protocols"
depends on IEEE1394
 
 config IEEE1394_VIDEO1394
@@ -136,12 +140,12 @@ config IEEE1394_RAWIO
tristate "Raw IEEE1394 I/O support"
depends on IEEE1394
help
- Say Y here if you want support for the raw device. This is generally
- a good idea, so you should say Y here. The raw device enables
- direct communication of user programs with the IEEE 1394 bus and
- thus with the attached peripherals.
+ This option adds support for the raw1394 device file which enables
+ direct communication of user programs with the IEEE 1394 bus and thus
+ with the attached peripherals.  Almost all application programs which
+ access FireWire require this option.
 
- To compile this driver as a module, say M here: the
- module will be called raw1394.
+ To compile this driver as a module, say M here: the module will be
+ called raw1394.
 
 endmenu

-- 
Stefan Richter
-=-=-=== -=-- =-=-=
http://arcgraph.de/sr/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel 2.6.20+ and rt2500/rt2570 problem

2007-04-21 Thread Ben Collins

On Sat, 2007-04-21 at 16:18 +0200, Wiktor Wandachowicz wrote:
> Recently I tried newer releases of Linux distributions, specifically
> Sabayon Linux 3.3 and Ubuntu 7.04. Both suffer from problems with
> rt2500/rt2570 modules not being able to associate with access points.
> There are numerous problems reported Ubuntu Launchpad, Gentoo Forums
> and Sabayon Forums.
> 
> Please, Linux kernel developers, try to help out the guys at:
> http://rt2x00.serialmonkey.com/
> 
> They look pretty clueless and don't know exactly what to do.
> It seems that advances in kernel to Wireless Extensions 19 broke
> their efforts somewhat.
> 
> If possible, speed up the inclusion process of the driver into mainline
> kernel, so users won't have to be forced to do a hardware upgrade.
> Additionally, this problem prevents my machines from being upgraded from
> Ubuntu 6.10 (kernel 2.6.17) to Ubuntu 7.04 (kernel 2.6.20 I believe).


There are two modules in Ubuntu for this chipset, the rt2500 (default)
and rt2500{pci,usb}. You could try blacklisting rt2500 and use
rt2500pci. See if that helps.

-- 
Ubuntu:http://www.ubuntu.com/
Linux1394: http://www.linux1394.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Gene Heskett

On Saturday 21 April 2007, Willy Tarreau wrote:
>Hi Ingo, Hi Con,
>
>I promised to perform some tests on your code. I'm short in time right now,
>but I observed behaviours that should be commented on.
>
>1) machine : dual athlon 1533 MHz, 1G RAM, kernel 2.6.21-rc7 + either
> scheduler Test:  ./ocbench -R 25 -S 75 -x 8 -y 8
>   ocbench: http://linux.1wt.eu/sched/
>
>2) SD-0.44
>
>   Feels good, but becomes jerky at moderately high loads. I've started
>   64 ocbench with a 250 ms busy loop and 750 ms sleep time. The system
>   always responds correctly but under X, mouse jumps quite a bit and
>   typing in xterm or even text console feels slightly jerky. The CPU is
>   not completely used, and the load varies a lot (see below). However,
>   the load is shared equally between all 64 ocbench, and they do not
>   deviate even after 4000 iterations. X uses less than 1% CPU during
>   those tests.
>
>   Here's the vmstat output :
>
>[EMAIL PROTECTED]:~$ vmstat 1
>   procs  memory  swap  io system 
> cpu r  b  w   swpd   free   buff  cache   si   sobibo   incs us
> sy id 0  0  0  0 919856   6648  577880022 24   148
> 31 49 20 0  0  0  0 919856   6648  5778800 0 02  
> 285 32 50 19 28  0  0  0 919836   6648  5778800 0 0   
> 0   331 24 40 36 64  0  0  0 919836   6648  5778800 0 0
>1   618 23 40 37 65  0  0  0 919836   6648  5778800 0   
>  00   571 21 36 43 35  0  0  0 919836   6648  5778800 0
> 03   382 32 50 18 2  0  0  0 919836   6648  5778800
> 0 00   308 37 61  2 8  0  0  0 919836   6648  5778800  
>   0 01   533 36 65  0 32  0  0  0 919768   6648  577880   
> 0 0 0   93   706 33 62  5 62  0  0  0 919712   6648  577880
>0 0 0   65   617 32 54 13 63  0  0  0 919712   6648  57788  
>  00 0 01   569 28 48 23 40  0  0  0 919712   6648 
> 5778800 0 00   427 26 50 24 4  0  0  0 919712  
> 6648  5778800 0 01   382 29 48 23 4  0  0  0 919712
>   6648  5778800 0 00   383 34 65  0 14  0  0  0
> 919712   6648  5778800 0 01   769 39 61  0 40  0  0
>  0 919712   6648  5778800 0 00   384 37 52 11 54  0  0 
> 0 919712   6648  5778800 0 01   715 31 60  8 58  0 
> 2  0 919712   6648  5778800 0 01   611 34 65  0 41 
> 0  0  0 919712   6648  5778800 0 0   19   395 28 45 27
> 0  0  0  0 919712   6648  5778800 0 0   31   421 23 32
> 45 0  0  0  0 919712   6648  5778800 0 0   31   328 34
> 44 22 29  0  0  0 919712   6648  5778800 0 0   34   369
> 32 43 25 65  0  0  0 919712   6648  5778800 0 0   31  
> 410 24 35 40 47  0  1  0 919712   6648  5778800 0 0  
> 42   538 25 39 35
>
>3) CFS-v4
>
>  Feels even better, mouse movements are very smooth even under high load.
>  I noticed that X gets reniced to -19 with this scheduler. I've not looked
>  at the code yet but this looked suspicious to me. I've reniced it to 0 and
>  it did not change any behaviour. Still very good. The 64 ocbench share
>  equal CPU time and show exact same progress after 2000 iterations. The CPU
>  load is more smoothly spread according to vmstat, and there's no idle (see
>  below). BUT I now think it was wrong to let new processes start with no
>  timeslice at all, because it can take tens of seconds to start a new
> process when only 64 ocbench are there. Simply starting "killall ocbench"
> takes about 10 seconds. On a smaller machine (VIA C3-533), it took me more
> than one minute to do "su -", even from console, so that's not X. BTW, X
> uses less than 1% CPU during those tests.
>
>[EMAIL PROTECTED]:~$ vmstat 1
>   procs  memory  swap  io system 
> cpu r  b  w   swpd   free   buff  cache   si   sobibo   incs us
> sy id 12  0  2  0 922120   6532  5754000   29929   31   386
> 17 27 57 12  0  2  0 922096   6532  5755600 0 01  
> 776 37 63  0 14  0  2  0 922096   6532  5755600 0 0   
> 1   782 35 65  0 13  0  1  0 922096   6532  5755600 0 0
>0   782 38 62  0 14  0  1  0 922096   6532  5755600 0   
>  01   782 36 64  0 13  0  1  0 922096   6532  5755600 0
> 02   785 38 62  0 13  0  1  0 922096   6532  5755600   
>  0 01   774 35 65  0 14  0  1  0 922096   6532  5755600
> 0 00   784 36 64  0 13  0  1  0 922096   6532  575560  
>  0 0 01   767 37 63  0 13  0  1  0 922096   6532  57556   
> 00 0 01   785 41 59  0 14  0  1  0

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Ulrich Drepper


On 4/21/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:

on a simple 'ls' command:

  21310 clone(child_stack=0,  ...) = 21399
  ...
  21399 execve("/bin/ls",
  ...
  21310 waitpid(-1, 

the PID is -1 so we dont actually know which task we are waiting for.


That's a special case.  Most programs don't do this.  In fact, in
multi-threaded code you better never do it since such an unqualified
wait might catch the child another thread waits for (particularly bad
if one thread uses system()).

And even in the case of bash, we probably can change to code to use a
qualified wait in case there are no other children.  This is known at
any time and I expect that most of the time there are no background
processes.  At least in shell scripts.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-21 Thread Rik van Riel


Hugh Dickins wrote:

On Fri, 20 Apr 2007, Rik van Riel wrote:

Andrew Morton wrote:


  I do go on about that.  But we're adding page flags at about one per
  year, and when we run out we're screwed - we'll need to grow the
  pageframe.

If you want, I can take a look at folding this into the
->mapping pointer.  I can guarantee you it won't be
pretty, though :)


Please don't.  If we're going to stuff another pageflag into there,
let it be PageSwapCache the natural partner of PageAnon, rather than
whatever our latest pageflag happens to be. 


I looked at doing what Andrew wanted, and it did indeed not
look like the right thing to do.  The locking on page->mapping
is the kind of locking we want to avoid during zap_page_range
and in the pageout code.

I like your suggestion better.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cxacru: Add Documentation file

2007-04-21 Thread Duncan Sands

Hi Simon,

> +named device points to the USB interface device's directory which contains
> +several sysfs attribute files for retriving device statistics:

retrieving

Ciao,

Duncan.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Jan Engelhardt


On Apr 21 2007 18:00, Ingo Molnar wrote:
>* Con Kolivas <[EMAIL PROTECTED]> wrote:
>
>> >   Feels even better, mouse movements are very smooth even under high 
>> >   load. I noticed that X gets reniced to -19 with this scheduler. 
>> >   I've not looked at the code yet but this looked suspicious to me. 
>> >   I've reniced it to 0 and it did not change any behaviour. Still 
>> >   very good.
>> 
>> Looks like this code does it:
>> 
>> +int sysctl_sched_privileged_nice_level __read_mostly = -19;
>
>correct. Note that Willy reniced X back to 0 so it had no relevance on 
>his test. Also note that i pointed this change out in the -v4 CFS 
>announcement:
>
>|| Changes since -v3:
>||
>||  - usability fix: automatic renicing of kernel threads such as 
>||keventd, OOM tasks and tasks doing privileged hardware access
>||(such as Xorg).
>
>i've attached it below in a standalone form, feel free to put it into 
>SD! :)

Assume X went crazy (lacking any statistics, I make the unproven
statement that this happens more often than kthreads going berserk),
then having it niced with minus something is not too nice.



Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Geert Bosch



On Apr 21, 2007, at 12:18, Willy Tarreau wrote:
Also, I believe that (in shells), most forked processes do not even  
consume
a full timeslice (eg: $(uname -n) is very fast). This means that  
assigning
them with a shorter one will not hurt them while preserving the  
shell's

performance against CPU hogs.


On a fast machine, during regression testing of GCC, I've noticed we  
create
an average of 500 processes per second during an hour or so. There  
are other
work loads like this. So, most processes start, execute and complete  
in 2ms.

How does fairness work in a situation like this?

  -Geert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] allow unprivileged mounts

2007-04-21 Thread Jan Engelhardt


On Apr 21 2007 10:57, Eric W. Biederman wrote:
>
>> tmpfs!
>
>tmpfs is a possible problem because it can consume lots of ram/swap. 
>Which is why it has limits on the amount of space it can consume. 

Users can gobble up all RAM and swap already today. (Unless they are
confined into an rlimit, which, in most systems, is not the case.)
And in case /dev/shm exists, they can already fill it without running
into an rlimit early.

>Those are set as mount options as I recall.  Which means that we
>would need to do something different with respect to limits before
>tmpfs could become safe for an untrusted user to mount.
>
>Still it's close.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Willy Tarreau

On Sat, Apr 21, 2007 at 06:53:47PM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds <[EMAIL PROTECTED]> wrote:
> 
> > It would be even better to simply have the rule:
> >  - child gets almost no points at startup
> >  - but when a parent does a "waitpid()" call and blocks, it will spread 
> >out its points to the childred (the "vfork()" blocking is another case 
> >that is really the same).
> > 
> > This is a very special kind of "priority inversion" logic: you give 
> > higher priority to the things you wait for. Not because of holding any 
> > locks, but simply because a blockign waitpid really is a damn big hint 
> > that "ok, the child now works for the parent".
> 
> yeah. One problem i can see with the implementation of this though is 
> that shells typically do nonspecific waits - for example bash does this 
> on a simple 'ls' command:
> 
>   21310 clone(child_stack=0,  ...) = 21399
>   ...
>   21399 execve("/bin/ls", 
>   ...
>   21310 waitpid(-1, 
> 
> the PID is -1 so we dont actually know which task we are waiting for. We 
> could use the first entry from the p->children list, but that looks too 
> specific of a hack to me. It should catch most of the 
> synchronous-helper-task cases though.

The last one should be more appropriate IMHO. If you waitpid(), it's very
likely that you're waiting for the result of the very last fork().

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] allow unprivileged mounts

2007-04-21 Thread Eric W. Biederman

Jan Engelhardt <[EMAIL PROTECTED]> writes:

> On Apr 21 2007 08:10, Eric W. Biederman wrote:
>>>
 Define a new fs flag FS_SAFE, which denotes, that unprivileged
 mounting of this filesystem may not constitute a security problem.

 Since most filesystems haven't been designed with unprivileged
 mounting in mind, a thorough audit is needed before setting this flag.
>>>
>>> Practically speaking, is there any realistic likelihood that any filesystem
>>> apart from FUSE will ever use this?
>>
>>Also potentially some of the kernel virtual filesystems.  /proc should
>>be safe already.  If you don't have any kind of backing store this problem
>>gets easier.
>
> tmpfs!

tmpfs is a possible problem because it can consume lots of ram/swap.  Which
is why it has limits on the amount of space it can consume.  Those are set as
mount options as I recall.  Which means that we would need to do something
different with respect to limits before tmpfs could become safe for
an untrusted user to mount.

Still it's close.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Willy Tarreau

On Sat, Apr 21, 2007 at 09:34:07AM -0700, Linus Torvalds wrote:
> 
> 
> On Sat, 21 Apr 2007, Willy Tarreau wrote:
> > 
> > If you remember, with 50/50, I noticed some difficulties to fork many
> > processes. I think that during a fork(), the parent has a higher probability
> > of forking other processes than the child. So at least, we should use
> > something like 67/33 or 75/25 for parent/child.
> 
> It would be even better to simply have the rule:
>  - child gets almost no points at startup
>  - but when a parent does a "waitpid()" call and blocks, it will spread 
>out its points to the childred (the "vfork()" blocking is another case 
>that is really the same).
> 
> This is a very special kind of "priority inversion" logic: you give higher 
> priority to the things you wait for. Not because of holding any locks, but 
> simply because a blockign waitpid really is a damn big hint that "ok, the 
> child now works for the parent".

I like this idea a lot. I don't know if it can be applied to pipes and unix
sockets, but it's clearly a way of saying "hurry up, I'm waiting for you"
which seems natural with inter-process communications. Also, if we can do
this on unix sockets, it would help a lot with X !

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> It would be even better to simply have the rule:
>  - child gets almost no points at startup
>  - but when a parent does a "waitpid()" call and blocks, it will spread 
>out its points to the childred (the "vfork()" blocking is another case 
>that is really the same).
> 
> This is a very special kind of "priority inversion" logic: you give 
> higher priority to the things you wait for. Not because of holding any 
> locks, but simply because a blockign waitpid really is a damn big hint 
> that "ok, the child now works for the parent".

yeah. One problem i can see with the implementation of this though is 
that shells typically do nonspecific waits - for example bash does this 
on a simple 'ls' command:

  21310 clone(child_stack=0,  ...) = 21399
  ...
  21399 execve("/bin/ls", 
  ...
  21310 waitpid(-1, 

the PID is -1 so we dont actually know which task we are waiting for. We 
could use the first entry from the p->children list, but that looks too 
specific of a hack to me. It should catch most of the 
synchronous-helper-task cases though.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread William Lee Irwin III

On Sat, 21 Apr 2007, Willy Tarreau wrote:
>> If you remember, with 50/50, I noticed some difficulties to fork many
>> processes. I think that during a fork(), the parent has a higher probability
>> of forking other processes than the child. So at least, we should use
>> something like 67/33 or 75/25 for parent/child.

On Sat, Apr 21, 2007 at 09:34:07AM -0700, Linus Torvalds wrote:
> It would be even better to simply have the rule:
>  - child gets almost no points at startup
>  - but when a parent does a "waitpid()" call and blocks, it will spread 
>out its points to the childred (the "vfork()" blocking is another case 
>that is really the same).
> This is a very special kind of "priority inversion" logic: you give higher 
> priority to the things you wait for. Not because of holding any locks, but 
> simply because a blockign waitpid really is a damn big hint that "ok, the 
> child now works for the parent".

An in-kernel scheduler API might help. void yield_to(struct task_struct *)?

A userspace API might be nice, too. e.g. int sched_yield_to(pid_t).


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread William Lee Irwin III

On Sat, Apr 21, 2007 at 06:00:08PM +0200, Ingo Molnar wrote:
>  arch/i386/kernel/ioport.c   |   13 ++---
>  arch/x86_64/kernel/ioport.c |8 ++--
>  drivers/block/loop.c|5 -
>  include/linux/sched.h   |7 +++
>  kernel/sched.c  |   40 
>  kernel/workqueue.c  |2 +-
>  mm/oom_kill.c   |4 +++-
>  7 files changed, 71 insertions(+), 8 deletions(-)

Yum. I'm going to see what this does for glxgears (I presume it's a
screensaver) on my dual G5 driving a 42" wall-mounted TV for a display. ;)

More seriously, there should be more portable ways of doing this. I
suspect even someone using fbdev on i386/x86-64 might be left out here.

-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Linus Torvalds



On Sat, 21 Apr 2007, Willy Tarreau wrote:
> 
> If you remember, with 50/50, I noticed some difficulties to fork many
> processes. I think that during a fork(), the parent has a higher probability
> of forking other processes than the child. So at least, we should use
> something like 67/33 or 75/25 for parent/child.

It would be even better to simply have the rule:
 - child gets almost no points at startup
 - but when a parent does a "waitpid()" call and blocks, it will spread 
   out its points to the childred (the "vfork()" blocking is another case 
   that is really the same).

This is a very special kind of "priority inversion" logic: you give higher 
priority to the things you wait for. Not because of holding any locks, but 
simply because a blockign waitpid really is a damn big hint that "ok, the 
child now works for the parent".

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2

2007-04-21 Thread Ulrich Drepper


On 4/21/07, Hugh Dickins <[EMAIL PROTECTED]> wrote:

But the Linux MADV_DONTNEED does throw away
data from a PROT_WRITE,MAP_PRIVATE mapping (or brk or stack) - those
changes are discarded, and a subsequent access will revert to zeroes
or the underlying mapped file.  Been like that since before 2.4.0.


I didn't say it changed.  I just say that there is a hole in the
current implementation as it does not allow to implement
POSIX_MADV_DONTNEED with anything but a no-op.  The
POSIX_MADV_DONTNEED behavior is useful and something IMO should be
added to allow implementing it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] CFS scheduler, v3

2007-04-21 Thread William Lee Irwin III

* William Lee Irwin III <[EMAIL PROTECTED]> wrote:
>> Suppose a table of nice weights like the following is tuned via 
>> /proc/:
>> -20  21   0  1
>>  -1  2   19  0.0476
> > Essentially 1/(n+1) when n >= 0 and 1-n when n < 0.

On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote:
> ok, thanks for thinking about it. I have changed the nice weight in 
> CVSv5-to-be so that it defaults to something pretty close to your 
> suggestion: the ratio between a nice 0 loop and a nice 19 loop is now 
> set to about 2%. (This something that users requested for some time, the 
> default ~5% is a tad high when running reniced SETI jobs, etc.)

Okay. Maybe what I suggested is too weak vs. too strong. I didn't
actually have it in mind as a proposal for general use, but maybe it is
good for such. I had more in mind tunability in general, but it's all
good. I'd think some curve gentler in intermediate nice levels and
stronger at the tails might be better.

On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote:
> the actual percentage scales almost directly with the nice offset 
> granularity value, but if this should be exposed to users at all, i 
> agree that it would be better to directly expose this as some sort of 
> 'ratio between nice 0 and nice 19 tasks', right? Or some other, more 
> finegrained metric. Percentile is too coarse i think, and using 0.1% 
> units isnt intuitive enough i think. The sysctl handler would then 
> transform that 'human readable' sysctl value into the appropriate 
> internal nice-offset-granularity value (or whatever mechanism the 
> implementation ends up using).

I vaguely liked specifying the full table, but maybe it's too much
for a real user interface.

4-digit or 5-digit fixed point decimal sounds reasonable.

On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote:
> I'd not do this as a per-nice-level thing but as a single value that 
> rescales the whole nice level range at once. That's alot less easy to 
> misconfigure and we've got enough nice levels for users to pick from 
> almost arbitrarily, as long as they have the ability to influence the 
> max.
> does this sound mostly OK to you?

For the most part, yes. I've been mostly looking at how effectively
the prioritization algorithms work. I'll be wrapping up writing a
testcase to measure all this soon. The basic idea is to take the
weights as inputs somehow and then check to see that they're honored.

What's appropriate for end-users is a very different thing from what
might be appropriate for me. I won't have trouble fiddling with the
code, so please do design around what the best interface for end-users
might be.

-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] reiserfs vs BKL

2007-04-21 Thread Peter Zijlstra

On Sat, 2007-04-21 at 12:14 -0400, Jeff Mahoney wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Peter Zijlstra wrote:
> > Replace all the lock_kernel() instances with reiserfs_write_lock(sb),
> > and make that use an actual per super-block mutex instead of
> > lock_kernel().
> > 
> > This should make reiserfs safe from PREEMPT_BKL=n, since it seems to
> > rely on being able to schedule. Also, it removes the dependency on the
> > BKL, and thereby is not prone to cause prio inversion with remaining BKL
> > users (notably tty).
> > 
> > Compile tested only, since I didn't dare boot it.
> 
> NACK.

darn, I suspected it would be wrong :-/, it was too easy to be right.

> Believe me, I would *love* to nuke the BKL from reiserfs, but a search
> and replace of this nature is just wrong. reiserfs_write_lock() using
> the BKL isn't an accident - it depends on its nesting properties. If you
> did try to boot this kernel, you'd deadlock pretty quickly.

Right, I see. 

> This one has been on my TODO list for a long time. Interestingly, I've
> been doing reiserfs xattr development recently using 2.6.21-rc7-git2,
> and I'm not seeing any of these messages.

Yeah, that would be pretty close to what I ran.

> # CONFIG_PREEMPT_NONE is not set
> CONFIG_PREEMPT_VOLUNTARY=y
> # CONFIG_PREEMPT is not set
> # CONFIG_PREEMPT_BKL is not set

I have CONFIG_PREEMPT=y, but seeing how one of those points came from a
cond_resched() within a lock_kernel() section, I'm not seeing how you
don't get these.

A well, I really do hope you come up with something some day, for me its
time to go change filesystems - bah, the pain...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Willy Tarreau

On Sat, Apr 21, 2007 at 05:46:14PM +0200, Ingo Molnar wrote:
> 
> * Willy Tarreau <[EMAIL PROTECTED]> wrote:
> 
> > I promised to perform some tests on your code. I'm short in time right 
> > now, but I observed behaviours that should be commented on.
> 
> thanks for the feedback!
> 
> > 3) CFS-v4
> > 
> >   Feels even better, mouse movements are very smooth even under high 
> >   load. I noticed that X gets reniced to -19 with this scheduler. I've 
> >   not looked at the code yet but this looked suspicious to me. I've 
> >   reniced it to 0 and it did not change any behaviour. Still very 
> >   good. The 64 ocbench share equal CPU time and show exact same 
> >   progress after 2000 iterations. The CPU load is more smoothly spread 
> >   according to vmstat, and there's no idle (see below). BUT I now 
> >   think it was wrong to let new processes start with no timeslice at 
> >   all, because it can take tens of seconds to start a new process when 
> >   only 64 ocbench are there. [...]
> 
> ok, i'll modify that portion and add back the 50%/50% parent/child CPU 
> time sharing approach again. (which CFS had in -v1) That should not 
> change the rest of your test and should improve the task startup 
> characteristics.

If you remember, with 50/50, I noticed some difficulties to fork many
processes. I think that during a fork(), the parent has a higher probability
of forking other processes than the child. So at least, we should use
something like 67/33 or 75/25 for parent/child.

There are many shell-scripts out there doing a lot of fork(), and it should
be reasonable to let them keep some CPU to continue to work.

Also, I believe that (in shells), most forked processes do not even consume
a full timeslice (eg: $(uname -n) is very fast). This means that assigning
them with a shorter one will not hurt them while preserving the shell's
performance against CPU hogs.

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [d_path 3/7] Add d_namespace_path() to compute namespace relative pathnames

2007-04-21 Thread Andreas Gruenbacher

On Saturday 21 April 2007 14:57, Tetsuo Handa wrote:
> So, you may want customized version of d_namespace_path()?

No. d_namespace_path() returns valid pathnames, just like d_path() does. 
Whatever quoting needed can be added to the resulting pathname.

Andreas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] reiserfs vs BKL

2007-04-21 Thread Jeff Mahoney

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Peter Zijlstra wrote:
> Replace all the lock_kernel() instances with reiserfs_write_lock(sb),
> and make that use an actual per super-block mutex instead of
> lock_kernel().
> 
> This should make reiserfs safe from PREEMPT_BKL=n, since it seems to
> rely on being able to schedule. Also, it removes the dependency on the
> BKL, and thereby is not prone to cause prio inversion with remaining BKL
> users (notably tty).
> 
> Compile tested only, since I didn't dare boot it.

NACK.

Believe me, I would *love* to nuke the BKL from reiserfs, but a search
and replace of this nature is just wrong. reiserfs_write_lock() using
the BKL isn't an accident - it depends on its nesting properties. If you
did try to boot this kernel, you'd deadlock pretty quickly.

This one has been on my TODO list for a long time. Interestingly, I've
been doing reiserfs xattr development recently using 2.6.21-rc7-git2,
and I'm not seeing any of these messages.

# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFGKjhxLPWxlyuTD7IRAgCkAJ95MUehySJUUjBzl1ldr7BxESmQQACgjnkw
73KpJaH2G725AMJeWD02Arg=
=PaIV
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Willy Tarreau

On Sat, Apr 21, 2007 at 06:00:08PM +0200, Ingo Molnar wrote:
> 
> * Con Kolivas <[EMAIL PROTECTED]> wrote:
> 
> > >   Feels even better, mouse movements are very smooth even under high 
> > >   load. I noticed that X gets reniced to -19 with this scheduler. 
> > >   I've not looked at the code yet but this looked suspicious to me. 
> > >   I've reniced it to 0 and it did not change any behaviour. Still 
> > >   very good.
> > 
> > Looks like this code does it:
> > 
> > +int sysctl_sched_privileged_nice_level __read_mostly = -19;
> 
> correct. Note that Willy reniced X back to 0 so it had no relevance on 
> his test.

Anyway, my X was mostly unused (below 1% CPU), which was my intent when
replacing glxgears by ocbench. We have not settled yet about how to handle
the special case for X. Let's at least try to get the best schedulers without
this problem, then see how to make them behave the best taking X into account.

> Also note that i pointed this change out in the -v4 CFS 
> announcement:
> 
> || Changes since -v3:
> ||
> ||  - usability fix: automatic renicing of kernel threads such as 
> ||keventd, OOM tasks and tasks doing privileged hardware access
> ||(such as Xorg).
> 
> i've attached it below in a standalone form, feel free to put it into 
> SD! :)

Con, I think it could be a good idea since you recommend to renice X with
SD. Most of the problem users are facing with renicing X is that they need
to change their configs or scripts. If the kernel can reliably detect X and
handle it differently, why not do it ?

It makes me think that this hint might be used to set some flags in the task
struct in order to apply different processing than just renicing. It is indeed
possible that nice is not the best solution and that something else would be
even better (eg: longer timeslices, but not changing priority in the queues).
Just an idea anyway.

OK, back to work ;-)
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Devel] Re: [PATCH] bluetooth bnep: Convert to kthread API.

2007-04-21 Thread Satyam Sharma

Hello,

On 4/20/07, Cedric Le Goater <[EMAIL PROTECTED]> wrote:

Cedric Le Goater wrote:
> Andrew Morton wrote:
>> On Thu, 19 Apr 2007 01:58:51 -0600
>> "Eric W. Biederman" <[EMAIL PROTECTED]> wrote:
>>>
>>>
>>> +   task = kthread_run(bnep_session, s, "kbnepd %s", dev->name);
>> It's unusual to have a kernel thread which has a space in its name.  That
>> could trip up infufficient-defensive userspace tools.

But all kernel threads are supposed to be only *in-kernel*
implementation details. Isn't a userspace tool whose behaviour relies
on the existence (or even the knowledge of the existence) of any
kernel thread *broken by design*?

> but we can't just change it, can we ? i could be used by a user space tool
> to check if the thread is running.

Yes, so although userspace shouldn't be bothering with kernel threads
in the first place, that does not mean that such tools do not exist.
So we'll have to live with this (unfortunate) naming for some time,
till we can get rid of it later.

Which is similar to the habit of some kernel threads in there that
actually *do* want to export the knowledge of their existence (and
even a signals-based interface!) to userspace. Eric did receive some
nacks on his patches that tried to remove the signals business from
kernel threads on this account, but perhaps that too is something that
we could get rid of later (hopefully by that time those using signals
in kernel threads would have realized their folly and shifted to
something else :-)

Satyam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 >

1 - 100 of 438 matches

Mail list logo