Re: [PATCH] powerpc pseries eeh: Convert to kthread API

2007-04-24 Thread Linas Vepstas
On Tue, Apr 24, 2007 at 11:38:53AM +1000, Benjamin Herrenschmidt wrote:
> > The only reason for using threads here is to get the error recovery
> > out of an interrupt context (where errors may be detected), and then,
> > an hour later, decrement a counter (which is how we limit these to 
> > 6 per hour). Thread reaping is "trivial", the thread just exits
> > after an hour.
> 
> In addition, it should be a thread and not done from within keventd
> because :
> 
>  - It can take a long time (well, relatively but still too long for a
> work queue)

Uhh, 15 or 20 seconds even. That's a long time by any kernel standard.

>  - The driver callbacks might need to use keventd or do flush_workqueue
> to synchronize with their own workqueues when doing an internal
> recovery.
> 
> > Since these are events rare, I've no particular concern about
> > performance or resource consumption. The current code seems 
> > to work just fine. :-)
> 
> I think moving to kthread's is cleaner (just a wrapper around kernel
> threads that simplify dealing with reaping them out mostly) and I agree
> with Christoph that it would be nice to be able to "fire off" kthreads
> from interrupt context.. in many cases, we abuse work queues for things
> that should really done from kthreads instead (basically anything that
> takes more than a couple hundred microsecs or so).

It would be nice to have threads that can be "fired off" from an
interrupt context.  That would simplify the EEH code slightly 
(removing a few dozen lines of code that do this bounce).

I presume that various device drivers might find this useful as well.

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cfq: get rid of cfqq hash

2007-04-24 Thread Jens Axboe
On Tue, Apr 24 2007, Jens Axboe wrote:
> On Tue, Apr 24 2007, Vasily Tarasov wrote:
> > From: Vasily Tarasov <[EMAIL PROTECTED]>
> > 
> > cfq hash is no more necessary.  We always can get cfqq from io context.
> > cfq_get_io_context_noalloc() function is introduced, because we don't want 
> > to
> > allocate cic on merging and checking may_queue.
> > In order to identify sync queue we've used hash key = CFQ_KEY_ASYNC. Since 
> > hash
> > is eliminated we need to use other criterion: sync flag for queue is added.
> > In all places where we dig in rb_tree we're in current context, so no
> > additional locking is required.
> > 
> > Advantages of this patch: no additional memory for hash, no seeking in hash,
> > code is cleaner. But it is necessary now to seek cic in per-ioc rbtree, but
> > it is faster:
> > - most processes work only with few devices
> > - most systems have only few block devices
> > - it is a rb-tree
> 
> Vasily, this is still not against the CFQ branch, I get tons of rejects:
> 
> [EMAIL PROTECTED]:/src/linux-2.6-block $ patch -p1 --dry-run < ~/foo
> [...]
> 10 out of 27 hunks FAILED -- saving rejects to file
> block/cfq-iosched.c.rej
> 
> If you don't want to use the git tree, then just grab
> 
> http://brick.kernel.dk/snaps/cfq-update-20070424
> 
> and apply it to 2.6.21-rc7-gitX (latest) and provide a diff against
> that. Thanks!

I merged it myself, care to double check it? I'll do some testing on it
tomorrow, and integrate if I'm happy with it.

Have you done any multi disk testing? scsi_debug can be quite handy for
such things, testing thousands of io_contexts and disks. Just be sure to
use delay=1 and fake_rw=1.

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 8093733..e42c09b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -9,7 +9,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -38,14 +37,6 @@ static int cfq_slice_idle = HZ / 125;
 
 #define CFQ_SLICE_SCALE(5)
 
-#define CFQ_KEY_ASYNC  (0)
-
-/*
- * for the hash of cfqq inside the cfqd
- */
-#define CFQ_QHASH_SHIFT6
-#define CFQ_QHASH_ENTRIES  (1 << CFQ_QHASH_SHIFT)
-
 #define RQ_CIC(rq) ((struct cfq_io_context*)(rq)->elevator_private)
 #define RQ_CFQQ(rq)((rq)->elevator_private2)
 
@@ -62,8 +53,6 @@ static struct completion *ioc_gone;
 #define ASYNC  (0)
 #define SYNC   (1)
 
-#define cfq_cfqq_sync(cfqq)((cfqq)->key != CFQ_KEY_ASYNC)
-
 #define sample_valid(samples)  ((samples) > 80)
 
 /*
@@ -90,11 +79,6 @@ struct cfq_data {
struct cfq_rb_root service_tree;
unsigned int busy_queues;
 
-   /*
-* cfqq lookup hash
-*/
-   struct hlist_head *cfq_hash;
-
int rq_in_driver;
int sync_flight;
int hw_tag;
@@ -138,10 +122,6 @@ struct cfq_queue {
atomic_t ref;
/* parent cfq_data */
struct cfq_data *cfqd;
-   /* cfqq lookup hash */
-   struct hlist_node cfq_hash;
-   /* hash key */
-   unsigned int key;
/* service_tree member */
struct rb_node rb_node;
/* service_tree key */
@@ -186,6 +166,7 @@ enum cfqq_state_flags {
CFQ_CFQQ_FLAG_prio_changed, /* task priority has changed */
CFQ_CFQQ_FLAG_queue_new,/* queue never been serviced */
CFQ_CFQQ_FLAG_slice_new,/* no requests dispatched in slice */
+   CFQ_CFQQ_FLAG_sync, /* synchronous queue */
 };
 
 #define CFQ_CFQQ_FNS(name) \
@@ -212,11 +193,14 @@ CFQ_CFQQ_FNS(idle_window);
 CFQ_CFQQ_FNS(prio_changed);
 CFQ_CFQQ_FNS(queue_new);
 CFQ_CFQQ_FNS(slice_new);
+CFQ_CFQQ_FNS(sync);
 #undef CFQ_CFQQ_FNS
 
-static struct cfq_queue *cfq_find_cfq_hash(struct cfq_data *, unsigned int, 
unsigned short);
+static struct cfq_io_context * cfq_get_io_context_noalloc(struct cfq_data *,
+ struct task_struct *);
 static void cfq_dispatch_insert(request_queue_t *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, unsigned int, struct 
task_struct *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
+  struct task_struct *, gfp_t);
 
 /*
  * scheduler run of queue, if there are requests pending and no one in the
@@ -235,17 +219,6 @@ static int cfq_queue_empty(request_queue_t *q)
return !cfqd->busy_queues;
 }
 
-static inline pid_t cfq_queue_pid(struct task_struct *task, int rw, int 
is_sync)
-{
-   /*
-* Use the per-process queue, for read requests and syncronous writes
-*/
-   if (!(rw & REQ_RW) || is_sync)
-   return task->pid;
-
-

Re: SOME STUFF ABOUT REISER4

2007-04-24 Thread Eric M. Hopper
On Mon, 2007-04-23 at 23:17 -0700, [EMAIL PROTECTED] wrote:
> On Sun, 22 Apr 2007 19:00:46 -0700, "Eric Hopper"
> <[EMAIL PROTECTED]> said:
> 
> > I know that this whole effort has been put in disarray by the
> > prosecution of Hans Reiser, but I'm curious as to its status. Is
> > Reiser4 going to be going into the Linus kernel anytime soon? Is there
> > somewhere I should be looking to find this out without wasting bandwidth
> > here?
> 
> There was a thread the other day, that talked about Reiser4.
> 
> It took a while but I have found it (actually two)
> 
> http://lkml.org/lkml/2007/4/5/360
> http://lkml.org/lkml/2007/4/9/4
> 
> You may want to check them out.

I did.  That whole thread is some guy spouting off a ludicrous Bonnie++
benchmark showing that compressing long strings of 0s results in things
taking up very little space and being very fast.

Such things will produce lots of flames and no useful information
whatsoever as is evinced by the half conspiracy theory, half truth the
thread degenerated into in the second message you linked to.

*sigh*,
-- 
Eric Hopper (http://www.omnifarious.org/~hopper/)


signature.asc
Description: This is a digitally signed message part


Re: [patch v2] Fixes and cleanups for earlyprintk aka boot console.

2007-04-24 Thread Andrew Morton
On Tue, 24 Apr 2007 16:04:16 +0200 Gerd Hoffmann <[EMAIL PROTECTED]> wrote:

> > I get this, across netconsole:
> > 
> > [17179569.184000] console handover: boot [earlyvga_f_0] -> real [tty0]
> > 
> > wanna take a look at why there's cruft in bootconsole->name please?
> 
> -EFULL ;)
> 
> "earlyvga" is 8 chars.  struct console->name is char[8].  No space left 
> for the trailing ´\0´, the cruft comes from the next field (write 
> function pointer).  Obviously nobody ever printed the early console 
> names before.

doh.

> Hmm.  We can make the names shorter.  We can make the name field longer 
> (probably 16, it ends up taking that much anyway due to aligments at 
> least on 64bit).  This looks best to me.  We could also use 
> printk("%.8s",name) to make printk stop after 8 chars, but I somehow 
> don't like hardcoding the length like this ...
> 

yup, making it 16 sounds simplest.  I'll do the patch, thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Getting the new RxRPC patches upstream

2007-04-24 Thread Oleg Nesterov
On 04/24, David Howells wrote:
>
> Oleg Nesterov <[EMAIL PROTECTED]> wrote:
> 
> > Great. I'll send the s/del_timer_sync/del_timer/ patch.
> 
> I didn't say I necessarily agreed that this was a good idea.  I just meant 
> that
> I agree that it will waste CPU.  You must still audit all uses of
> cancel_delayed_work().

Sure, I'll grep for cancel_delayed_work(). But unless I missed something,
this change should be completely transparent for all users. Otherwise, it
is buggy.

> > Aha, now I see what you mean. However. Why the code above is better then
> > 
> > cancel_delayed_work(&afs_server_reaper);
> > schedule_delayed_work(&afs_server_reaper, 0);
> > 
> > ? (I assume we already changed cancel_delayed_work() to use del_timer).
> 
> Because calling schedule_delayed_work() is a waste of CPU if the timer expiry
> handler is currently running at this time as *that* is going to also schedule
> the delayed work item.

Yes. But otoh, try_to_del_timer_sync() is a waste of CPU compared to 
del_timer(),
when the timer is not pending.

> > 1: lock_timer_base(), return -1, skip schedule_delayed_work().
> >
> > 2: check timer_pending(), return 0, call schedule_delayed_work(),
> >return immediately because test_and_set_bit(WORK_STRUCT_PENDING)
> >fails.
> 
> I don't see what you're illustrating here.  Are these meant to be two steps in
> a single process?  Or are they two alternate steps?

two alternate steps.

1 means
if (try_to_cancel_delayed_work())
schedule_delayed_work();

2 means
cancel_delayed_work();
schedule_delayed_work();

> > So I still don't think try_to_del_timer_sync() can help in this particular
> > case.
> 
> It permits us to avoid the test_and_set_bit() under some circumstances.

Yes. But lock_timer_base() is more costly.

> > To some extent, try_to_cancel_delayed_work is
> > 
> > int try_to_cancel_delayed_work(dwork)
> > {
> > ret = cancel_delayed_work(dwork);
> > if (!ret && work_pending(&dwork->work))
> > ret = -1;
> > return ret;
> > }
> > 
> > iow, work_pending() looks like a more "precise" indication that work->func()
> > is going to run soon.
> 
> Ah, but the timer routine may try to set the work item pending flag *after* 
> the
> work_pending() check you have here.

No, delayed_work_timer_fn() doesn't set the _PENDING flag.

>  Furthermore, it would be better to avoid
> the work_pending() check entirely because that check involves interacting with
> atomic ops done on other CPUs.

Sure, the implementation of try_to_cancel_delayed_work() above is just for
illustration. I don't think we need try_to_cancel_delayed_work() at all.

>try_to_del_timer_sync() returning -1 tells us
> without a shadow of a doubt that the work item is either scheduled now or will
> be scheduled very shortly, thus allowing us to avoid having to do it ourself.

First, this is very unlikely event, delayed_work_timer_fn() is very fast unless
interrupted.

_PENDING flag won't be cleared until this work is executed by run_workqueue().
In generak, work_pending() after del_timer() is imho better way to avoid the
unneeded schedule_delayed_work().

But again, I can't undertand the win for that particular case.

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS 2.6.21-rc7-mm1] kernel BUG at fs/sysfs/inode.c:272 (sysfs_drop_dentry)

2007-04-24 Thread Andrew Morton
On Wed, 25 Apr 2007 01:33:59 +0900 Tejun Heo <[EMAIL PROTECTED]> wrote:

> Vincent Vanackere wrote:
> > Hi,
> > 
> > I'm getting the following oops at boot with the latest -mm kernel :
> > ---
> > kernel BUG at fs/sysfs/inode.c:272
> 
> Known problem.  Working on it.  Thanks.
> 

If it had been known by me I wouldn't have released known-buggy code to
people who need to be able to test other people's code too. (argh)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] powerpc pseries eeh: Convert to kthread API

2007-04-24 Thread Linas Vepstas
On Thu, Apr 19, 2007 at 01:58:45AM -0600, Eric W. Biederman wrote:
> From: Eric W. Biederman <[EMAIL PROTECTED]>
> 
> This patch modifies the startup of eehd to use kthread_run
> not a combination of kernel_thread and daemonize.  Making
> the code slightly simpler and more maintainable.

For the patch that touched arch/powerpc/platforms/pseries/eeh_event.c,
I ran a variety of tests, and couldn't see/find/evoke any adverse
effects, so ..

Acked-by: Linas Vepstas <[EMAIL PROTECTED]>

> The second question is whether this is the right implementation.
> kthread_create already works by using a workqueue to create the thread
> and then waits for it.  If we really want to support creating threads
> asynchronously on demand we should have a proper API in kthread.c for
> this instead of spreading workqueues.

Yes, exactly; all I really want is to start a thread from an
interrupt context, and pass a structure to it.  This is pretty much
all that arch/powerpc/platforms/pseries/eeh_event.c is trying to do,
and little else.

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pagecache: find_or_create_page does not call a proper page allocator function

2007-04-24 Thread Christoph Lameter
On Tue, 24 Apr 2007, Hugh Dickins wrote:

> I've not yet looked at the patch under discussion, but this remark
> prompts me...  a couple of days ago I got very worried by the various
> hard-wired GFP_HIGHUSER allocations in mm/migrate.c and mm/mempolicy.c,
> and wondered how those would work out if someone has a blockdev mmap'ed.

Hmmm These not that critical given that 32 bit NUMA systems are a bit 
rare. And if a page is in the wrong area then it can be bounced before I/O 
is performed on it.
 
> (If vma->vm_file is non-NULL, we can be sure vma->vm_file->f_mapping
> is non-NULL, can't we?  Some common code assumes that, some does not:
> I've avoided cargo-cult safety below, but don't let me make it unsafe.)
> 
> 
> Is there a problem with page migration to HIGHMEM, if pages were
> mapped from a GFP_USER block device?  I failed to demonstrate any
> problem, but here's a quick fix if needed.

If a page is migrated into a different zone and then a write attempt is 
made on it then the page will be bounced into a zone from which the device
can write from. 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Christoph Lameter
On Fri, 20 Apr 2007, Siddha, Suresh B wrote:

> > Last I checked it was workload-dependent, but there were things that
> > hammer it. I mostly know of the remote wakeup issue, but there could
> > be other things besides wakeups that do it, too.
> 
> remote wakeup was the main issue and the 0.5% improvement was seen
> on a two node platform. Aligning it reduces the number of remote
> cachelines that needs to be touched as part of this wakeup.

.5% is usually in the noise ratio. Are you consistently seeing an 
improvement or is that sporadic?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Siddha, Suresh B
On Tue, Apr 24, 2007 at 10:39:48AM -0700, Christoph Lameter wrote:
> On Fri, 20 Apr 2007, Siddha, Suresh B wrote:
> 
> > > Last I checked it was workload-dependent, but there were things that
> > > hammer it. I mostly know of the remote wakeup issue, but there could
> > > be other things besides wakeups that do it, too.
> > 
> > remote wakeup was the main issue and the 0.5% improvement was seen
> > on a two node platform. Aligning it reduces the number of remote
> > cachelines that needs to be touched as part of this wakeup.
> 
> .5% is usually in the noise ratio. Are you consistently seeing an 
> improvement or is that sporadic?

No. This is consistent. I am waiting for the perf data on a much much bigger
NUMA box.

Anyhow, this is a straight forward optimization and needs to be done. Do you
have any specific concerns?

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pagecache: find_or_create_page does not call a proper page allocator function

2007-04-24 Thread Christoph Lameter
On Tue, 24 Apr 2007, Hugh Dickins wrote:

> I've not yet looked at the patch under discussion, but this remark
> prompts me...  a couple of days ago I got very worried by the various
> hard-wired GFP_HIGHUSER allocations in mm/migrate.c and mm/mempolicy.c,
> and wondered how those would work out if someone has a blockdev mmap'ed.

I hope you are not confused by the fact that memory policies are only
ever applied to one zone on a node. This is either HIGHMEM or NORMAL. 
There is no memory policy support for other than the highest zone.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pagecache: find_or_create_page does not call a proper page allocator function

2007-04-24 Thread Hugh Dickins
On Tue, 24 Apr 2007, Christoph Lameter wrote:
> On Tue, 24 Apr 2007, Hugh Dickins wrote:
> 
> > I've not yet looked at the patch under discussion, but this remark
> > prompts me...  a couple of days ago I got very worried by the various
> > hard-wired GFP_HIGHUSER allocations in mm/migrate.c and mm/mempolicy.c,
> > and wondered how those would work out if someone has a blockdev mmap'ed.
> 
> Hmmm These not that critical given that 32 bit NUMA systems are a bit 
> rare.

That's true.  And everybody but the owners of those systems wish
fervently that they didn't exist ;)

> And if a page is in the wrong area then it can be bounced before I/O 
> is performed on it.

I think that much is also true, but not where the problem lies.
Isn't the problem that filesystems using these block devices
expect their metadata to be accessible without kmap calls?

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Willy Tarreau
On Tue, Apr 24, 2007 at 10:38:32AM -0400, Gene Heskett wrote:
> On Tuesday 24 April 2007, Ingo Molnar wrote:
> >* David Lang <[EMAIL PROTECTED]> wrote:
> >> > (Btw., to protect against such mishaps in the future i have changed
> >> > the SysRq-N [SysRq-Nice] implementation in my tree to not only
> >> > change real-time tasks to SCHED_OTHER, but to also renice negative
> >> > nice levels back to 0 - this will show up in -v6. That way you'd
> >> > only have had to hit SysRq-N to get the system out of the wedge.)
> >>
> >> if you are trying to unwedge a system it may be a good idea to renice
> >> all tasks to 0, it could be that a task at +19 is holding a lock that
> >> something else is waiting for.
> >
> >Yeah, that's possible too, but +19 tasks are getting a small but
> >guaranteed share of the CPU so eventually it ought to release it. It's
> >still a possibility, but i think i'll wait for a specific incident to
> >happen first, and then react to that incident :-)
> >
> > Ingo
> 
> In the instance I created, even the SysRq+b was ignored, and ISTR thats 
> supposed to initiate a reboot is it not?  So it was well and truly wedged.

On many machines I use this on, I have to release Alt while still holding B.
Don't know why, but it works like this.

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] use mutex instead of semaphore in RocketPort driver

2007-04-24 Thread Matthias Kaehlcke
the RocketPort driver uses a semaphore as mutex. use the mutex API
instead of the (binary) semaphore

Signed-off-by: Matthias Kaehlcke <[EMAIL PROTECTED]>

--

diff --git a/drivers/char/rocket.c b/drivers/char/rocket.c
index 76357c8..faa5dd5 100644
--- a/drivers/char/rocket.c
+++ b/drivers/char/rocket.c
@@ -702,7 +702,7 @@ static void init_r_port(int board, int aiop, int chan, 
struct pci_dev *pci_dev)
}
}
spin_lock_init(&info->slock);
-   sema_init(&info->write_sem, 1);
+   mutex_init(&info->write_mtx);
rp_table[line] = info;
if (pci_dev)
tty_register_device(rocket_driver, line, &pci_dev->dev);
@@ -1662,7 +1662,7 @@ static void rp_put_char(struct tty_struct *tty, unsigned 
char ch)
return;
 
/*  Grab the port write semaphore, locking out other processes that try 
to write to this port */
-   down(&info->write_sem);
+   mutex_lock(&info->write_mtx);
 
 #ifdef ROCKET_DEBUG_WRITE
printk(KERN_INFO "rp_put_char %c...", ch);
@@ -1684,7 +1684,7 @@ static void rp_put_char(struct tty_struct *tty, unsigned 
char ch)
info->xmit_fifo_room--;
}
spin_unlock_irqrestore(&info->slock, flags);
-   up(&info->write_sem);
+   mutex_unlock(&info->write_mtx);
 }
 
 /*
@@ -1706,7 +1706,7 @@ static int rp_write(struct tty_struct *tty,
if (count <= 0 || rocket_paranoia_check(info, "rp_write"))
return 0;
 
-   down_interruptible(&info->write_sem);
+   mutex_lock_interruptible(&info->write_mtx);
 
 #ifdef ROCKET_DEBUG_WRITE
printk(KERN_INFO "rp_write %d chars...", count);
@@ -1777,7 +1777,7 @@ end:
wake_up_interruptible(&tty->poll_wait);
 #endif
}
-   up(&info->write_sem);
+   mutex_unlock(&info->write_mtx);
return retval;
 }
 
diff --git a/drivers/char/rocket_int.h b/drivers/char/rocket_int.h
index 3a8bcc8..04bcf61 100644
--- a/drivers/char/rocket_int.h
+++ b/drivers/char/rocket_int.h
@@ -1171,7 +1171,7 @@ struct r_port {
struct wait_queue *close_wait;
 #endif
spinlock_t slock;
-   struct semaphore write_sem;
+   struct mutex write_mtx;
 };
 
 #define RPORT_MAGIC 0x525001


-- 
Matthias Kaehlcke
Linux Application Developer
Barcelona


   Usually when people are sad, they don't do anything. They just cry over
 their condition. But when they get angry, they bring about a change
  (Malcolm X)
 .''`.
using free software / Debian GNU/Linux | http://debian.org  : :'  :
`. `'`
gpg --keyserver pgp.mit.edu --recv-keys 47D8E5D4  `-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Christoph Lameter
On Tue, 24 Apr 2007, Siddha, Suresh B wrote:

> > .5% is usually in the noise ratio. Are you consistently seeing an 
> > improvement or is that sporadic?
> 
> No. This is consistent. I am waiting for the perf data on a much much bigger
> NUMA box.
> 
> Anyhow, this is a straight forward optimization and needs to be done. Do you
> have any specific concerns?

Yes there should not be contention on per cpu data in principle. The 
point of per cpu data is for the cpu to have access to contention free 
cachelines.

If the data is contented then it should be moved out of per cpu data and 
properly 
placed to minimize contention. Otherwise we will get into cacheline 
aliases (__read_mostly in per cpu??) etc etc in the per cpu areas.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pagecache: find_or_create_page does not call a proper page allocator function

2007-04-24 Thread Christoph Lameter
On Tue, 24 Apr 2007, Hugh Dickins wrote:

> > And if a page is in the wrong area then it can be bounced before I/O 
> > is performed on it.
> 
> I think that much is also true, but not where the problem lies.
> Isn't the problem that filesystems using these block devices
> expect their metadata to be accessible without kmap calls?

Metadata is not movable nor subject to memory policies. It will never be 
mapped into a process space.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] Ignore stolen time in the softlockup watchdog

2007-04-24 Thread Jeremy Fitzhardinge
Andrew Morton wrote:
> It seems fairly sensitive to .config settings.  See
> http://userweb.kernel.org/~akpm/config-sony.txt
>   

I haven't tried your config yet, but I haven't managed to reproduce it
by playing with the usual suspects in my config (SMP, PREEMPT).  Any
idea about which config changes make the difference?

Hm, is it caused by using sched_clock() to generate the printk
timestamps while generating the lock test output?

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS 2.6.21-rc7-mm1] kernel BUG at fs/sysfs/inode.c:272 (sysfs_drop_dentry)

2007-04-24 Thread Tejun Heo
Andrew Morton wrote:
> On Wed, 25 Apr 2007 01:33:59 +0900 Tejun Heo <[EMAIL PROTECTED]> wrote:
> 
>> Vincent Vanackere wrote:
>>> Hi,
>>>
>>> I'm getting the following oops at boot with the latest -mm kernel :
>>> ---
>>> kernel BUG at fs/sysfs/inode.c:272
>> Known problem.  Working on it.  Thanks.
>>
> 
> If it had been known by me I wouldn't have released known-buggy code to
> people who need to be able to test other people's code too. (argh)
> 

It's the problem Cornelia reported in the thread the patch was posted.
Okay, somehow it's missing lkml on cc list.  I included everybody
related but missed lkml.  A lot of screw ups lately.  My apologies.
Currently, known problems are...

1. new sysfs_drop_dentry() not allowing drop of parent when it's not
empty.  Reported by Cornelia and this thread reports the same problem.
I'm working on it now.

2. using sysfs_dirent pointer as sysfs inode number doesn't work because
ino_t is smaller than ulong on some archs.  It's also broken for 32bit
apps on 64bit archs.  Updated patch just submitted for review.

3. Sysfs rename doesn't work.  I broke it during the first series of
sysfs updates.  Will fix in a few days.

Again, sorry for all the troubles.  I'll be more careful.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Siddha, Suresh B
On Tue, Apr 24, 2007 at 10:47:45AM -0700, Christoph Lameter wrote:
> On Tue, 24 Apr 2007, Siddha, Suresh B wrote:
> > Anyhow, this is a straight forward optimization and needs to be done. Do you
> > have any specific concerns?
> 
> Yes there should not be contention on per cpu data in principle. The 
> point of per cpu data is for the cpu to have access to contention free 
> cachelines.
> 
> If the data is contented then it should be moved out of per cpu data and 
> properly 
> placed to minimize contention. Otherwise we will get into cacheline 
> aliases (__read_mostly in per cpu??) etc etc in the per cpu areas.

yes, we were planning to move this to a different percpu section, where
all the elements in this new section will be cacheline aligned(both
at the start, aswell as end)

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pagecache: find_or_create_page does not call a proper page allocator function

2007-04-24 Thread Andrew Morton
On Tue, 24 Apr 2007 18:45:03 +0100 (BST) Hugh Dickins <[EMAIL PROTECTED]> wrote:

> On Tue, 24 Apr 2007, Christoph Lameter wrote:
> > On Tue, 24 Apr 2007, Hugh Dickins wrote:
> > 
> > > I've not yet looked at the patch under discussion, but this remark
> > > prompts me...  a couple of days ago I got very worried by the various
> > > hard-wired GFP_HIGHUSER allocations in mm/migrate.c and mm/mempolicy.c,
> > > and wondered how those would work out if someone has a blockdev mmap'ed.
> > 
> > Hmmm These not that critical given that 32 bit NUMA systems are a bit 
> > rare.
> 
> That's true.  And everybody but the owners of those systems wish
> fervently that they didn't exist ;)
> 
> > And if a page is in the wrong area then it can be bounced before I/O 
> > is performed on it.
> 
> I think that much is also true, but not where the problem lies.
> Isn't the problem that filesystems using these block devices
> expect their metadata to be accessible without kmap calls?
> 

yup.  wherever we dereference buffer_head.b_data we're touching
page_address(buffer_head.b_page) without kmapping.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] eHCA: Add "Modify Port" verb

2007-04-24 Thread Roland Dreier
Looks good, applied for 2.6.22.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Support for Zilog or Infineon hardware?

2007-04-24 Thread Stuart MacDonald
We have been investigating Linux support for the following chips:

Zilog   Z16C30 (also known as the "USC" or just 16C30)
Zilog   Z16C32 (also known as the "IUSC" or just 16C32)
http://www.zilog.com/products/family.asp?fam=200

Infineon"SEROCCO" family
SAB/SAF-20532 ("Serial Optimized Communications Controller")
SAB/SAF-20542 ("Serial Optimized Communications Controller with 
DMA")
http://www.infineon.com/cgi-bin/ifx/portal/ep/channelView.do?channelId=-65034&channelPage=%2Fep%2Fchannel%2FleafNote.jsp&pageTypeId=
17099

I grepped 6.20 and 4.34 extensively, and found
drivers/char/synclink*.c which appear to use the 16C32. Google turned
up http://www.evolware.org/chri/serocco/index.html for the Infineon
chips, but it's an out-of-tree driver.

Is anyone aware of any other Linux drivers for those chips?

..Stu

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] use mutex instead of semaphore in RocketPort driver

2007-04-24 Thread Oliver Neukum
Am Dienstag, 24. April 2007 19:49 schrieb Matthias Kaehlcke:
> @@ -1706,7 +1706,7 @@ static int rp_write(struct tty_struct *tty,
> if (count <= 0 || rocket_paranoia_check(info, "rp_write"))
> return 0;
>  
> -   down_interruptible(&info->write_sem);
> +   mutex_lock_interruptible(&info->write_mtx);

This is a bug. It is also present in the current code, but nevertheless
it is a bug. If you use an interruptible lock, you must be ready to deal
with interrupts, which are ignored by this code.

Regards
Oliver
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[git patches] net driver fixes

2007-04-24 Thread Jeff Garzik

Please pull from 'upstream-linus' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git 
upstream-linus

to receive the following updates:

 drivers/net/depca.c   |3 +-
 drivers/net/hamradio/baycom_ser_fdx.c |6 +++-
 drivers/net/sis900.c  |   44 +++--
 drivers/usb/net/pegasus.c |   17 +---
 drivers/usb/net/pegasus.h |3 +-
 5 files changed, 40 insertions(+), 33 deletions(-)

Andrea Righi (1):
  [netdrvr] depca: handle platform_device_add() failure

Andrew Morton (1):
  drivers/net/hamradio/baycom_ser_fdx build fix

Dan Williams (1):
  usb-net/pegasus: fix pegasus carrier detection

Neil Horman (1):
  sis900: Allocate rx replacement buffer before rx operation

diff --git a/drivers/net/depca.c b/drivers/net/depca.c
index 5113eef..f3807aa 100644
--- a/drivers/net/depca.c
+++ b/drivers/net/depca.c
@@ -1491,8 +1491,9 @@ static void __init depca_platform_probe (void)
depca_io_ports[i].device = pldev;
 
if (platform_device_add(pldev)) {
-   platform_device_put(pldev);
depca_io_ports[i].device = NULL;
+   pldev->dev.platform_data = NULL;
+   platform_device_put(pldev);
continue;
}
 
diff --git a/drivers/net/hamradio/baycom_ser_fdx.c 
b/drivers/net/hamradio/baycom_ser_fdx.c
index 59214e7..30baf6e 100644
--- a/drivers/net/hamradio/baycom_ser_fdx.c
+++ b/drivers/net/hamradio/baycom_ser_fdx.c
@@ -75,12 +75,14 @@
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include 
 
+#include 
+#include 
+#include 
+
 /* - */
 
 #define BAYCOM_DEBUG
diff --git a/drivers/net/sis900.c b/drivers/net/sis900.c
index b3750f2..b2a3b19 100644
--- a/drivers/net/sis900.c
+++ b/drivers/net/sis900.c
@@ -1755,6 +1755,24 @@ static int sis900_rx(struct net_device *net_dev)
} else {
struct sk_buff * skb;
 
+   pci_unmap_single(sis_priv->pci_dev,
+   sis_priv->rx_ring[entry].bufptr, RX_BUF_SIZE,
+   PCI_DMA_FROMDEVICE);
+
+   /* refill the Rx buffer, what if there is not enought
+* memory for new socket buffer ?? */
+   if ((skb = dev_alloc_skb(RX_BUF_SIZE)) == NULL) {
+   /*
+* Not enough memory to refill the buffer
+* so we need to recycle the old one so
+* as to avoid creating a memory hole
+* in the rx ring
+*/
+   skb = sis_priv->rx_skbuff[entry];
+   sis_priv->stats.rx_dropped++;
+   goto refill_rx_ring;
+   }   
+
/* This situation should never happen, but due to
   some unknow bugs, it is possible that
   we are working on NULL sk_buff :-( */
@@ -1768,9 +1786,6 @@ static int sis900_rx(struct net_device *net_dev)
break;
}
 
-   pci_unmap_single(sis_priv->pci_dev,
-   sis_priv->rx_ring[entry].bufptr, RX_BUF_SIZE,
-   PCI_DMA_FROMDEVICE);
/* give the socket buffer to upper layers */
skb = sis_priv->rx_skbuff[entry];
skb_put(skb, rx_size);
@@ -1783,33 +1798,14 @@ static int sis900_rx(struct net_device *net_dev)
net_dev->last_rx = jiffies;
sis_priv->stats.rx_bytes += rx_size;
sis_priv->stats.rx_packets++;
-
-   /* refill the Rx buffer, what if there is not enought
-* memory for new socket buffer ?? */
-   if ((skb = dev_alloc_skb(RX_BUF_SIZE)) == NULL) {
-   /* not enough memory for skbuff, this makes a
-* "hole" on the buffer ring, it is not clear
-* how the hardware will react to this kind
-* of degenerated buffer */
-   if (netif_msg_rx_status(sis_priv))
-   printk(KERN_INFO "%s: Memory squeeze,"
-   "deferring packet.\n",
-   net_dev->name);
-   sis_priv->rx_skbuff[entry] = NULL;
-   /* reset buffer descriptor state */
-   sis

Re: [patch] CFS scheduler, v3

2007-04-24 Thread Christoph Lameter
On Tue, 24 Apr 2007, Siddha, Suresh B wrote:

> On Tue, Apr 24, 2007 at 10:47:45AM -0700, Christoph Lameter wrote:
> > On Tue, 24 Apr 2007, Siddha, Suresh B wrote:
> > > Anyhow, this is a straight forward optimization and needs to be done. Do 
> > > you
> > > have any specific concerns?
> > 
> > Yes there should not be contention on per cpu data in principle. The 
> > point of per cpu data is for the cpu to have access to contention free 
> > cachelines.
> > 
> > If the data is contented then it should be moved out of per cpu data and 
> > properly 
> > placed to minimize contention. Otherwise we will get into cacheline 
> > aliases (__read_mostly in per cpu??) etc etc in the per cpu areas.
> 
> yes, we were planning to move this to a different percpu section, where
> all the elements in this new section will be cacheline aligned(both
> at the start, aswell as end)

I would not call this a per cpu area. It is used by multiple cpus it 
seems. But for 0.5%? on what benchmark? Is is really worth it?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: cpufreq default governor

2007-04-24 Thread Ian E. Morgan

On 24/04/07, William Heimbigner <[EMAIL PROTECTED]> wrote:

On Tue, 24 Apr 2007, Michal Piotrowski wrote:
> On 24/04/07, William Heimbigner <[EMAIL PROTECTED]> wrote:
>>  On Tue, 24 Apr 2007, Michal Piotrowski wrote:
>>
>> >  On 24/04/07, William Heimbigner <[EMAIL PROTECTED]> wrote:
>> > >   On Tue, 24 Apr 2007, Michal Piotrowski wrote:
>> > >
>> > > >   Hi William,
>> > > >
>> > > >   On 24/04/07, William Heimbigner <[EMAIL PROTECTED]> wrote:
>> > > > >Question: is there some reason that kconfig does not allow for
>> > > > >default
>> > > > >governors of conservative/ondemand/powersave?
>> > > >
>> > > >   Performance?
>> > > >
>> > > > >I'm not aware of any reason why one of those governors could not
>> > > > >be
>> > > > >used
>> > > > >as default.
>> > > >
>> > > >   My hardware doesn't work properly with ondemand governor. I hear
>> > > >   strange noises when frequency is changed.
>> > > >
>> > >
>> > >   That doesn't mean it isn't working, though.
>> >
>> >  I didn't say that cpufreq ondemand is broken. It's a hardware problem.
>> >
>> > >   I here weird noises if the cpu
>> > >   is clocked anywhere from 333MHz to 1GHz (sounds like an RD-D2 beeping
>> > >   noises in ultra high pitch?)
>> >
>> >  Yes, something like that.
>>
>>  Is it actually "not working" though, even at the hardware level?
>
> It works, but for me this sounds are very weird ;)
>
>>  To my
>>  knowledge those noises are normal, and aren't even signs of a harware
>>  problem. I believe it is the natural result of changing frequencies at any
>>  time. If you change frequencies, especially in the low end of available
>>  frequencies, you should hear a very brief noise. A governor such as
>>  ondemand, which is rapidly switching the frequency from say, 333 MHz to
>>  2.66 GHz, is likely to make this much more noticable.
>
> Ok, it might be normal behavior. I might be wrong, but IMO users
> prefer speed and no strange sounds as default setting.

I agree! My suggestion, however, is that if they do want a different
scheduler as the default, they can choose one.

There are some cases in which this could be very useful. A couple examples
would be the processor with poor cooling that overheats easily, or a
laptop with a poor battery.

However, on second thought with regards to Kconfig, would it be feasible
to have performance always be the default, unless a
"cpufreqgov=conservative" arguement was specified on the command line?

This would be less susceptible to users complaining that their cpu is
chirping all of a sudden.


I'm all for the ability to set the default to whatever governor the
user wants. I _always_ run my laptops with the ondemand governor, my
Pentium M-based PVR runs with ondemand too, and only my old P4 box
doesn't because it's pointless. If you're running servers that
_aren't_ going to be idle most of the time, then by all means set your
default to performance, or just don't enable cpufreq at all, but give
the rest of us the option.

Particularily with laptops, I've always wanted the kernel to boot and
immediately slow the CPU down, even if all I do is boot into single
user mode, or even bypass init altogether. This will give best battery
life and coolest operation out of the box without having to rely on
userland whatsoever.

I had an old laptop a while back that _would_ overheat and shutdown
within a couple of minutes, even though idle, if booted to single user
mode because the cpu freq wasn't slowed down.

--
Ian Morgan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pagecache: find_or_create_page does not call a proper page allocator function

2007-04-24 Thread Christoph Lameter
On Tue, 24 Apr 2007, Andrew Morton wrote:

> > I think that much is also true, but not where the problem lies.
> > Isn't the problem that filesystems using these block devices
> > expect their metadata to be accessible without kmap calls?
> > 
> 
> yup.  wherever we dereference buffer_head.b_data we're touching
> page_address(buffer_head.b_page) without kmapping.

Yes but before we get there we will bounce pagecache pages into an area 
where we do not need kmap.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS 2.6.21-rc7-mm1] kernel BUG at fs/sysfs/inode.c:272 (sysfs_drop_dentry)

2007-04-24 Thread Andrew Morton
On Wed, 25 Apr 2007 02:51:43 +0900 Tejun Heo <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > On Wed, 25 Apr 2007 01:33:59 +0900 Tejun Heo <[EMAIL PROTECTED]> wrote:
> > 
> >> Vincent Vanackere wrote:
> >>> Hi,
> >>>
> >>> I'm getting the following oops at boot with the latest -mm kernel :
> >>> ---
> >>> kernel BUG at fs/sysfs/inode.c:272
> >> Known problem.  Working on it.  Thanks.
> >>
> > 
> > If it had been known by me I wouldn't have released known-buggy code to
> > people who need to be able to test other people's code too. (argh)
> > 
> 
> It's the problem Cornelia reported in the thread the patch was posted.

Is there a workaround?  What might happen if we just delete that BUG_ON()?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] Ignore stolen time in the softlockup watchdog

2007-04-24 Thread Andrew Morton
On Tue, 24 Apr 2007 10:51:35 -0700 Jeremy Fitzhardinge <[EMAIL PROTECTED]> 
wrote:

> Andrew Morton wrote:
> > It seems fairly sensitive to .config settings.  See
> > http://userweb.kernel.org/~akpm/config-sony.txt
> >   
> 
> I haven't tried your config yet, but I haven't managed to reproduce it
> by playing with the usual suspects in my config (SMP, PREEMPT).  Any
> idea about which config changes make the difference?

I said that because the damn thing went away when I was hunting it down
because I lost the config and was unable to remember the right combination
of debug settings.  Fortunately it later came back so I took care to
preserve the config.

> Hm, is it caused by using sched_clock() to generate the printk
> timestamps while generating the lock test output?

Conceivably.  What does that locking API test do?

I was using printk timestamps and netconsole at the time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] Fix two boot problems related to ZONE_MOVABLE sizing

2007-04-24 Thread Mel Gorman
Following this mail are two fixes related to a boot problem in relation
to ZONE_MOVABLE. These are fixes for memory partitioning where kernelcore=
is used and is unrelated to grouping pages by mobility.

The first patch moves kernelcore= parsing to common code. This avoids an
infinite loop that can occur when booting on IA64. As a side-effect,
it extends support of kernelcore= to all architectures that use
architecture-independent zone-sizing.

The second patch aligns ZONE_MOVABLE correctly. The bootmem allocator makes
assumptions on the alignment of zones. This can cause pages to be placed
on the freelists for the wrong zone resulting in a BUG() later. Aligning
ZONE_MOVABLE avoids the problem.

They have been successfully boot-tested with and without kernelcore=
specified on x86_64, ppc64 and IA64 (where the bug was first triggered).
-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] Handle kernelcore= boot parameter in common code to avoid boot problem on IA64

2007-04-24 Thread Mel Gorman

When "kernelcore" boot option is specified, kernel can't boot up on ia64
because of an infinite loop.  In addition, the parsing code can be handled
in an architecture-independent manner.

This patch patches uses common code to handle the kernelcore= parameter.
It is only available to architectures that support arch-independent
zone-sizing (i.e. define CONFIG_ARCH_POPULATES_NODE_MAP). Other architectures
will ignore the boot parameter.

This effectively removes the following arch-specific patches;

ia64-specify-amount-of-kernel-memory-at-boot-time.patch
ppc-and-powerpc-specify-amount-of-kernel-memory-at-boot-time.patch
x86_64-specify-amount-of-kernel-memory-at-boot-time.patch
x86-specify-amount-of-kernel-memory-at-boot-time.patch


Signed-off-by: Yasunori Goto <[EMAIL PROTECTED]>
Acked-by: Mel Gorman <[EMAIL PROTECTED]>
Acked-by: Andy Whitcroft <[EMAIL PROTECTED]>
---

 arch/i386/kernel/setup.c   |1 -
 arch/ia64/kernel/efi.c |2 --
 arch/powerpc/kernel/prom.c |1 -
 arch/ppc/mm/init.c |2 --
 arch/x86_64/kernel/e820.c  |1 -
 include/linux/mm.h |1 -
 mm/page_alloc.c|4 +++-
 7 files changed, 3 insertions(+), 9 deletions(-)

Index: kernelcore/arch/ia64/kernel/efi.c
===
--- kernelcore.orig/arch/ia64/kernel/efi.c  2007-04-24 15:09:37.0 
+0900
+++ kernelcore/arch/ia64/kernel/efi.c   2007-04-24 15:25:22.0 +0900
@@ -423,8 +423,6 @@ efi_init (void)
mem_limit = memparse(cp + 4, &cp);
} else if (memcmp(cp, "max_addr=", 9) == 0) {
max_addr = GRANULEROUNDDOWN(memparse(cp + 9, &cp));
-   } else if (memcmp(cp, "kernelcore=",11) == 0) {
-   cmdline_parse_kernelcore(cp+11);
} else if (memcmp(cp, "min_addr=", 9) == 0) {
min_addr = GRANULEROUNDDOWN(memparse(cp + 9, &cp));
} else {
Index: kernelcore/arch/i386/kernel/setup.c
===
--- kernelcore.orig/arch/i386/kernel/setup.c2007-04-24 15:29:20.0 
+0900
+++ kernelcore/arch/i386/kernel/setup.c 2007-04-24 15:29:39.0 +0900
@@ -195,7 +195,6 @@ static int __init parse_mem(char *arg)
return 0;
 }
 early_param("mem", parse_mem);
-early_param("kernelcore", cmdline_parse_kernelcore);
 
 #ifdef CONFIG_PROC_VMCORE
 /* elfcorehdr= specifies the location of elf core header
Index: kernelcore/arch/powerpc/kernel/prom.c
===
--- kernelcore.orig/arch/powerpc/kernel/prom.c  2007-04-24 15:04:47.0 
+0900
+++ kernelcore/arch/powerpc/kernel/prom.c   2007-04-24 15:30:25.0 
+0900
@@ -431,7 +431,6 @@ static int __init early_parse_mem(char *
return 0;
 }
 early_param("mem", early_parse_mem);
-early_param("kernelcore", cmdline_parse_kernelcore);
 
 /*
  * The device tree may be allocated below our memory limit, or inside the
Index: kernelcore/arch/ppc/mm/init.c
===
--- kernelcore.orig/arch/ppc/mm/init.c  2007-04-24 15:04:47.0 +0900
+++ kernelcore/arch/ppc/mm/init.c   2007-04-24 15:30:56.0 +0900
@@ -214,8 +214,6 @@ void MMU_setup(void)
}
 }
 
-early_param("kernelcore", cmdline_parse_kernelcore);
-
 /*
  * MMU_init sets up the basic memory mappings for the kernel,
  * including both RAM and possibly some I/O regions,
Index: kernelcore/arch/x86_64/kernel/e820.c
===
--- kernelcore.orig/arch/x86_64/kernel/e820.c   2007-04-24 15:04:47.0 
+0900
+++ kernelcore/arch/x86_64/kernel/e820.c2007-04-24 15:34:02.0 
+0900
@@ -604,7 +604,6 @@ static int __init parse_memopt(char *p)
return 0;
 } 
 early_param("mem", parse_memopt);
-early_param("kernelcore", cmdline_parse_kernelcore);
 
 static int userdef __initdata;
 
Index: kernelcore/include/linux/mm.h
===
--- kernelcore.orig/include/linux/mm.h  2007-04-24 15:09:37.0 +0900
+++ kernelcore/include/linux/mm.h   2007-04-24 15:35:52.0 +0900
@@ -1051,7 +1051,6 @@ extern unsigned long find_max_pfn_with_a
 extern void free_bootmem_with_active_regions(int nid,
unsigned long max_low_pfn);
 extern void sparse_memory_present_with_active_regions(int nid);
-extern int cmdline_parse_kernelcore(char *p);
 #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
 extern int early_pfn_to_nid(unsigned long pfn);
 #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
Index: kernelcore/mm/page_alloc.c
===
--- kernelcore.orig/mm/page_alloc.c 2007-04-24 15:09:37.0 +0900
+++ kernelcore/mm/page_alloc.c  2007-04-24 16:00:21.0 +0900
@@ -3728,6 +3728,9 @@ in

[PATCH 2/2] Align ZONE_MOVABLE to a MAX_ORDER_NR_PAGES boundary

2007-04-24 Thread Mel Gorman

The boot memory allocator makes assumptions on the alignment of zone
boundaries even though the buddy allocator has no requirements on the
alignment of zones. This may cause boot problems in situations where
ZONE_MOVABLE is populated because the bootmem allocator assumes zones are
at least order-log2(BITS_PER_LONG) aligned. As the two potential users
(huge pages and memory hot-remove) of ZONE_MOVABLE would prefer a higher
alignment, this patch aligns the start of the zone instead of fixing the
different assumptions made by the bootmem allocator.

This patch rounds the start of ZONE_MOVABLE in each node to a
MAX_ORDER_NR_PAGES boundary. If the rounding pushes the start of ZONE_MOVABLE
above the end of the node then the zone will contain no memory and will not
be used at runtime. The value is rounded up instead of down as it is
better to have the kernel-portion of memory larger than requested instead
of smaller. The impact is that the kernel-usable portion of memory because a
minimum guarantee instead of the exact size requested by the user.


Signed-off-by: Mel Gorman <[EMAIL PROTECTED]>
Acked-by: Andy Whitcroft <[EMAIL PROTECTED]>
---

 page_alloc.c |5 +
 1 files changed, 5 insertions(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
linux-2.6.21-rc6-mm1-002_commonparse/mm/page_alloc.c 
linux-2.6.21-rc6-mm1-003_alignmovable/mm/page_alloc.c
--- linux-2.6.21-rc6-mm1-002_commonparse/mm/page_alloc.c2007-04-24 
09:38:30.0 +0100
+++ linux-2.6.21-rc6-mm1-003_alignmovable/mm/page_alloc.c   2007-04-24 
11:15:40.0 +0100
@@ -3642,6 +3642,11 @@ restart:
usable_nodes--;
if (usable_nodes && required_kernelcore > usable_nodes)
goto restart;
+   
+   /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
+   for (nid = 0; nid < MAX_NUMNODES; nid++)
+   zone_movable_pfn[nid] =
+   roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
 }
 
 /**
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ehea: fix for dlpar and sysfs entries

2007-04-24 Thread Jeff Garzik

Jan-Bernd Themann wrote:

This patch includes:
- dlpar fix: 
	certain resources may only be allocated when first

logical port is available, and must be removed when
last logical port has been removed

- sysfs entries:
create symbolic link from each logical port to ehea driver

Signed-off-by: Jan-Bernd Themann <[EMAIL PROTECTED]>


What Arnd said... this should be two patches


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Li, Tong N
On Mon, 2007-04-23 at 18:57 -0700, Bill Huey wrote: 
> On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote:
> > I don't know if we've discussed this or not. Since both CFS and SD claim
> > to be fair, I'd like to hear more opinions on the fairness aspect of
> > these designs. In areas such as OS, networking, and real-time, fairness,
> > and its more general form, proportional fairness, are well-defined
> > terms. In fact, perfect fairness is not feasible since it requires all
> > runnable threads to be running simultaneously and scheduled with
> > infinitesimally small quanta (like a fluid system). So to evaluate if a
> 
> Unfortunately, fairness is rather non-formal in this context and probably
> isn't strictly desirable given how hack much of Linux userspace is. Until
> there's a method of doing directed yields, like what Will has prescribed
> a kind of allotment to thread doing work for another a completely strict
> mechanism, it is probably problematic with regards to corner cases.
> 
> X for example is largely non-thread safe. Until they can get their xcb
> framework in place and addition thread infrastructure to do hand off
> properly, it's going to be difficult schedule for it. It's well known to
> be problematic.

I agree. I just think calling the designs "perfectly" or "completely"
fair is too strong. It might cause unnecessary confusion that
overshadows the actual merits of these designs. If we were to evaluate
specifically the fairness aspect of a design, then I'd suggest defining
it more formally.

> You announced your scheduler without CCing any of the relevant people here
> (and risk being completely ignored in lkml traffic):
> 
>   http://lkml.org/lkml/2007/4/20/286
> 
> What is your opinion of both CFS and SDL ? How can you work be useful
> to either scheduler mentioned or to the Linux kernel on its own ?

I like SD for its simplicity. My concern with CFS is the RB tree
structure. Log(n) seems high to me given the fact that we had an O(1)
scheduler. Many algorithms achieve strong fairness guarantees at the
cost of log(n) time. Thus, I tend to think, if log(n) is acceptable, we
might want also to look at other algorithms (e.g., start-time first)
with better fairness properties and see if they could be extended to be
general purpose.

> > I understand that via experiments we can show a design is reasonably
> > fair in the common case, but IMHO, to claim that a design is fair, there
> > needs to be some kind of formal analysis on the fairness bound, and this
> > bound should be proven to be constant. Even if the bound is not
> > constant, at least this analysis can help us better understand and
> > predict the degree of fairness that users would experience (e.g., would
> > the system be less fair if the number of threads increases? What happens
> > if a large number of threads dynamically join and leave the system?).
> 
> Will has been thinking about this, but you have to also consider the
> practicalities of your approach versus Con's and Ingo's.

I consider my work an approach to extend an existing scheduler to
support proportional fairness. I see many proportional-share designs are
lacking things such as good interactive support that Linux does well.
This is why I designed it on top of the existing scheduler so that it
can leverage things such as dynamic priorities. Regardless of the
underlying scheduler, SD or CFS, I think the algorithm I used would
still apply and thus we can extend the scheduler similarly.

> I'm all for things like proportional scheduling and the extensions
> needed to do it properly. It would be highly relevant to some version
> of the -rt patch if not that patch directly.

I'd love it to be considered for part of the -rt patch. I'm new to 
this, so would you please let me know what to do?

Thanks,

  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP lockup in virtualized environment

2007-04-24 Thread Jeremy Fitzhardinge
LAPLACE Cyprien wrote:
> I wonder how the guest domain can be denied timer interrupts for such a
> long time ? The only reason I see is that the guest domain is not
> scheduled at all (host domain or another higher priority guest running).
>
> Now in SMP host and guest, what happens if a guest CPU is not scheduled
> for a while ?
>   

I think this mostly happens when you're doing an inherently
time-stealing thing, like pausing vcpus or suspend/resume.  But I guess
it could happen with a low-prio domain/vcpu on a busy system.

> An example: in kernel/pid.c:alloc_pid(), if one of the guest CPUs is
> descheduled when holding the pidmap_lock, what happens to the other
> guest CPUs who want to alloc/free pids ? Are they blocked too ?
>   

Yep; its a problem.  If a vcpu holding locks and not running, then
everyone else will be prevented from taking those locks.  If you have a
very busy system, then presumably all the vcpus will be similarly loaded
and the fact that it takes a long time to get locks just means you're
trying to run too many vcpus for your hardware's capacity.

The other problem is if you've got two vcpus sharing one real cpu, and
vcpu A spins while waiting for vcpu B to release a lock, and vcpu B is
waiting for A to get off the physical CPU.  I don't know how often this
is a real problem in practice.

J

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/7] libata: check for AN support

2007-04-24 Thread Olivier Galibert
On Tue, Apr 24, 2007 at 08:49:04AM -0700, Kristen Carlson Accardi wrote:
> On Tue, 24 Apr 2007 12:23:04 +0200
> Olivier Galibert <[EMAIL PROTECTED]> wrote:
> 
> > Sorry for replying to Alan's reply, I missed the original mail.
> > 
> > > > +#define ata_id_has_AN(id)  \
> > > > +   ((id[76] && (~id[76])) & ((id)[78] & (1 << 5)))
> > 
> > (a && ~a) & (b & 32)
> > 
> > I don't think that does what you think it does, because at that point
> > it's a funny way to write 0 ((0 or 1) binary-and (0 or 32)).
> > 
> > I'm not even sure what it is you want.  If for the first part you
> > wanted (id[76] != 0x00 && id[76] != 0xff), please write just that,
> > thanks :-)
> > 
> >   OG.
> > 
> 
> >From the serial ata spec, we have:
> 
> 13.2.1.18Word 78: Serial ATA features supported
> If Word 76 is not h or h, Word 78 reports the optional features 
> supported by the device.  Support for this word is optional and if not 
> supported the word shall be zero indicating the device has no support for new 
> Serial ATA capabilities.
> 
> so, basically yes, I'm really testing to make sure that word 76 isn't 0 or all
> one then using that value & with value of bit in work 78 to determine AN
> support - if you think this is really obfuscated, I've got no problem 
> changing 
> it - there's obviously many ways to mess around with bits.

& is not &&, so right now it's really incorrect.  1 & 32 is 0.

((id)[76] != 0x && (id)[76] != 0x && ((id)[78] & (1 << 5)))

The implicit typing of id looks dangerous to me, but you're not the
one who has started it.

  OG.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Siddha, Suresh B
On Tue, Apr 24, 2007 at 10:55:45AM -0700, Christoph Lameter wrote:
> On Tue, 24 Apr 2007, Siddha, Suresh B wrote:
> > yes, we were planning to move this to a different percpu section, where
> > all the elements in this new section will be cacheline aligned(both
> > at the start, aswell as end)
> 
> I would not call this a per cpu area. It is used by multiple cpus it 
> seems. 

not decided about the name yet. but the area is allocated per cpu and yes,
the data can be referred by other cpus.

> But for 0.5%? on what benchmark? Is is really worth it?

famous database workload :)

I don't think the new section will be added for this 0.5%. This is a straight
fwd optimization, and anyone can plug into this new section in future.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] Ignore stolen time in the softlockup watchdog

2007-04-24 Thread Jeremy Fitzhardinge
Andrew Morton wrote:
> I said that because the damn thing went away when I was hunting it down
> because I lost the config and was unable to remember the right combination
> of debug settings.  Fortunately it later came back so I took care to
> preserve the config.
>   

sched_clock doesn't *do* anything except flap interrupts. Oh, wait, have
you got Andi's bugfixed version of the sched_clock patch?  The first
version did a local_save_flags rather than a local_irq_save.

>> Hm, is it caused by using sched_clock() to generate the printk
>> timestamps while generating the lock test output?
>> 
>
> Conceivably.  What does that locking API test do?
>   

Didn't make a difference here.  Building your config now.

> I was using printk timestamps and netconsole at the time.
>   

Ah, great, now you're going to make me setup netconsole...

J

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] use mutex instead of semaphore in RocketPort driver

2007-04-24 Thread Matthias Kaehlcke
El Tue, Apr 24, 2007 at 07:53:04PM +0200 Oliver Neukum ha dit:

> Am Dienstag, 24. April 2007 19:49 schrieb Matthias Kaehlcke:
> > @@ -1706,7 +1706,7 @@ static int rp_write(struct tty_struct *tty,
> > if (count <= 0 || rocket_paranoia_check(info, "rp_write"))
> > return 0;
> >  
> > -   down_interruptible(&info->write_sem);
> > +   mutex_lock_interruptible(&info->write_mtx);
> 
> This is a bug. It is also present in the current code, but nevertheless
> it is a bug. If you use an interruptible lock, you must be ready to deal
> with interrupts, which are ignored by this code.

i fear i don't have the experience/knowledge to fix this bug, thanks
for your remark. 

i'm a bit confused now about the interruptible locks, i thought using
them means that the process will be waked up when receiving a
signal. what role are playing interrupts when using interruptible locks?

-- 
Matthias Kaehlcke
Linux Application Developer
Barcelona

 La libertad es como la mañana. Hay quienes esperan dormidos a que
   llegue, pero hay quienes desvelan y caminan la noche para alcanzarla
(Subcomandante Marcos)
 .''`.
using free software / Debian GNU/Linux | http://debian.org  : :'  :
`. `'`
gpg --keyserver pgp.mit.edu --recv-keys 47D8E5D4  `-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NFS: Unable to handle kernel NULL pointer dereference at nfs_set_page_dirty+0xd/0x5d in 2.6.21rc7-git6

2007-04-24 Thread Andi Kleen

> 
> Does this patch fix it?

Didn't hit that particular oops with that anymore in several LTP runs and
your patch applied, but got this data corruption:

doio(rwtest04) (20172) 19:01:03
-
*** DATA COMPARISON ERROR ***
check_file(/tmp/ltp-14369/mm-sync-20156, 12754624, 23879, R:20172:bigfoot:doio*,
 21, 0) failed

Comparison fd is 5, with open flags 0
Corrupt regions follow - unprintable chars are represented as '.'
-
corrupt bytes starting at file offset 12754624
1st 32 expected bytes:  R:20172:bigfoot:doio*R:20172:big
1st 32 actual bytes:

Request number 622
  fd 4 is file /tmp/ltp-14369/mm-sync-20156 - open flags are 010002 O_RD
WR,O_SYNC,
  write done at file offset 12754624 - pattern is R (0122)
  number of requests is 1, strides per request is 1
  i/o byte count = 23879
  memory alignment is unaligned

syscall:  mmap-write(NULL, 1280, PROT_WRITE, MAP_SHARED, 4, 0)
file is mmaped to: 0x2b982c7ca000
file-mem=0x2b982d3f3ec0, length=23879, buffer=0x52d543


doio(rwtest04) (20169) 19:01:03
-
(parent) pid 20172 exited because of data compare errors


To reproduce: run runltplite.sh of LTP over NFS a few times

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Getting the new RxRPC patches upstream

2007-04-24 Thread David Howells
Oleg Nesterov <[EMAIL PROTECTED]> wrote:

> Sure, I'll grep for cancel_delayed_work(). But unless I missed something,
> this change should be completely transparent for all users. Otherwise, it
> is buggy.

I guess you will have to make sure that cancel_delayed_work() is always
followed by a flush of the workqueue, otherwise you might get this situation:

CPU 0   CPU 1
=== ===

cancel_delayed_work(x) == 0 -->delayed_work_timer_fn(x)
kfree(x);   -->do_IRQ()
y = kmalloc(); // reuses x
<--do_IRQ()
__queue_work(x)
--- OOPS ---

That's my main concern.  If you are certain that can't happen, then fair
enough.

Note that although you can call cancel_delayed_work() from within a work item
handler, you can't then follow it up with a flush as it's very likely to
deadlock.

> > Because calling schedule_delayed_work() is a waste of CPU if the timer
> > expiry handler is currently running at this time as *that* is going to
> > also schedule the delayed work item.
> 
> Yes. But otoh, try_to_del_timer_sync() is a waste of CPU compared to
> del_timer(), when the timer is not pending.

I suppose that's true.  As previously stated, my main objection to del_timer()
is the fact that it doesn't tell you if the timer expiry function is still
running.

Can you show me a patch illustrating exactly how you want to change
cancel_delayed_work()?  I can't remember whether you've done so already, but
if you have, I can't find it.  Is it basically this?:

 static inline int cancel_delayed_work(struct delayed_work *work)
 {
int ret;

-   ret = del_timer_sync(&work->timer);
+   ret = del_timer(&work->timer);
if (ret)
work_release(&work->work);
return ret;
 }

I was thinking this situation might be a problem:

CPU 0   CPU 1
=== ===

cancel_delayed_work(x) == 0 -->delayed_work_timer_fn(x)
schedule_delayed_work(x,0)  -->do_IRQ()

x->work()
<--do_IRQ()
__queue_work(x)

But it won't, will it?

> > Ah, but the timer routine may try to set the work item pending flag
> > *after* the work_pending() check you have here.
> 
> No, delayed_work_timer_fn() doesn't set the _PENDING flag.

Good point.  I don't think that's a problem because cancel_delayed_work()
won't clear the pending flag if it didn't remove a timer.

> First, this is very unlikely event, delayed_work_timer_fn() is very fast
> unless interrupted.

Yeah, I guess so.


Okay, you've convinced me, I think - provided you consider the case I
outlinded at the top of this email.

If you give me a patch to alter cancel_delayed_work(), I'll substitute it for
mine and use that that instead.  Dave Miller will just have to live with that
patch being there:-)

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.21-rc7-mm1: BUG at init_sched_clock()

2007-04-24 Thread Alexey Dobriyan
...
CPU1: Thermal monitoring enabled (TM2)
Intel(R) Core(TM)2 CPU  6400  @ 2.13GHz stepping 02
checking TSC synchronization [CPU#0 -> CPU#1]: passed.
Brought up 2 CPUs
migration_cost=
BUG: at arch/x86_64/kernel/../../i386/kernel/sched-clock.c:175 
init_sched_clock()

Call Trace:
 [] show_trace+0x34/0x4f
 [] dump_stack+0x12/0x17
 [] init_sched_clock+0x59/0x8a
 [] kernel_init+0x167/0x2dc
 [] child_rip+0xa/0x12

It didn't happen in 2.6.21-rc6-mm1 nor in mainline. And migration cost was 15.
However, box boots to the end as usual. /proc/cpuinfo shows 2 CPUs.

[reboots box]
Now migration cost is 1 and same trace.

--
Should-be-relevant config options:

CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_MCORE2=y
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_HT=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_SMP=y
CONFIG_SCHED_MC=y
CONFIG_PREEMPT=y
CONFIG_PREEMPT_BKL=y
CONFIG_NR_CPUS=2
CONFIG_HPET_TIMER=y
CONFIG_HZ=100
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_DETECT_SOFTLOCKUP=y

--
Linux version 2.6.21-rc7-mm1 ([EMAIL PROTECTED]) (gcc version 4.1.1 (Gentoo 
4.1.1-r3)) #1 SMP PREEMPT Wed Apr 25 01:58:43 MSD 2007
Command line: root=/dev/sda2 [EMAIL PROTECTED]/eth0,[EMAIL 
PROTECTED]/00:80:48:45:EC:73 ignore_loglevel
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009fc00 (usable)
 BIOS-e820: 0009fc00 - 000a (reserved)
 BIOS-e820: 000e4000 - 0010 (reserved)
 BIOS-e820: 0010 - 7ff9 (usable)
 BIOS-e820: 7ff9 - 7ff9e000 (ACPI data)
 BIOS-e820: 7ff9e000 - 7ffe (ACPI NVS)
 BIOS-e820: 7ffe - 8000 (reserved)
 BIOS-e820: fee0 - fee01000 (reserved)
 BIOS-e820: ffb0 - 0001 (reserved)
Entering add_active_range(0, 0, 159) 0 entries of 256 used
Entering add_active_range(0, 256, 524176) 1 entries of 256 used
end_pfn_map = 1048576
DMI 2.4 present.
Entering add_active_range(0, 0, 159) 0 entries of 256 used
Entering add_active_range(0, 256, 524176) 1 entries of 256 used
sizeof(struct page) = 56
Zone PFN ranges:
  DMA 0 -> 4096
  DMA324096 ->  1048576
  Normal1048576 ->  1048576
Movable zone start PFN for each node
early_node_map[2] active PFN ranges
0:0 ->  159
0:  256 ->   524176
On node 0 totalpages: 524079
Node 0 memmap at 0x81000100 size 29360128 first pfn 0x81000100
  DMA zone: 56 pages used for memmap
  DMA zone: 1636 pages reserved
  DMA zone: 2307 pages, LIFO batch:0
  DMA32 zone: 7110 pages used for memmap
  DMA32 zone: 512970 pages, LIFO batch:31
  Normal zone: 0 pages used for memmap
  Movable zone: 0 pages used for memmap
Intel MultiProcessor Specification v1.4
MPTABLE: OEM ID: P5B  MPTABLE: Product ID:  MPTABLE: APIC at: 0xFEE0
Processor #0 (Bootup-CPU)
Processor #1
I/O APIC #2 at 0xFEC0.
Setting APIC routing to flat
Processors: 2
Allocating PCI resources starting at 8800 (gap: 8000:7ee0)
PERCPU: Allocating 24832 bytes of per cpu data
Built 1 zonelists, mobility grouping on.  Total pages: 515277
Kernel command line: root=/dev/sda2 [EMAIL PROTECTED]/eth0,[EMAIL 
PROTECTED]/00:80:48:45:EC:73 ignore_loglevel
netconsole: local port 6665
netconsole: local IP 10.10.0.42
netconsole: interface eth0
netconsole: remote port 9353
netconsole: remote IP 10.10.0.1
netconsole: remote ethernet address 00:80:48:45:ec:73
debug: ignoring loglevel setting.
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 32768 bytes)
time.c: Detected 2135.087 MHz processor.
Console: colour VGA+ 80x25
Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
... MAX_LOCKDEP_SUBCLASSES:8
... MAX_LOCK_DEPTH:  30
... MAX_LOCKDEP_KEYS:2048
... CLASSHASH_SIZE:   1024
... MAX_LOCKDEP_ENTRIES: 8192
... MAX_LOCKDEP_CHAINS:  16384
... CHAINHASH_SIZE:  8192
 memory used by lock dependency info: 1648 kB
 per task-struct memory footprint: 1680 bytes
Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
Memory: 2057452k/2096704k available (1421k kernel code, 38652k reserved, 918k 
data, 148k init)
SLUB: General Slabs=18, HW alignment=64, Processors=2, Nodes=1
Calibrating delay using timer specific routine.. 4273.12 BogoMIPS (lpj=21365632)
Mount-cache hash table entries: 256
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 2048K
using mwait in idle threads.
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
CPU0: Thermal monitoring enabled (TM2)
Freeing SMP alternatives: 16k freed
ExtINT not setup in hardware but reported by MP table
Using local APIC timer interrupts.
result 16680360
Detected 16.680 MH

rsdl v46 report,numbers,comments

2007-04-24 Thread Mike Mattie
Hello,

0. intro

I am very happy to report that v46 of RSDL subjectively is much better than 
v42. As you (Con Kolivas) might 
remember from a previous mail I was experimenting with using nice levels 
effectively. I have refined these 
levels to this layout:

-2  : clock (ntpd)
-1  : syslog,sshd,X
0   : command; default for shells
1   : audacious (audio), xfce window manager (with compositor on )
2   :  emacs (SCHED_OTHER), desktop/window manager infrastructure (dbus), 
ssh-agent , bind (batch scheduled )
3   : desktop applications (mail , xchat, openoffice )
5   : spamd,batch scheduled compiles/test-suites.
10  : cron jobs

1. Some numbers

My machine is a particularly tough case I think. A uni-processor Athlon XP 
3000+ (involuntary pre-empt) with a 
software RAID5 on PATA drives. I load it heavily with compiles/test-suites, and 
I am very sensitive to audio 
glitches. 

here are some stats for idle:

---load-avg--- --memory-usage- total-cpu-usage 
interrupts--- ---system--
_1m_ _5m_ 15m_|_used _buff _cach _free|usr sys idl wai hiq siq|__17_ __18_ 
__20_|_int_ _csw_
 0.2  0.2  0.2| 170M   15M  309M 6560k|  2   1  94   4   0   0|   1 7   150 
| 238   208 
 0.2  0.2  0.2| 170M   15M  309M 6568k|  1   0  99   0   0   0|   0 0 0 
|  7655 
 0.2  0.2  0.2| 170M   15M  309M 6568k|  0   1  99   0   0   0|   0 0 0 
|  7547 
 0.2  0.2  0.2| 170M   15M  309M 6624k|  4   0  96   0   0   0|   0 0 0 
|  7537 
 0.2  0.2  0.2| 170M   15M  309M 6624k|  1   0  99   0   0   0|   0 0 0 
|  7536 

here are some stats for music playing:

---load-avg--- --memory-usage- total-cpu-usage 
interrupts--- ---system--
_1m_ _5m_ 15m_|_used _buff _cach _free|usr sys idl wai hiq siq|__17_ __18_ 
__20_|_int_ _csw_
 0.9  0.4  0.2| 175M   15M  305M 5652k|  2   1  94   4   0   0|   1 7   150 
| 238   210 
 0.9  0.4  0.2| 175M   15M  305M 5652k| 10   1  89   0   0   0|   0 3   989 
|1068  1510 
 0.9  0.4  0.2| 175M   15M  305M 5592k| 13   0  87   0   0   0|   0 3  1013 
|1093  1565 
 0.9  0.4  0.2| 175M   15M  304M 6300k| 11   1  88   0   0   0|   0 3  1000 
|1078  1496 
 0.9  0.4  0.2| 175M   15M  305M 6300k| 13   0  87   0   0   0|   0 3  1006 
|1084  1509 
 0.8  0.4  0.2| 175M   15M  305M 6180k| 13   1  86   0   0   0|   0 3  1000 
|1078  1524 
 0.8  0.4  0.2| 175M   15M  305M 6060k| 12   1  87   0   0   0|   0 3  1000 
|1078  1564 

The context switches are high, but so are the interrupts (USB 2.0 Audigy NX)

To see how effective using these nice levels were I decided to play with 
rr_interval, on the theory
that with priorities strictly enforced and used aggressively that a longer 
time-slice would not
cause audio delay. So far that theory is holding. All of these numbers are with 
rr_internal = 20, and
I have less audio problems than any previous kernel/tuning setup.

That is very impressive.

as far as batch loading goes I tried a kernel compile. These numbers look nice 
for RSDL but there are
some caveats:

kernel compile , CFS v3 : make  756.83s user 89.37s system 
58% cpu 24:08.21 total
kernel compile , v46 rr_interval = default  : make  754.66s user 89.74s system 
59% cpu 23:35.38 total
kernel compile , v46 rr_interval = 20   : make  682.83s user 84.34s system 
73% cpu 17:29.57 total

1. The system was noisy. I did this intentionally. My typical load is a mixture 
of desktop/compile.
   All three numbers were generated while listening to music, reading 
docs/web/news, using emacs etc.
   with each of the compiles I tried running a visualization plugin (ProjectM 
inside audacious ) for
   a minute or so.

   This skews the numbers for comparison , but I was looking for an impression 
that was based off a
   *real* work-load. 

   It would like to add as well that before RSDL the mainline scheduler failed 
completely at running 
   ProjectM even when it was the only application on the desktop. ( It stalled 
for seconds with a rock steady period ).

2. All of these ran nice 5 sched: BATCH

3. I have the xfce compositor turned on, using the transparency.

4. compiled on software RAID 5 (md) -> dev mapper -> lvm2 -> ext3 , 4 drives, 
write-cache disabled,
   external 512 mg flash drive for a external journal , commit=15, journal=data

From the caveats above , especially the deep stack for the block layer, plus 
meeting audio deadlines
while sharing a interrupt with the journal drive (arghh) this is very 
impressive system behavior for me.

Here is the stats for doing a kernel compile with audacious running, plus 
mail,editor etc.

---load-avg--- --memory-usage- total-cpu-usage 
interrupts--- ---system--
_1m_ _5m_ 15m_|_used _buff _cach _free|usr sys idl wai hiq siq|__17_ __18_ 
__20_|_int_ _csw_
 1.31  0.8| 198M   22M  269M   11M|  3   1  92   4   0   0|   1 7   199 
| 287   348 
 1.31  0.8| 204M   22M  269M 6072k| 79  12   0   9   0   0|   0 7  1003 
|1087 

Re: [PATCH] ib_core: Add missing device link to class device

2007-04-24 Thread Roland Dreier
 > I had a look at the kernel code -- currently, all device drivers except
 > ehca do this by themselves:

 > So I think it makes a lot of sense to put the class_dev.dev assignment
 > into generic ib_core code instead of repeating it in all the drivers.
 > The respective lines could move out of the drivers in the future but
 > won't hurt anyone until then.

Actually I think we should delete the duplicate code now while merging
this.  So I queued this up for 2.6.22:

commit f19c8d7cbe3153d68f0a559afd02f66655310238
Author: Joachim Fenkes <[EMAIL PROTECTED]>
Date:   Mon Apr 23 18:20:27 2007 +0200

IB: Set class_dev->dev in core for nice device symlink

All RDMA drivers except ehca set class_dev->dev to their dma_device
value (ehca leaves this unset).  dma_device is the only value that
makes any sense, so move this assignment to core/sysfs.c.  This reduce
the duplicated code in the rest of the drivers and gives ehca a nice
/sys/class/infiniband/ehcaX/device symlink.

Signed-off-by: Joachim Fenkes <[EMAIL PROTECTED]>
Signed-off-by: Roland Dreier <[EMAIL PROTECTED]>

diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index 000c086..08c299e 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -683,6 +683,7 @@ int ib_device_register_sysfs(struct ib_device *device)
 
class_dev->class  = &ib_class;
class_dev->class_data = device;
+   class_dev->dev= device->dma_device;
strlcpy(class_dev->class_id, device->name, BUS_ID_SIZE);
 
INIT_LIST_HEAD(&device->port_list);
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c 
b/drivers/infiniband/hw/amso1100/c2_provider.c
index fef9727..607c09b 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -796,7 +796,6 @@ int c2_register_device(struct c2_dev *dev)
memcpy(&dev->ibdev.node_guid, dev->pseudo_netdev->dev_addr, 6);
dev->ibdev.phys_port_cnt = 1;
dev->ibdev.dma_device = &dev->pcidev->dev;
-   dev->ibdev.class_dev.dev = &dev->pcidev->dev;
dev->ibdev.query_device = c2_query_device;
dev->ibdev.query_port = c2_query_port;
dev->ibdev.modify_port = c2_modify_port;
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c 
b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index 24e0df0..af28a31 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -1108,7 +1108,6 @@ int iwch_register_device(struct iwch_dev *dev)
memcpy(dev->ibdev.node_desc, IWCH_NODE_DESC, sizeof(IWCH_NODE_DESC));
dev->ibdev.phys_port_cnt = dev->rdev.port_info.nports;
dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev);
-   dev->ibdev.class_dev.dev = &(dev->rdev.rnic_info.pdev->dev);
dev->ibdev.query_device = iwch_query_device;
dev->ibdev.query_port = iwch_query_port;
dev->ibdev.modify_port = iwch_modify_port;
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c 
b/drivers/infiniband/hw/ipath/ipath_verbs.c
index f5604b8..18c6df2 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -1559,7 +1559,6 @@ int ipath_register_ib_device(struct ipath_devdata *dd)
dev->node_type = RDMA_NODE_IB_CA;
dev->phys_port_cnt = 1;
dev->dma_device = &dd->pcidev->dev;
-   dev->class_dev.dev = dev->dma_device;
dev->query_device = ipath_query_device;
dev->modify_device = ipath_modify_device;
dev->query_port = ipath_query_port;
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c 
b/drivers/infiniband/hw/mthca/mthca_provider.c
index 0725ad7..47e6fd4 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -1293,7 +1293,6 @@ int mthca_register_device(struct mthca_dev *dev)
dev->ib_dev.node_type= RDMA_NODE_IB_CA;
dev->ib_dev.phys_port_cnt= dev->limits.num_ports;
dev->ib_dev.dma_device   = &dev->pdev->dev;
-   dev->ib_dev.class_dev.dev= &dev->pdev->dev;
dev->ib_dev.query_device = mthca_query_device;
dev->ib_dev.query_port   = mthca_query_port;
dev->ib_dev.modify_device= mthca_modify_device;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.21-rc7-mm1 BUG at kernel/sched-clock.c:175 init_sched_clock()

2007-04-24 Thread Berck E. Nash
Kernel panic on boot, console output attached.

[14316256.221707] Linux version 2.6.21-rc7-mm1 ([EMAIL PROTECTED]) (gcc version 
4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #1 SMP Tue Apr 24 12:09:02 MDT 
2007
[14316256.221707] Command line: root=/dev/sdc1 ro console=tty0 
console=ttyS0,115200n8 BOOT_IMAGE=vmlinuz
[14316256.221707] BIOS-provided physical RAM map:
[14316256.221707]  BIOS-e820:  - 0009fc00 (usable)
[14316256.221707]  BIOS-e820: 0009fc00 - 000a (reserved)
[14316256.221707]  BIOS-e820: 000e4000 - 0010 (reserved)
[14316256.221707]  BIOS-e820: 0010 - 3ff8 (usable)
[14316256.221707]  BIOS-e820: 3ff8 - 3ff8e000 (ACPI data)
[14316256.221707]  BIOS-e820: 3ff8e000 - 3ffe (ACPI NVS)
[14316256.221707]  BIOS-e820: 3ffe - 4000 (reserved)
[14316256.221707]  BIOS-e820: ffb0 - 0001 (reserved)
[14316256.221707] end_pfn_map = 1048576
[14316256.221707] DMI 2.4 present.
[14316256.221707] ACPI: RSDP 000FAF20, 0024 (r2 ACPIAM)
[14316256.221707] ACPI: XSDT 3FF80100, 0064 (r1 NEC  1000724 MSFT   
97)
[14316256.221707] ACPI: FACP 3FF80290, 00F4 (r3 A_M_I_ OEMFACP   1000724 MSFT   
97)
[14316256.221707] ACPI: DSDT 3FF80590, 9560 (r1  A0543 A05430000 INTL 
20060113)
[14316256.221707] ACPI: FACS 3FF8E000, 0040
[14316256.221707] ACPI: APIC 3FF80390, 0080 (r1 A_M_I_ OEMAPIC   1000724 MSFT   
97)
[14316256.221707] ACPI: SLIC 3FF80410, 0176 (r1 NEC  1000724 MSFT   
97)
[14316256.221707] ACPI: OEMB 3FF8E040, 0066 (r1 A_M_I_ AMI_OEM   1000724 MSFT   
97)
[14316256.221707] ACPI: HPET 3FF89AF0, 0038 (r1 A_M_I_ OEMHPET   1000724 MSFT   
97)
[14316256.221707] ACPI: MCFG 3FF89B30, 003C (r1 A_M_I_ OEMMCFG   1000724 MSFT   
97)
[14316256.221707] ACPI: SSDT 3FF8E0B0, 01C6 (r1AMI   CPU1PM1 INTL 
20060113)
[14316256.221707] ACPI: SSDT 3FF8E280, 013A (r1AMI   CPU2PM1 INTL 
20060113)
[14316256.221707] Zone PFN ranges:
[14316256.221707]   DMA 0 -> 4096
[14316256.221707]   DMA324096 ->  1048576
[14316256.221707]   Normal1048576 ->  1048576
[14316256.221707] Movable zone start PFN for each node
[14316256.221707] early_node_map[2] active PFN ranges
[14316256.221707] 0:0 ->  159
[14316256.221707] 0:  256 ->   262016
[14316256.221707] ACPI: PM-Timer IO Port: 0x808
[14316256.221707] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
[14316256.221707] Processor #0 (Bootup-CPU)
[14316256.221707] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
[14316256.221707] Processor #1
[14316256.221707] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x82] disabled)
[14316256.221707] ACPI: LAPIC (acpi_id[0x04] lapic_id[0x83] disabled)
[14316256.221707] ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0])
[14316256.221707] IOAPIC[0]: apic_id 2, address 0xfec0, GSI 0-23
[14316256.221707] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[14316256.221707] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[14316256.221707] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[14316256.221707] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[14316256.221707] Setting APIC routing to flat
[14316256.221707] ACPI: HPET id: 0x8086a201 base: 0xfed0
[14316256.221707] Using ACPI (MADT) for SMP configuration information
[14316256.221707] Allocating PCI resources starting at 5000 (gap: 
4000:bfb0)
[14316256.221707] PERCPU: Allocating 32896 bytes of per cpu data
[14316256.221707] Built 1 zonelists, mobility grouping on.  Total pages: 257236
[14316256.221707] Kernel command line: root=/dev/sdc1 ro console=tty0 
console=ttyS0,115200n8 BOOT_IMAGE=vmlinuz
[14316256.221707] Initializing CPU#0
[14316256.221707] PID hash table entries: 4096 (order: 12, 32768 bytes)
[14316256.221707] time.c: Detected 2564.902 MHz processor.
[14316256.231707] Console: colour VGA+ 80x25
[14316256.235040] Dentry cache hash table entries: 131072 (order: 8, 1048576 
bytes)
[14316256.235040] Inode-cache hash table entries: 65536 (order: 7, 524288 bytes)
[14316256.235040] Checking aperture...
[14316256.241707] Memory: 1026888k/1048064k available (2355k kernel code, 
20620k reserved, 1490k data, 212k init)
[14316256.325040] Calibrating delay using timer specific routine.. 5134.38 
BogoMIPS (lpj=8554219)
[14316256.325040] Mount-cache hash table entries: 256
[14316256.325040] CPU: L1 I cache: 32K, L1 D cache: 32K
[14316256.325040] CPU: L2 cache: 2048K
[14316256.325040] using mwait in idle threads.
[14316256.325040] CPU: Physical Processor ID: 0
[14316256.325040] CPU: Processor Core ID: 0
[14316256.325040] CPU0: Thermal monitoring enabled (TM2)
[14316256.325040] Freeing SMP alternatives: 26k freed
[14316256.325040] ACPI: Core revision 20070126
[14316256.361707] Using local APIC timer interrupts.
[14316256.361707] result 22900897
[14316256.361707] Detecte

Re: [patch 1/7] libata: check for AN support

2007-04-24 Thread Kristen Carlson Accardi
On Tue, 24 Apr 2007 20:05:52 +0200
Olivier Galibert <[EMAIL PROTECTED]> wrote:

> On Tue, Apr 24, 2007 at 08:49:04AM -0700, Kristen Carlson Accardi wrote:
> > On Tue, 24 Apr 2007 12:23:04 +0200
> > Olivier Galibert <[EMAIL PROTECTED]> wrote:
> > 
> > > Sorry for replying to Alan's reply, I missed the original mail.
> > > 
> > > > > +#define ata_id_has_AN(id)\
> > > > > + ((id[76] && (~id[76])) & ((id)[78] & (1 << 5)))
> > > 
> > > (a && ~a) & (b & 32)
> > > 
> > > I don't think that does what you think it does, because at that point
> > > it's a funny way to write 0 ((0 or 1) binary-and (0 or 32)).
> > > 
> > > I'm not even sure what it is you want.  If for the first part you
> > > wanted (id[76] != 0x00 && id[76] != 0xff), please write just that,
> > > thanks :-)
> > > 
> > >   OG.
> > > 
> > 
> > >From the serial ata spec, we have:
> > 
> > 13.2.1.18Word 78: Serial ATA features supported
> > If Word 76 is not h or h, Word 78 reports the optional features 
> > supported by the device.  Support for this word is optional and if not 
> > supported the word shall be zero indicating the device has no support for 
> > new 
> > Serial ATA capabilities.
> > 
> > so, basically yes, I'm really testing to make sure that word 76 isn't 0 or 
> > all
> > one then using that value & with value of bit in work 78 to determine AN
> > support - if you think this is really obfuscated, I've got no problem 
> > changing 
> > it - there's obviously many ways to mess around with bits.
> 
> & is not &&, so right now it's really incorrect.  1 & 32 is 0.

ah - ok, gotcha, thanks.

> 
> ((id)[76] != 0x && (id)[76] != 0x && ((id)[78] & (1 << 5)))
> 
> The implicit typing of id looks dangerous to me, but you're not the
> one who has started it.
> 
>   OG.
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cfq: get rid of cfqq hash

2007-04-24 Thread Jens Axboe
On Tue, Apr 24 2007, Jens Axboe wrote:
> - if (key != CFQ_KEY_ASYNC)
> + if (!is_sync)
>   cfq_mark_cfqq_idle_window(cfqq);
> + else
> + cfq_mark_cfqq_sync(cfqq);

Woops, should be

if (is_sync) {
cfq_mark_cfqq_idle_window(cfqq);
cfq_mark_cfqq_sync(cfqq);
}

of course.

> +static struct cfq_io_context *
> +cfq_get_io_context_noalloc(struct cfq_data *cfqd, struct task_struct *tsk)
> +{
> + struct cfq_io_context *cic = NULL;
> + struct io_context *ioc;
> +
> + ioc = tsk->io_context;
> + if (ioc)
> + cic = cfq_cic_rb_lookup(cfqd, ioc);
> +
> + return cic;
> +}

I'll change that to just call cfq_cic_rb_lookup(), returning NULL for
NULL ioc.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] Ignore stolen time in the softlockup watchdog

2007-04-24 Thread Andrew Morton
On Tue, 24 Apr 2007 11:16:09 -0700 Jeremy Fitzhardinge <[EMAIL PROTECTED]> 
wrote:

> Andrew Morton wrote:
> > I said that because the damn thing went away when I was hunting it down
> > because I lost the config and was unable to remember the right combination
> > of debug settings.  Fortunately it later came back so I took care to
> > preserve the config.
> >   
> 
> sched_clock doesn't *do* anything except flap interrupts.

Well, it _is_ mysterious.

Did you try to locate the code which failed?  I got lost in macros and
include files, and gave up very very easily.  Stop hiding, Ingo.

> Oh, wait, have
> you got Andi's bugfixed version of the sched_clock patch?  The first
> version did a local_save_flags rather than a local_irq_save.

I have whatever I pulled from firstfloor over the weekend.  It's in
rc7-mm1.  No, it doesn't use local_save_flags.

> >> Hm, is it caused by using sched_clock() to generate the printk
> >> timestamps while generating the lock test output?
> >> 
> >
> > Conceivably.  What does that locking API test do?
> >   
> 
> Didn't make a difference here.  Building your config now.
> 
> > I was using printk timestamps and netconsole at the time.
> >   
> 
> Ah, great, now you're going to make me setup netconsole...
> 

That's a doddle.

On test system, boot with

netconsole=@/eth0,@/

On workstation:

sudo netcat -u -l -p  | tee -a ~/.log/log-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cfq: get rid of cfqq hash

2007-04-24 Thread Jens Axboe
On Tue, Apr 24 2007, Jens Axboe wrote:
> On Tue, Apr 24 2007, Jens Axboe wrote:
> > -   if (key != CFQ_KEY_ASYNC)
> > +   if (!is_sync)
> > cfq_mark_cfqq_idle_window(cfqq);
> > +   else
> > +   cfq_mark_cfqq_sync(cfqq);
> 
> Woops, should be
> 
> if (is_sync) {
> cfq_mark_cfqq_idle_window(cfqq);
> cfq_mark_cfqq_sync(cfqq);
> }
> 
> of course.
> 
> > +static struct cfq_io_context *
> > +cfq_get_io_context_noalloc(struct cfq_data *cfqd, struct task_struct *tsk)
> > +{
> > +   struct cfq_io_context *cic = NULL;
> > +   struct io_context *ioc;
> > +
> > +   ioc = tsk->io_context;
> > +   if (ioc)
> > +   cic = cfq_cic_rb_lookup(cfqd, ioc);
> > +
> > +   return cic;
> > +}
> 
> I'll change that to just call cfq_cic_rb_lookup(), returning NULL for
> NULL ioc.

Updated patch below.

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 8093733..b92e6b2 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -9,7 +9,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -38,14 +37,6 @@ static int cfq_slice_idle = HZ / 125;
 
 #define CFQ_SLICE_SCALE(5)
 
-#define CFQ_KEY_ASYNC  (0)
-
-/*
- * for the hash of cfqq inside the cfqd
- */
-#define CFQ_QHASH_SHIFT6
-#define CFQ_QHASH_ENTRIES  (1 << CFQ_QHASH_SHIFT)
-
 #define RQ_CIC(rq) ((struct cfq_io_context*)(rq)->elevator_private)
 #define RQ_CFQQ(rq)((rq)->elevator_private2)
 
@@ -62,8 +53,6 @@ static struct completion *ioc_gone;
 #define ASYNC  (0)
 #define SYNC   (1)
 
-#define cfq_cfqq_sync(cfqq)((cfqq)->key != CFQ_KEY_ASYNC)
-
 #define sample_valid(samples)  ((samples) > 80)
 
 /*
@@ -90,11 +79,6 @@ struct cfq_data {
struct cfq_rb_root service_tree;
unsigned int busy_queues;
 
-   /*
-* cfqq lookup hash
-*/
-   struct hlist_head *cfq_hash;
-
int rq_in_driver;
int sync_flight;
int hw_tag;
@@ -138,10 +122,6 @@ struct cfq_queue {
atomic_t ref;
/* parent cfq_data */
struct cfq_data *cfqd;
-   /* cfqq lookup hash */
-   struct hlist_node cfq_hash;
-   /* hash key */
-   unsigned int key;
/* service_tree member */
struct rb_node rb_node;
/* service_tree key */
@@ -186,6 +166,7 @@ enum cfqq_state_flags {
CFQ_CFQQ_FLAG_prio_changed, /* task priority has changed */
CFQ_CFQQ_FLAG_queue_new,/* queue never been serviced */
CFQ_CFQQ_FLAG_slice_new,/* no requests dispatched in slice */
+   CFQ_CFQQ_FLAG_sync, /* synchronous queue */
 };
 
 #define CFQ_CFQQ_FNS(name) \
@@ -212,11 +193,14 @@ CFQ_CFQQ_FNS(idle_window);
 CFQ_CFQQ_FNS(prio_changed);
 CFQ_CFQQ_FNS(queue_new);
 CFQ_CFQQ_FNS(slice_new);
+CFQ_CFQQ_FNS(sync);
 #undef CFQ_CFQQ_FNS
 
-static struct cfq_queue *cfq_find_cfq_hash(struct cfq_data *, unsigned int, 
unsigned short);
 static void cfq_dispatch_insert(request_queue_t *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, unsigned int, struct 
task_struct *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
+  struct task_struct *, gfp_t);
+static struct cfq_io_context *cfq_cic_rb_lookup(struct cfq_data *,
+   struct io_context *);
 
 /*
  * scheduler run of queue, if there are requests pending and no one in the
@@ -235,17 +219,6 @@ static int cfq_queue_empty(request_queue_t *q)
return !cfqd->busy_queues;
 }
 
-static inline pid_t cfq_queue_pid(struct task_struct *task, int rw, int 
is_sync)
-{
-   /*
-* Use the per-process queue, for read requests and syncronous writes
-*/
-   if (!(rw & REQ_RW) || is_sync)
-   return task->pid;
-
-   return CFQ_KEY_ASYNC;
-}
-
 /*
  * Scale schedule slice based on io priority. Use the sync time slice only
  * if a queue is marked sync and has sync io queued. A sync queue with async
@@ -602,10 +575,14 @@ static struct request *
 cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
 {
struct task_struct *tsk = current;
-   pid_t key = cfq_queue_pid(tsk, bio_data_dir(bio), bio_sync(bio));
+   struct cfq_io_context *cic;
struct cfq_queue *cfqq;
 
-   cfqq = cfq_find_cfq_hash(cfqd, key, tsk->ioprio);
+   cic = cfq_cic_rb_lookup(cfqd, tsk->io_context);
+   if (!cic)
+   return NULL;
+
+   cfqq =  cic->cfqq[bio_sync(bio)];
if (cfqq) {
sector_t sector = bio->bi_sector + bio_sectors(bio);
 
@@ -699,9 +676,8 @@ static int cfq_allow_merge(request_queue_t *q, struct 
request *rq,
   struct bio *bio)
 {
struct cfq_data *cfqd = q->elevator->elevator_data;
-   const int rw = bio

Re: 2.6.21-rc7-mm1: BUG at init_sched_clock()

2007-04-24 Thread Andrew Morton
On Tue, 24 Apr 2007 22:23:58 +0400 Alexey Dobriyan <[EMAIL PROTECTED]> wrote:

>   ...
> CPU1: Thermal monitoring enabled (TM2)
> Intel(R) Core(TM)2 CPU  6400  @ 2.13GHz stepping 02
> checking TSC synchronization [CPU#0 -> CPU#1]: passed.
> Brought up 2 CPUs
> migration_cost=
> BUG: at arch/x86_64/kernel/../../i386/kernel/sched-clock.c:175 
> init_sched_clock()
> 
> Call Trace:
>  [] show_trace+0x34/0x4f
>  [] dump_stack+0x12/0x17
>  [] init_sched_clock+0x59/0x8a
>  [] kernel_init+0x167/0x2dc
>  [] child_rip+0xa/0x12
> 
> It didn't happen in 2.6.21-rc6-mm1 nor in mainline.

It seems to expect that init_sched_clock() will be called before SMP
bringup, but init_sched_clock() should _always_ be called _after_ SMP
bringup - it's core_initcall.  Confused.

> And migration cost was 15.
> However, box boots to the end as usual. /proc/cpuinfo shows 2 CPUs.
> 
> [reboots box]
> Now migration cost is 1 and same trace.

That's a consequence of the same thing: we need a working sched_clock() to
calibrate the migration costs, but init_sched_clock() hasn't run yet.


> --
> Should-be-relevant config options:
> 
> CONFIG_X86_64=y
> CONFIG_64BIT=y
> CONFIG_X86=y
> CONFIG_GENERIC_TIME=y
> CONFIG_GENERIC_TIME_VSYSCALL=y
> CONFIG_MCORE2=y
> CONFIG_X86_TSC=y
> CONFIG_X86_GOOD_APIC=y
> CONFIG_X86_HT=y
> CONFIG_X86_IO_APIC=y
> CONFIG_X86_LOCAL_APIC=y
> CONFIG_SMP=y
> CONFIG_SCHED_MC=y
> CONFIG_PREEMPT=y
> CONFIG_PREEMPT_BKL=y
> CONFIG_NR_CPUS=2
> CONFIG_HPET_TIMER=y
> CONFIG_HZ=100
> CONFIG_TRACE_IRQFLAGS_SUPPORT=y
> CONFIG_DETECT_SOFTLOCKUP=y

CONFIG_SCHED_SMT and CONFIG_SCHED_MC might be significant here.

Anyway, I'll cc Andi and run away.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] use mutex instead of semaphore in SBPCD driver

2007-04-24 Thread Matthias Kaehlcke
the SBPCD driver uses a semaphore as mutex. use the mutex API
instead of the (binary) semaphore

Signed-off-by: Matthias Kaehlcke <[EMAIL PROTECTED]>

--

diff --git a/drivers/cdrom/sbpcd.c b/drivers/cdrom/sbpcd.c
index a1283b1..5c6a8d3 100644
--- a/drivers/cdrom/sbpcd.c
+++ b/drivers/cdrom/sbpcd.c
@@ -602,7 +602,7 @@ static u_char xa_tail_buf[CD_XA_TAIL];
 static volatile u_char busy_data;
 static volatile u_char busy_audio; /* true semaphores would be safer */
 #endif /* OLD_BUSY */ 
-static DECLARE_MUTEX(ioctl_read_sem);
+static DEFINE_MUTEX(ioctl_read_mtx);
 static u_long timeout;
 static volatile u_char timed_out_delay;
 static volatile u_char timed_out_data;
@@ -834,7 +834,7 @@ static void sbp_sleep(u_int time)
sti();
 }
 /*==*/
-#define RETURN_UP(rc) {up(&ioctl_read_sem); return(rc);}
+#define RETURN_UNLOCK(rc) {mutex_unlock(&ioctl_read_mtx); return(rc);}
 /*==*/
 /*
  *  convert logical_block_address to m-s-f_number (3 bytes only)
@@ -4174,7 +4174,7 @@ static int sbpcd_audio_ioctl(struct cdrom_device_info 
*cdi, u_int cmd,
msg(DBG_INF, "ioctl: bad device: %s\n", cdi->name);
return (-ENXIO); /* no such drive */
}
-   down(&ioctl_read_sem);
+   mutex_lock(&ioctl_read_mtx);
if (p != current_drive)
switch_drive(p);

@@ -4193,19 +4193,19 @@ static int sbpcd_audio_ioctl(struct cdrom_device_info 
*cdi, u_int cmd,
case audio_playing:
if (famL_drive) i=cc_ReadSubQ();
else i=cc_Pause_Resume(1);
-   if (i<0) RETURN_UP(-EIO);
+   if (i<0) RETURN_UNLOCK(-EIO);
if (famL_drive) i=cc_Pause_Resume(1);
else i=cc_ReadSubQ();
-   if (i<0) RETURN_UP(-EIO);
+   if (i<0) RETURN_UNLOCK(-EIO);

current_drive->pos_audio_start=current_drive->SubQ_run_tot;
current_drive->audio_state=audio_pausing;
-   RETURN_UP(0);
+   RETURN_UNLOCK(0);
case audio_pausing:
i=cc_Seek(current_drive->pos_audio_start,1);
-   if (i<0) RETURN_UP(-EIO);
-   RETURN_UP(0);
+   if (i<0) RETURN_UNLOCK(-EIO);
+   RETURN_UNLOCK(0);
default:
-   RETURN_UP(-EINVAL);
+   RETURN_UNLOCK(-EINVAL);
}
 
case CDROMRESUME: /* resume paused audio play */
@@ -4213,26 +4213,26 @@ static int sbpcd_audio_ioctl(struct cdrom_device_info 
*cdi, u_int cmd,
/* resume playing audio tracks when a previous PLAY AUDIO call 
has  */
/* been paused with a PAUSE command.
*/
/* It will resume playing from the location saved in 
SubQ_run_tot.  */
-   if (current_drive->audio_state!=audio_pausing) 
RETURN_UP(-EINVAL);
+   if (current_drive->audio_state!=audio_pausing) 
RETURN_UNLOCK(-EINVAL);
if (famL_drive)
i=cc_PlayAudio(current_drive->pos_audio_start,
   current_drive->pos_audio_end);
else i=cc_Pause_Resume(3);
-   if (i<0) RETURN_UP(-EIO);
+   if (i<0) RETURN_UNLOCK(-EIO);
current_drive->audio_state=audio_playing;
-   RETURN_UP(0);
+   RETURN_UNLOCK(0);
 
case CDROMPLAYMSF:
msg(DBG_IOC,"ioctl: CDROMPLAYMSF entered.\n");
 #ifdef SAFE_MIXED
-   if (current_drive->has_data>1) RETURN_UP(-EBUSY);
+   if (current_drive->has_data>1) RETURN_UNLOCK(-EBUSY);
 #endif /* SAFE_MIXED */
if (current_drive->audio_state==audio_playing)
{
i=cc_Pause_Resume(1);
-   if (i<0) RETURN_UP(-EIO);
+   if (i<0) RETURN_UNLOCK(-EIO);
i=cc_ReadSubQ();
-   if (i<0) RETURN_UP(-EIO);
+   if (i<0) RETURN_UNLOCK(-EIO);

current_drive->pos_audio_start=current_drive->SubQ_run_tot;
i=cc_Seek(current_drive->pos_audio_start,1);
}
@@ -4252,30 +4252,30 @@ static int sbpcd_audio_ioctl(struct cdrom_device_info 
*cdi, u_int cmd,
msg(DBG_INF,"ioctl: cc_PlayAudio returns %d\n",i);
DriveReset();
current_drive->audio_state=0;
-   RETURN_UP(-EIO);
+   RETURN_UNLOCK(-EIO);
}
current_drive->audio_state=audio_playing;
-   RETURN_UP(0);
+   

Re: Pagecache: find_or_create_page does not call a proper page allocator function

2007-04-24 Thread Hugh Dickins
On Tue, 24 Apr 2007, Christoph Lameter wrote:
> On Tue, 24 Apr 2007, Hugh Dickins wrote:
> 
> > I've not yet looked at the patch under discussion, but this remark
> > prompts me...  a couple of days ago I got very worried by the various
> > hard-wired GFP_HIGHUSER allocations in mm/migrate.c and mm/mempolicy.c,
> > and wondered how those would work out if someone has a blockdev mmap'ed.
> 
> I hope you are not confused by the fact that memory policies are only
> ever applied to one zone on a node. This is either HIGHMEM or NORMAL. 
> There is no memory policy support for other than the highest zone.

I was certainly ignorant of that; but I'm not convinced it eliminates
the potential issue.  For a start, sys_move_pages seems not to involve
mempolicies at all - I don't see what prevents it migrating blockdev
pages away from the only node which has NORMAL memory.

> Metadata is not movable nor subject to memory policies.
> It will never be mapped into a process space.

Not as metadata, no.  But someone (let's hope only root, though I may
be wrong on that) can map any part of the block device into userspace.

> > yup.  wherever we dereference buffer_head.b_data we're touching
> > page_address(buffer_head.b_page) without kmapping.
> 
> Yes but before we get there we will bounce pagecache pages into an area 
> where we do not need kmap.

Again, I'm not convinced: bouncing gets done for the I/O,
but where is it done to meet the filesystem's expectations?

On the other hand, as I said, I've seen no problem myself in practice.
However, if there is no problem, why do block devices demand GFP_USER?

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work

2007-04-24 Thread Oleg Nesterov
On 04/24, Jarek Poplawski wrote:
>
> This looks fine. Of course, it requires to remove some debugging
> currently done with _PENDING flag

For example?

>and it's hard to estimate this
> all before you do more, but it should be more foreseeable than
> current way. But the races with _PENDING could be really "funny"
> without locking it everywhere.

Please see the patch below. Do you see any problems? I'll send it
when I have time to re-read the code and write some tests. I still
hope we can find a way to avoid the change in run_workqueue()...

Note that cancel_rearming_delayed_work() now can handle the works
which re-arm itself via queue_work(), not only queue_delayed_work().

Note also we can change cancel_work_sync(), so it can deal with the
self rearming work_structs.

> BTW - are a few locks more a real
> problem, while serving a "sleeping" path? And I don't think there
> is any reason to hurry... 

Sorry, could you clarify what you mean?

> > > Yes, but currently you cannot to behave like this e.g. with
> > > "rearming" work.
> > 
> > Why?
> 
> OK, it's not impossible, but needs some bothering: if I simply
> set some flag and my work function exits before rearming -
> cancel_rearming_delayed_work can loop.

Yes sure. I meant "after we fix the problems you pointed out".

Oleg.

--- OLD/kernel/workqueue.c~1_CRDW   2007-04-13 17:43:23.0 +0400
+++ OLD/kernel/workqueue.c  2007-04-24 22:41:15.0 +0400
@@ -242,11 +242,11 @@ static void run_workqueue(struct cpu_wor
work_func_t f = work->func;
 
cwq->current_work = work;
-   list_del_init(cwq->worklist.next);
+   list_del_init(&work->entry);
+   work_clear_pending(work);
spin_unlock_irq(&cwq->lock);
 
BUG_ON(get_wq_data(work) != cwq);
-   work_clear_pending(work);
f(work);
 
if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
@@ -398,6 +398,16 @@ static void wait_on_work(struct cpu_work
wait_for_completion(&barr.done);
 }
 
+static void needs_a_good_name(struct workqueue_struct *wq,
+   struct work_struct *work)
+{
+   const cpumask_t *cpu_map = wq_cpu_map(wq);
+   int cpu;
+
+   for_each_cpu_mask(cpu, *cpu_map)
+   wait_on_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
+}
+
 /**
  * cancel_work_sync - block until a work_struct's callback has terminated
  * @work: the work which is to be flushed
@@ -414,9 +424,6 @@ static void wait_on_work(struct cpu_work
 void cancel_work_sync(struct work_struct *work)
 {
struct cpu_workqueue_struct *cwq;
-   struct workqueue_struct *wq;
-   const cpumask_t *cpu_map;
-   int cpu;
 
might_sleep();
 
@@ -434,15 +441,10 @@ void cancel_work_sync(struct work_struct
work_clear_pending(work);
spin_unlock_irq(&cwq->lock);
 
-   wq = cwq->wq;
-   cpu_map = wq_cpu_map(wq);
-
-   for_each_cpu_mask(cpu, *cpu_map)
-   wait_on_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
+   needs_a_good_name(cwq->wq, work);
 }
 EXPORT_SYMBOL_GPL(cancel_work_sync);
 
-
 static struct workqueue_struct *keventd_wq;
 
 /**
@@ -532,22 +534,34 @@ EXPORT_SYMBOL(flush_scheduled_work);
 /**
  * cancel_rearming_delayed_work - kill off a delayed work whose handler rearms 
the delayed work.
  * @dwork: the delayed work struct
- *
- * Note that the work callback function may still be running on return from
- * cancel_delayed_work(). Run flush_workqueue() or cancel_work_sync() to wait
- * on it.
  */
 void cancel_rearming_delayed_work(struct delayed_work *dwork)
 {
-   struct cpu_workqueue_struct *cwq = get_wq_data(&dwork->work);
-
-   /* Was it ever queued ? */
-   if (cwq != NULL) {
-   struct workqueue_struct *wq = cwq->wq;
-
-   while (!cancel_delayed_work(dwork))
-   flush_workqueue(wq);
-   }
+   struct work_struct *work = &dwork->work;
+   struct cpu_workqueue_struct *cwq = get_wq_data(work);
+   int retry;
+
+   if (!cwq)
+   return;
+
+   do {
+   retry = 1;
+   spin_lock_irq(&cwq->lock);
+   /* CPU_DEAD in progress may change cwq */
+   if (likely(cwq == get_wq_data(work))) {
+   list_del_init(&work->entry);
+   __set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
+   retry = try_to_del_timer_sync(&dwork->timer) < 0;
+   }
+   spin_unlock_irq(&cwq->lock);
+   } while (unlikely(retry));
+
+   /*
+* Nobody can clear WORK_STRUCT_PENDING. This means that the
+* work can't be re-queued and the timer can't be re-started.
+*/
+   needs_a_good_name(cwq->wq, work);
+   work_clear_pending(work);
 }
 EXPORT_SYMBOL(cancel_rearming_delayed_work)

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread David Lang

On Tue, 24 Apr 2007, Nikita Danilov wrote:


Amit Gud writes:

Hello,

>
> This is an initial implementation of ChunkFS technique, briefly discussed
> at: http://lwn.net/Articles/190222 and
> http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf

I have a couple of questions about chunkfs repair process.

First, as I understand it, each continuation inode is a sparse file,
mapping some subset of logical file blocks into block numbers. Then it
seems, that during "final phase" fsck has to check that these partial
mappings are consistent, for example, that no two different continuation
inodes for a given file contain a block number for the same offset. This
check requires scan of all chunks (rather than of only "active during
crash"), which seems to return us back to the scalability problem
chunkfs tries to address.


not quite.

this checking is a O(n^2) or worse problem, and it can eat a lot of memory in 
the process. with chunkfs you divide the problem by a large constant (100 or 
more) for the checks of individual chunks. after those are done then the final 
pass checking the cross-chunk links doesn't have to keep track of everything, it 
only needs to check those links and what they point to


any ability to mark a filesystem as 'clean' and then not have to check it on 
reboot is a bonus on top of this.


David Lang


Second, it is not clear how, under assumption of bugs in the file system
code (which paper makes at the very beginning), fsck can limit itself
only to the chunks that were active at the moment of crash.

[...]

>
> Best,
> AG

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pagecache: find_or_create_page does not call a proper page allocator function

2007-04-24 Thread Hugh Dickins
On Tue, 24 Apr 2007, Andrew Morton wrote:
> 
> From my reading it would be pretty simple to teach unmap_and_move()
> to pass mapping_gfp_mask(page_mapping(page)) down into
> (*get_new_page)() to get the correct type of page.

Or even simpler, since they're already passed the source page,
just get it from that.  However, I'm much less able to test this
patch in a hurry: it looks plausible, builds and runs okay in my
non-NUMA testing; but whether it does what I intend it to do,
without causing unexpected side-effects, I don't know.

Whereas I did check the previous little vma_migratable() patch
did the right thing, and think it more suitable for lastminute
and -stable.  This one would need real testing by real migrants
with real NUMA.

Or Christoph may prevail in persuading there's no such problem.


Is there a problem with page migration to HIGHMEM, if pages were
mapped from a GFP_USER block device?  I failed to demonstrate any
problem, but here's a quick fix if needed.

Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]>
---

 include/linux/migrate.h |1 +
 mm/mempolicy.c  |4 ++--
 mm/migrate.c|   12 +---
 mm/swap_state.c |1 +
 4 files changed, 13 insertions(+), 5 deletions(-)

--- 2.6.21-rc7/include/linux/migrate.h  2007-03-07 14:17:59.0 +
+++ linux/include/linux/migrate.h   2007-04-24 19:19:01.0 +0100
@@ -14,6 +14,7 @@ static inline int vma_migratable(struct 
 }
 
 #ifdef CONFIG_MIGRATION
+extern gfp_t page_gfp_mask(struct page *page);
 extern int isolate_lru_page(struct page *p, struct list_head *pagelist);
 extern int putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
--- 2.6.21-rc7/mm/mempolicy.c   2007-03-07 14:18:01.0 +
+++ linux/mm/mempolicy.c2007-04-24 19:19:01.0 +0100
@@ -594,7 +594,7 @@ static void migrate_page_add(struct page
 
 static struct page *new_node_page(struct page *page, unsigned long node, int 
**x)
 {
-   return alloc_pages_node(node, GFP_HIGHUSER, 0);
+   return alloc_pages_node(node, page_gfp_mask(page), 0);
 }
 
 /*
@@ -710,7 +710,7 @@ static struct page *new_vma_page(struct 
 {
struct vm_area_struct *vma = (struct vm_area_struct *)private;
 
-   return alloc_page_vma(GFP_HIGHUSER, vma, page_address_in_vma(page, 
vma));
+   return alloc_page_vma(page_gfp_mask(page), vma, 
page_address_in_vma(page, vma));
 }
 #else
 
--- 2.6.21-rc7/mm/migrate.c 2007-03-07 14:18:01.0 +
+++ linux/mm/migrate.c  2007-04-24 19:19:01.0 +0100
@@ -735,12 +735,18 @@ struct page_to_node {
int status;
 };
 
-static struct page *new_page_node(struct page *p, unsigned long private,
+gfp_t page_gfp_mask(struct page *page)
+{
+   struct address_space *mapping = page_mapping(page);
+   return mapping? mapping_gfp_mask(mapping): GFP_HIGHUSER;
+}
+
+static struct page *new_page_node(struct page *page, unsigned long private,
int **result)
 {
struct page_to_node *pm = (struct page_to_node *)private;
 
-   while (pm->node != MAX_NUMNODES && pm->page != p)
+   while (pm->node != MAX_NUMNODES && pm->page != page)
pm++;
 
if (pm->node == MAX_NUMNODES)
@@ -748,7 +754,7 @@ static struct page *new_page_node(struct
 
*result = &pm->status;
 
-   return alloc_pages_node(pm->node, GFP_HIGHUSER | GFP_THISNODE, 0);
+   return alloc_pages_node(pm->node, page_gfp_mask(page) | GFP_THISNODE, 
0);
 }
 
 /*
--- 2.6.21-rc7/mm/swap_state.c  2006-09-20 04:42:06.0 +0100
+++ linux/mm/swap_state.c   2007-04-24 19:19:01.0 +0100
@@ -40,6 +40,7 @@ struct address_space swapper_space = {
.page_tree  = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
.tree_lock  = __RW_LOCK_UNLOCKED(swapper_space.tree_lock),
.a_ops  = &swap_aops,
+   .flags  = (__force unsigned long) GFP_HIGHUSER,
.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
.backing_dev_info = &swap_backing_dev_info,
 };
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] use mutex instead of semaphore in TPM driver

2007-04-24 Thread Matthias Kaehlcke
the TPM driver uses two semaphores as mutexes. use the mutex API
instead of the (binary) semaphores

Signed-off-by: Matthias Kaehlcke <[EMAIL PROTECTED]> 

--

diff --git a/drivers/char/tpm/tpm.c b/drivers/char/tpm/tpm.c
index e5a254a..0805d39 100644
--- a/drivers/char/tpm/tpm.c
+++ b/drivers/char/tpm/tpm.c
@@ -328,10 +328,10 @@ static void timeout_work(struct work_struct *work)
 {
struct tpm_chip *chip = container_of(work, struct tpm_chip, work);
 
-   down(&chip->buffer_mutex);
+   mutex_lock(&chip->buffer_mutex);
atomic_set(&chip->data_pending, 0);
memset(chip->data_buffer, 0, TPM_BUFSIZE);
-   up(&chip->buffer_mutex);
+   mutex_unlock(&chip->buffer_mutex);
 }
 
 /*
@@ -380,7 +380,7 @@ static ssize_t tpm_transmit(struct tpm_chip *chip, const 
char *buf,
return -E2BIG;
}
 
-   down(&chip->tpm_mutex);
+   mutex_lock(&chip->tpm_mutex);
 
if ((rc = chip->vendor.send(chip, (u8 *) buf, count)) < 0) {
dev_err(chip->dev,
@@ -419,7 +419,7 @@ out_recv:
dev_err(chip->dev,
"tpm_transmit: tpm_recv: error %zd\n", rc);
 out:
-   up(&chip->tpm_mutex);
+   mutex_unlock(&chip->tpm_mutex);
return rc;
 }
 
@@ -966,14 +966,14 @@ ssize_t tpm_write(struct file *file, const char __user 
*buf,
while (atomic_read(&chip->data_pending) != 0)
msleep(TPM_TIMEOUT);
 
-   down(&chip->buffer_mutex);
+   mutex_lock(&chip->buffer_mutex);
 
if (in_size > TPM_BUFSIZE)
in_size = TPM_BUFSIZE;
 
if (copy_from_user
(chip->data_buffer, (void __user *) buf, in_size)) {
-   up(&chip->buffer_mutex);
+   mutex_unlock(&chip->buffer_mutex);
return -EFAULT;
}
 
@@ -981,7 +981,7 @@ ssize_t tpm_write(struct file *file, const char __user *buf,
out_size = tpm_transmit(chip, chip->data_buffer, TPM_BUFSIZE);
 
atomic_set(&chip->data_pending, out_size);
-   up(&chip->buffer_mutex);
+   mutex_unlock(&chip->buffer_mutex);
 
/* Set a timeout by which the reader must come claim the result */
mod_timer(&chip->user_read_timer, jiffies + (60 * HZ));
@@ -1004,10 +1004,10 @@ ssize_t tpm_read(struct file *file, char __user *buf,
if (size < ret_size)
ret_size = size;
 
-   down(&chip->buffer_mutex);
+   mutex_lock(&chip->buffer_mutex);
if (copy_to_user(buf, chip->data_buffer, ret_size))
ret_size = -EFAULT;
-   up(&chip->buffer_mutex);
+   mutex_unlock(&chip->buffer_mutex);
}
 
return ret_size;
@@ -1100,8 +1100,8 @@ struct tpm_chip *tpm_register_hardware(struct device 
*dev, const struct tpm_vend
if (chip == NULL)
return NULL;
 
-   init_MUTEX(&chip->buffer_mutex);
-   init_MUTEX(&chip->tpm_mutex);
+   mutex_init(&chip->buffer_mutex);
+   mutex_init(&chip->tpm_mutex);
INIT_LIST_HEAD(&chip->list);
 
INIT_WORK(&chip->work, timeout_work);
diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index bb9a43c..bf58aac 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -95,11 +95,11 @@ struct tpm_chip {
/* Data passed to and from the tpm via the read/write calls */
u8 *data_buffer;
atomic_t data_pending;
-   struct semaphore buffer_mutex;
+   struct mutex buffer_mutex;
 
struct timer_list user_read_timer;  /* user needs to claim result */
struct work_struct work;
-   struct semaphore tpm_mutex; /* tpm is processing */
+   struct mutex tpm_mutex; /* tpm is processing */
 
struct tpm_vendor_specific vendor;
 
-- 
Matthias Kaehlcke
Linux Application Developer
Barcelona

  Dreams and reality are opposites. Action synthesizes them
 (Assata Shakur)
 .''`.
using free software / Debian GNU/Linux | http://debian.org  : :'  :
`. `'`
gpg --keyserver pgp.mit.edu --recv-keys 47D8E5D4  `-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.7 locking up hard on boot

2007-04-24 Thread Marcos Pinto

I can confirm that reverting commit
7639e962234c76031d1ddf436def7fd9602be560 fixes the problem.  Also,
there seem to be plenty of other people reporting the same boot
locking:

http://groups.google.com/group/fa.linux.kernel/msg/cc0453677be44a9e
http://bbs.archlinux.org/viewtopic.php?pid=245313
http://bugs.archlinux.org/task/6845
http://www.usenetlinux.com/archive/topic.php/t-757515.html

Please consider reverting this patch in upstream.
Thank you for your time,
Marcos

On 4/23/07, Marcos Pinto <[EMAIL PROTECTED]> wrote:

On 4/23/07, Jan Beulich <[EMAIL PROTECTED]> wrote:
> Given that all of the reports are in cases when the adjustment is *not*
> being done (and only a message is being printed), I can only assume that
> the breakage results from the adding of PCI_BASE_ADDRESS_SPACE_IO
> into the resource flags. I considered this unconditional setting of the flags
> odd already in the original code, and added this extra flag only for
> consistency reasons (because the settings reported by X indicated that
> this was missing). Perhaps the adjustment (original and the added
> extra flag) shouldn't be done if IORESOURCE_IO wasn't already set.
> Perhaps one of those seeing the issue could try out returning from the
> function right after that printk(), without any adjustment to the flags.
>
> Jan
>
>


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Re[2]: sendfile to nonblocking socket

2007-04-24 Thread David Schwartz

> DS> Threads plus epoll is another.

> 20k threads and maybe more is too much :). Look at http://nginx.net/
> senction "Architecture and scalability" for example.
> DS> It really depends upon how much performance you need
> all, that hardware can take and hold :)

Why would you want 20k threads? You aren't seriously suggesting that you
need to have 20,000 outstanding disk operations, are you? Surely you don't
think that would be efficient. If the disk is the limiting factor, it may
get slightly faster as you pend more concurrent requests, but surely 20,000
is not the best number! (256 is probably closer to the optimal value, and it
may be less.)

Your application has to manage the outstanding disk read requests. I don't
know of any way to foist this task on the kernel. Perhaps a pool of disk
read threads?

I would keep a flag for each connection to track whether the last write got
a 'would block' or was incomplete. So long as this flag is clear, let the
disk read thread attempt the socket 'write'. If the disk read thread gets a
partial write (or a would block indication), set the flag on the socket and
let the socket I/O threads takeover the connection (based on 'epoll'
notification). When a write completes and you need more disk data, clear the
flag and let the disk read threads takeover the connection until a write
blocks again. (This disk read threads can use 'sendfile' or 'splice' so long
as they don't block on the socket.)

Perhaps the disk read threads should be using 'mmap' with MAP_POPULATE.

There are certainly many possible approaches.

DS


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] use mutex instead of semaphore in SBPCD driver

2007-04-24 Thread Eberhard Moenkeberg
Hi,

OK for me.

Viele Grüße
Eberhard Mönkeberg ([EMAIL PROTECTED], [EMAIL PROTECTED])
 

On Tue, 24 Apr 2007, Matthias Kaehlcke wrote:

> the SBPCD driver uses a semaphore as mutex. use the mutex API
> instead of the (binary) semaphore
> 
> Signed-off-by: Matthias Kaehlcke <[EMAIL PROTECTED]>
> 
> --
> 
> diff --git a/drivers/cdrom/sbpcd.c b/drivers/cdrom/sbpcd.c
> index a1283b1..5c6a8d3 100644
> --- a/drivers/cdrom/sbpcd.c
> +++ b/drivers/cdrom/sbpcd.c
> @@ -602,7 +602,7 @@ static u_char xa_tail_buf[CD_XA_TAIL];
>  static volatile u_char busy_data;
>  static volatile u_char busy_audio; /* true semaphores would be safer */
>  #endif /* OLD_BUSY */ 
> -static DECLARE_MUTEX(ioctl_read_sem);
> +static DEFINE_MUTEX(ioctl_read_mtx);
>  static u_long timeout;
>  static volatile u_char timed_out_delay;
>  static volatile u_char timed_out_data;
> @@ -834,7 +834,7 @@ static void sbp_sleep(u_int time)
>   sti();
>  }
>  
> /*==*/
> -#define RETURN_UP(rc) {up(&ioctl_read_sem); return(rc);}
> +#define RETURN_UNLOCK(rc) {mutex_unlock(&ioctl_read_mtx); return(rc);}
>  
> /*==*/
>  /*
>   *  convert logical_block_address to m-s-f_number (3 bytes only)
> @@ -4174,7 +4174,7 @@ static int sbpcd_audio_ioctl(struct cdrom_device_info 
> *cdi, u_int cmd,
>   msg(DBG_INF, "ioctl: bad device: %s\n", cdi->name);
>   return (-ENXIO); /* no such drive */
>   }
> - down(&ioctl_read_sem);
> + mutex_lock(&ioctl_read_mtx);
>   if (p != current_drive)
>   switch_drive(p);
>   
> @@ -4193,19 +4193,19 @@ static int sbpcd_audio_ioctl(struct cdrom_device_info 
> *cdi, u_int cmd,
>   case audio_playing:
>   if (famL_drive) i=cc_ReadSubQ();
>   else i=cc_Pause_Resume(1);
> - if (i<0) RETURN_UP(-EIO);
> + if (i<0) RETURN_UNLOCK(-EIO);
>   if (famL_drive) i=cc_Pause_Resume(1);
>   else i=cc_ReadSubQ();
> - if (i<0) RETURN_UP(-EIO);
> + if (i<0) RETURN_UNLOCK(-EIO);
>   
> current_drive->pos_audio_start=current_drive->SubQ_run_tot;
>   current_drive->audio_state=audio_pausing;
> - RETURN_UP(0);
> + RETURN_UNLOCK(0);
>   case audio_pausing:
>   i=cc_Seek(current_drive->pos_audio_start,1);
> - if (i<0) RETURN_UP(-EIO);
> - RETURN_UP(0);
> + if (i<0) RETURN_UNLOCK(-EIO);
> + RETURN_UNLOCK(0);
>   default:
> - RETURN_UP(-EINVAL);
> + RETURN_UNLOCK(-EINVAL);
>   }
>  
>   case CDROMRESUME: /* resume paused audio play */
> @@ -4213,26 +4213,26 @@ static int sbpcd_audio_ioctl(struct cdrom_device_info 
> *cdi, u_int cmd,
>   /* resume playing audio tracks when a previous PLAY AUDIO call 
> has  */
>   /* been paused with a PAUSE command.
> */
>   /* It will resume playing from the location saved in 
> SubQ_run_tot.  */
> - if (current_drive->audio_state!=audio_pausing) 
> RETURN_UP(-EINVAL);
> + if (current_drive->audio_state!=audio_pausing) 
> RETURN_UNLOCK(-EINVAL);
>   if (famL_drive)
>   i=cc_PlayAudio(current_drive->pos_audio_start,
>  current_drive->pos_audio_end);
>   else i=cc_Pause_Resume(3);
> - if (i<0) RETURN_UP(-EIO);
> + if (i<0) RETURN_UNLOCK(-EIO);
>   current_drive->audio_state=audio_playing;
> - RETURN_UP(0);
> + RETURN_UNLOCK(0);
>  
>   case CDROMPLAYMSF:
>   msg(DBG_IOC,"ioctl: CDROMPLAYMSF entered.\n");
>  #ifdef SAFE_MIXED
> - if (current_drive->has_data>1) RETURN_UP(-EBUSY);
> + if (current_drive->has_data>1) RETURN_UNLOCK(-EBUSY);
>  #endif /* SAFE_MIXED */
>   if (current_drive->audio_state==audio_playing)
>   {
>   i=cc_Pause_Resume(1);
> - if (i<0) RETURN_UP(-EIO);
> + if (i<0) RETURN_UNLOCK(-EIO);
>   i=cc_ReadSubQ();
> - if (i<0) RETURN_UP(-EIO);
> + if (i<0) RETURN_UNLOCK(-EIO);
>   
> current_drive->pos_audio_start=current_drive->SubQ_run_tot;
>   i=cc_Seek(current_drive->pos_audio_start,1);
>   }
> @@ -4252,30 +4252,30 @@ static int sbpcd_audio_ioctl(struct cdrom_device_info 
> *cdi, u_int cmd,
>   msg(DBG_INF,"ioctl: cc_PlayAudio returns %d\n",i);
>   DriveReset();
>

Re: [PATCH 6/7] ds2760 W1 slave

2007-04-24 Thread Pavel Machek
Hi!

> +#if 0
> +/* Code below works, but unused currently, thus produces "defined but not
> + * used" warning. It stays here for reference and future needs, feel free
> + * to drop "#if 0" when you're about to use it. */

We normaly don't do this.

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel IOMMU][patch 3/8] Generic hardware support for Intel IOMMU.

2007-04-24 Thread Andi Kleen
On Tuesday 24 April 2007 08:03:02 Ashok Raj wrote:
>
> +#ifdef CONFIG_DMAR
> +#ifdef CONFIG_SMP
> +static void dmar_msi_set_affinity(unsigned int irq, cpumask_t mask)


Why does it need an own interrupt type?

> +
> +config IOVA_GUARD_PAGE
> + bool "Enables gaurd page when allocating IO Virtual Address for IOMMU"
> + depends on DMAR
> +
> +config IOVA_NEXT_CONTIG
> + bool "Keeps IOVA allocations consequent between allocations"
> + depends on DMAR && EXPERIMENTAL

Needs reference to Intel and better description

The file should have a high level description what it is good for etc.

Need high level overview over what locks protects what and if there
is a locking order.

It doesn't seem to enable sg merging? Since you have enough space 
that should work.

> +static char *fault_reason_strings[] =
> +{
> + "Software",
> + "Present bit in root entry is clear",
> + "Present bit in context entry is clear",
> + "Invalid context entry",
> + "Access beyond MGAW",
> + "PTE Write access is not set",
> + "PTE Read access is not set",
> + "Next page table ptr is invalid",
> + "Root table address invalid",
> + "Context table ptr is invalid",
> + "non-zero reserved fields in RTP",
> + "non-zero reserved fields in CTP",
> + "non-zero reserved fields in PTE",
> + "Unknown"
> +};
> +
> +#define MAX_FAULT_REASON_IDX (12)


You got 14 of them. better use ARRAY_SIZE

> +#define IOMMU_NAME_LEN   (7)
> +
> +struct iommu {

call it intel_iommu or somesuch even when it's private.

> +static int __init intel_iommu_setup(char *str)
> +{
> + if (!str)
> + return -EINVAL;
> + while (*str) {
> + if (!strncmp(str, "off", 3)) {
> + dmar_disabled = 1;
> + printk(KERN_INFO"Intel-IOMMU: disabled\n");
> + }
> + str += strcspn(str, ",");
> + while (*str == ',')
> + str++;
> + }
> + return 0;
> +}
> +__setup("intel_iommu=", intel_iommu_setup);

Why can't you just use the normal iommu=off for this? 

> +
> +#define MIN_PGTABLE_PAGES(10)
> +static mempool_t *pgtable_mempool;
> +#define MIN_DOMAIN_REQ   (20)
> +static mempool_t *domain_mempool;
> +#define MIN_DEVINFO_REQ  (20)
> +static mempool_t *devinfo_mempool;

Lots of mempools. How much memory does this pin?

> +
> +#define alloc_pgtable_page() mempool_alloc(pgtable_mempool, GFP_ATOMIC)
> +#define free_pgtable_page(vaddr) mempool_free(vaddr, pgtable_mempool)
> +#define alloc_domain_mem() mempool_alloc(domain_mempool, GFP_ATOMIC)
> +#define free_domain_mem(vaddr) mempool_free(vaddr, domain_mempool)
> +#define alloc_devinfo_mem() mempool_alloc(devinfo_mempool, GFP_ATOMIC)
> +#define free_devinfo_mem(vaddr) mempool_free(vaddr, devinfo_mempool)

Do we need the macros? Better expand them in the caller.

> +static void __iommu_flush_cache(struct iommu *iommu, void *addr, int size)
> +{
> + if (!ecap_coherent(iommu->ecap))
> + clflush_cache_range(addr, size);
> +}
> +
> +#define iommu_flush_cache_entry(iommu, addr) \
> + __iommu_flush_cache(iommu, addr, 8)
> +#define iommu_flush_cache_page(iommu, addr) \
> + __iommu_flush_cache(iommu, addr, PAGE_SIZE_4K)

Similar.

And the 8 should be probably something more descriptive (sizeof?)

> +/* context entry handling */
> +static struct context_entry * device_to_context_entry(struct iommu *iommu,
> + u8 bus, u8 devfn)
> +{
> + struct root_entry *root;
> + struct context_entry *context;
> + unsigned long phy_addr;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&iommu->lock, flags);
> + root = &iommu->root_entry[bus];
> + if (!root_present(*root)) {
> + phy_addr = (unsigned long)alloc_pgtable_page();

A GFP_ATOMIC mempool is rather useless. mempool only works if it can block
for someone else freeing memory and if it can't do that it's not failsafe.
I'm afraid you need to revise the allocation strategy -- best would be
to somehow move the memory allocations outside the spinlock paths
and preallocate if possible.

Same problem in other code.

> + if (!dma_pte_present(*pte)) {
> + tmp = alloc_pgtable_page();

Please don't name variable tmp. I know some other code does it, but it's
just bad style imho.


> + /* Make sure hardware complete it */
> + start_time = jiffies;
> + while (1) {
> + sts = dmar_readl(iommu->reg, DMAR_GSTS_REG);
> + if (sts & DMA_GSTS_RTPS)
> + break;
> + if (time_after(jiffies, start_time + DMAR_OPERATION_TIMEOUT))
> + panic("DMAR hardware is malfunctional, please disable 
> IOMMU\n");
> + cpu_relax();
> + }

Could MWAIT/MONITOR be used for this?


> + spin_unlock_irqrestore(&iommu->register_lock, flag);
> +}
> +
> +static void iommu_flush_write_buffer(struct iommu *iommu)

Re: [Intel IOMMU][patch 1/8] ACPI support for Intel Virtualization Technology for Directed I/O

2007-04-24 Thread Andi Kleen

> +config DMAR
> + bool "Support for DMA Remapping Devices (EXPERIMENTAL)"
> + depends on PCI_MSI && ACPI && EXPERIMENTAL
> + help
> +   Support DMA Remapping Devices. The devices are reported via
> +   ACPI tables and includes pci device scope under each DMA
> +   remapping device.

The description needs to explain what a dma remapping device is.


And some high level comment here what this file does.

> +
> +LIST_HEAD(dmar_drhd_units);
> +LIST_HEAD(dmar_rmrr_units);

Comment describing what lock protects those lists?
In fact there seems to be no locking. What about hotplug?

>
> +
> + dmar = (struct acpi_table_dmar *)table;
> + if (!dmar) {
> + printk (KERN_WARNING PREFIX "Unable to map DMAR\n");
> + return -ENODEV;
> + }

Shouldn't that be wherever the table is mapped. Or is it not needed?

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/7] Universal battery class

2007-04-24 Thread Pavel Machek
Hi!

> Signed-off-by: Anton Vorontsov <[EMAIL PROTECTED]>

Yes please. Generic battery support is badly needed.

Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: drivers/char/watchdog/pcwd.c + drivers/char/watchdog/pcwd_pci.c

2007-04-24 Thread Wim Van Sebroeck
Hi Andrew,

> >  /* module parameters */
> > +#define CELSIUS0
> > +#define FAHRENHEIT 1
> > +static int proc_temp_mode = CELSIUS;
> > +module_param(proc_temp_mode, int, 0);
> > +MODULE_PARM_DESC(proc_temp_mode, "which temperature mode to use in /proc/ 
> > 0=Celsius, 1=Fahrenheit (default=0)");
> 
> hm, is that a good idea?  Making the contents of /proc files dependent upon
> some module parameter?
> 
> I'd have thought that it would be better to remove this option and either
> 
> a) offer two /proc files: one for Celcius, one for Fahrenheit
> 
> b) Put both values into the same /proc file (one per line)
> 
> c) Just remove the Fahrenheit option altogether - let it join cubits,
>furlongs, etc.
> 
> I mean, how is an application to make sense of that information if its units
> depend upon some module parameter, or a kernel boot option?

My opinion: temperature measured in a watchdog device is created to be able to
react on a "system overheat" situation. For me this thus is a hwmon device and
thus should be done via the hwmon interface. And that seems to be in 
millidegree Celsius... (where the watchdog drivers used to do things in
Fahrenheit). I propose to follow the hwmon interface setup.

(BTW: My plans with the pcwd drivers is to 1) get one driver that is maintained
in the kernel instead of a basic driver in the kernel and an extended one that
is maintained seperately, 2) convert the pcwd.c driver to the isa_driver,
3) change temperature stuff to hwmon interface and 4) create a sysfs equivalent
for the /proc interface.)

I will take all your other comments/input and rework the code so that it is up
to the kernel standards.

Greetings,
Wim.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel IOMMU][patch 4/8] Supporting Zero Length Reads in Intel IOMMU.

2007-04-24 Thread Andi Kleen
On Tuesday 24 April 2007 08:03:03 Ashok Raj wrote:
> PCI specs permit zero length reads (ZLR) even if the mapping for that region 
> is write only. Support for this feature is indicated by the presence of a bit 
> in the DMAR capability. If a particular DMAR does not support this capability
> we map write-only regions as read-write.
> 
> This option can also provides a workaround for some drivers that request
> a write-only mapping when they really should request a read-write.
> (We ran into one such case in eepro100.c in handling rx_ring_dma)

Better just fix the drivers instead of adding such hacks

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel IOMMU][patch 8/8] Preserve some Virtual Address when devices cannot address entire range.

2007-04-24 Thread Andi Kleen
On Tuesday 24 April 2007 08:03:07 Ashok Raj wrote:
> Some devices may not support entire 64bit DMA. In a situation where such 
> devices are co-located in a shared domain, we need to ensure there is some 
> address space reserved for such devices without the low addresses getting
> depleted by other devices capable of handling high dma addresses.

Sorry, but you need to find some way to make this usually work without special
options. Otherwise users will be unhappy.

An possible way would be to allocate space upside down from the limit of the
device. Then the lower areas should be usually free.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel IOMMU][patch 7/8] Support for legacy ISA devices

2007-04-24 Thread Andi Kleen
On Tuesday 24 April 2007 08:03:06 Ashok Raj wrote:
> Floppy disk drivers dont work well with DMA remapping.

What is the problem? You can't allocate mappings <16MB?

> Its possible to  
> extend the current use for x86_64, but the gain is very little. If someone
> feels compelled to clean this up, its up for grabs. Since these use 16M, we 
> just provide a unity map for the ISA bridge device.
> 

While it's probably not worth for the floppy there are other devices
with similar weird addressing limitations. Some generic handling of it
would be nice.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Getting the new RxRPC patches upstream

2007-04-24 Thread Oleg Nesterov
On 04/24, David Howells wrote:
>
> Oleg Nesterov <[EMAIL PROTECTED]> wrote:
> 
> > Sure, I'll grep for cancel_delayed_work(). But unless I missed something,
> > this change should be completely transparent for all users. Otherwise, it
> > is buggy.
> 
> I guess you will have to make sure that cancel_delayed_work() is always
> followed by a flush of the workqueue, otherwise you might get this situation:
> 
>   CPU 0   CPU 1
>   === ===
>   
>   cancel_delayed_work(x) == 0 -->delayed_work_timer_fn(x)
>   kfree(x);   -->do_IRQ()
>   y = kmalloc(); // reuses x
>   <--do_IRQ()
>   __queue_work(x)
>   --- OOPS ---
> 
> That's my main concern.  If you are certain that can't happen, then fair
> enough.

Yes sure. Note that this is documented:

/*
 * Kill off a pending schedule_delayed_work().  Note that the work 
callback
 * function may still be running on return from cancel_delayed_work().  
Run
 * flush_workqueue() or cancel_work_sync() to wait on it.
 */

This comment is not very precise though. If the work doesn't re-arm itself,
we need cancel_work_sync() only if cancel_delayed_work() returns 0.

So there is no difference with the proposed change. Except, return value == 0
means:

currently (del_timer_sync): callback may still be running or scheduled

with del_timer: may still be running, or scheduled, or will be scheduled
right now.

However, this is the same from the caller POV.

> Can you show me a patch illustrating exactly how you want to change
> cancel_delayed_work()?  I can't remember whether you've done so already, but
> if you have, I can't find it.  Is it basically this?:
> 
>  static inline int cancel_delayed_work(struct delayed_work *work)
>  {
>   int ret;
> 
> - ret = del_timer_sync(&work->timer);
> + ret = del_timer(&work->timer);
>   if (ret)
>   work_release(&work->work);
>   return ret;
>  }

Yes, exactly. The patch is trivial, but I need some time to write the
understandable changelog...

> I was thinking this situation might be a problem:
> 
>   CPU 0   CPU 1
>   === ===
>   
>   cancel_delayed_work(x) == 0 -->delayed_work_timer_fn(x)
>   schedule_delayed_work(x,0)  -->do_IRQ()
>   
>   x->work()
>   <--do_IRQ()
>   __queue_work(x)
> 
> But it won't, will it?

Yes, I think this should be OK. schedule_delayed_work() will notice
_PENDING and abort, so the last "x->work()" doesn't happen.

What can happen is


cancel_delayed_work(x) == 0
-->delayed_work_timer_fn(x)
__queue_work(x)

x->work()
schedule_delayed_work(x,0)


, so we can have an "unneeded schedule", but this is very unlikely.

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread Nikita Danilov
David Lang writes:
 > On Tue, 24 Apr 2007, Nikita Danilov wrote:
 > 
 > > Amit Gud writes:
 > >
 > > Hello,
 > >
 > > >
 > > > This is an initial implementation of ChunkFS technique, briefly discussed
 > > > at: http://lwn.net/Articles/190222 and
 > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
 > >
 > > I have a couple of questions about chunkfs repair process.
 > >
 > > First, as I understand it, each continuation inode is a sparse file,
 > > mapping some subset of logical file blocks into block numbers. Then it
 > > seems, that during "final phase" fsck has to check that these partial
 > > mappings are consistent, for example, that no two different continuation
 > > inodes for a given file contain a block number for the same offset. This
 > > check requires scan of all chunks (rather than of only "active during
 > > crash"), which seems to return us back to the scalability problem
 > > chunkfs tries to address.
 > 
 > not quite.
 > 
 > this checking is a O(n^2) or worse problem, and it can eat a lot of memory 
 > in 
 > the process. with chunkfs you divide the problem by a large constant (100 or 
 > more) for the checks of individual chunks. after those are done then the 
 > final 
 > pass checking the cross-chunk links doesn't have to keep track of 
 > everything, it 
 > only needs to check those links and what they point to

Maybe I failed to describe the problem presicely.

Suppose that all chunks have been checked. After that, for every inode
I0 having continuations I1, I2, ... In, one has to check that every
logical block is presented in at most one of these inodes. For this one
has to read I0, with all its indirect (double-indirect, triple-indirect)
blocks, then read I1 with all its indirect blocks, etc. And to repeat
this for every inode with continuations.

In the worst case (every inode has a continuation in every chunk) this
obviously is as bad as un-chunked fsck. But even in the average case,
total amount of io necessary for this operation is proportional to the
_total_ file system size, rather than to the chunk size.

 > 
 > any ability to mark a filesystem as 'clean' and then not have to check it on 
 > reboot is a bonus on top of this.
 > 
 > David Lang

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pagecache: find_or_create_page does not call a proper page allocator function

2007-04-24 Thread Christoph Lameter
On Tue, 24 Apr 2007, Hugh Dickins wrote:

> I was certainly ignorant of that; but I'm not convinced it eliminates
> the potential issue.  For a start, sys_move_pages seems not to involve
> mempolicies at all - I don't see what prevents it migrating blockdev
> pages away from the only node which has NORMAL memory.

There is no need to. If the user does something stupid (and 32bit NUMA 
system are so stupid by default) as you say then this means that the 
pages would have to use bounce buffers to be written back.

> > Metadata is not movable nor subject to memory policies.
> > It will never be mapped into a process space.
> 
> Not as metadata, no.  But someone (let's hope only root, though I may
> be wrong on that) can map any part of the block device into userspace.

Concurrent access to a block device by a filesystem and the user? That 
cannot go over well. If one just reads then I would expect that a copy
of the metadata becomes available to the user. Also you cannot migrate 
pages that have multiple references (which is the case here if the 
filesystem uses the page cache for the metadata) unless the user has 
special priviledges and uses special command options.

A page that has references that cannot be accounted for by page migration 
is never migrated. I would assume that the filesystem at minimum takes a 
refcount on the page used for metadata.

If the filesystem would not take a refcount then it would already be in 
trouble because the page may then be evicted at any time.

> > Yes but before we get there we will bounce pagecache pages into an area 
> > where we do not need kmap.
> 
> Again, I'm not convinced: bouncing gets done for the I/O,
> but where is it done to meet the filesystem's expectations?

Bouncing gets done for the block device and the block device determines 
the zones from which it can do I/O. Thus the block device determines the 
allocation restrictions.

> On the other hand, as I said, I've seen no problem myself in practice.
> However, if there is no problem, why do block devices demand GFP_USER?

GFP_USER requires allocation in non highmem areas. These are typically 
reachable by device DMA. I think there is no problem if one would open a 
block device with GFP_HIGHMEM. The pages may be allocated in highmem 
which would require bounce buffers but otherwise we will be fine. A 
ramdisk device may just work fine with highmem.

If the system has both high memory and normal memory then only allocations 
to highmemory are subject to memory policies etc etc. The block device 
allocations would be in zone normal/dma and thus be exempt from NUMA 
placement.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Kernel-discuss] Re: [PATCH 3/7] [RFC] Battery monitoring class

2007-04-24 Thread Pavel Machek
Hi!

> > > > That said, you may need to use uWh and uAh instead of mAh and mWh, 
> > > > though.
> > > 
> > > Not sure. Is there any existing chip that can report uAh/uWh? That is
> > > great precision.
> > 
> > The way things are going, it should be feasible for small embedded systems
> > quite soon.  Refer to the previous thread.
> 
> I see... is it also applicable to currents and voltages? I.e. should we
> use uA and uV from the start?

AFAICT, mobile phone in standby can eat less than 1000 uW... so uA/uV
would indeed be nice.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFD] alternative kobject release wait mechanism

2007-04-24 Thread Alan Stern
On Sun, 22 Apr 2007, Greg KH wrote:

> > Looking some more, kobject_get_path() is used for kobject renaming,
> > uevent handling, and a little bit in the input core.  None of these things 
> > should try to access a kobject after it has been del()ed.  After all, it's 
> > no longer present in the filesystem so it doesn't _have_ a path.
> 
> But we _have_ to have a full path at that time to tell userspace what
> just went away.  That is the main reason we enforce this (there were
> tons of issues with scsi devices and this in the past which is what
> caused us to enforce this.)

The SCSI subsystem has undergone many, many changes since 2.6.0.  In 
particular, it has implemented a full-fledged device-state model, complete 
with spinlock-protected state transitions.

I'll have to check with James Bottomley, but I bet that SCSI now always 
unregisters all the children of a device before unregistering the device 
itself.


For everyone:

We ought to make it explicitly clear that _all_ subsystems should behave
this way.  Maybe it isn't necessary to go as far as having device_del()
call itself recursively; doing that would open up lots of possible races.  
But I think it would be a good idea to add a WARN_ON in device_del, right
after the call to bus_remove_device(), that would be triggered if the
device still had any children.

It would also be good to document (but where?) some lifetime rules for 
device drivers.  Something like this:

When a driver's remove() method returns, the driver must no
longer try to use the device it was just unbound from.  The
device may be physically gone, or a different driver may be
bound to it.  Most importantly, remove() should unregister
all child devices created by the driver.

To accomplish all this safely, the driver should allocate a
private data structure containing at least a "gone" flag and 
a mutex or spinlock for synchronization.  Each time the driver
needs to use the device, it should first lock the mutex or 
spinlock and check the "gone" flag.

Ideally remove() should release all of the driver's references
to the device, in accord with the "Immediate Detach" principle.
However it is acceptable for the driver to retain a reference,
provided it meets the following conditions:

The reference must be dropped in a timely manner,
such as when the release() methods for all child
devices have run.

The driver must also retain a module reference to
the owner of the device.  In practice this means the
driver must contain static code references to the
subsystem which created the device, since struct
device doesn't have an "owner" field.

The driver must restrict itself to reading (not
writing!) the fields in the device structure.  The
only exception is that the driver may lock/unlock
dev->sem.

How does that sound?

Alan Stern

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Question about Reiser4

2007-04-24 Thread Andi Kleen
> Because there are unaddressed items in this todo list:
> http://pub.namesys.com/Reiser4/ToDo
> The main issues here are xattrs and support for blocksize != pagesize.

I would consider both to be optional. We have various file systems
in tree that don't support either (e.g. JFS only supports 4K blocks
and OCFS2 doesn't support xattr) They shouldn't block merging.

> 2. Who will maintain this?
> 
> Currently there are two namesys employees working mostly on
> enthusiasm. Divide them into 2 file systems, plus many people who
> really help with fixing problems.

Merging will probably be a peak of work for the necessary changes,
then hopefully the work will be less once you're in tree because
you don't need to track mainline anymore
(assuming not to many bugs come in from users) 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS] 2.6.21-rc6-git5 in cfq_dispatch_insert

2007-04-24 Thread Brad Campbell

Jens Axboe wrote:


Ok, can you try and reproduce with this one applied? It'll keep the
system running (unless there are other corruptions going on), so it
should help you a bit as well. It will dump some cfq state info when the
condition triggers that can perhaps help diagnose this. So if you can
apply this patch and reproduce + send the output, I'd much appreciate
it!



I think I'm wearing holes in my platters. This is being a swine to hit, but I 
finally got some..
It seems to respond to a combination of high cpu usage and relatively high disk 
utilisation.

I tried all sorts of acrobatics with multiple readers and hammering the array while reading from 
individual drives..


The only reliable way I can reproduce this seems to be on a degraded array running a "while true ; 
do md5sum -c md5 ; done" on about a 180GB directory of files. It is taking anywhere from 4 to 96 
hours to hit it though.. but at least it's reproducible.



[105449.653682] cfq: rbroot not empty, but ->next_rq == NULL! Fixing up, report the issue to 
[EMAIL PROTECTED]

[105449.683646] cfq: busy=1,drv=0,timer=0
[105449.694871] cfq rr_list:
[105449.702715]   3108: sort=0,next=,q=0/1,a=1/0,d=0/0,f=69
[105449.720693] cfq busy_list:
[105449.729054] cfq idle_list:
[105449.737418] cfq cur_rr:
[115435.022192] cfq: rbroot not empty, but ->next_rq == NULL! Fixing up, report the issue to 
[EMAIL PROTECTED]

[115435.052160] cfq: busy=1,drv=0,timer=0
[115435.063383] cfq rr_list:
[115435.071227]   3196: sort=0,next=,q=0/1,a=1/0,d=0/0,f=69
[115435.089205] cfq busy_list:
[115435.097566] cfq idle_list:
[115435.105930] cfq cur_rr:
[115616.651883] cfq: rbroot not empty, but ->next_rq == NULL! Fixing up, report the issue to 
[EMAIL PROTECTED]

[115616.681848] cfq: busy=1,drv=0,timer=0
[115616.693071] cfq rr_list:
[115616.700916]   3196: sort=0,next=,q=0/1,a=1/0,d=0/0,f=61
[115616.718893] cfq busy_list:
[115616.727253] cfq idle_list:
[115616.735617] cfq cur_rr:
[119679.564753] cfq: rbroot not empty, but ->next_rq == NULL! Fixing up, report the issue to 
[EMAIL PROTECTED]

[119679.594732] cfq: busy=1,drv=0,timer=0
[119679.605955] cfq rr_list:
[119679.613799]   3241: sort=0,next=,q=0/1,a=1/0,d=0/0,f=69
[119679.631778] cfq busy_list:
[119679.640136] cfq idle_list:
[119679.648502] cfq cur_rr:

Brad
--
"Human beings, who are almost unique in having the ability
to learn from the experience of others, are also remarkable
for their apparent disinclination to do so." -- Douglas Adams
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pagecache: find_or_create_page does not call a proper page allocator function

2007-04-24 Thread Andrew Morton
On Tue, 24 Apr 2007 12:34:53 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> 
wrote:

> > Not as metadata, no.  But someone (let's hope only root, though I may
> > be wrong on that) can map any part of the block device into userspace.
> 
> Concurrent access to a block device by a filesystem and the user? That 
> cannot go over well. If one just reads then I would expect that a copy
> of the metadata becomes available to the user. Also you cannot migrate 
> pages that have multiple references (which is the case here if the 
> filesystem uses the page cache for the metadata) unless the user has 
> special priviledges and uses special command options.
> 
> A page that has references that cannot be accounted for by page migration 
> is never migrated. I would assume that the filesystem at minimum takes a 
> refcount on the page used for metadata.
> 
> If the filesystem would not take a refcount then it would already be in 
> trouble because the page may then be evicted at any time.

No, think of the following scenario:

- file I/O causes a read of an ext2 file's bitmap.  The bitmap is
  brought into /dev/hda1's pagecache using !__GFP_HIGHMEM

- references are released against that page and it's now just clean
  reclaimable pagecache

- someone (say, an online filesystem checker or something) mmaps
  /dev/hda1 and reads that page.

- migration comes alnog and migrates that page into highmem

- file I/O causes a read of that bitmap again.  We find it in
  /dev/hda's pagecache.

  Here's set_bh_page().

void set_bh_page(struct buffer_head *bh,
struct page *page, unsigned long offset)
{
bh->b_page = page;
BUG_ON(offset >= PAGE_SIZE);
if (PageHighMem(page))
/*
 * This catches illegal uses and preserves the offset:
 */
bh->b_data = (char *)(0 + offset);
else
bh->b_data = page_address(page) + offset;
}

- ext2 now tries to access the bits in the bitmap via page->bh->b_data

- game over
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pagecache: find_or_create_page does not call a proper page allocator function

2007-04-24 Thread Christoph Lameter
On Tue, 24 Apr 2007, Hugh Dickins wrote:

> Or Christoph may prevail in persuading there's no such problem.

This is pointless. NUMA allocations can only be controlled for the highest 
zone. If we switch to a lower zone then we allocate on a different zone 
than the user requested.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pagecache: find_or_create_page does not call a proper page allocator function

2007-04-24 Thread Christoph Lameter
On Tue, 24 Apr 2007, Andrew Morton wrote:

> No, think of the following scenario:
> 
> - file I/O causes a read of an ext2 file's bitmap.  The bitmap is
>   brought into /dev/hda1's pagecache using !__GFP_HIGHMEM
> 
> - references are released against that page and it's now just clean
>   reclaimable pagecache
> 
> - someone (say, an online filesystem checker or something) mmaps
>   /dev/hda1 and reads that page.
> 
> - migration comes alnog and migrates that page into highmem
> 
> - file I/O causes a read of that bitmap again.  We find it in
>   /dev/hda's pagecache.

Read of the bitmap? How would that work? Page cache lookup right?

>   Here's set_bh_page().

A highmem page can have buffers???
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] use mutex instead of semaphore in Sony PI driver

2007-04-24 Thread Matthias Kaehlcke
the Sony Programmable I/O Control driver uses a semaphore as
mutex. use the mutex API instead of the (binary) semaphore

Signed-off-by: Matthias Kaehlcke <[EMAIL PROTECTED]> 

--

diff --git a/drivers/char/sonypi.c b/drivers/char/sonypi.c
index 7823757..878d8d0 100644
--- a/drivers/char/sonypi.c
+++ b/drivers/char/sonypi.c
@@ -477,7 +477,7 @@ static struct sonypi_device {
u16 evtype_offset;
int camera_power;
int bluetooth_power;
-   struct semaphore lock;
+   struct mutex lock;
struct kfifo *fifo;
spinlock_t fifo_lock;
wait_queue_head_t fifo_proc_list;
@@ -884,7 +884,7 @@ int sonypi_camera_command(int command, u8 value)
if (!camera)
return -EIO;
 
-   down(&sonypi_device.lock);
+   mutex_lock(&sonypi_device.lock);
 
switch (command) {
case SONYPI_COMMAND_SETCAMERA:
@@ -919,7 +919,7 @@ int sonypi_camera_command(int command, u8 value)
   command);
break;
}
-   up(&sonypi_device.lock);
+   mutex_unlock(&sonypi_device.lock);
return 0;
 }
 
@@ -938,20 +938,20 @@ static int sonypi_misc_fasync(int fd, struct file *filp, 
int on)
 static int sonypi_misc_release(struct inode *inode, struct file *file)
 {
sonypi_misc_fasync(-1, file, 0);
-   down(&sonypi_device.lock);
+   mutex_lock(&sonypi_device.lock);
sonypi_device.open_count--;
-   up(&sonypi_device.lock);
+   mutex_unlock(&sonypi_device.lock);
return 0;
 }
 
 static int sonypi_misc_open(struct inode *inode, struct file *file)
 {
-   down(&sonypi_device.lock);
+   mutex_lock(&sonypi_device.lock);
/* Flush input queue on first open */
if (!sonypi_device.open_count)
kfifo_reset(sonypi_device.fifo);
sonypi_device.open_count++;
-   up(&sonypi_device.lock);
+   mutex_unlock(&sonypi_device.lock);
return 0;
 }
 
@@ -1001,7 +1001,7 @@ static int sonypi_misc_ioctl(struct inode *ip, struct 
file *fp,
u8 val8;
u16 val16;
 
-   down(&sonypi_device.lock);
+   mutex_lock(&sonypi_device.lock);
switch (cmd) {
case SONYPI_IOCGBRT:
if (sonypi_ec_read(SONYPI_LCD_LIGHT, &val8)) {
@@ -1101,7 +1101,7 @@ static int sonypi_misc_ioctl(struct inode *ip, struct 
file *fp,
default:
ret = -EINVAL;
}
-   up(&sonypi_device.lock);
+   mutex_unlock(&sonypi_device.lock);
return ret;
 }
 
@@ -1330,7 +1330,7 @@ static int __devinit sonypi_probe(struct platform_device 
*dev)
}
 
init_waitqueue_head(&sonypi_device.fifo_proc_list);
-   init_MUTEX(&sonypi_device.lock);
+   mutex_init(&sonypi_device.lock);
sonypi_device.bluetooth_power = -1;
 
if ((pcidev = pci_get_device(PCI_VENDOR_ID_INTEL,

-- 
Matthias Kaehlcke
Linux Application Developer
Barcelona

 If liberty means anything at all, it means the
   right to tell people what they do not want to hear
(George Orwell)
 .''`.
using free software / Debian GNU/Linux | http://debian.org  : :'  :
`. `'`
gpg --keyserver pgp.mit.edu --recv-keys 47D8E5D4  `-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] Ignore stolen time in the softlockup watchdog

2007-04-24 Thread Jeremy Fitzhardinge
Andrew Morton wrote:
> Well, it _is_ mysterious.
>
> Did you try to locate the code which failed?  I got lost in macros and
> include files, and gave up very very easily.  Stop hiding, Ingo.
>   

OK, I've managed to reproduce it.  Removing the local_irq_save/restore
from sched_clock() makes it go away, as I'd expect (otherwise it would
really be magic).  But given that it never seems to touch the softlockup
during testing, I have no idea what difference it makes...

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 16/25] xen: Use the hvc console infrastructure for Xen console

2007-04-24 Thread Jeremy Fitzhardinge
Olof Johansson wrote:
> On Mon, Apr 23, 2007 at 02:56:54PM -0700, Jeremy Fitzhardinge wrote:
>   
>> Implement a Xen back-end for hvc console.
>>
>> From: Gerd Hoffmann <[EMAIL PROTECTED]>
>> Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
>>
>> ---
>>  arch/i386/xen/Kconfig |1 
>>  arch/i386/xen/events.c|3 -
>>  drivers/Makefile  |3 +
>>  drivers/xen/Makefile  |1 
>>  drivers/xen/hvc-console.c |  134 
>> +
>>  include/xen/events.h  |1 
>>  6 files changed, 142 insertions(+), 1 deletion(-)
>> 
>
> If you move the driver to drivers/char/hvc_xen.c instead, you won't have to 
> do...
>
>   
>> +#include "../char/hvc_console.h"
>> 
>
> ...this.
>
> Other single-platform backend hvc drivers are under drivers/char already.

Good point.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread David Lang

On Tue, 24 Apr 2007, Nikita Danilov wrote:


David Lang writes:
> On Tue, 24 Apr 2007, Nikita Danilov wrote:
>
> > Amit Gud writes:
> >
> > Hello,
> >
> > >
> > > This is an initial implementation of ChunkFS technique, briefly discussed
> > > at: http://lwn.net/Articles/190222 and
> > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> >
> > I have a couple of questions about chunkfs repair process.
> >
> > First, as I understand it, each continuation inode is a sparse file,
> > mapping some subset of logical file blocks into block numbers. Then it
> > seems, that during "final phase" fsck has to check that these partial
> > mappings are consistent, for example, that no two different continuation
> > inodes for a given file contain a block number for the same offset. This
> > check requires scan of all chunks (rather than of only "active during
> > crash"), which seems to return us back to the scalability problem
> > chunkfs tries to address.
>
> not quite.
>
> this checking is a O(n^2) or worse problem, and it can eat a lot of memory in
> the process. with chunkfs you divide the problem by a large constant (100 or
> more) for the checks of individual chunks. after those are done then the final
> pass checking the cross-chunk links doesn't have to keep track of everything, 
it
> only needs to check those links and what they point to

Maybe I failed to describe the problem presicely.

Suppose that all chunks have been checked. After that, for every inode
I0 having continuations I1, I2, ... In, one has to check that every
logical block is presented in at most one of these inodes. For this one
has to read I0, with all its indirect (double-indirect, triple-indirect)
blocks, then read I1 with all its indirect blocks, etc. And to repeat
this for every inode with continuations.

In the worst case (every inode has a continuation in every chunk) this
obviously is as bad as un-chunked fsck. But even in the average case,
total amount of io necessary for this operation is proportional to the
_total_ file system size, rather than to the chunk size.


actually, it should be proportional to the number of continuation nodes. The 
expectation (and design) is that they are rare.


If you get into the worst-case situation of all of them being continuation 
nodes, then you are actually worse off then you were to start with (as you are 
saying), but numbers from people's real filesystems (assuming a chunk size equal 
to a block cluster size) indicates that we are more on the order of a fraction 
of a percent of the nodes. and the expectation is that since the chunk sizes 
will be substantially larger then the block cluster sizes this should get 
reduced even more.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pagecache: find_or_create_page does not call a proper page allocator function

2007-04-24 Thread Andrew Morton
On Tue, 24 Apr 2007 12:59:17 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> 
wrote:

> On Tue, 24 Apr 2007, Andrew Morton wrote:
> 
> > No, think of the following scenario:
> > 
> > - file I/O causes a read of an ext2 file's bitmap.  The bitmap is
> >   brought into /dev/hda1's pagecache using !__GFP_HIGHMEM
> > 
> > - references are released against that page and it's now just clean
> >   reclaimable pagecache
> > 
> > - someone (say, an online filesystem checker or something) mmaps
> >   /dev/hda1 and reads that page.
> > 
> > - migration comes alnog and migrates that page into highmem
> > 
> > - file I/O causes a read of that bitmap again.  We find it in
> >   /dev/hda's pagecache.
> 
> Read of the bitmap? How would that work? Page cache lookup right?

yup.

sb_bread
->__bread
  ->__getblk
->__find_get_block
  ->__find_get_block_slow
->find_get_page

> >   Here's set_bh_page().
> 
> A highmem page can have buffers???

yep.  Take a 4k page which is stored in four discontiguous 1k disk blocks. The
data at page_buffers(page) is the sole way in which we track which parts of
the page belong to which blocks of the disk.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pagecache: find_or_create_page does not call a proper page allocator function

2007-04-24 Thread Christoph Lameter
On Tue, 24 Apr 2007, Andrew Morton wrote:

> > A highmem page can have buffers???
> 
> yep.  Take a 4k page which is stored in four discontiguous 1k disk blocks. The
> data at page_buffers(page) is the sole way in which we track which parts of
> the page belong to which blocks of the disk.

But I see no use of kmap for buffer access? The data is not accessible.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] UidBind LSM 0.2

2007-04-24 Thread Gerhard Mack
On Tue, 24 Apr 2007, Roberto De Ioris wrote:

> Hi all,
> 
> this is the second release for UidBind LSM:
> 
> http://projects.unbit.it/uidbind/
> 
> UidBind allows call to bind() function only to the uid defined in a
> configfs tree.
> 
> It is now possible to specify different uid (for the same port) on
> different ipv4 addresses:
> 
> mkdir uidbind/8081
> mkdir uidbind/8081/192.168.1.17
> mkdir uidbind/8081/192.168.1.26
> echo 1017 > uidbind/8081/192.168.1.17/uid
> echo 1026 > uidbind/8081/192.168.1.26/uid
> 
> This version even fix some leek in version 0.1
> 
> Patch attached is still for vanilla 2.6.20.7

Is it possible to specify ranges as allowing everyone?  Is it possible to 
allow multiple users acess to the same port?  Can ports be allowed by 
group?

I really like the idea of this patch.  It has the potential to solve a lot 
of my current administrative headachs.

Gerhard


--
Gerhard Mack

[EMAIL PROTECTED]

<>< As a computer I find your faith in technology amusing.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] Ignore stolen time in the softlockup watchdog

2007-04-24 Thread Andrew Morton
On Tue, 24 Apr 2007 13:00:49 -0700 Jeremy Fitzhardinge <[EMAIL PROTECTED]> 
wrote:

> Andrew Morton wrote:
> > Well, it _is_ mysterious.
> >
> > Did you try to locate the code which failed?  I got lost in macros and
> > include files, and gave up very very easily.  Stop hiding, Ingo.
> >   
> 
> OK, I've managed to reproduce it.  Removing the local_irq_save/restore
> from sched_clock() makes it go away, as I'd expect (otherwise it would
> really be magic).

erm, why do you expect that?  A local_irq_save()/local_irq_restore() pair
shouldn't be affecting anything?

>  But given that it never seems to touch the softlockup
> during testing, I have no idea what difference it makes...

To what softlockup are you referring, and what does that have to do with
anything?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS 2.6.21-rc7-mm1] kernel BUG at fs/sysfs/inode.c:272 (sysfs_drop_dentry)

2007-04-24 Thread Tejun Heo
On Tue, Apr 24, 2007 at 10:58:44AM -0700, Andrew Morton wrote:
> On Wed, 25 Apr 2007 02:51:43 +0900 Tejun Heo <[EMAIL PROTECTED]> wrote:
> 
> > Andrew Morton wrote:
> > > On Wed, 25 Apr 2007 01:33:59 +0900 Tejun Heo <[EMAIL PROTECTED]> wrote:
> > > 
> > >> Vincent Vanackere wrote:
> > >>> Hi,
> > >>>
> > >>> I'm getting the following oops at boot with the latest -mm kernel :
> > >>> ---
> > >>> kernel BUG at fs/sysfs/inode.c:272
> > >> Known problem.  Working on it.  Thanks.
> > >>
> > > 
> > > If it had been known by me I wouldn't have released known-buggy code to
> > > people who need to be able to test other people's code too. (argh)
> > > 
> > 
> > It's the problem Cornelia reported in the thread the patch was posted.
> 
> Is there a workaround?  What might happen if we just delete that BUG_ON()?

Okay, here's the workaround.  It leaks dentries and inodes if parent
is deleted first but other than that it should be okay.

diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c
index eea50a5..b466671 100644
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -212,8 +212,11 @@ static struct dentry *sysfs_lookup_sd(struct sysfs_dirent 
*sd)
depth = sysfs_path_depth(sd);
 
while (depth--) {
-   /* negative intermediate node is a BUG */
-   BUG_ON(!dentry->d_inode);
+   /* XXX */ /* negative intermediate node is a BUG */
+   /* XXX */ /* BUG_ON(!dentry->d_inode); */
+   if (!dentry->d_inode)
+   return NULL;
+   /* XXX */
 
for (cur = sd, i = 0; i < depth; i++)
cur = cur->s_parent;
@@ -269,7 +272,7 @@ void sysfs_drop_dentry(struct sysfs_dirent *sd)
}
 
if (isdir) {
-   BUG_ON(!simple_empty(dentry));
+   /* XXX */ /* BUG_ON(!simple_empty(dentry)); */
drop_nlink(dir);
/* unpin if directory */
dput(dentry);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pagecache: find_or_create_page does not call a proper page allocator function

2007-04-24 Thread Hugh Dickins
On Tue, 24 Apr 2007, Christoph Lameter wrote:
> On Tue, 24 Apr 2007, Hugh Dickins wrote:
> 
> > Or Christoph may prevail in persuading there's no such problem.
> 
> This is pointless. NUMA allocations can only be controlled for the highest 
> zone. If we switch to a lower zone then we allocate on a different zone 
> than the user requested.

Sorry, I'm not following you there.   Perhaps your comment is saying
that second patch is pointless, because it forces allocations elsewhere
than the policy requires?  Rather than commenting on the line you quote?

(When I said I didn't know if the patch would have unwanted side-effects
I was rather wondering what happens when you ask the allocator to do
some impossible combination.)

> If the system has both high memory and normal memory then only allocations 
> to highmemory are subject to memory policies etc etc. The block device 
> allocations would be in zone normal/dma and thus be exempt from NUMA 
> placement.

Please just point us to the line where sys_move_pages enforces this.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel IOMMU][patch 1/8] ACPI support for Intel Virtualization Technology for Directed I/O

2007-04-24 Thread Ashok Raj
On Tue, Apr 24, 2007 at 08:50:48PM +0200, Andi Kleen wrote:
> 
> > +
> > +LIST_HEAD(dmar_drhd_units);
> > +LIST_HEAD(dmar_rmrr_units);
> 
> Comment describing what lock protects those lists?
> In fact there seems to be no locking. What about hotplug?
> 

There is no support to handle an IOMMU hotplug at this time. IOMMU hotplug
requires additional support via ACPI that needs to be extended to handle this.

These definitions are scanned at boot time from BIOS tables. They are
pretty much static data that we process during boot. Hence no locking is 
required. We pretty much tread this as read only, and the information never 
gets changed after initial parsing.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pagecache: find_or_create_page does not call a proper page allocator function

2007-04-24 Thread Andrew Morton
On Tue, 24 Apr 2007 13:12:42 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> 
wrote:

> On Tue, 24 Apr 2007, Andrew Morton wrote:
> 
> > > A highmem page can have buffers???
> > 
> > yep.  Take a 4k page which is stored in four discontiguous 1k disk blocks. 
> > The
> > data at page_buffers(page) is the sole way in which we track which parts of
> > the page belong to which blocks of the disk.
> 
> But I see no use of kmap for buffer access? The data is not accessible.
> 

The kernel rarely has a need to actually read or write the page's contents
with the CPU.  On those occasions where it does, it will use kmap.  Search
for zero_user_page() and kmap in rc7-mm1's fs/buffer.c

But that's file pagecache, which can be in highmem.  File metadata is accessed
within the filesystems without kmapping.  This:

box:/usr/src/linux-2.6.21-rc7> grep '[-]>b_data' fs/*/*.c | wc -l 
1017

explains why that never got fixed.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)

2007-04-24 Thread Pavel Machek
Hi!

> > From subsequent emails, I think you already got your answer, but just 
> > in case...
> > 
> > Yes, if you enabled "Replace swsusp by default" and you already had it 
> > set up for getting swsusp to resume. If not, and you're using an 
> > initrd/ramfs, you'll need to modify it to echo
> > > /sys/power/suspend2/do_resume after /sys and /proc are mounted but
> > prior to mounting / and so on.
> 
> yeah, went with the default suggested by your patch:
> 
>CONFIG_SUSPEND2_REPLACE_SWSUSP=y
> 
> and it was pretty easy to set things up. I used "echo disk > 
> /sys/power/state" to trigger it.
> 
> In hindsight it was all pretty straightforward and suspend2 worked 
> beautifully on an UP and on an SMP system i tried. So in exchange for 
> suspend2 folks debugging a bug in CFS here's some suspend2 review 
> feedback ;) Any plans about moving suspend2 to the upstream kernel? It 
> should be pretty easy for it to co-exist with the current swsuspend 
> code.

Well, current uswsusp code can do most of stuff suspend2 can do, with
20% (or so) of kernel code. 

"Major feature" that is missing is ability to save 100% of memory if
it is all the pagecache. I think that is not that important; we have
200 line patch to do that, but noone was able to verify it is correct.

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] Ignore stolen time in the softlockup watchdog

2007-04-24 Thread Jeremy Fitzhardinge
Jeremy Fitzhardinge wrote:
> Andrew Morton wrote:
>   
>> Well, it _is_ mysterious.
>>
>> Did you try to locate the code which failed?  I got lost in macros and
>> include files, and gave up very very easily.  Stop hiding, Ingo.
>>   
>> 
>
> OK, I've managed to reproduce it.  Removing the local_irq_save/restore
> from sched_clock() makes it go away, as I'd expect (otherwise it would
> really be magic).  But given that it never seems to touch the softlockup
> during testing, I have no idea what difference it makes...

And sched_clock's use of local_irq_save/restore appears to be absolutely
correct, so I think it must be triggering a bug in either the self-tests
or lockdep itself.

The only way I could actually extract the test code itself was to run
the whole thing through cpp+indent, but it doesn't shed much light.

It's also not clear to me if there are 6 independent failures, or if
they're a cascade.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kthread: Enhance kthread_stop to abort interruptible sleeps

2007-04-24 Thread Oleg Nesterov
On 04/24, Eric W. Biederman wrote:
> 
> I don't know if this is the problem but it certainly needs to be fixed.

I guess you will re-submit these patches soon. May I suggest you to put
this

> + spin_lock_irq(&tsk->sighand->siglock);
> + signal_wake_up(tsk, 1);
> + spin_unlock_irq(&tsk->sighand->siglock);

and this

>  fastcall void recalc_sigpending_tsk(struct task_struct *t)
>  {
>   if (t->signal->group_stop_count > 0 ||
> - (freezing(t)) ||
> + (freezing(t)) || __kthread_should_stop(t) ||

into the separate patch?

Perhaps I am too paranoid, and most probably this change is good, but
still I'm afraid this very subtle change may break things. In that case
it would be easy to revert that only part (for example for the testing
purposes).

Consider,

current->flags |= PF_NOFREEZE;

while (!kthread_should_stop()) {

begin_something();

// I am a kernel thread, all signals are ignored.
// I don't want to contribute to loadavg, so I am
// waiting for the absoulutely critical event in
// TASK__INTERRUPTIBLE state.

if (wait_event_interruptible(condition))
panic("Impossible!");

commit_something();
}

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pagecache: find_or_create_page does not call a proper page allocator function

2007-04-24 Thread Christoph Lameter
On Tue, 24 Apr 2007, Hugh Dickins wrote:

> On Tue, 24 Apr 2007, Christoph Lameter wrote:
> > On Tue, 24 Apr 2007, Hugh Dickins wrote:
> > 
> > > Or Christoph may prevail in persuading there's no such problem.
> > 
> > This is pointless. NUMA allocations can only be controlled for the highest 
> > zone. If we switch to a lower zone then we allocate on a different zone 
> > than the user requested.
> 
> Sorry, I'm not following you there.   Perhaps your comment is saying
> that second patch is pointless, because it forces allocations elsewhere
> than the policy requires?  Rather than commenting on the line you quote?

NUMA allocation can only be controlled for the highest zone. If you switch 
to another zone then no control of the node is possible anymore.

> (When I said I didn't know if the patch would have unwanted side-effects
> I was rather wondering what happens when you ask the allocator to do
> some impossible combination.)

It falls back from HIGHMEM to NORMAL (on 32 bit numa that means node 0. 
The result of your patch is that someone requiring an alloc on node 4 will
get memory on node 0).

> > If the system has both high memory and normal memory then only allocations 
> > to highmemory are subject to memory policies etc etc. The block device 
> > allocations would be in zone normal/dma and thus be exempt from NUMA 
> > placement.
> 
> Please just point us to the line where sys_move_pages enforces this.

It does not. It will happily move the pages into highmem. The filesystem 
cannot expect the page to remain in the same zone without holding a 
reference count. Swap may also move pages between zones. There is nothing 
special in what page migration does here.

I would say that the filesystem is broke if it has such expectations 
regardless of page migration.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel IOMMU][patch 7/8] Support for legacy ISA devices

2007-04-24 Thread Arjan van de Ven

Andi Kleen wrote:

On Tuesday 24 April 2007 08:03:06 Ashok Raj wrote:

Floppy disk drivers dont work well with DMA remapping.


What is the problem? You can't allocate mappings <16MB?


floppy doesn't use the DMA mapping API :)
that's the problem
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel IOMMU][patch 8/8] Preserve some Virtual Address when devices cannot address entire range.

2007-04-24 Thread Ashok Raj
On Tue, Apr 24, 2007 at 09:33:15PM +0200, Andi Kleen wrote:
> On Tuesday 24 April 2007 08:03:07 Ashok Raj wrote:
> > Some devices may not support entire 64bit DMA. In a situation where such 
> > devices are co-located in a shared domain, we need to ensure there is some 
> > address space reserved for such devices without the low addresses getting
> > depleted by other devices capable of handling high dma addresses.
> 
> Sorry, but you need to find some way to make this usually work without special
> options. Otherwise users will be unhappy.
> 
> An possible way would be to allocate space upside down from the limit of the
> device. Then the lower areas should be usually free.
> 
With PCIE there is some benefit to keep dma addr low for performance reasons, 
since it will use   32bit Transaction level packets instead of 64bit.

This reservation is only required if we have some legacy device under a p2p 
where its required to share its addr space with other devices. We could 
implement a default when one is not specified to keep things simple.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] Ignore stolen time in the softlockup watchdog

2007-04-24 Thread Andrew Morton
On Tue, 24 Apr 2007 13:24:24 -0700 Jeremy Fitzhardinge <[EMAIL PROTECTED]> 
wrote:

> Jeremy Fitzhardinge wrote:
> > Andrew Morton wrote:
> >   
> >> Well, it _is_ mysterious.
> >>
> >> Did you try to locate the code which failed?  I got lost in macros and
> >> include files, and gave up very very easily.  Stop hiding, Ingo.
> >>   
> >> 
> >
> > OK, I've managed to reproduce it.  Removing the local_irq_save/restore
> > from sched_clock() makes it go away, as I'd expect (otherwise it would
> > really be magic).  But given that it never seems to touch the softlockup
> > during testing, I have no idea what difference it makes...
> 
> And sched_clock's use of local_irq_save/restore appears to be absolutely
> correct, so I think it must be triggering a bug in either the self-tests
> or lockdep itself.

It's weird.  And I don't think the locking selftest code calls
sched_clock() (or any other time-related thing) at all, does it?

> The only way I could actually extract the test code itself was to run
> the whole thing through cpp+indent, but it doesn't shed much light.
> 
> It's also not clear to me if there are 6 independent failures, or if
> they're a cascade.

Oh well.  I'll restore the patches and when people hit problems we can
blame Ingo!

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/7] Universal battery class

2007-04-24 Thread Anton Vorontsov
Hello Pavel,

On Tue, Apr 24, 2007 at 09:28:46PM +0200, Pavel Machek wrote:
> Hi!
> 
> > Signed-off-by: Anton Vorontsov <[EMAIL PROTECTED]>
> 
> Yes please. Generic battery support is badly needed.

I'm glad you like it, thanks for the review!

Also I've done some code split. It removes all #ifdefs from battery.c,
and also separates core from sysfs and leds.

If there will no objections, I'll send this version when patch window
will opened.

Here it is...


From: Anton Vorontsov <[EMAIL PROTECTED]>
Date: Fri, 20 Apr 2007 14:40:50 +0400
Subject: [PATCH] Universal battery class

Signed-off-by: Anton Vorontsov <[EMAIL PROTECTED]>
---
 Documentation/battery-class.txt |  150 +++
 drivers/Kconfig |2 +
 drivers/Makefile|1 +
 drivers/battery/Kconfig |   11 +++
 drivers/battery/Makefile|9 +++
 drivers/battery/battery.c   |  137 +++
 drivers/battery/battery.h   |   37 ++
 drivers/battery/battery_leds.c  |  106 +++
 drivers/battery/battery_sysfs.c |  122 +++
 include/linux/battery.h |  125 
 10 files changed, 700 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/battery-class.txt
 create mode 100644 drivers/battery/Kconfig
 create mode 100644 drivers/battery/Makefile
 create mode 100644 drivers/battery/battery.c
 create mode 100644 drivers/battery/battery.h
 create mode 100644 drivers/battery/battery_leds.c
 create mode 100644 drivers/battery/battery_sysfs.c
 create mode 100644 include/linux/battery.h

diff --git a/Documentation/battery-class.txt b/Documentation/battery-class.txt
new file mode 100644
index 000..6a7d591
--- /dev/null
+++ b/Documentation/battery-class.txt
@@ -0,0 +1,150 @@
+Linux battery class
+===
+
+Synopsis
+
+Battery class used to export battery properties to userspace in consistent
+manner.
+
+It defines core set of battery attributes, available via sysfs, which
+should be applicable to (almost) every battery out there. Each attribute
+has well defined meaning, up to unit of measure used. While the attributes
+provided are believed to be universally applicable to any battery,
+specific monitoring hardware may not be able to provide them all, so
+any of them may be skipped.
+
+Battery class is extensible, and allows to define drivers own attributes.
+The core attribute set is subject to the standard Linux evolution (i.e.
+if it will be found that some attribute is applicable to many batteries
+or their drivers, it can be added to the core set).
+
+Battery class integrates with External Power framework, for the purpose of
+notification battery drivers when charging power is available. Note that
+specific charge control is left to the battery drivers.
+
+It also integrates with LED framework, for the purpose of providing
+typically expected (at least for portable devices) feedback of battery
+status (charging/fully charged) via LEDs. (Note that specific details of
+the indication (including whether to use it at all) are fully controllable
+by user and/or specific machine defaults, per design principles of LED
+framework).
+
+
+Attributes/properties
+~
+Battery class has predefined set of attributes, this eliminates code
+duplication across battery drivers. Battery class insist on reusing its
+predefined attributes *and* their units.
+
+So, userspace gets expected set of attributes and their units for
+any kind of battery, and can process/present them to a user in consistent
+manner. Results for different batteries and machines are also directly
+comparable.
+
+See drivers/battery/ds2760_battery.c for the example how to declare and
+handle attributes.
+
+
+Units
+~
+Quoting include/linux/battery.h:
+
+  All voltages, currents, charges, energies, time and temperatures in uV,
+  uA, uAh, uWh, seconds and tenths of degree Celsius unless otherwise
+  stated. It's driver's job to convert its raw values to units in which
+  this class operates.
+
+
+Attributes/properties detailed
+~~
+
+~ ~ ~ ~ ~ ~ ~  Charge/Energy/Capacity - how to not confuse  ~ ~ ~ ~ ~ ~ ~
+~   ~
+~ Because both "charge" (uAh) and "energy" (uWh) represents "capacity"  ~
+~ of battery, battery class distinguish these terms. Don't mix them!~
+~   ~
+~ CHARGE_* attributes represents capacity in uAh only.  ~
+~ ENERGY_* attributes represents capacity in uWh only.  ~
+~ CAPACITY attribute represents capacity in *percents*, from 0 to 100.  ~
+~   ~
+~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
+
+Postfixes:
+_AVG - *hardware* averaged value, use it if your

Re: [ANNOUNCE] UidBind LSM 0.2

2007-04-24 Thread Casey Schaufler

--- Gerhard Mack <[EMAIL PROTECTED]> wrote:

> On Tue, 24 Apr 2007, Roberto De Ioris wrote:
> 
> > Hi all,
> > 
> > this is the second release for UidBind LSM:
> > 
> > http://projects.unbit.it/uidbind/
> > 
> > UidBind allows call to bind() function only to the uid defined in a
> > configfs tree.
> > 
> > It is now possible to specify different uid (for the same port) on
> > different ipv4 addresses:
> > 
> > mkdir uidbind/8081
> > mkdir uidbind/8081/192.168.1.17
> > mkdir uidbind/8081/192.168.1.26
> > echo 1017 > uidbind/8081/192.168.1.17/uid
> > echo 1026 > uidbind/8081/192.168.1.26/uid
> > 
> > This version even fix some leek in version 0.1
> > 
> > Patch attached is still for vanilla 2.6.20.7
> 
> Is it possible to specify ranges as allowing everyone?  Is it possible to 
> allow multiple users acess to the same port?  Can ports be allowed by 
> group?

If you're going to go beyond the simple owner access model it
probably makes sense to go all out, swipe the file system ACL
code and provide the whole nine yards of users, groups, and modes.
The only system that I know of that had socket ACLs was the 4.X
version of Trusted Irix, and socket ACLs were dropped in 5.0 because
they were unpopular.

If you're daring you could propose that low number ports be treated
the same way as other ports, with the default ownership being root and
the default ACL allowing only root.

> I really like the idea of this patch.  It has the potential to solve a lot 
> of my current administrative headachs.

Putting access control on ports rather than sockets is a novel
approach. It is a lot simpler underneath and more consistant with
the way other object name spaces are treated.


Casey Schaufler
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel IOMMU][patch 7/8] Support for legacy ISA devices

2007-04-24 Thread Ashok Raj
On Tue, Apr 24, 2007 at 09:31:09PM +0200, Andi Kleen wrote:
> On Tuesday 24 April 2007 08:03:06 Ashok Raj wrote:
> > Floppy disk drivers dont work well with DMA remapping.
> 
> What is the problem? You can't allocate mappings <16MB?

No.. these drivers dont call DMA mapping api's.. thats the problem.

> 
> > Its possible to  
> > extend the current use for x86_64, but the gain is very little. If someone
> > feels compelled to clean this up, its up for grabs. Since these use 16M, we 
> > just provide a unity map for the ISA bridge device.
> > 
> 
> While it's probably not worth for the floppy there are other devices
> with similar weird addressing limitations. Some generic handling of it
> would be nice.
> 

In the intro we had outlined a way to handle this via a generic unity
map for all devices, we could do that, i.e

- implement a generic 1-1 map if the device is not calling dma api's and 
dynamically dissociate it if the device does start using dma apis.

For some of the addr reservation as well, we could use set_dma_mask() 
to ensure there is some dma space. Problem is some drivers may not use 
dma apis. Also it might be difficult to address device hotplugged that 
has a weird requirement.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/7] Universal battery class

2007-04-24 Thread Pavel Machek
Hi!
> > 
> > > Signed-off-by: Anton Vorontsov <[EMAIL PROTECTED]>
> > 
> > Yes please. Generic battery support is badly needed.
> 
> I'm glad you like it, thanks for the review!
> 
> Also I've done some code split. It removes all #ifdefs from battery.c,
> and also separates core from sysfs and leds.
> 
> If there will no objections, I'll send this version when patch window
> will opened.

Actually, you probably want to add some changelogs, and send it to
akpm _now_. He'll merge it to -mm tree, and hopefully send it upstream
in the next window. Merging directly to linus is for bigger projects
with git trees etc.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2   3   4   5   >