date:20070309

Re: [PATCH] dvb-core: Fix several locking related problems.

2007-03-09 Thread Johannes Stezenbach

On Sun, Mar 04, 2007 at 05:45:54PM +, Simon Arlott wrote:
> Fix several instances of dvb-core functions using mutex_lock_interruptible 
> and returning -ERESTARTSYS where the calling function will either never 
> retry or never check the return value.
> 
> These cause a race condition with dvb_dmxdev_filter_free and 
> dvb_dvr_release, both of which are filesystem release functions whose 
> return value is ignored and will never be retried. When this happens it 
> becomes impossible to open dvr0 again (-EBUSY) since it has not been 
> released properly.
> 
> Signed-off-by: Simon Arlott <[EMAIL PROTECTED]>

Acked-By: Johannes Stezenbach <[EMAIL PROTECTED]>

I can't test this but to me it looks good.
Mauro, could you please pick it up and keep it in the
linuxtv.org repository for a while for testing?


Thanks,
Johannes


> ---
> On 04/03/07 15:41, Andreas Oberritter wrote:
> >please send an updated patch together with
> >Signed-off-by line to Mauro <[EMAIL PROTECTED]> and ask him to apply
> >it for inclusion into the -mm tree for further testing.
> 
> Unless there are other -mm trees I've not heard about, presumably I should 
> just do this myself. Doesn't linux-dvb have it's own development tree this 
> would get better tested in?
> 
> The dvb_dvr_release change has been working for me for 6 months and the 
> dvb_dmxdev_filter_free (dvb_dmxdev_filter_free) change looks equivalent.
> See http://www.linuxtv.org/pipermail/linux-dvb/2007-February/016120.html 
> for an example of the bug before and after fixing.
> 
> All the other changes run ok for me but should have lockdep enabled when 
> testing (if there's a possible deadlock somewhere, using _interruptible 
> will hide it).
> 
> drivers/media/dvb/dvb-core/dmxdev.c|   12 +++-
> drivers/media/dvb/dvb-core/dvb_demux.c |   21 +++--
> drivers/media/dvb/dvb-core/dvbdev.c|9 +++--
> 3 files changed, 13 insertions(+), 29 deletions(-)
> 
> diff --git a/drivers/media/dvb/dvb-core/dmxdev.c 
> b/drivers/media/dvb/dvb-core/dmxdev.c
> index fc77de4..a5c0e1a 100644
> --- a/drivers/media/dvb/dvb-core/dmxdev.c
> +++ b/drivers/media/dvb/dvb-core/dmxdev.c
> @@ -180,8 +180,7 @@ static int dvb_dvr_release(struct inode *inode, struct 
> file *file)
>   struct dvb_device *dvbdev = file->private_data;
>   struct dmxdev *dmxdev = dvbdev->priv;
> 
> - if (mutex_lock_interruptible(&dmxdev->mutex))
> - return -ERESTARTSYS;
> + mutex_lock(&dmxdev->mutex);
> 
>   if ((file->f_flags & O_ACCMODE) == O_WRONLY) {
>   dmxdev->demux->disconnect_frontend(dmxdev->demux);
> @@ -673,13 +672,8 @@ static int dvb_demux_open(struct inode *inode, struct 
> file *file)
> static int dvb_dmxdev_filter_free(struct dmxdev *dmxdev,
> struct dmxdev_filter *dmxdevfilter)
> {
> - if (mutex_lock_interruptible(&dmxdev->mutex))
> - return -ERESTARTSYS;
> -
> - if (mutex_lock_interruptible(&dmxdevfilter->mutex)) {
> - mutex_unlock(&dmxdev->mutex);
> - return -ERESTARTSYS;
> - }
> + mutex_lock(&dmxdev->mutex);
> + mutex_lock(&dmxdevfilter->mutex);
> 
>   dvb_dmxdev_filter_stop(dmxdevfilter);
>   dvb_dmxdev_filter_reset(dmxdevfilter);
> diff --git a/drivers/media/dvb/dvb-core/dvb_demux.c 
> b/drivers/media/dvb/dvb-core/dvb_demux.c
> index fcff5ea..6d8d1c3 100644
> --- a/drivers/media/dvb/dvb-core/dvb_demux.c
> +++ b/drivers/media/dvb/dvb-core/dvb_demux.c
> @@ -673,8 +673,7 @@ static int dmx_ts_feed_stop_filtering(struct 
> dmx_ts_feed *ts_feed)
>   struct dvb_demux *demux = feed->demux;
>   int ret;
> 
> - if (mutex_lock_interruptible(&demux->mutex))
> - return -ERESTARTSYS;
> + mutex_lock(&demux->mutex);
> 
>   if (feed->state < DMX_STATE_GO) {
>   mutex_unlock(&demux->mutex);
> @@ -748,8 +747,7 @@ static int dvbdmx_release_ts_feed(struct dmx_demux *dmx,
>   struct dvb_demux *demux = (struct dvb_demux *)dmx;
>   struct dvb_demux_feed *feed = (struct dvb_demux_feed *)ts_feed;
> 
> - if (mutex_lock_interruptible(&demux->mutex))
> - return -ERESTARTSYS;
> + mutex_lock(&demux->mutex);
> 
>   if (feed->state == DMX_STATE_FREE) {
>   mutex_unlock(&demux->mutex);
> @@ -916,8 +914,7 @@ static int dmx_section_feed_stop_filtering(struct 
> dmx_section_feed *feed)
>   struct dvb_demux *dvbdmx = dvbdmxfeed->demux;
>   int ret;
> 
> - if (mutex_lock_interruptible(&dvbdmx->mutex))
> - return -ERESTARTSYS;
> + mutex_lock(&dvbdmx->mutex);
> 
>   if (!dvbdmx->stop_feed) {
>   mutex_unlock(&dvbdmx->mutex);
> @@ -942,8 +939,7 @@ static int dmx_section_feed_release_filter(struct 
> dmx_section_feed *feed,
>   struct dvb_demux_feed *dvbdmxfeed = (struct dvb_demux_feed *)feed;
>   struct dvb_demux *dvbdmx = dvbdmxfeed->demux;
> 
> - if (mutex_lock_interruptible(&dvbdmx->mutex))
> - return -ERESTARTSYS;

Re: [PATCH] proc: maps protection

2007-03-09 Thread Andy Isaacson

On Tue, Mar 06, 2007 at 08:22:11PM -0800, Arjan van de Ven wrote:
> > How about using a reduced check, as is done for fd and environ?  This 
> > would allow root-running system monitors to still do their job.  
> > Effectively, this changes the test from "is ptracing" to just "can 
> > ptrace".
> > 
> > If this still isn't considered safe, I'll add the maps_protect file...
> 
> btw I consider it an information leak that any user can see which
> files/libraries any other user and root has mmap'd. (and with glibc's
> stdio mmap feature that goes even beyond direct mmap to fopen()'d).
> 
> If root or some other user wants to watch
> hillary-vs-obama-in-the-mud.avi, no other user has ANY business even
> seeing that. So at minimum it's a privacy issue showing the filenames...

Sure, I would be fine with making /proc//maps mode 0400 (so long as
root can still read it).  But please don't take away my ability to debug
applications using pmap(1).  It's so incredibly useful to be able to say
"ah, look, he mapped nss_nis.so rather than nss_nisplus.so".

-andy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] rcfs core patch

2007-03-09 Thread Paul Jackson

Herbert wrote:
> personally, I'd prefer to avoid hierarchical
> structures wherever possible,

Sure - avoid them if you like.  But sometimes they work out rather
well.  And file system API's are sometimes the best fit for them.

I'm all for choosing the simplest API topology that makes sense.

But encoding higher dimension topologies into lower dimension API's,
just because they seem "simpler" results in obfuscation, convolution
and obscurity, which ends up costing everyone more than getting the
natural fit.

"Make everything as simple as possible, but not simpler."
--  Albert Einstein

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc3-mm1 RSDL results

2007-03-09 Thread Con Kolivas

On Saturday 10 March 2007 11:49, Matt Mackall wrote:
> On Sat, Mar 10, 2007 at 11:34:26AM +1100, Con Kolivas wrote:
> > Ok, so some of the basics then. Can you please give me the output of 'top
> > -b' running for a few seconds during the whole affair?
>
> Here you go:
>
> http://selenic.com/baseline
> http://selenic.com/underload
>
> This is with 2.6.20+rsdl+tickfix at HZ=250.
>
> Something I haven't mentioned about my setup is that I'm using ccache.
> And it turns out disabling ccache makes a large difference. Going to
> switch back to a NO_HZ kernel and see what that looks like.

Your X is reniced to -10 so try again with X nice 0 please.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH, take3] VFS : Delay the dentry name generation on sockets and pipes.

2007-03-09 Thread Eric Dumazet

Hi Andrew

Please find take3 of this patch : Linus suggested to introduce a helper 
function to factorize work done by most d_dname() implementations.

Thank you

[PATCH] VFS : Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to build a pathname 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all 
pipes/sockets) may need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen);

2) Adds a dynamic_dname() helper function that eases d_dname() implementations

3) Defines d_dname method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

4) Defines d_dname method for pipes : No more sprintf() at pipe creation.
This is delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Acked-by: Linus Torvalds <[EMAIL PROTECTED]>
--- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 
+0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h  2007-03-09 20:08:36.0 
+0100
@@ -133,6 +133,7 @@ struct dentry_operations {
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char *(*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
@@ -293,6 +294,11 @@ extern struct dentry * d_hash_and_lookup
 /* validate "insecure" dentry pointer */
 extern int d_validate(struct dentry *, struct dentry *);
 
+/*
+ * helper function for dentry_operations.d_dname() members
+ */
+extern char *dynamic_dname(struct dentry *, char *, int, const char *, ...);
+
 extern char * d_path(struct dentry *, struct vfsmount *, char *, int);
   
 /* Allocation counts.. */
--- linux-2.6.21-rc3/Documentation/filesystems/vfs.txt  2007-03-08 
10:14:38.0 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/vfs.txt   2007-03-09 
20:08:36.0 +0100
@@ -827,7 +827,7 @@ This describes how a filesystem can over
 operations. Dentries and the dcache are the domain of the VFS and the
 individual filesystem implementations. Device drivers have no business
 here. These methods may be set to NULL, as they are either optional or
-the VFS uses a default. As of kernel 2.6.13, the following members are
+the VFS uses a default. As of kernel 2.6.22, the following members are
 defined:
 
 struct dentry_operations {
@@ -837,6 +837,7 @@ struct dentry_operations {
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char *(*d_dname)(struct dentry *, char *, int);
 };
 
   d_revalidate: called when the VFS needs to revalidate a dentry. This
@@ -859,6 +860,26 @@ struct dentry_operations {
VFS calls iput(). If you define this method, you must call
iput() yourself
 
+  d_dname: called when the pathname of a dentry should be generated.
+   Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay
+   pathname generation. (Instead of doing it when dentry is created,
+   its done only when the path is needed.). Real filesystems probably
+   dont want to use it, because their dentries are present in global
+   dcache hash, so their hash should be an invariant. As no lock is
+   held, d_dname() should not try to modify the dentry itself, unless
+   appropriate SMP safety is used. CAUTION : d_path() logic is quite
+   tricky. The correct way to return for example "Hello" is to put it
+   at the end of the buffer, and returns a pointer to the first char.
+   dynamic_dname() helper function is provided to take care of this.
+
+Example :
+
+static char *pipefs_dname(struct dentry *dent, char *buffer, int buflen)
+{
+   return dynamic_dname(dentry, buffer, buflen, "pipe:[%lu]",
+   dentry->d_inode->i_ino);
+}
+
 Each dentry has a pointer to its parent dentry, as well as a hash list
 of child dentries. Child dentries are basically like files in a
 directory.
--- linux-2.6.21-rc3/Documentation/filesystems/Locking  2007-03-08 
10:29:04.0 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/Locking   2007-03-08 
12:08:56.0 +0100
@@ -15,6 +15,7 @@ prototypes:
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
 
 locking rules:
none hav

Re: [PATCH] dio: invalidate clean pages before dio write

2007-03-09 Thread Benjamin LaHaise

On Fri, Mar 09, 2007 at 02:35:57PM -0800, Zach Brown wrote:
> + if (rw == WRITE && mapping->nrpages) {
> + int err = invalidate_inode_pages2_range(mapping,
> +   offset >> PAGE_CACHE_SHIFT, end);
> + if (err && retval >= 0)
> + retval = err;
> + }

I don't think reporting the error is the correct thing to do in the presense 
of the write having completed.  It's a race that the caller can do nothing 
about and is arguably a kernel bug, so I'd rather do something like:

if (err) {
if (!retval)
retval = err;
else
printk_ratelimited(KERN_DEBUG
"dio sucks and hit the race %ld %ld\n",
retval, err);
}

Aside from that, I much prefer this approach to fix the problem than going 
around and changing semantics.  Feel free to add my Signed-off-by.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc3-mm1 RSDL results

2007-03-09 Thread Matt Mackall

On Sat, Mar 10, 2007 at 12:02:25PM +1100, Con Kolivas wrote:
> On Saturday 10 March 2007 09:12, Con Kolivas wrote:
> > On Saturday 10 March 2007 08:57, Willy Tarreau wrote:
> > > On Fri, Mar 09, 2007 at 03:39:59PM -0600, Matt Mackall wrote:
> > > > On Sat, Mar 10, 2007 at 08:19:18AM +1100, Con Kolivas wrote:
> > > > > On Saturday 10 March 2007 08:07, Con Kolivas wrote:
> > > > > > On Saturday 10 March 2007 07:46, Matt Mackall wrote:
> 
> > > > 5x memload: good
> > > > 5x execload: good
> > > > 5x forkload: good
> > > > 5 parallel makes: mostly good
> > > > make -j 5: bad
> > > >
> > > > So what's different between makes in parallel and make -j 5? Make's
> > > > job server uses pipe I/O to control how many jobs are running.
> > >
> > > Matt, could you check with plain 2.6.20 + Con's patch ? It is possible
> > > that he added bugs when porting to -mm, or that someting in -mm causes
> > > the trouble. Your experience with -mm seems so much different from mine
> > > with mainline, there must be a difference somewhere !
> >
> > Good idea.
> 
> It's all very odd Matt. It really isn't behaving anything like you describe 
> for myself or others. It sounds more like a real bug than what the design 
> would do at all. The only things that are different on yours is Beryl and a 
> different graphics card. When you're comparing to mainline are you 
> comparing -mm1 to -mm2 to ensure something else from -mm isn't responsible? 
> Also have you tried rsdl on 2.6.20 as Willy suggested?

Haven't tried -mm2. So far I've tried 2.6.21-rc2-mm2 (aka 'stock'),
2.6.21-rc3-mm1, and 2.6.20+rsdl.

I also did a test with Metacity and saw the same issues under load. So
I think Beryl is not part of the problem.

Right now it's looking like the problem is caused by ccache. Disabling
ccache with 2.6.21-rc2-mm1+tickfix+noyield lets me run make -j 5
acceptably. So my new column would be:

RSDL+NO_HZ+tickfix+noyield+noccache
make -j 5
 beryl  ok/good
 galeon ok/good
 mp3good
 terminal   good
 mouse  good

So it's about on par with 2.6.20, maybe slightly better.

I suspect ccache lock contention is somehow involved though it doesn't
explain why my 5 independent makes test beats out make -j 5.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] rcfs core patch

2007-03-09 Thread Herbert Poetzl

On Fri, Mar 09, 2007 at 11:27:07PM +0530, Srivatsa Vaddagiri wrote:
> On Fri, Mar 09, 2007 at 01:38:19AM +0100, Herbert Poetzl wrote:
> > > 2) you allow a task to selectively reshare namespaces/subsystems with
> > >another task, i.e. you can update current->task_proxy to point to
> > >a proxy that matches your existing task_proxy in some ways and the
> > >task_proxy of your destination in others. In that case a trivial
> > >implementation would be to allocate a new task_proxy and copy some
> > >pointers from the old task_proxy and some from the new. But then
> > >whenever a task moves between different groupings it acquires a
> > >new unique task_proxy. So moving a bunch of tasks between two
> > >groupings, they'd all end up with unique task_proxy objects with
> > >identical contents.

> > this is exactly what Linux-VServer does right now, and I'm
> > still not convinced that the nsproxy really buys us anything
> > compared to a number of different pointers to various spaces
> > (located in the task struct)

> Are you saying that the current scheme of storing pointers to
> different spaces (uts_ns, ipc_ns etc) in nsproxy doesn't buy
> anything?

> Or are you referring to storage of pointers to resource 
> (name)spaces in nsproxy doesn't buy anything?

> In either case, doesn't it buy speed and storage space?

let's do a few examples here, just to illustrate the
advantages and disadvantages of nsproxy as separate
structure over nsproxy as part of the task_struct

1) typical setup, 100 guests as shell servers, 5
   tasks each when unused, 10 tasks when used 10%
   used in average

   a) separate nsproxy, we need at least 100
  structs to handle that (saves some space)

  we might end up with ~500 nsproxies, if
  the shell clones a new namespace (so might
  not save that much space)

  we do a single inc/dec when the nsproxy
  is reused, but do the full N inc/dec when
  we have to copy an nsproxy (might save
  some refcounting)

  we need to do the indirection step, from
  task to nsproxy to space (and data)

   b) we have ~600 tasks with 600 times the
  nsproxy data (uses up some more space)

  we have to do the full N inc/dev when
  we create a new task (more refcounting)

  we do not need to do the indirection, we
  access spaces directly from the 'hot'
  task struct (makes hot pathes quite fast)

   so basically we trade a little more space and
   overhead on task creation for having no 
   indirection to the data accessed quite often
   throughout the tasks life (hopefully)

2) context migration: for whatever reason, we decide
   to migrate a task into a subset (space mix) of a
   context 1000 times

   a) separate nsproxy, we need to create a new one
  consisting of the 'new' mix, which will

  - allocate the nsproxy struct
  - inc refcounts to all copied spaces
  - inc refcount nsproxy and assign to task
  - dec refcount existing task nsproxy

  after task completion
  - dec nsproxy refcount
  - dec refcounts for all spaces  
  - free up nsproxy struct

   b) nsproxy data in task struct

  - inc/dec refcounts to changed spaces

  after task completion
  - dec refcounts to spaces

   so here we gain nothing with the nsproxy, unless
   the chosen subset is identical to the one already
   used, where we end up with a single refcount 
   instead of N 

> > I'd prefer to do accounting (and limits) in a very simple
> > and especially performant way, and the reason for doing
> > so is quite simple:

> Can you elaborate on the relationship between data structures
> used to store those limits to the task_struct?

sure it is one to many, i.e. each task points to
exactly one context struct, while a context can
consist of zero, one or many tasks (no back- 
pointers there)

> Does task_struct store pointers to those objects directly?

it contains a single pointer to the context struct, 
and that contains (as a substruct) the accounting
and limit information

HTC,
Herbert

> -- 
> Regards,
> vatsa
> ___
> Containers mailing list
> [EMAIL PROTECTED]
> https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Pluggable Schedulers (was: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler)

2007-03-09 Thread William Lee Irwin III

On Fri, Mar 09, 2007 at 05:18:31PM -0500, Ryan Hope wrote:
> from what I understood, there is a performance loss in plugsched
> schedulers because they have to share code
> even if pluggable schedulers is not a viable option, being able to
> choose which one was built into the kernel would be easy (only takes a
> few ifdefs), i too think competition would be good

Neither sharing code nor data structures is strictly necessary for a
pluggable scheduler. For instance, backing out per-cpu runqueues in
favor of a single locklessly-accessed queue or similar per-leaf-domain
queues is one potential design alternative (never mind difficulties
with ->cpus_allowed) explicitly considered for the sake of sched_yield()
semantics on SMP, among other concerns. What plugsched originally did
was to provide a set of driver functions and allow each scheduler to
play with its private data declared static in separate C files in what
were later intended to become kernel modules. As far as I know, runtime
switchover code to complement all that has never been written in such a
form. One possibility abandoned early-on was to have multiple schedulers
simultaneously active to manage different portions of the system with
different policies, in no small part due to the difficulty of load
balancing between the partitions associated with the different schedulers.
Some misguided attempts were made to export the lowest-level API possible,
which I rather quickly deemed a mistake, but they still held to such
largely design considerations as I described above.

-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3/6] 2.6.21-rc2: known regressions

2007-03-09 Thread Mathieu Bérard

Jeff Garzik a écrit :
> Adrian Bunk wrote:
>> Subject: NCQ problem with ahci and Hitachi drive
>> References : http://lkml.org/lkml/2007/3/4/178
>> Submitter  : Mathieu Bérard <[EMAIL PROTECTED]>
>> Status : unknown
>
> according to the last message in that thread, it sounds like ACPI and
> interrupt problems
>
Hi,
after more testing with a 2.6.21-rc3, it appears that after several ata
errors the boot process
somehow continued as normal, after a "NCQ disabled due to excessive
errors" message.
"pci=noacpi" or "noacpi" parameters workarounds the problem "irqpoll"
does nothing.

lspci:
00:00.0 Host bridge: Intel Corporation Mobile 915GM/PM/GMS/910GML
Express Processor to DRAM Controller (rev 03)
00:01.0 PCI bridge: Intel Corporation Mobile 915GM/PM Express PCI
Express Root Port (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6
Family) PCI Express Port 1 (rev 03)
00:1c.1 PCI bridge: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6
Family) PCI Express Port 2 (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6
Family) USB UHCI #1 (rev 03)
00:1d.1 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6
Family) USB UHCI #2 (rev 03)
00:1d.2 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6
Family) USB UHCI #3 (rev 03)
00:1d.3 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6
Family) USB UHCI #4 (rev 03)
00:1d.7 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6
Family) USB2 EHCI Controller (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev d3)
00:1e.2 Multimedia audio controller: Intel Corporation
82801FB/FBM/FR/FW/FRW (ICH6 Family) AC'97 Audio Controller (rev 03)
00:1e.3 Modem: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family)
AC'97 Modem Controller (rev 03)
00:1f.0 ISA bridge: Intel Corporation 82801FBM (ICH6M) LPC Interface
Bridge (rev 03)
00:1f.1 IDE interface: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6
Family) IDE Controller (rev 03)
00:1f.2 IDE interface: Intel Corporation 82801FBM (ICH6M) SATA
Controller (rev 03)
00:1f.3 SMBus: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family)
SMBus Controller (rev 03)
01:00.0 VGA compatible controller: ATI Technologies Inc M24 [Radeon
Mobility X600]
06:01.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (rev 10)
06:02.0 Network controller: Intel Corporation PRO/Wireless 2200BG
Network Connection (rev 05)
06:04.0 CardBus bridge: Texas Instruments PCIxx21/x515 Cardbus Controller
06:04.2 FireWire (IEEE 1394): Texas Instruments OHCI Compliant IEEE 1394
Host Controller
06:04.3 Mass storage controller: Texas Instruments PCIxx21 Integrated
FlashMedia Controller
06:04.4 Generic system peripheral [0805]: Texas Instruments
PCI6411/6421/6611/6621/7411/7421/7611/7621 Secure Digital Controller

/proc/interrupts:
CPU0  
  0:   3242   IO-APIC-edge  timer
  1:863   IO-APIC-edge  i8042
  8:  3   IO-APIC-edge  rtc
  9:  1   IO-APIC-fasteoi   acpi
 12:116   IO-APIC-edge  i8042
 14:128   IO-APIC-edge  libata
 15:  0   IO-APIC-edge  libata
 16:  1   IO-APIC-fasteoi   uhci_hcd:usb4, yenta
 17:  0   IO-APIC-fasteoi   tifm_7xx1, Intel ICH6
 18:249   IO-APIC-fasteoi   eth0
 19:   2712   IO-APIC-fasteoi   libata, uhci_hcd:usb2, sdhci:slot0,
sdhci:slot1, sdhci:slot2
 20: 47   IO-APIC-fasteoi   uhci_hcd:usb1, ehci_hcd:usb5
 21:  3   IO-APIC-fasteoi   uhci_hcd:usb3, ohci1394
 22:  1   IO-APIC-fasteoi   ipw2200
NMI:  0
LOC:  15767
ERR:  0
MIS:  0

/proc/interrupts with pci=noacpi:
CPU0  
  0:   2886XT-PIC-XTtimer
  1: 79XT-PIC-XTi8042
  2:  0XT-PIC-XTcascade
  8:  3XT-PIC-XTrtc
  9:  1XT-PIC-XTacpi
 10:  1XT-PIC-XTuhci_hcd:usb4, tifm_7xx1, yenta,
sdhci:slot0, sdhci:slot1, sdhci:slot2, Intel ICH6
 11:   3415XT-PIC-XTeth0, libata, uhci_hcd:usb1,
uhci_hcd:usb2, uhci_hcd:usb3, ehci_hcd:usb5, ohci1394, ipw2200
 12:116XT-PIC-XTi8042
 14:129XT-PIC-XTlibata
 15:  0XT-PIC-XTlibata
NMI:  0
LOC:   6594
ERR:  0
MIS:  0

Full 2.6.21-rc3 boot log:
[0.00] Linux version 2.6.21-rc3 ([EMAIL PROTECTED]) (gcc version 4.1.2
(Ubuntu 4.1.2-0ubuntu4)) #1 PREEMPT Fri Mar 9 01:54:11 CET 2007
[0.00] BIOS-provided physical RAM map:
[0.00] sanitize start
[0.00] sanitize end
[0.00] copy_e820_map() start:  size:
0009f800 end: 0009f800 type: 1
[0.00] copy_e820_map() type is E820_RAM
[0.00] copy_e820_map() start: 0009f800 size:
0800 end: 000a type: 2
[0.00] copy_e820_map() start: 000d2000 size:
2000 end: 000d4000 type: 2
[0.0

Re: 2.6.21-rc3-mm1 RSDL results

2007-03-09 Thread Con Kolivas

On Saturday 10 March 2007 09:12, Con Kolivas wrote:
> On Saturday 10 March 2007 08:57, Willy Tarreau wrote:
> > On Fri, Mar 09, 2007 at 03:39:59PM -0600, Matt Mackall wrote:
> > > On Sat, Mar 10, 2007 at 08:19:18AM +1100, Con Kolivas wrote:
> > > > On Saturday 10 March 2007 08:07, Con Kolivas wrote:
> > > > > On Saturday 10 March 2007 07:46, Matt Mackall wrote:

> > > 5x memload: good
> > > 5x execload: good
> > > 5x forkload: good
> > > 5 parallel makes: mostly good
> > > make -j 5: bad
> > >
> > > So what's different between makes in parallel and make -j 5? Make's
> > > job server uses pipe I/O to control how many jobs are running.
> >
> > Matt, could you check with plain 2.6.20 + Con's patch ? It is possible
> > that he added bugs when porting to -mm, or that someting in -mm causes
> > the trouble. Your experience with -mm seems so much different from mine
> > with mainline, there must be a difference somewhere !
>
> Good idea.

It's all very odd Matt. It really isn't behaving anything like you describe 
for myself or others. It sounds more like a real bug than what the design 
would do at all. The only things that are different on yours is Beryl and a 
different graphics card. When you're comparing to mainline are you 
comparing -mm1 to -mm2 to ensure something else from -mm isn't responsible? 
Also have you tried rsdl on 2.6.20 as Willy suggested? I would really love to 
get to the bottom of this as it really shouldn't behave that way under load 
no matter how the load is dished out.

Thanks!

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Use more gcc extensions in the Linux headers

2007-03-09 Thread Jan Engelhardt


On Mar 10 2007 09:57, Rusty Russell wrote:
>On Fri, 2007-03-09 at 16:56 +1100, Rusty Russell wrote:
>> __builtin_types_compatible_p() has been around since gcc 2.95, and we
>> don't use it anywhere.  This patch quietly fixes that.
>
>OK, many people complained that it needed a comment.  Good point!
>==
>Add comment to ARRAY_SIZE macro.
>
>Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
>
>diff -r 933e410f204f include/linux/kernel.h
>--- a/include/linux/kernel.h   Sat Mar 10 09:55:31 2007 +1100
>+++ b/include/linux/kernel.h   Sat Mar 10 09:55:53 2007 +1100
>@@ -35,6 +35,7 @@ extern const char linux_proc_banner[];
> #define ALIGN(x,a)__ALIGN_MASK(x,(typeof(x))(a)-1)
> #define __ALIGN_MASK(x,mask)  (((x)+(mask))&~(mask))
> 
>+/* GCC is awesome. */
> #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])   
>   \
>   + sizeof(typeof(int[1 - 2*!!__builtin_types_compatible_p(typeof(arr), \
>typeof(&arr[0]))]))*0)

Getting back at the macro, how would you like to have it merged?



Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc3-mm1 RSDL results

2007-03-09 Thread Matt Mackall

On Sat, Mar 10, 2007 at 11:34:26AM +1100, Con Kolivas wrote:
> On Saturday 10 March 2007 09:29, Matt Mackall wrote:
> > On Sat, Mar 10, 2007 at 09:18:05AM +1100, Con Kolivas wrote:
> > > On Saturday 10 March 2007 08:57, Con Kolivas wrote:
> > > > On Saturday 10 March 2007 08:39, Matt Mackall wrote:
> > > > > On Sat, Mar 10, 2007 at 08:19:18AM +1100, Con Kolivas wrote:
> > > > > > On Saturday 10 March 2007 08:07, Con Kolivas wrote:
> > > > > > > On Saturday 10 March 2007 07:46, Matt Mackall wrote:
> > > > > > > > My suspicion is the problem lies in giving too much quanta to
> > > > > > > > newly-started processes.
> > > > > > >
> > > > > > > Ah that's some nice detective work there. Mainline does some
> > > > > > > rather complex accounting on sched_fork including (possibly) a
> > > > > > > whole timer tick which rsdl does not do. make forks off
> > > > > > > continuously so what you say may well be correct. I'll see if I
> > > > > > > can try to revert to the mainline behaviour in sched_fork (which
> > > > > > > was obviously there for a reason).
> > > > > >
> > > > > > Wow! Thanks Matt. You've found a real bug too. This seems to fix
> > > > > > the qemu misbehaviour and bitmap errors so far too! Now can you
> > > > > > please try this to see if it fixes your problem?
> > > > >
> > > > > Sorry, it's about the same. I now suspect an accounting glitch
> > > > > involving pipe wake-ups.
> > > > >
> > > > > 5x memload: good
> > > > > 5x execload: good
> > > > > 5x forkload: good
> > > > > 5 parallel makes: mostly good
> > > > > make -j 5: bad
> > > > >
> > > > > So what's different between makes in parallel and make -j 5? Make's
> > > > > job server uses pipe I/O to control how many jobs are running.
> > > >
> > > > Hmm it must be those deep pipes again then. I removed any quirks
> > > > testing for those from mainline as I suspected it would be ok. Guess
> > > > I"m wrong.
> > >
> > > I shouldn't blame this straight up though if NO_HZ makes it better.
> > > Something else is going wrong... wtf though?
> >
> > Just so we're clear, dynticks has only 'fixed' the single non-parallel
> > make load so far.
> 
> Ok, so some of the basics then. Can you please give me the output of 'top -b' 
> running for a few seconds during the whole affair?

Here you go:

http://selenic.com/baseline
http://selenic.com/underload
 
This is with 2.6.20+rsdl+tickfix at HZ=250.

Something I haven't mentioned about my setup is that I'm using ccache.
And it turns out disabling ccache makes a large difference. Going to
switch back to a NO_HZ kernel and see what that looks like.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] rcfs core patch

2007-03-09 Thread Herbert Poetzl

On Fri, Mar 09, 2007 at 11:25:47AM -0800, Paul Jackson wrote:
> > Ease of use maybe. Scripts can be more readily used with a fs-based
> > interface.
> 
> And, as I might have already stated, file system API's are a natural
> fit for hierarchically shaped data, especially if the nodes in the
> hierarchy would benefit from file system like permission attributes.

personally, I'd prefer to avoid hierarchical
structures wherever possible, because they tend
to make processing and checks a lot more complicated
than necessary, and if we really want hierarchical
structures, it might be more than sufficient to
keep the hierarchy in userspace, and use a flat
representation inside the kernel ...

but hey, I'm all for running a hypervisor under
a hypervisor running inside a hypervisor :)

best,
Herbert

> -- 
>   I won't rest till it's the best ...
>   Programmer, Linux Scalability
>   Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
> ___
> Containers mailing list
> [EMAIL PROTECTED]
> https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [linux-usb-devel] Question re hiddev

2007-03-09 Thread Gene Heskett

On Friday 09 March 2007, Adam Kropelin wrote:
>Gene Heskett wrote:
>> On Thursday 08 March 2007, Gene Heskett wrote:
>>> Greetings;
>>>
>>> Belkin is being non-responsive to requests for updated drivers for
>>> their line of UPS's, all of which now have a USB port which is the
>>> Belkin recommended way to talk to these things.
>>>
>>> Unforch, the belkin supplied *nix stuff was last compiled on an rh5.2
>>> machine using gcc-2.7.2, so there has been some bitrot.
>>>
>>> I believe the problem to be that when their version of upsd is
>>> trying to open the /dev/name its given, it is assuming and hard
>>> coded to do the ioctl's to set the ports speed in baudrate, width of
>>> word, parity etc.
>>>
>>> Getting failure messages for that, it retrys the open until it has
>>> 1024 links to /dev/hiddev0 according to an lsof|grep hiddev0, all of
>>> which presumably have failed so it never actually opens the
>>> /dev/hiddev0 port in r/w mode successfully.
>>>
>>> I can, from a shell, 'cat' the data from this port, its not very fast
>>> taking about 8-10 seconds to output all the integers or bytes to
>>> constitute a complete screen update when translated by the gui into
>>> sensible data.
>>>
>>> My proposal, and I'll see if I can make a patch, is to add to the
>>> hiddev.c code, stubs for these otherwise useless functions that do
>>> nothing but return a 0 indicating success so that these legacy
>>> drivers can make use of a port whose data is just fine but fails
>>> these configuration things that don't mean squat to hiddev anyway.
>>>
>>> Would this effort at making legacy drivers who think they are
>>> using /dev/ttySx, work with /dev/hiddev constitute an acceptable
>>> reason for such a patch to hiddev.c?
>>
>> Its been about a day now, and no one has commented.  Am I an idiot or
>> ??
>
>I think you fundamentally misunderstand hiddev. It's an interface to HID
>devices, which are not (NOT!) byte streams of the sort you'd get on
>/dev/ttySx. hiddev speaks in specific structures via read/write/ioctl as
>detailed in Documentation/usb/hiddev.txt. Any application which is
>making tty ioctls to set baud rate, etc. will never work (unmodified)
>with hiddev.
>
>Your Belkin UPS may follow the USB HID class spec for Power Devices, in
>which case a suite like NUT will be able to handle it with their
>USB-generic driver.
>
>--Adam

I *think* I had the nut daemon working a couple of days ago, it at least 
ran without getting a tummy ache and upchucking into the logs.  But as 
for using that data usefully, nuts docs might as well be written in 
swahili.  It also doesn't seem to have a gui like the belkin programs 
give.  Or I don't know how to set it up to use a gui.  Either is a strong 
possibility...

If anyone else has it working, please speak up.  The ups is a Belkin 
F6C-1500, with some more adjectives appended that have nothing to do with 
the electronics in it.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
UPS interrupted the server's power
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] rcfs core patch

2007-03-09 Thread Herbert Poetzl

On Fri, Mar 09, 2007 at 11:44:22PM +0530, Srivatsa Vaddagiri wrote:
> On Fri, Mar 09, 2007 at 01:48:16AM +0100, Herbert Poetzl wrote:
> > > There have been various projects attempting to provide resource
> > > management support in Linux, including CKRM/Resource Groups and UBC.

> > let me note here, once again, that you forgot Linux-VServer
> > which does quite non-intrusive resource management ...

> Sorry, not intentionally. Maybe it slipped because I haven't
> seen much res mgmt related patches from Linux Vserver on 
> lkml recently.

mainly because I got the impression that we planned
to work on the various spaces first, and handle things
like resource management later .. but it seems that
resource management is now in focus, while the spaces
got somewhat delayed ...

> Note that I -did- talk about VServer at one point in past
> (http://lkml.org/lkml/2006/06/15/112)!

noted and appreciated (although this was about CPU
resources, which IMHO is a special resource like
the networking, as you are mostly interested in
'bandwidth' limitations there, not in resource
limits per se (and of course, it wasn't even cited
correctly, as it is Linux-VServer not vserver ...)

> > the basic 'context' (pid space) is the grouping mechanism
> > we use for resource management too

> so tasks sharing the same nsproxy->pid_ns is the fundamental
> unit of resource management (as far as vserver/container goes)?

we currently have a 'process' context, which holds
the administrative data (capabilities and flags) and
the resource accounting and limits, which basically
contains the pid namespace, so yes and no

it contains a reference to the 'main' nsproxy, which
is used to copy spaces from when you enter the guest
(or some set of spaces), and it defines the unit we
consider a process container

> > > As you know, the introduction of 'struct container' was objected
> > > to and was felt redundant as a means to group tasks. Thats where
> > > I took a shot at converting over Paul Menage's patch to avoid
> > > 'struct container' abstraction and insead work with 'struct
> > > nsproxy'.
> > 
> > which IMHO isn't a step in the right direction, as
> > you will need to handle different nsproxies within
> > the same 'resource container' (see previous email)
> 
> Isn't that made simple because of the fact that we have pointers to
> namespace objects (and not actual objects themselves) in nsproxy?
> 
> I mean, all that is required to manage multiple nsproxy's
> is to have the pointer to the same resource object in all of them.
> 
> In system call terms, if someone does a unshare of uts namespace, 
> he will get into a new nsproxy object sure (which has a pointer to the
> new uts namespace) but the new nsproxy object will still be pointing
> to the old resource controlling objects.

yes, that is why I agreed, that the container (or
resource limit/accounting/controlling object) can
be seen as space too (and handled like that)

> > > When we support task movement across resource classes, we need to
> > > find a nsproxy which has the right combination of resource classes
> > > that the task's nsproxy can be hooked to.
> > 
> > no, not necessarily, we can simply create a new one
> > and give it the proper resource or whatever-spaces
> 
> That would be the simplest, agreeably. But not optimal in terms of
> storage?
> 
> Pls note that task-movement can be not-so-infrequent 
> (in other words, frequent) in context of non-container workload 
> management.

not only there, also with solutions like Linux-VServer
(it is quite common to enter guests or subsets of the
space mix assigned)

> > why is the filesystem approach so favored for this
> > kind of manipulations?
> > 
> > IMHO it is one of the worst interfaces I can imagine
> > (to move tasks between spaces and/or assign resources)
> > but yes, I'm aware that filesystems are 'in' nowadays
> 
> Ease of use maybe. Scripts can be more readily used with a fs-based
> interface.

correct, but what about security and/or atomicity?
i.e. how to assure that some action really was 
taken and/or how to wait for completion?

sure, all this _can_ be done, no doubt, but it
is much harder to do with a fs based interface than
with e.g. a syscall interface ...

> -- 
> Regards,
> vatsa
> ___
> Containers mailing list
> [EMAIL PROTECTED]
> https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc3 - oops on remove of USB dongle

2007-03-09 Thread John Stoffel


Greg> On Fri, Mar 09, 2007 at 10:40:21AM -0500, John Stoffel wrote:
>> 
>> Hi all,
>> 
>> I've just compiled and installed 2.6.21-rc3 on my Dual CPU Dell
>> Precision 610MT system.  Dual 550mhz Xeon, 768mb of RAM.  Mix of SCSI,
>> ATA drives.  I'm using the new ATA_ drivers for my PATA disks.  
>> 
>> After booting, I pulled my seldom used USB->serial device from the
>> system to toss in my bag to bring to work.  It's a Belkin F5U109
>> dongle.  I noticed the following oops:

Greg> Ugh, I have a fix for this in my tree, I'll send it to Linus in
Greg> a few hours.

Greg> sorry for taking so long with this...

No problem.  Oliver Neukum sent me a pair of patches to try (appended
below) and they seem to have done the trick just fine now.  I can
plug/unplug the device without any more oopses.

Hopefully this will make it into 2.6.21-rc4

Cheers,
John

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc3 - oops on remove of USB dongle

2007-03-09 Thread John Stoffel


Duh... forgot the patches:



2.6.21-rc-usb-serial.patch
Description: two patches for usbserial oops on 2.6.21-rc3

kernel bug #7674 - bad hard disk noise on shutdown

2007-03-09 Thread Giovanni Lovato

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi all.
I'm also affected by this bug. I noticed it upgrading from Ubuntu Dapper
to Edgy, and now persist on Feisty.
I can't remember the latest kernel version I used on Dapper, but I'm
sure that the disk was parking fine!
Are we in front of a regression?
- --
www.aldu.net/~heruan
[EMAIL PROTECTED]
ldaps://pgpkeys.aldu.net

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFF8f1uaWLXrn9dopwRAmh1AJ9EUOaj9oCSHmRvzHOWtcAqAylUJwCbBtT8
mudfyYV4OCQtuNkGt+at/VA=
=G2ch
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc3-mm1 RSDL results

2007-03-09 Thread Con Kolivas

On Saturday 10 March 2007 09:29, Matt Mackall wrote:
> On Sat, Mar 10, 2007 at 09:18:05AM +1100, Con Kolivas wrote:
> > On Saturday 10 March 2007 08:57, Con Kolivas wrote:
> > > On Saturday 10 March 2007 08:39, Matt Mackall wrote:
> > > > On Sat, Mar 10, 2007 at 08:19:18AM +1100, Con Kolivas wrote:
> > > > > On Saturday 10 March 2007 08:07, Con Kolivas wrote:
> > > > > > On Saturday 10 March 2007 07:46, Matt Mackall wrote:
> > > > > > > My suspicion is the problem lies in giving too much quanta to
> > > > > > > newly-started processes.
> > > > > >
> > > > > > Ah that's some nice detective work there. Mainline does some
> > > > > > rather complex accounting on sched_fork including (possibly) a
> > > > > > whole timer tick which rsdl does not do. make forks off
> > > > > > continuously so what you say may well be correct. I'll see if I
> > > > > > can try to revert to the mainline behaviour in sched_fork (which
> > > > > > was obviously there for a reason).
> > > > >
> > > > > Wow! Thanks Matt. You've found a real bug too. This seems to fix
> > > > > the qemu misbehaviour and bitmap errors so far too! Now can you
> > > > > please try this to see if it fixes your problem?
> > > >
> > > > Sorry, it's about the same. I now suspect an accounting glitch
> > > > involving pipe wake-ups.
> > > >
> > > > 5x memload: good
> > > > 5x execload: good
> > > > 5x forkload: good
> > > > 5 parallel makes: mostly good
> > > > make -j 5: bad
> > > >
> > > > So what's different between makes in parallel and make -j 5? Make's
> > > > job server uses pipe I/O to control how many jobs are running.
> > >
> > > Hmm it must be those deep pipes again then. I removed any quirks
> > > testing for those from mainline as I suspected it would be ok. Guess
> > > I"m wrong.
> >
> > I shouldn't blame this straight up though if NO_HZ makes it better.
> > Something else is going wrong... wtf though?
>
> Just so we're clear, dynticks has only 'fixed' the single non-parallel
> make load so far.

Ok, so some of the basics then. Can you please give me the output of 'top -b' 
running for a few seconds during the whole affair?

Thanks very much for your testing so far!

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc3-mm1 RSDL results

2007-03-09 Thread Con Kolivas

On Saturday 10 March 2007 10:06, Matt Mackall wrote:
> On Sat, Mar 10, 2007 at 10:02:37AM +1100, Con Kolivas wrote:
> > On Saturday 10 March 2007 09:29, Matt Mackall wrote:
> > > On Sat, Mar 10, 2007 at 09:18:05AM +1100, Con Kolivas wrote:
> > > > On Saturday 10 March 2007 08:57, Con Kolivas wrote:
> > > > > On Saturday 10 March 2007 08:39, Matt Mackall wrote:
> > > > > > So what's different between makes in parallel and make -j 5?
> > > > > > Make's job server uses pipe I/O to control how many jobs are
> > > > > > running.
> > > > >
> > > > > Hmm it must be those deep pipes again then. I removed any quirks
> > > > > testing for those from mainline as I suspected it would be ok.
> > > > > Guess I"m wrong.
> > > >
> > > > I shouldn't blame this straight up though if NO_HZ makes it better.
> > > > Something else is going wrong... wtf though?
> > >
> > > Just so we're clear, dynticks has only 'fixed' the single non-parallel
> > > make load so far.
> >
> > Ok, back to the pipe idea. Without needing a kernel recompile, can you
> > try running the make -j5 as a SCHED_BATCH task?
>
> Seems the same.
>
> Oddly, nice make -j 5 is better than batch (but not quite up to stock).

Shouldn't be odd. SCHED_BATCH (as Ingo implemented it which is what I'm trying 
to reproduce for RSDL) is meant to give the same cpu as the same nice level, 
but not give low latency. Nice on the other hand will give much less cpu.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Pluggable Schedulers (was: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler)

2007-03-09 Thread David Lang


On Fri, 9 Mar 2007, Al Boldi wrote:




My preferred sphere of operation is the Manichean domain of faster vs.
slower, functionality vs. non-functionality, and the like. For me, such
design concerns are like the need for a kernel to format pagetables so
the x86 MMU decodes what was intended, or for a compiler to emit valid
assembly instructions, or for a programmer to write C the compiler
won't reject with parse errors.


Sure, but I think, even from a technical point of view, competition is a good
thing to have.  Pluggable schedulers give us this kind of competition, that
forces each scheduler to refine or become obsolete.  Think evolution.


The point Linus is makeing is that with pluggable schedulers there isn't 
competition between them, the various developer teams would go off in their own 
direction and any drawbacks to their scheduler could be answered with "that's 
not what we are good at, use a different scheduler", with the very real 
possibility that a person could get this answer from ALL schedulers, leaving 
them with nothing good to use.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ALIGN (Re: [PATCH] Fix get_order())

2007-03-09 Thread Oleg Verych

On Fri, Mar 09, 2007 at 03:15:10PM -0800, Linus Torvalds wrote:
> 
> 
> On Sat, 10 Mar 2007, Oleg Verych wrote:
> > 
> > OTOH, if i would write it this way
> > 
> > #define BALIGN(x,bits)  x) >> (bits)) + 1) << (bits))
> 
> But that's *wrong*. It aligns something that is *already* aligned to 
> something else.

Indeed. I'm confused by semantics of ALIGN macro, as from C arithmetic
side, as from using side. Former confusion also yields patches like
(Fix 'ALIGN()' macro, take 2), fixing that, i was wonder about.

BALIGN is like do-this alignment, and must be

void do_align(what, bits)

instead. While clear arithmetic (optimized in assembler, shifts are
C shifts!), it fails in value(alignment(what, how)) kind of thing.

> So you'd have to do it as something like
> 
>   #define ALIGN_TO_POW2(x,pow) \
>   x)+(1<<(pow))-1)>>(pow))<<(pow))
>
> and the thing is, that's actually both (a) less readable than what ALIGN() 
> already does (b) unless the compiler notices that what you really want is 
> a "bitwise and", it's really inefficient too because shifts are generally 
> much more expensive than simple bitops.
> 
> So you're simply better off doing it as
> 
>   #define ALIGN_TO_POW2(x,pow) ALIGN(x, (1ull << pow))
> 
> but then you'd still have to depend on the "typeof()" magic in ALIGN() to 
> turn the final end result back to the type of "x".
>
> (the "1ull" is necessary exactly because you don't know what type "x" is 
> beforehand, so you need to make sure that the mask is at *least* as big a 
> type as x, and not overflow to undefined C semantics).

Via typeof() *feature* and 1U, 1UL, 1ULL *things*, i (we?) have that, what
is described above.

Examples of using ALIGN. As that, i've picked earlier,

arch/powerpc/mm/hugetlbpage.c:  addr = 
ALIGN(addr+1,1UL

Re: [patch 1/9] signalfd/timerfd v1 - anonymoush inode source ...

2007-03-09 Thread Davide Libenzi

On Sat, 10 Mar 2007, Jan Engelhardt wrote:

> 
> On Mar 9 2007 15:39, Davide Libenzi wrote:
> >
> >This patch add an anonymous inode source, to be used for files that need 
> >and inode only in order to create a file*. We do not care of having an 
> >inode for each file, and we do not even care of having different names in 
> >the associated dentries (dentry names will be same for classes of file*).
> >This allow code reuse, and will be used by epoll, signalfd and timerfd 
> >(and whatever else there'll be).
> 
> Perhaps procfs?

But procfs needs real inodes and dentries. The reason of the anonymous 
inode source patch, is to 1) avoid code duplication (setup code is 
basically identical for all those file types) 2) save memory for inodes in 
all those pseudo-files like epoll, signalfd, timerfd, ...



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Pluggable Schedulers (was: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler)

2007-03-09 Thread William Lee Irwin III

William Lee Irwin III wrote:
>> The short translation of my message for you is "Linus, please don't
>> LART me too hard."

On Fri, Mar 09, 2007 at 11:43:46PM +0300, Al Boldi wrote:
> Right.

Given where the code originally came from, I've got bullets to dodge.

William Lee Irwin III wrote:
>> This sort of concern is too subjective for me to have an opinion on it.

On Fri, Mar 09, 2007 at 11:43:46PM +0300, Al Boldi wrote:
> How diplomatic.

Impoliteness doesn't accomplish anything I want to do.

William Lee Irwin III wrote:
>> My preferred sphere of operation is the Manichean domain of faster vs.
>> slower, functionality vs. non-functionality, and the like. For me, such
>> design concerns are like the need for a kernel to format pagetables so
>> the x86 MMU decodes what was intended, or for a compiler to emit valid
>> assembly instructions, or for a programmer to write C the compiler
>> won't reject with parse errors.

On Fri, Mar 09, 2007 at 11:43:46PM +0300, Al Boldi wrote:
> Sure, but I think, even from a technical point of view, competition is a good 
> thing to have.  Pluggable schedulers give us this kind of competition, that 
> forces each scheduler to refine or become obsolete.  Think evolution.

I'm more of a cooperative than competitive person, not to say that
flies well in Linux. There are more productive uses of time than having
everyone NIH'ing everyone else's code. If the result isn't so great,
I'd rather send them code or talk them about what needs to be done.

William Lee Irwin III wrote:
>> If Linus, akpm, et al object to the
>> design, then invalid output was produced. Please refer to Linus, akpm,
>> et al for these sorts of design concerns.

On Fri, Mar 09, 2007 at 11:43:46PM +0300, Al Boldi wrote:
> Point taken.

Decisions with respect to overall kernel design are made from well
above my level. Similarly with coding style, release management, code
directory hierarchy, nomenclature, and more. These things are Linus'
and devolved to those who go along with him on those fronts. If I
made those decisions, you might as well call it "wlix" not "Linux."

Linus Torvalds wrote:
>> And hey, you can try to prove me wrong. Code talks. So far, nobody has
>> really ever come close.
>> So go and code it up, and show the end result. So far, nobody who actually
>> *does* CPU schedulers have really wanted to do it, because they all want
>> to muck around with their own private versions of the data structures.

On Fri, Mar 09, 2007 at 11:43:46PM +0300, Al Boldi wrote:
> What about PlugSched?

The extant versions of it fall well short of Linus' challenge as well
as my original goals for it. A useful exercise may also be enumerating
your expectations and having those who actually work with the code
describe how well those are actually met.

-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: TCP MSG_PEEK assertion issue ...

2007-03-09 Thread David Miller


Keep trying, you might hit the proper mailing list after a
few more attempts. :-)

Please post networking issues to netdev@vger.kernel.org,
thank you.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Sleeping thread not receive signal until it wakes up

2007-03-09 Thread Luong Ngo


On 3/8/07, Parav K Pandit <[EMAIL PROTECTED]> wrote:

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Luong Ngo
Sent: Friday, March 09, 2007 8:54 AM
To: Robert Hancock
Cc: linux-kernel; [EMAIL PROTECTED]
Subject: Re: Sleeping thread not receive signal until it wakes up

On 3/8/07, Robert Hancock <[EMAIL PROTECTED]> wrote:
> Luong Ngo wrote:
> > Hi Thomas and Dick,
> > I appreciate all the responses. They are very good information to me.
> > Actually, it wasn't me working on the driver but it's been there long
> > time. I thought I just need to add the signal and signal handling
> > part, not expecting it would lead me to the driver space.
> > Here is what I have in the driver. Maybe racing condition could happen
> > in scenario that the ioctl realease the lock but befor going to sleep,
> > the ISR is invoked and call waking up on the queue, hence the ioctl
> > will not be waken up since the wak up cal already executed. But I
> > believe in our system, this could be tolerant since the hardware would
> > keep raising interrupt if the abnormal condition still exists (Due to
> > the ioctl being blocked so user app nevers get a chance to service the
> > device). But is this the reason why my signal handler not get executed
> > at all? Theoretically, according to the Richard Stevens book, I think
> > the process should be waken up and received the signal even if it gets
> > blocked in the IOCTL call, am i right?
>
> ..
>
> > static int ats89_ioctl(struct inode *inode, struct file *file, u_int
> > cmd, u_long arg)
> > {
> >
> >  switch(cmd){
> >   case GET_IRQ_CMD: {
> >u32  regMask32;
> >
> >   spin_lock_irq(dev->lock);
> >   while ((dev->irqMask & dev->irqEvent) == 0) {
> > // Sleep until board interrupt happens
> > spin_unlock_irq(dev->lock);
> > interruptible_sleep_on(&(dev->boardIRQWaitQueue));
> > if (uncond_wakeup) {
> > /* don't go back to loop */
> > break;
> > }
> > spin_lock_irq(dev->lock);
> > }
>
> Kernel code does not get pre-empted by signals. If the code needs to be
> interruptible by signals this has to be handled explicitly.
> interruptible_sleep on will stop waiting if your task gets a signal, but
> your code doesn't check the signal_pending flag to know whether it
> should exit the loop. If signal_pending(current) is set after the sleep
> you should likely be returning -ERESTARTSYS to allow the task to handle
> the signal. Then after the signal handler from the task returns, the
> ioctl will get called again.
>
> Also, as was pointed out, you should not use the sleep_on family of
> functions, use the wait_event functions intead. sleep_on is racy, if the
> interrupt happened just before you do the sleep, you'll sit there
> waiting for something that already occurred.
>
> --
> Robert Hancock  Saskatoon, SK, Canada
> To email, remove "nospam" from [EMAIL PROTECTED]
> Home Page: http://www.roberthancock.com/
>
>
> Robert, thanks a lot for your suggestion
> But I have added the signal_pending(current) check and signal handler
> is not invoked

>   spin_lock_irq(dev->lock);
>   while ((dev->irqMask & dev->irqEvent) == 0) {
> // Sleep until board interrupt happens
> spin_unlock_irq(dev->lock);
> interruptible_sleep_on(&(dev->boardIRQWaitQueue));
>
> if(signal_pending(current) {
>   return -ERESTARTSYS;
> }
>
> if (uncond_wakeup) {
> /* don't go back to loop */
> break;
> }
> spin_lock_irq(dev->lock);
> }
>Still no luck yet.
>LNgo
>-

I guess you need to call allow_signal(xxx) before you go for sleep.
Parav

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


DISCLAIMER:
This message (including attachment if any) is confidential and may be 
privileged. Before opening attachments please check them for viruses and 
defects. MindTree Consulting Limited (MindTree) will not be responsible for any 
viruses or defects or any forwarded attachments emanating either from within 
MindTree or outside. If you have received this message by mistake please notify 
the sender by return  e-mail and delete this message from your system. Any 
unauthorized use or dissemination of this message in whole or in part is 
strictly prohibited.  Please note that e-mails are susceptible to change and 
MindTree shall not be liable for any improper, untimely or incomplete 
transmission.


Thanks Parav, adding singal_allow(SIGALRM) wakeup the blocking
interruptible_sleep_on and checking the signal_pendi

Re: [PATCH] dma_ops as const

2007-03-09 Thread Muli Ben-Yehuda

On Fri, Mar 09, 2007 at 03:46:40PM -0800, Stephen Hemminger wrote:
> The dma_ops structure can be const since it never changes
> after boot.

Sounds reasonable. I haven't come up with a likely case where we would
want to change a dma_ops structure (as opposed to just pointing to a
different structure) so ...

Acked-by: Muli Ben-Yehuda <[EMAIL PROTECTED]>

Cheers,
Muli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Software Suspend: Fix suspend when console is in VT_AUTO/KD_GRAPHICS mode

2007-03-09 Thread Andrew Johnson

On Fri, 2007-09-03 at 13:19 -0800, Pavel Machek wrote:
> Hi!
> 
> > 
> > -void set_console(int nr)
> > +extern char vt_dont_switch;
> > +
> 
> What does this variable do and why do we want to use it here?
> 
It's needed in set_console().  Console switch will fail if this is true.

> 'if ('
> 
> > + (vc->vt_mode.mode != VT_PROCESS && vc->vc_mode ==
> KD_GRAPHICS)) {
> > +
> > + return -EINVAL;
> > + }
> > +
> 
> I assume you want ...mode == VT_AUTO here?
> 
> And big comment explaining why we want this behaviour?
> 
> And another big comment explaining why this will not break existing
> set_console() users?
> 
See updated comment in attached patch.

> > - set_console(SUSPEND_CONSOLE);
> > + if (set_console(SUSPEND_CONSOLE)) {
> > + /* Unable to change to the new console */
> 
> That's not what the comment should say.
> 
> It should explain why it is okay to proceed when we can't change to
> text console.
> 
See updated comment in attached patch.  It's really up to the caller to
decide what to do if we can't switch the console - currently all callers
ignore the return code so I assume that it's okay to proceed anyway.

-- Andrew


Signed-off-by: Andrew Johnson <[EMAIL PROTECTED]>
---
diff -rup linux-2.6.20.1/drivers/char/vt.c linux/drivers/char/vt.c
--- linux-2.6.20.1/drivers/char/vt.c2007-02-19 22:34:32.0 -0800
+++ linux/drivers/char/vt.c 2007-03-09 15:48:29.0 -0800
@@ -2188,10 +2188,30 @@ static void console_callback(struct work
release_console_sem();
 }
 
-void set_console(int nr)
+extern char vt_dont_switch;
+
+int set_console(int nr)
 {
+   struct vc_data *vc = vc_cons[fg_console].d;
+
+   if(!vc_cons_allocated(nr) || vt_dont_switch || 
+   (vc->vt_mode.mode == VT_AUTO && vc->vc_mode == KD_GRAPHICS)) {
+
+   /* 
+* Console switch will fail in console_callback() or 
+* change_console() so there is no point scheduling 
+* the callback
+*
+* Existing set_console() users don't check the return
+* value so this shouldn't break anything 
+*/ 
+   return -EINVAL;
+   }
+
want_console = nr;
schedule_console_callback();
+
+   return 0;
 }
 
 struct tty_driver *console_driver;
diff -rup linux-2.6.20.1/drivers/char/vt_ioctl.c
linux/drivers/char/vt_ioctl.c
--- linux-2.6.20.1/drivers/char/vt_ioctl.c  2007-02-19 22:34:32.0
-0800
+++ linux/drivers/char/vt_ioctl.c   2007-03-08 14:15:41.0 -0800
@@ -34,7 +34,7 @@
 #include 
 #include 
 
-static char vt_dont_switch;
+char vt_dont_switch;
 extern struct tty_driver *console_driver;
 
 #define VT_IS_IN_USE(i)(console_driver->ttys[i] &&
console_driver->ttys[i]->count)
diff -rup linux-2.6.20.1/include/linux/kbd_kern.h
linux/include/linux/kbd_kern.h
--- linux-2.6.20.1/include/linux/kbd_kern.h 2007-02-19
22:34:32.0 -0800
+++ linux/include/linux/kbd_kern.h  2007-03-08 14:15:41.0 -0800
@@ -75,7 +75,7 @@ extern int do_poke_blanked_console;
 
 extern void (*kbd_ledfunc)(unsigned int led);
 
-extern void set_console(int nr);
+extern int set_console(int nr);
 extern void schedule_console_callback(void);
 
 static inline void set_leds(void)
diff -rup linux-2.6.20.1/kernel/power/console.c
linux/kernel/power/console.c
--- linux-2.6.20.1/kernel/power/console.c   2007-02-19 22:34:32.0
-0800
+++ linux/kernel/power/console.c2007-03-09 15:52:32.0 -0800
@@ -27,7 +27,15 @@ int pm_prepare_console(void)
return 1;
}
 
-   set_console(SUSPEND_CONSOLE);
+   if (set_console(SUSPEND_CONSOLE)) {
+   /*
+* We're unable to switch to the SUSPEND_CONSOLE. 
+* Let the calling function know so it can decide 
+* what to do.
+*/ 
+   release_console_sem();
+   return 1;
+   }
release_console_sem();
 
if (vt_waitactive(SUSPEND_CONSOLE)) {




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] hotplug cpu: migrate a task within its cpuset

2007-03-09 Thread Nathan Lynch

Hello-

Cliff Wickman wrote:
> This patch would insert a preference to migrate such a task to a cpu within
> its cpuset (and set its cpus_allowed to its cpuset).
> 
> With this patch, migrate the task to:
>  1) to any cpu on the same node as the disabled cpu, which is both online
> and among that task's cpus_allowed
>  2) to any online cpu within the task's cpuset
>  3) to any cpu which is both online and among that task's cpus_allowed

I think I disagree with this change.

The kernel shouldn't have to be any smarter than it already is about
moving tasks off an offlined cpu.  The only way case 2) can be reached
is if the user has changed a task's cpu affinity.  If the user is
sophisticated enough to manipulate tasks' cpu affinity then they can
arrange to migrate tasks as they see fit before offlining a cpu.

Furthermore:

> --- morton.070123.orig/kernel/sched.c
> +++ morton.070123/kernel/sched.c
> @@ -5170,6 +5170,12 @@ restart:
>   if (dest_cpu == NR_CPUS)
>   dest_cpu = any_online_cpu(p->cpus_allowed);
>  
> + /* try to stay on the same cpuset */
> + if (dest_cpu == NR_CPUS) {
> + p->cpus_allowed = cpuset_cpus_allowed(p);
> + dest_cpu = any_online_cpu(p->cpus_allowed);
> + }

It's not okay to call cpuset_cpus_allowed in this context -- local
irqs are supposed to have been disabled by the caller of
move_task_off_dead_cpu and cpuset_cpus_allowed acquires a mutex.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/9] signalfd/timerfd v1 - anonymoush inode source ...

2007-03-09 Thread Jan Engelhardt


On Mar 9 2007 15:39, Davide Libenzi wrote:
>
>This patch add an anonymous inode source, to be used for files that need 
>and inode only in order to create a file*. We do not care of having an 
>inode for each file, and we do not even care of having different names in 
>the associated dentries (dentry names will be same for classes of file*).
>This allow code reuse, and will be used by epoll, signalfd and timerfd 
>(and whatever else there'll be).

Perhaps procfs?



Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: question regarding the Linux block device cache

2007-03-09 Thread Xin Zhao


I read the code and found that a block buffer is not necessarily freed
even if the corresponding inode is released. Looks like block buffer
can stay around as long as the system still has free memory. Is my
understanding correct?

-x

On 3/9/07, Xin Zhao <[EMAIL PROTECTED]> wrote:

Hi,

I am working on a file system that allow multiple files to share data
blocks. That is, a data block can be shared by two or more files. Now
my question is: suppose file A and B share the same data block D. Now
a process open file A and read block D, then this process closes file
A. If another process open file B and read block D right after the
first process closes A, is the data of block D read from some cache or
has to be loaded from disk again? I think this has to do with the
Linux block device buffer cache. But I am not quite familiar with this
part.

Can someone help me or direct me to the right place to find the answer?

Thanks in advance!

-x


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] dma_ops as const

2007-03-09 Thread Stephen Hemminger

The dma_ops structure can be const since it never changes
after boot.

Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>
---
 arch/x86_64/kernel/pci-calgary.c |2 +-
 arch/x86_64/kernel/pci-gart.c|2 +-
 arch/x86_64/kernel/pci-nommu.c   |2 +-
 arch/x86_64/kernel/pci-swiotlb.c |2 +-
 arch/x86_64/mm/init.c|2 +-
 include/asm-x86_64/dma-mapping.h |2 +-
 6 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86_64/kernel/pci-calgary.c b/arch/x86_64/kernel/pci-calgary.c
index 04480c3..5bd20b5 100644
--- a/arch/x86_64/kernel/pci-calgary.c
+++ b/arch/x86_64/kernel/pci-calgary.c
@@ -507,7 +507,7 @@ error:
return ret;
 }
 
-static struct dma_mapping_ops calgary_dma_ops = {
+static const struct dma_mapping_ops calgary_dma_ops = {
.alloc_coherent = calgary_alloc_coherent,
.map_single = calgary_map_single,
.unmap_single = calgary_unmap_single,
diff --git a/arch/x86_64/kernel/pci-gart.c b/arch/x86_64/kernel/pci-gart.c
index 030eb37..f7723e6 100644
--- a/arch/x86_64/kernel/pci-gart.c
+++ b/arch/x86_64/kernel/pci-gart.c
@@ -552,7 +552,7 @@ static __init int init_k8_gatt(struct agp_kern_info *info)
 
 extern int agp_amd64_init(void);
 
-static struct dma_mapping_ops gart_dma_ops = {
+static const struct dma_mapping_ops gart_dma_ops = {
.mapping_error = NULL,
.map_single = gart_map_single,
.map_simple = gart_map_simple,
diff --git a/arch/x86_64/kernel/pci-nommu.c b/arch/x86_64/kernel/pci-nommu.c
index df09ab0..6dade0c 100644
--- a/arch/x86_64/kernel/pci-nommu.c
+++ b/arch/x86_64/kernel/pci-nommu.c
@@ -79,7 +79,7 @@ void nommu_unmap_sg(struct device *dev, struct scatterlist 
*sg,
 {
 }
 
-struct dma_mapping_ops nommu_dma_ops = {
+const struct dma_mapping_ops nommu_dma_ops = {
.map_single = nommu_map_single,
.unmap_single = nommu_unmap_single,
.map_sg = nommu_map_sg,
diff --git a/arch/x86_64/kernel/pci-swiotlb.c b/arch/x86_64/kernel/pci-swiotlb.c
index eb18be5..4b4569a 100644
--- a/arch/x86_64/kernel/pci-swiotlb.c
+++ b/arch/x86_64/kernel/pci-swiotlb.c
@@ -12,7 +12,7 @@
 int swiotlb __read_mostly;
 EXPORT_SYMBOL(swiotlb);
 
-struct dma_mapping_ops swiotlb_dma_ops = {
+const struct dma_mapping_ops swiotlb_dma_ops = {
.mapping_error = swiotlb_dma_mapping_error,
.alloc_coherent = swiotlb_alloc_coherent,
.free_coherent = swiotlb_free_coherent,
diff --git a/arch/x86_64/mm/init.c b/arch/x86_64/mm/init.c
index ec31534..5ca6173 100644
--- a/arch/x86_64/mm/init.c
+++ b/arch/x86_64/mm/init.c
@@ -46,7 +46,7 @@
 #define Dprintk(x...)
 #endif
 
-struct dma_mapping_ops* dma_ops;
+const struct dma_mapping_ops* dma_ops;
 EXPORT_SYMBOL(dma_ops);
 
 static unsigned long dma_reserve __initdata;
diff --git a/include/asm-x86_64/dma-mapping.h b/include/asm-x86_64/dma-mapping.h
index d2af227..6897e2a 100644
--- a/include/asm-x86_64/dma-mapping.h
+++ b/include/asm-x86_64/dma-mapping.h
@@ -52,7 +52,7 @@ struct dma_mapping_ops {
 };
 
 extern dma_addr_t bad_dma_address;
-extern struct dma_mapping_ops* dma_ops;
+extern const struct dma_mapping_ops* dma_ops;
 extern int iommu_merge;
 
 static inline int dma_mapping_error(dma_addr_t dma_addr)
-- 
1.5.0.2

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel threads

2007-03-09 Thread Oleg Nesterov

On 03/09, Roland McGrath wrote:
>
> > Yes sure, this change shoud be tested in -mm tree (I'll send the patch
> > on Sunday after some testing). The only (afaics) problem is that with
> > this change a kernel thread must not do do_fork(CLONE_THREAD). 
> 
> To clarify, the danger here is that an exit_signal=-1 leader would
> self-reap and leave behind live threads with dangling ->group_leader
> pointers.  This danger doesn't exist for normal user group leaders with
> parents ignoring SIGCHLD, because exit_signal is never set to -1 until
> do_notify_parent, which is never called until the last thread in the group
> dies (except when ptrace'd, but then do_notify_parent never resets
> exit_signal at all).  Is that right?

I think yes.

> > I think it should not, but currently this is technically
> > possible. Perhaps it makes sense to add BUG_ON(CLONE_THREAD &&
> > group_leader->exit_signal==-1) in copy_process().
>
> It probably wouldn't hurt to make it:
>
>   if (user_mode(regs))
>   BUG_ON(current->group_leader->exit_signal == -1);

Well, this is of course right, but a bit strange. Because we can add
this check to any function which can't be called after exit_notify().

>   else
>   BUG_ON((clone_flags & (CLONE_THREAD|CLONE_UNTRACED))
>  != CLONE_UNTRACED);

I think this _should_ be right, but please note that fork_idle() does
copy_process(CLONE_VM). Also, we may have some external driver which
plays with do_fork/copy_process.

> > While we are talking about kernel threads, there is something I can't
> > undestand. kthread/daemonize use sigprocmask(SIG_BLOCK) to protect
> > against signals. This doesn't look right to me, because this doesn't
> > prevent the signal delivery, this only blocks signal_wake_up(). Every
> > "killall -33 khelper" means a "struct siginfo" leak.
>
> It does prevent the delivery (signal_pending() never set), but not the 
> queuing.

Yep.

> > Imho, the kernel thread shouldn't play with ->blocked at all. Instead
> > it should set SIG_IGN for all handlers. If it really needs, say, SIGCHLD,
> > it should call allow_signal() anyway. Do you see any problems with this
> > approach?
>
> That sounds reasonable to me generally.  However, if kernel threads ever
> spawn user children, they may not want the self-reaping behavior of
> ignoring SIGCHLD even if they never dequeue the signal (because they want
> to call do_wait).

Yes. That is why wait_for_helper() does allow_signal(SIGCHLD). I think a
kernel thread must not make any assumption about ->action[SIGCHLD] if it
wants to call wait4, but we may break some "buggy" external driver.

In fact, most threads inherit action[SIGCHLD] == SIG_IGN from worker_thread().

BTW, wait_for_helper() does do_sigaction() before allow_signal(). Looks
unneeded to me.

>There might be other strange caveats like that I'm not
> thinking of.

Yes, this makes me worry too :)

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 9/9] signalfd/timerfd v1 - timerfd compat code ...

2007-03-09 Thread Davide Libenzi

This patch implement the necessary compat code for the timerfd system call.


Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.20.ep2/fs/compat.c
===
--- linux-2.6.20.ep2.orig/fs/compat.c   2007-03-09 12:56:04.0 -0800
+++ linux-2.6.20.ep2/fs/compat.c2007-03-09 15:36:21.0 -0800
@@ -2257,3 +2257,23 @@
return sys_signalfd(ufd, ksigmask, sizeof(sigset_t));
 }
 
+
+asmlinkage long compat_sys_timerfd(int ufd, int tmrtype,
+  const struct timespec __user *utmr)
+{
+   long res;
+   struct timespec t;
+   struct timespec __user *ut;
+
+   res = -EFAULT;
+   if (get_compat_timespec(&t, utmr))
+   goto err_exit;
+   ut = compat_alloc_user_space(sizeof(*ut));
+   if (copy_to_user(ut, &t, sizeof(t)) )
+   goto err_exit;
+
+   res = sys_timerfd(ufd, tmrtype, ut);
+err_exit:
+   return res;
+}
+

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 8/9] signalfd/timerfd v1 - wire up timerfd x86_64 arch ...

2007-03-09 Thread Davide Libenzi

This patch wire the timerfd system call to the x86_64 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.20.ep2/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.20.ep2.orig/arch/x86_64/ia32/ia32entry.S  2007-03-09 
12:56:04.0 -0800
+++ linux-2.6.20.ep2/arch/x86_64/ia32/ia32entry.S   2007-03-09 
15:36:19.0 -0800
@@ -720,4 +720,5 @@
.quad sys_getcpu
.quad sys_epoll_pwait
.quad sys_signalfd  /* 320 */
+   .quad sys_timerfd
 ia32_syscall_end:
Index: linux-2.6.20.ep2/include/asm-x86_64/unistd.h
===
--- linux-2.6.20.ep2.orig/include/asm-x86_64/unistd.h   2007-03-09 
12:56:04.0 -0800
+++ linux-2.6.20.ep2/include/asm-x86_64/unistd.h2007-03-09 
15:36:19.0 -0800
@@ -621,8 +621,10 @@
 __SYSCALL(__NR_move_pages, sys_move_pages)
 #define __NR_signalfd  280
 __SYSCALL(__NR_signalfd, sys_signalfd)
+#define __NR_timerfd   281
+__SYSCALL(__NR_timerfd, sys_timerfd)
 
-#define __NR_syscall_max __NR_signalfd
+#define __NR_syscall_max __NR_timerfd
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 6/9] signalfd/timerfd v1 - timerfd core ...

2007-03-09 Thread Davide Libenzi

This patch introduces a new system call for timers events delivered
though file descriptors. This allows timer event to be used with
standard POSIX poll(2), select(2) and read(2). As a consequence of
supporting the Linux f_op->poll subsystem, they can be used with
epoll(2) too.
The system call is defined as:

int timerfd(int ufd, int tmrtype, const struct timespec *utmr);

The "ufd" parameter allows for re-use (re-programming) of an existing
timerfd w/out going through the close/open cycle (same as signalfd).
If "ufd" is -1, s new file descriptor will be created, otherwise the
existing "ufd" will be re-programmed.
The "tmrtype" parameter allows to specify the timer type. The following
values are supported:

TFD_TIMER_REL
The time specified in the "utmr" parameter is a relative time
from NOW.

TFD_TIMER_ABS
The timer specified in the "utmr" parameter is an absolute time.

TFD_TIMER_SEQ
The time specified in the "utmr" parameter is an interval at
which a continuous clock rate will be generated.

The function returns the new (or same, in case "ufd" is a valid timerfd
descriptor) file, or -1 in case of error.
As stated before, the timerfd file descriptor supports poll(2), select(2)
and epoll(2). When a timer event happened on the timerfd, a POLLIN mask
will be returned.
The read(2) call can be used, and it will return a u32 variable holding
the number of "ticks" that happened on the interface since the last call
to read(2). The read(2) call supportes the O_NONBLOCK flag too, and EAGAIN
will be returned if no ticks happened.
A quick test program, shows timerfd working correctly on my amd64 box:

http://www.xmailserver.org/timerfd-test.c




Signed-off-by: Davide Libenzi 



- Davide



Index: linux-2.6.20.ep2/fs/timerfd.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.20.ep2/fs/timerfd.c   2007-03-09 15:09:52.0 -0800
@@ -0,0 +1,263 @@
+/*
+ *  fs/timerfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+
+
+struct timerfd_ctx {
+   struct hrtimer tmr;
+   ktime_t tval;
+   int tmrtype;
+   spinlock_t lock;
+   wait_queue_head_t wqh;
+   unsigned long ticks;
+};
+
+
+static int timerfd_tmrproc(struct hrtimer *htmr);
+static void timerfd_cleanup(struct timerfd_ctx *ctx);
+static int timerfd_close(struct inode *inode, struct file *file);
+static unsigned int timerfd_poll(struct file *file, poll_table *wait);
+static ssize_t timerfd_read(struct file *file, char *buf, size_t count, loff_t 
*ppos);
+
+
+
+static const struct file_operations timerfd_fops = {
+   .release= timerfd_close,
+   .poll   = timerfd_poll,
+   .read   = timerfd_read,
+};
+static struct kmem_cache *timerfd_ctx_cachep;
+
+
+
+static int timerfd_tmrproc(struct hrtimer *htmr)
+{
+   struct timerfd_ctx *ctx = container_of(htmr, struct timerfd_ctx, tmr);
+   int rval = HRTIMER_NORESTART;
+   unsigned long flags;
+
+   spin_lock_irqsave(&ctx->lock, flags);
+   ctx->ticks++;
+   __wake_up_locked(&ctx->wqh, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE);
+   if (ctx->tmrtype == TFD_TIMER_SEQ) {
+   hrtimer_forward(htmr, htmr->base->softirq_time, ctx->tval);
+   rval = HRTIMER_RESTART;
+   }
+   spin_unlock_irqrestore(&ctx->lock, flags);
+
+   return rval;
+}
+
+/*
+ * Create a file descriptor that is associated with our signal
+ * state. We can pass it around to others if we want to, but
+ * it will always be _our_ signal state.
+ */
+asmlinkage long sys_timerfd(int ufd, int tmrtype, const struct timespec __user 
*utmr)
+{
+   int error;
+   struct timerfd_ctx *ctx;
+   struct file *file;
+   struct inode *inode;
+   ktime_t tval, tnow;
+   struct timespec ktmr, tmrnow;
+
+   error = -EFAULT;
+   if (copy_from_user(&ktmr, utmr, sizeof(ktmr)))
+   goto err_exit;
+
+   tval = timespec_to_ktime(ktmr);
+   error = -EINVAL;
+   switch (tmrtype) {
+   case TFD_TIMER_REL:
+   case TFD_TIMER_SEQ:
+   break;
+   case TFD_TIMER_ABS:
+   getnstimeofday(&tmrnow);
+   tnow = timespec_to_ktime(tmrnow);
+   if (ktime_to_ns(tval) <= ktime_to_ns(tnow))
+   goto err_exit;
+   tval = ktime_sub(tval, tnow);
+   break;
+   default:
+   goto err_exit;
+   }
+
+   if (ufd == -1) {
+   error = -ENOMEM;
+   ctx = kmem_cache_alloc(timerfd_ctx_cachep, GFP_KERNEL);
+   if (!ctx)
+   goto err_exit;
+
+   init_waitqueue_head(&ctx->wqh);
+   spin_lock_init(&ctx-

[patch 7/9] signalfd/timerfd v1 - wire up timerfd i386 arch ...

2007-03-09 Thread Davide Libenzi

This patch wire the timerfd system call to the i386 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.20.ep2/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.20.ep2.orig/arch/i386/kernel/syscall_table.S  2007-03-09 
12:56:05.0 -0800
+++ linux-2.6.20.ep2/arch/i386/kernel/syscall_table.S   2007-03-09 
15:36:16.0 -0800
@@ -320,3 +320,4 @@
.long sys_getcpu
.long sys_epoll_pwait
.long sys_signalfd  /* 320 */
+   .long sys_timerfd
Index: linux-2.6.20.ep2/include/asm-i386/unistd.h
===
--- linux-2.6.20.ep2.orig/include/asm-i386/unistd.h 2007-03-09 
12:56:05.0 -0800
+++ linux-2.6.20.ep2/include/asm-i386/unistd.h  2007-03-09 15:36:16.0 
-0800
@@ -326,10 +326,11 @@
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
 #define __NR_signalfd  320
+#define __NR_timerfd   321
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 321
+#define NR_syscalls 322
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 1/9] signalfd/timerfd v1 - anonymoush inode source ...

2007-03-09 Thread Davide Libenzi

This patch add an anonymous inode source, to be used for files that need 
and inode only in order to create a file*. We do not care of having an 
inode for each file, and we do not even care of having different names in 
the associated dentries (dentry names will be same for classes of file*).
This allow code reuse, and will be used by epoll, signalfd and timerfd 
(and whatever else there'll be).



Signed-off-by: Davide Libenzi 



- Davide



Index: linux-2.6.20.ep2/fs/anon_inodes.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.20.ep2/fs/anon_inodes.c   2007-03-07 15:58:01.0 -0800
@@ -0,0 +1,203 @@
+/*
+ *  fs/anon_inodes.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+
+
+static int ainofs_delete_dentry(struct dentry *dentry);
+static struct inode *aino_getinode(void);
+static struct inode *aino_mkinode(void);
+static int ainofs_get_sb(struct file_system_type *fs_type, int flags,
+const char *dev_name, void *data, struct vfsmount 
*mnt);
+
+
+
+static struct vfsmount *aino_mnt __read_mostly;
+static struct inode *aino_inode;
+static struct file_operations aino_fops = { };
+static struct file_system_type aino_fs_type = {
+   .name   = "ainofs",
+   .get_sb = ainofs_get_sb,
+   .kill_sb= kill_anon_super,
+};
+static struct dentry_operations ainofs_dentry_operations = {
+   .d_delete   = ainofs_delete_dentry,
+};
+
+
+
+int aino_getfd(int *pfd, struct inode **pinode, struct file **pfile,
+  char const *name, const struct file_operations *fops, void *priv)
+{
+   struct qstr this;
+   struct dentry *dentry;
+   struct inode *inode;
+   struct file *file;
+   int error, fd;
+
+   error = -ENFILE;
+   file = get_empty_filp();
+   if (!file)
+   goto eexit_1;
+
+   inode = aino_getinode();
+   if (IS_ERR(inode)) {
+   error = PTR_ERR(inode);
+   goto eexit_2;
+   }
+
+   error = get_unused_fd();
+   if (error < 0)
+   goto eexit_3;
+   fd = error;
+
+   /*
+* Link the inode to a directory entry by creating a unique name
+* using the inode sequence number.
+*/
+   error = -ENOMEM;
+   this.name = name;
+   this.len = strlen(name);
+   this.hash = 0;
+   dentry = d_alloc(aino_mnt->mnt_sb->s_root, &this);
+   if (!dentry)
+   goto eexit_4;
+   dentry->d_op = &ainofs_dentry_operations;
+   /* Do not publish this dentry inside the global dentry hash table */
+   dentry->d_flags &= ~DCACHE_UNHASHED;
+   d_instantiate(dentry, inode);
+
+   file->f_path.mnt = mntget(aino_mnt);
+   file->f_path.dentry = dentry;
+   file->f_mapping = inode->i_mapping;
+
+   file->f_pos = 0;
+   file->f_flags = O_RDONLY;
+   file->f_op = fops;
+   file->f_mode = FMODE_READ;
+   file->f_version = 0;
+   file->private_data = priv;
+
+   fd_install(fd, file);
+
+   *pfd = fd;
+   *pinode = inode;
+   *pfile = file;
+   return 0;
+
+eexit_4:
+   put_unused_fd(fd);
+eexit_3:
+   iput(inode);
+eexit_2:
+   put_filp(file);
+eexit_1:
+   return error;
+}
+
+
+static int ainofs_delete_dentry(struct dentry *dentry)
+{
+   /*
+* We faked vfs to believe the dentry was hashed when we created it.
+* Now we restore the flag so that dput() will work correctly.
+*/
+   dentry->d_flags |= DCACHE_UNHASHED;
+   return 1;
+}
+
+
+static struct inode *aino_getinode(void)
+{
+   return igrab(aino_inode);
+}
+
+
+/*
+ * A single inode exist for all aino files. On the contrary of pipes,
+ * aino inodes has no per-instance data associated, so we can avoid
+ * the allocation of multiple of them.
+ */
+static struct inode *aino_mkinode(void)
+{
+   int error = -ENOMEM;
+   struct inode *inode = new_inode(aino_mnt->mnt_sb);
+
+   if (!inode)
+   goto eexit_1;
+
+   inode->i_fop = &aino_fops;
+
+   /*
+* Mark the inode dirty from the very beginning,
+* that way it will never be moved to the dirty
+* list because mark_inode_dirty() will think
+* that it already _is_ on the dirty list.
+*/
+   inode->i_state = I_DIRTY;
+   inode->i_mode = S_IRUSR | S_IWUSR;
+   inode->i_uid = current->fsuid;
+   inode->i_gid = current->fsgid;
+   inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+   return inode;
+
+eexit_1:
+   return ERR_PTR(error);
+}
+
+
+static int ainofs_get_sb(struct file_system_type *fs_type, int flags,
+const char *dev_name, void *data, struct vfsmount *mnt)
+{
+   return get_sb_pseudo(fs_type, "aino:", NULL, AINOFS_MAGIC,

[patch 2/9] signalfd/timerfd v1 - signalfd core ...

2007-03-09 Thread Davide Libenzi

This patch series implements the new signalfd() and signalfd_dequeue()
system calls. I took part of the original Linus code (and you know how
badly it can be broken :), and I added even more breakage ;)
Signals are fetched from the same signal queue used by the process,
so signalfd will compete with standard kernel delivery in dequeue_signal().
If you want to reliably fetch signals on the signalfd file, you need to
block them with sigprocmask(SIG_BLOCK).
This seems to be working fine on my Dual Opteron machine. I made a quick 
test program for it:

http://www.xmailserver.org/signafd-test.c

The signalfd() system call implements signal delivery into a file 
descriptor receiver. The signalfd file descriptor if created with the 
following API:

int signalfd(int ufd, const sigset_t *mask, size_t masksize);

The "ufd" parameter allows to change an existing signalfd sigmask, w/out 
going to close/create cycle (Linus idea). Use "ufd" == -1 if you want a 
brand new signalfd file.
The "mask" allows to specify the signal mask of signals that we are 
interested in. The "masksize" parameter is the size of "mask".
The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
will return POLLIN when signals are available to be dequeued. As a direct
consequence of supporting the Linux poll subsystem, the signalfd fd can use
used together with epoll(2) too.
The read(2) system call will return a "struct signalfd_siginfo" structure
in the userspace supplied buffer. The return value is the number of bytes
copied in the supplied buffer, or -1 in case of error. The read(2) call
can also return 0, in case the sighand structure to which the signalfd
was attached, has been orphaned. The O_NONBLOCK flag is also supported, and
read(2) will return -EAGAIN in case no signal is available.
The format of the struct signalfd_siginfo is, and the valid fields depends
of the (->code & __SI_MASK) value, in the same way a struct siginfo would:

struct signalfd_siginfo {
__u32 signo;/* si_signo */
__s32 err;  /* si_errno */
__s32 code; /* si_code */
__u32 pid;  /* si_pid */
__u32 uid;  /* si_uid */
__s32 fd;   /* si_fd */
__u32 tid;  /* si_fd */
__u32 band; /* si_band */
__u32 overrun;  /* si_overrun */
__u32 trapno;   /* si_trapno */
__s32 status;   /* si_status */
__s32 svint;/* si_int */
__u64 svptr;/* si_ptr */
__u64 utime;/* si_utime */
__u64 stime;/* si_stime */
__u64 addr; /* si_addr */
};



Signed-off-by: Davide Libenzi 



- Davide



Index: linux-2.6.20.ep2/fs/signalfd.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.20.ep2/fs/signalfd.c  2007-03-09 10:55:26.0 -0800
@@ -0,0 +1,358 @@
+/*
+ *  fs/signalfd.c
+ *
+ *  Copyright (C) 2003  Linus Torvalds
+ *
+ *  Mon Mar 5, 2007: Davide Libenzi 
+ *  Changed ->read() to return a siginfo strcture instead of signal number.
+ *  Fixed locking in ->poll().
+ *  Added sighand-detach notification.
+ *  Added fd re-use in sys_signalfd() syscall.
+ *  Now using anonymous inode source.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+
+
+struct signalfd_ctx {
+   struct list_head lnk;
+   wait_queue_head_t wqh;
+   sigset_t sigmask;
+   struct task_struct *tsk;
+   struct sighand_struct *sighand;
+};
+
+
+
+static void signalfd_cleanup(struct signalfd_ctx *ctx);
+static int signalfd_close(struct inode *inode, struct file *file);
+static unsigned int signalfd_poll(struct file *file, poll_table *wait);
+static int signalfd_copyinfo(struct signalfd_siginfo __user *uinfo,
+siginfo_t const *kinfo);
+static ssize_t signalfd_read(struct file *file, char *buf, size_t count,
+loff_t *ppos);
+
+
+
+static const struct file_operations signalfd_fops = {
+   .release= signalfd_close,
+   .poll   = signalfd_poll,
+   .read   = signalfd_read,
+};
+static struct kmem_cache *signalfd_ctx_cachep;
+
+
+/*
+ * This must be called with the sighand lock held.
+ */
+int signalfd_deliver(struct sighand_struct *sighand, int sig,
+struct siginfo *info)
+{
+   int nsig = 0;
+   struct list_head *pos;
+   struct signalfd_ctx *ctx;
+
+   list_for_each(pos, &sighand->sfdlist) {
+   ctx = list_entry(pos, struct signalfd_ctx, lnk);
+   /*
+* We use a negative signal value as a way to broadcast that the
+* sighand has been orphaned, so that we can notify all the
+* listeners about this. Remeber the ctx->sigmask is inverted,
+* so if the user is interested in a signal, that corresponding

[patch 3/9] signalfd/timerfd v1 - wire up signalfd i386 arch ...

2007-03-09 Thread Davide Libenzi

This patch wire the signalfd system call to the i386 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.20.ep2/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.20.ep2.orig/arch/i386/kernel/syscall_table.S  2007-03-09 
10:43:20.0 -0800
+++ linux-2.6.20.ep2/arch/i386/kernel/syscall_table.S   2007-03-09 
10:56:00.0 -0800
@@ -319,3 +319,4 @@
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+   .long sys_signalfd  /* 320 */
Index: linux-2.6.20.ep2/include/asm-i386/unistd.h
===
--- linux-2.6.20.ep2.orig/include/asm-i386/unistd.h 2007-03-09 
10:43:20.0 -0800
+++ linux-2.6.20.ep2/include/asm-i386/unistd.h  2007-03-09 10:56:00.0 
-0800
@@ -325,10 +325,11 @@
 #define __NR_move_pages317
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
+#define __NR_signalfd  320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 5/9] signalfd/timerfd v1 - signalfd compat code ...

2007-03-09 Thread Davide Libenzi

This patch implement the necessary compat code for the signalfd system call.


Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.20.ep2/fs/compat.c
===
--- linux-2.6.20.ep2.orig/fs/compat.c   2007-03-09 10:43:18.0 -0800
+++ linux-2.6.20.ep2/fs/compat.c2007-03-09 10:56:05.0 -0800
@@ -46,6 +46,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -2235,3 +2236,24 @@
return sys_ni_syscall();
 }
 #endif
+
+asmlinkage long compat_sys_signalfd(int ufd,
+   const compat_sigset_t __user *sigmask,
+   compat_size_t sigsetsize)
+{
+   compat_sigset_t ss32;
+   sigset_t tmp;
+   sigset_t __user *ksigmask;
+
+   if (sigsetsize != sizeof(compat_sigset_t))
+   return -EINVAL;
+   if (copy_from_user(&ss32, sigmask, sizeof(ss32)))
+   return -EFAULT;
+   sigset_from_compat(&tmp, &ss32);
+   ksigmask = compat_alloc_user_space(sizeof(sigset_t));
+   if (copy_to_user(ksigmask, &tmp, sizeof(sigset_t)))
+   return -EFAULT;
+
+   return sys_signalfd(ufd, ksigmask, sizeof(sigset_t));
+}
+
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 4/9] signalfd/timerfd v1 - wire up signalfd x86_64 arch ...

2007-03-09 Thread Davide Libenzi

This patch wire the signalfd system call to the x86_64 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.20.ep2/include/asm-x86_64/unistd.h
===
--- linux-2.6.20.ep2.orig/include/asm-x86_64/unistd.h   2007-03-09 
10:43:19.0 -0800
+++ linux-2.6.20.ep2/include/asm-x86_64/unistd.h2007-03-09 
10:56:02.0 -0800
@@ -619,8 +619,10 @@
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_signalfd  280
+__SYSCALL(__NR_signalfd, sys_signalfd)
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_signalfd
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.20.ep2/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.20.ep2.orig/arch/x86_64/ia32/ia32entry.S  2007-03-09 
10:43:19.0 -0800
+++ linux-2.6.20.ep2/arch/x86_64/ia32/ia32entry.S   2007-03-09 
10:56:02.0 -0800
@@ -714,8 +714,10 @@
.quad compat_sys_get_robust_list
.quad sys_splice
.quad sys_sync_file_range
-   .quad sys_tee
+   .quad sys_tee   /* 315 */
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
.quad sys_getcpu
-ia32_syscall_end:  
+   .quad sys_epoll_pwait
+   .quad sys_signalfd  /* 320 */
+ia32_syscall_end:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ABI coupling to hypervisors via CONFIG_PARAVIRT

2007-03-09 Thread Ingo Molnar

* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:

> The important part is that there's more to the story than just pv_ops. 
> If you wanted to make such a change, then you'd need to refactor the 
> i386 support code to add a vma->paging helper layer.  That layer would 
> be available for any pv_ops interface to use if it wishes.

no such change is needed to native. [ other than the removal of tons of 
lowlevel hooks ;-) ] Think of this in terms of a completely separate MM 
layer for guest kernels, with all memory management details done on the 
hypervisor side, ok? I dont think you can emulate that in an equivalent 
way via VMI, the kernel object in this model is on the hypervisor side - 
while with VMI that does not look possible.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ABI coupling to hypervisors via CONFIG_PARAVIRT

2007-03-09 Thread Zachary Amsden


Ingo Molnar wrote:

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

  
[...] If that is the case then my ABI worries would indeed be wrong 
and i'd owe Zach a big fat apology [and more] for my flames ;-)



that apology i very much owe to Zach no matter what the outcome of the 
discussion. Zach, some of my mindless characterisations of the quality 
of VMI code (and of your intentions) were really out of bound and were 
unfair, and i'd like to apologize for that :-/
  


That's fine.  Despite not having an open source hypervisor, we really 
don't have evil intentions, and we really don't want to create tension 
or impede the progress of anyone else.  We think this work has positive 
benefits for Linux, our hypervisor, and others as well.


Some of our code did very much need fixing, and the positive thing we 
can take away from such a heated discussion is that we probably got more 
eyes on our code, maybe trying to find the evil, and instead finding 
bugs or things we could have done better.


At least someone did need to play devil's advocate, as this is an 
important thing to get right for the future direction of Linux, and I 
respect you for doing so, and don't take offense.


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2/6] 2.6.21-rc2: known regressions

2007-03-09 Thread Pavel Machek

Hi!

> > You're better off using the VGA console, and lettign X re-initialize the 
> > graphics device. That generally at least has a reasonably good chance of 
> > working.
> > 
> > Re-initializing graphics modes really is very hard. You can try with the 
> > BIOS video hack (I forget the kernel command line to turn it on), but we 
> > really do end up depending on X doing it better.
> 
> acpi_sleep=s3_bios has always worked for me on my ThinkPad T42p.
> 
> Even if one doesn't use the fb console at all, radeonfb apparently
> is still required on some ThinkPad models to work around BIOS bugs:
> 
> http://www.thinkwiki.org/wiki/Problem_with_high_power_drain_in_ACPI_sleep#Radeon_GPU_not_powered_off


s2ram should be able to work around this, it has parts from
radeontool. (suspend.sf.net).


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/3] fs: introduce perform_write aop

2007-03-09 Thread Mark Fasheh

On Fri, Mar 09, 2007 at 10:39:13AM +, Christoph Hellwig wrote:
> > One problem with this interface is that it cannot be used to write into the
> > filesystem by any means other than already-initialised buffers via iovecs. 
> > So
> > prepare/commit have to stay around for non-user data... 
> 
> Actually I think that's a a good thing to a certain extent.  It reminds
> us that all other users are horrible abuse of the interface.  I'd even
> go so far as to make batch_write a callback that the filesystem passes
> to generic_file_aio_write to make clear it's not a generic thing but
> a helper.  (It's not a generic thing because it's the upper layer writing
> into the pagecache, not a pagecache to fs below operation).
> 
> The still leaves open on how to get rid of ->prepare_write and ->commit_write
> compltely, and for that we'll probably need ->kernel_read and ->kernel_write
> file operations.  But that's a step you shouldn't consider yet when doing
> this work.

->kernel_write() as opposed to genericizing ->perform_write() would be fine
with me. Just so long as we get rid of ->prepare_write and ->commit_write in
that other kernel code doesn't call them directly. That interface just
doesn't work for Ocfs2. There, we have the triple whammy of having to order
cluster locks with page locks, avoiding nesting cluster locks in the case
that the user data has to be paged in (causing a lock in ->readpage()) and
grabbing / zeroing adjacent pages to fill holes.

So, a combination of ->perform_write and ->kernel_write() could really help
me solve my write woes.

Right now I've got Ocfs2 implementing it's own lowest-level buffered write
code - think generic_file_buffered_write() replacement for Ocfs2. With some
duplicated code above that layer. What's nice is that I can abstract away
the "copy data into some target pages" bits such that the majority of that
code is re-usable for ocfs2's splice write operation. I'm not sure we could
have that low a level of abstraction for anyhing above individual the file
system though which also has to deal with non-kernel writes though. That's
where a ->kernel_write() might come in handy.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: netconsole system freeze when cable unplugged

2007-03-09 Thread Simon Arlott


On 09/03/07 20:42, Francois Romieu wrote:

Simon Arlott <[EMAIL PROTECTED]> :
When I unplug the cable the system just stops responding to anything, 
at all. No message is printed to the console when the cable is plugged 
back in.


rtl8139_interrupt (spin_lock(&tp->lock))
-> rtl8139_weird_interrupt
   -> rtl_check_media
  -> mii_check_media (printk(KERN_INFO "%s: link down\n", ...))
 [netpoll stuff here]
 -> rtl8139_poll_controller
-> rtl8139_interrupt
   *deadlock*

See below for my random stuff of the day. Feel free to open a PR at
bugzilla.kernel.org if the issue does not go away.


The patch doesn't fix it, nothing changes. I'm not sure how this can 
be debugged if printk won't work...



8<-

8139too: netconsole breakage when link changes

rtl8139_interrupt is not supposed to be reentrant but its link
management part can emit printk.

Signed-off-by: Francois Romieu <[EMAIL PROTECTED]>

diff --git a/drivers/net/8139too.c b/drivers/net/8139too.c
index 99304b2..64467ad 100644
--- a/drivers/net/8139too.c
+++ b/drivers/net/8139too.c
@@ -2215,9 +2215,16 @@ static irqreturn_t rtl8139_interrupt (int irq, void 
*dev_instance)
  */
 static void rtl8139_poll_controller(struct net_device *dev)
 {
-   disable_irq(dev->irq);
+   struct rtl8139_private *tp = netdev_priv(dev);
+   unsigned long flags;
+   int rc;
+
+   rc = spin_trylock_irqsave(&tp->lock, flags);
+   if (!rc)
+   return;
+   spin_unlock(&tp->lock);
rtl8139_interrupt(dev->irq, dev);
-   enable_irq(dev->irq);
+   local_irq_restore(flags);
 }
 #endif
 
-

To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Simon Arlott
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2] NET: Multiple queue network device support

2007-03-09 Thread Waskiewicz Jr, Peter P

> * Waskiewicz Jr, Peter P <[EMAIL PROTECTED]> 
> 2007-03-09 11:25
> > > > +   }
> > > > +   } else {
> > > > +   /* We're not a multi-queue device. */
> > > > +   spin_lock(&dev->queue_lock);
> > > > +   q = dev->qdisc;
> > > > +   if (q->enqueue) {
> > > > +   rc = q->enqueue(skb, q);
> > > > +   qdisc_run(dev);
> > > > +   spin_unlock(&dev->queue_lock);
> > > > +   rc = rc == NET_XMIT_BYPASS
> > > > +  ? 
> NET_XMIT_SUCCESS : rc;
> > > > +   goto out;
> > > > +   }
> > > > +   spin_unlock(&dev->queue_lock);
> > > 
> > > Please don't duplicate already existing code.
> > 
> > I don't want to mix multiqueue and non-multiqueue code in 
> the transmit 
> > path.  This was an attempt to allow MQ and non-MQ devices 
> to coexist 
> > in a machine having separate code paths.  Are you suggesting to 
> > combine them?  That would get very messy trying to 
> determine what type 
> > of lock to grab (subqueue lock or dev->queue_lock), not to mention 
> > grabbing the
> > dev->queue_lock would block multiqueue devices in that same 
> codepath.
> 
> The piece of code I quoted above is the branch executed if 
> multi queue is not enabled. The code you added is 100% 
> identical to the already existing enqueue logic. Just execute 
> the existing branch if multi queue is not enabled.
> 

I totally agree this is identical code if multiqueue isn't enabled.
However, when multiqueue is enabled, I don't want to make all network
drivers implement the subqueue API just to be able to lock the queues.
So the first thing I check is netif_is_multiqueue(dev), and if it
*isn't* multiqueue, it will run the existing code.  This way both
non-multiqueue devices don't have to have any knowledge of the MQ API.

> > This is another attempt to keep the two codepaths separate. 
>  The only 
> > way I see of combining them is to check netif_is_multiqueue() 
> > everytime I need to grab a lock, which I think would be excessive.
> 
> The code added is 100% identical to the existing code, just 
> be a little smarter on how you do the ifdefs.

An example could be an on-board adapter isn't multiqueue, and an
expansion card you have in your system is.  If I handle multiqueue being
on with just ifdef's, then I could use the expansion card, but not the
on-board adapter as-is.  The on-board driver would need to be updated to
have 1 queue in the multiqueue context, which is what I tried to avoid.

> > > Your modified qdisc_restart() expects the queue_lock to 
> be locked, 
> > > how can this work?
> > 
> > No, it doesn't expect the lock to be held.  Because of the multiple 
> > queues, enqueueing and dequeueing are now asynchronous, since I can 
> > enqueue to queue 0 while dequeuing from queue 1.  dev->queue_lock 
> > isn't held, so this can happen.  Therefore the 
> spin_trylock() is used 
> > in this dequeue because I don't want to wait for someone to finish 
> > with that queue in question (e.g. enqueue working), since that will 
> > block all other bands/queues after the band in question.  So if the 
> > lock isn't available to grab, we move to the next band.  If 
> I were to 
> > wait for the lock, I'd serialize the enqueue/dequeue 
> completely, and 
> > block other traffic flows in other queues waiting for the lock.
> 
> The first thing you do in qdisc_restart() after dequeue()ing 
> is unlock the sub queue lock. You explicitely unlock it 
> before calling qdisc_run() so I assume dequeue() is expected 
> to keep it locked. Something doesn't add up.

That's the entire point of this extra locking.  enqueue() is going to
put an skb into a band somewhere that maps to some queue, and there is
no way to guarantee the skb I retrieve from dequeue() is headed for the
same queue.  Therefore, I need to unlock the queue after I finish
enqueuing, since having that lock makes little sense to dequeue().
dequeue() will then grab *a* lock on a queue; it may be the same one we
had during enqueue(), but it may not be.  And the placement of the
unlock of that queue is exactly where it happens in non-multiqueue,
which is right before the hard_start_xmit().

I concede that the locking model is complex and seems really nasty, but
to truly separate queue functionality from one another, I see no other
feasible option than to run locking like this.  I am very open to
suggestions.

> 
> BTW, which lock serializes your write access to 
> qdisc->q.qlen? It used to be dev->queue_lock but that is 
> apparently not true for multi queue.
> 

This is a very good catch, because it isn't being protected on the
entire qdisc right now for PRIO.  However, Chris Leech pointed out the
LINK_STATE_QDISC_RUNNING bit is serializing things at the qdisc_run()
level

question about periodic clocks

2007-03-09 Thread Jeremy Fitzhardinge

How does the clock period get set on periodic timers?  In my clock
driver, I'm seeing a call to ->set_mode(CLOCK_EVT_MODE_PERIODIC, evt),
but then... nothing.  I was expecting a call to set_next_event to set
the timer period.

The calltrace is:

#0  xen_new_set_mode (mode=CLOCK_EVT_MODE_PERIODIC, evt=0xc10a2ac0)
at arch/i386/xen/time.c:275
#1  0xc01323da in clockevents_set_mode (dev=0xc10a2ac0, 
mode=CLOCK_EVT_MODE_PERIODIC) at kernel/time/clockevents.c:64
#2  0xc0132854 in tick_setup_periodic (dev=0xc10a2ac0, 
broadcast=) at kernel/time/tick-common.c:111

and tick_setup_periodic does just call clockevents_set_mode, but nothing
to set a period.

Am I supposed to assume some default period?  HZ?  (That's what hpet
seems to do.)

Is set_next_event only ever called if the timer is in ONESHOT mode?

Thanks,
J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: irq nobody cared issue on 2.6.2x

2007-03-09 Thread Hans-Peter Jansen

Am Mittwoch, 7. März 2007 20:48 schrieb Mws:
> hi all,
>
> i just moved my win tv dvb-s card (PCI) from my old to my actual pc.
>
> its an ASUS M2N32 WS Professional AMD64 X2 Board equiped with
> the nvidia nForce 590 SLI MCP chipset.
>
> in the past, i had to use the noapic kernel cmdline param to get linux
> booting and working properly.
>
> iirc versions >=2.6.18 of the kernel fixed all of my previous problems,
> thus i am not using noapic since then.

Hmm, I'm using the same board here since two weeks. openSUSE 10.2 (i586, 
kernel 2.6.18.{2,8}) didn't installed properly even with noapic (problems 
with eth device while installing from nfs server :-(, temporarily work 
arounded with an e1000). I found a BIOS update necessary for proper apic 
operation. No problems since then, but not using a dvb card in there 
either, since my vdr is a seperate system ;-)..

Cheers,
   Pete
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ABI coupling to hypervisors via CONFIG_PARAVIRT

2007-03-09 Thread Rik van Riel


Ingo Molnar wrote:

ok, sure, how about the one i mentioned: long-term i'd like to have a 
paravirt model where the guest does not store /any/ page tables - all 
paging is managed by the hypervisor. The guest has a vma tree, but 
otherwise it does not process pagefaults, has no concept of a pte (if in 
paravirt mode), has no concept of kernel page tables either: there are 
hypercalls to allocate/free guest-kernel memory, etc. This needs some 
(serious) MM surgery but it's doable and it's interesting as well.  How

would you map this to the VMI backend?


Ugh!

In a situation like that, how does the guest handle pageouts?

What about the dirty and accessed bits?

I'm guessing VMI would deal with this kind of abstraction in
exactly the same way we do the "native hardware" layer behind
paravirt_ops.  Ie. this example is a red herring.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc3-mm1 RSDL results

2007-03-09 Thread Matt Mackall

On Sat, Mar 10, 2007 at 10:02:37AM +1100, Con Kolivas wrote:
> On Saturday 10 March 2007 09:29, Matt Mackall wrote:
> > On Sat, Mar 10, 2007 at 09:18:05AM +1100, Con Kolivas wrote:
> > > On Saturday 10 March 2007 08:57, Con Kolivas wrote:
> > > > On Saturday 10 March 2007 08:39, Matt Mackall wrote:
> > > > > So what's different between makes in parallel and make -j 5? Make's
> > > > > job server uses pipe I/O to control how many jobs are running.
> > > >
> > > > Hmm it must be those deep pipes again then. I removed any quirks
> > > > testing for those from mainline as I suspected it would be ok. Guess
> > > > I"m wrong.
> > >
> > > I shouldn't blame this straight up though if NO_HZ makes it better.
> > > Something else is going wrong... wtf though?
> >
> > Just so we're clear, dynticks has only 'fixed' the single non-parallel
> > make load so far.
> 
> Ok, back to the pipe idea. Without needing a kernel recompile, can you try 
> running the make -j5 as a SCHED_BATCH task?

Seems the same.

Oddly, nice make -j 5 is better than batch (but not quite up to stock).

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ALIGN (Re: [PATCH] Fix get_order())

2007-03-09 Thread Linus Torvalds

On Sat, 10 Mar 2007, Oleg Verych wrote:
> 
> OTOH, if i would write it this way
> 
> #define BALIGN(x,bits)  x) >> (bits)) + 1) << (bits))

But that's *wrong*. It aligns something that is *already* aligned to 
something else.

So you'd have to do it as something like

#define ALIGN_TO_POW2(x,pow) \
x)+(1<<(pow))-1)>>(pow))<<(pow))

and the thing is, that's actually both (a) less readable than what ALIGN() 
already does (b) unless the compiler notices that what you really want is 
a "bitwise and", it's really inefficient too because shifts are generally 
much more expensive than simple bitops.

So you're simply better off doing it as

#define ALIGN_TO_POW2(x,pow) ALIGN(x, (1ull << pow))

but then you'd still have to depend on the "typeof()" magic in ALIGN() to 
turn the final end result back to the type of "x".

(the "1ull" is necessary exactly because you don't know what type "x" is 
beforehand, so you need to make sure that the mask is at *least* as big a 
type as x, and not overflow to undefined C semantics).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Use more gcc extensions in the Linux headers

2007-03-09 Thread Robert P. J. Day

On Sat, 10 Mar 2007, Rusty Russell wrote:

> On Fri, 2007-03-09 at 16:56 +1100, Rusty Russell wrote:
> > __builtin_types_compatible_p() has been around since gcc 2.95, and we
> > don't use it anywhere.  This patch quietly fixes that.
>
> OK, many people complained that it needed a comment.  Good point!
> ==
> Add comment to ARRAY_SIZE macro.
>
> Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
>
> diff -r 933e410f204f include/linux/kernel.h
> --- a/include/linux/kernel.h  Sat Mar 10 09:55:31 2007 +1100
> +++ b/include/linux/kernel.h  Sat Mar 10 09:55:53 2007 +1100
> @@ -35,6 +35,7 @@ extern const char linux_proc_banner[];
>  #define ALIGN(x,a)   __ALIGN_MASK(x,(typeof(x))(a)-1)
>  #define __ALIGN_MASK(x,mask) (((x)+(mask))&~(mask))
>
> +/* GCC is awesome. */
>  #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])  
>   \
>   + sizeof(typeof(int[1 - 2*!!__builtin_types_compatible_p(typeof(arr), \
>typeof(&arr[0]))]))*0)

ah, but is that "universe" kind of awesome, or the "hot dogs" kind of
awesome?

http://youtube.com/watch?v=0rYT0YvQ3hs

rday

-- 

Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://fsdev.net/wiki/index.php?title=Main_Page

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] Delete JFFS (version 1)

2007-03-09 Thread Stefan Monnier

> I argue that you can count the users (who aren't on 2.4) on one hand, and
> developers don't seem to have cared for it in ages.

Rather than ask on mailing-lists, it's probably easier to just make the jffs
compilation fail (with a #error).  This way, if someone uses it, he'll bump
into it, no matter how much he wants to ignore messages posted on
mailing-lists.


Stefan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ABI coupling to hypervisors via CONFIG_PARAVIRT

2007-03-09 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> [...] If that is the case then my ABI worries would indeed be wrong 
> and i'd owe Zach a big fat apology [and more] for my flames ;-)

that apology i very much owe to Zach no matter what the outcome of the 
discussion. Zach, some of my mindless characterisations of the quality 
of VMI code (and of your intentions) were really out of bound and were 
unfair, and i'd like to apologize for that :-/

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Use more gcc extensions in the Linux headers

2007-03-09 Thread Roland Dreier

Perhaps this patch can go into Wesnoth for testing for a while before
we merge it into the kernel?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] revoke: core code

2007-03-09 Thread Peter Zijlstra

On Fri, 2007-03-09 at 10:15 +0200, Pekka J Enberg wrote:

> +static int revoke_vma(struct vm_area_struct *vma, struct zap_details 
> *details)
> +{
> + unsigned long restart_addr, start_addr, end_addr;
> + int need_break;
> +
> + start_addr = vma->vm_start;
> + end_addr = vma->vm_end;
> +
> + /*
> +  * Not holding ->mmap_sem here.
> +  */
> + vma->vm_flags |= VM_REVOKED;
> + smp_mb();

Hmm, i_mmap_lock pins the vma and excludes modifications, but doesn't
exclude concurrent faults.

I guess its save.

> +  again:
> + restart_addr = zap_page_range(vma, start_addr, end_addr - start_addr,
> +   details);
> +
> + need_break = need_resched() || need_lockbreak(details->i_mmap_lock);
> + if (need_break)
> + goto out_need_break;
> +
> + if (restart_addr < end_addr) {
> + start_addr = restart_addr;
> + goto again;
> + }
> + return 0;
> +
> +  out_need_break:
> + spin_unlock(details->i_mmap_lock);
> + cond_resched();
> + spin_lock(details->i_mmap_lock);
> + return -EINTR;

I'm not sure this scheme works, given a sufficiently loaded machine,
this might never complete.

> +}
> +
> +static int revoke_mapping(struct address_space *mapping, struct file 
> *to_exclude)
> +{
> + struct vm_area_struct *vma;
> + struct prio_tree_iter iter;
> + struct zap_details details;
> + int err = 0;
> +
> + details.i_mmap_lock = &mapping->i_mmap_lock;
> +
> + spin_lock(&mapping->i_mmap_lock);
> + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
> + if (vma->vm_flags & VM_SHARED && vma->vm_file != to_exclude) {

I'm never sure of operator precedence and prefer:

 (vma->vm_flags & VM_SHARED) && ...

which leaves no room for error.

> + err = revoke_vma(vma, &details);
> + if (err)
> + goto out;
> + }
> + }
> +
> + list_for_each_entry(vma, &mapping->i_mmap_nonlinear, 
> shared.vm_set.list) {
> + if (vma->vm_flags & VM_SHARED && vma->vm_file != to_exclude) {

Idem.

> + err = revoke_vma(vma, &details);
> + if (err)
> + goto out;
> + }
> + }
> +  out:
> + spin_unlock(&mapping->i_mmap_lock);
> + return err;
> +}


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ABI coupling to hypervisors via CONFIG_PARAVIRT

2007-03-09 Thread Ingo Molnar

* Chris Wright <[EMAIL PROTECTED]> wrote:

> > ok, sure, how about the one i mentioned: long-term i'd like to have 
> > a paravirt model where the guest does not store /any/ page tables - 
> > all paging is managed by the hypervisor. The guest has a vma tree, 
> > but otherwise it does not process pagefaults, has no concept of a 
> > pte (if in paravirt mode), has no concept of kernel page tables 
> > either: there are hypercalls to allocate/free guest-kernel memory, 
> > etc. This needs some (serious) MM surgery but it's doable and it's 
> > interesting as well. How would you map this to the VMI backend?
> 
> Sounds a lot like a userspace process.  My immediate thought is, why 
> not use containers, a more natural fit.  [...]

easy: in my model the hypervisor is isolated from the guest kernel. In 
the container model it is not. [ This is a basic quality requirement for 
virtualization: a guest kernel does not get to read any hypervisor 
crypto keys to HD-DVD smut! ;-) ]

> [...] But if you have _any_ hope of booting this kernel on native 
> hardware when it's not running under a hypervisor then I'd expect the 
> same pv_ops interfaces that allow it to run on native would allow VMI 
> to build and handle the shadow (since you'd have taken it out of the 
> kernel).  Heh, so in order to run this on native we had to add 
> fork/mmap pv ops?  I agree it might be interesting, but it's still not 
> clear that it's useful w/out some code to back it up, and see the 
> value.

progress ;-) But yes, some /really/ high-level pv_ops would be needed.

[ in the end we might be able to simplify it down to a single hook! That 
  would be: run_native_image / run_guest_image ;-) ]

seriously, most of the body of x86 kernel code is in filesystems, VFS, 
networking, scheduler and the core kernel - much of which can be shared 
between native and guest. The MM is a significant and very central 
chunk, but it is less than 3% of the total codesize.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

TCP MSG_PEEK assertion issue ...

2007-03-09 Thread davef1624


Hello,

 I periodically see the following TCP kernel assertion errors in 
/var/log/message
 (it does seem that networking is eventually able to recover from these 
errors):


 kernel: KERNEL: assertion (flags & MSG_PEEK) failed at net/ipv4/tcp.c 
(1171)
 kernel: KERNEL: assertion (flags & MSG_PEEK) failed at net/ipv4/tcp.c 
(1171)
 kernel: KERNEL: assertion (tp->copied_seq == tp->rcv_nxt || (flags & 
(MSG_PEEK | MSG_TRUNC))) failed at net/ipv4/tcp.c (1235)
 kernel: KERNEL: assertion (flags & MSG_PEEK) failed at net/ipv4/tcp.c 
(1171)
 kernel: KERNEL: assertion (tp->copied_seq == tp->rcv_nxt || (flags & 
(MSG_PEEK | MSG_TRUNC))) failed at net/ipv4/tcp.c (1235)


 These errors only seem to occur when using the sk98lin (SysKonnect) 
Network Device Driver,
 and a Yukon Gigabit Ethernet 10/100/1000Base-T Adapter (from Marvell I 
believe).

(The other NIC cards we're using in-house don't exhibit this issue).

I've seen postings concerning this issue over the past year or so,
 but have not seen a clear resolution and/or understanding of the 
root-cause of this issue.


 I'm in the process of upgrading my sk98lin network driver to the 
latest from SysKonnect

in an attempt to see if this corrects the problem.

Does anybody have any insight into this problem?

Best regards,
Dave F

AOL now offers free email to everyone.  Find out more about what's free 
from AOL at AOL.com.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA resume slowness, e1000 MSI warning

2007-03-09 Thread Kok, Auke


Eric W. Biederman wrote:

[CHOP]


Below is an additional set of warnings that should help debug this.
The old code just got lucky that it triggered a warning when this happens.



I'm trying this patch together with the other 2 that you sent out a few days 
ago. I'm seeing some minor issues with this and lots of bogus warnings as far as 
I can see.


If I suspend/resume and unload e1000, then reinsert e1000.ko, I immediately hit 
the WARN_ON at `msi.c:516: WARN_ON(!hlist_empty(&dev->saved_cap_space));`


I'm not sure that's useful debugging information. even though saved state exists 
the module has been removed, so you might want to purge the state table when the 
driver gets removed?



anyway, back to testing.

Auke



diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 01869b1..5113913 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -613,6 +613,7 @@ int pci_enable_msi(struct pci_dev* dev)
return -EINVAL;
 
 	WARN_ON(!!dev->msi_enabled);

+   WARN_ON(!hlist_empty(&dev->saved_cap_space));
 
 	/* Check whether driver already requested for MSI-X irqs */

if (dev->msix_enabled) {
@@ -638,6 +639,8 @@ void pci_disable_msi(struct pci_dev* dev)
if (!dev->msi_enabled)
return;
 
+	WARN_ON(!hlist_empty(&dev->saved_cap_space));

+
msi_set_enable(dev, 0);
pci_intx(dev, 1);   /* enable intx */
dev->msi_enabled = 0;
@@ -739,6 +742,7 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry 
*entries, int nvec)
}
}
WARN_ON(!!dev->msix_enabled);
+   WARN_ON(!hlist_empty(&dev->saved_cap_space));
 
 	/* Check whether driver already requested for MSI irq */

if (dev->msi_enabled) {
@@ -763,6 +767,8 @@ void pci_disable_msix(struct pci_dev* dev)
if (!dev->msix_enabled)
return;
 
+	WARN_ON(!hlist_empty(&dev->saved_cap_space));

+
msix_set_enable(dev, 0);
pci_intx(dev, 1);   /* enable intx */
dev->msix_enabled = 0;
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index bd44a48..4418839 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -677,6 +677,7 @@ pci_restore_state(struct pci_dev *dev)
}
pci_restore_pcix_state(dev);
pci_restore_msi_state(dev);
+   WARN_ON(!hlist_empty(&dev->saved_cap_space));
 
 	return 0;

 }

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ABI coupling to hypervisors via CONFIG_PARAVIRT

2007-03-09 Thread Zachary Amsden


Linus Torvalds wrote:
but ... maybe because VMI is so lowlevel and covers /all/ of x86 today, 
it will always be able to emulate whatever different concept we can come 
up with? Do we really know this absolutely sure?



"For sure"? Absolutely not. But since any new interfaces we come up with 
for doing timers etc had better work perfectly fine on an old hardware 
platform too, we can't exactly require any interfaces that do things that 
a bog-standard old dual-PPro didn't do 10 years ago, can we?


So assumign that the VMI interface is roughly as powerful (by virtue of 
basically emulating it) as the old single-ioapic/single-lapic systems we 
used to use, I don't think it should ever be a real problem. Hmm?
  


Sorry to keep you in a thread you don't want more to do with Linus, but 
to answer the question completely directly:


There are four design requirement which are inviolate for achieving a 
large measurable performance gain, which is the primary benefit of i386 
paravirt-ops for us.  These changes are required for /ALL/ high 
performance paravirtualized kernels not running in hardware 
virtualization, across a broad spectrum of hypervisors, and have either 
zero negative impact or opportunity for additional gain when hardware 
virtualization is enabled.


1) Ability to run the kernel at non-zero CPL
2) Ability to replace hardware interrupt masking functions with 
virtualizable equivalents

3) Notification when page tables are allocated and released
4) Notification in some form when page table entries are updated (or 
vma's are changed)


Everything after this are incremental gains, some more valuable than 
others, but not as major in significance as the above four (apic_write, 
incidentally, _is_ one of the more substantial gains for us).


These don't seem to be a major burden on the kernel at all.

#1 is already the default case now for even native hardware.
#2 requires a lot of hooks because interrupt masking is a common 
function, and this is where the large numbers of hook points Ingo was 
demonstrating came from, but these icache effects and costs are on 
already expensive instructions.  In fact, it appears on some hardware, 
the nop padding around cli / sti / pushf / popf contributes to a 
mysterious performance gain, perhaps due to some pipelining anomaly.
#3 is not a common enough operation to be of performance concern.  It 
does however, require pagetables, just as native hardware does.  Which 
we can implement perfectly well anyway in our backend, just as the 
native backend would if some reckless madman removed all notions of page 
tables from a paravirt kernel.
#4 involves an extra call in page fault paths and from some points in 
the mm layer.


There are no ABI requirements tied to these, merely the presence of any 
usable API for them in paravirt-ops.


Linus is right - our virtual hardware is an exact replica of real 
hardware.  So no matter how you change paravirtualization in the kernel, 
anything that runs on real hardware will continue to run on VMware.  VMI 
is tied very closely to the hardware, on purpose, and follows the rules 
of native hardware extremely closely.  So you can pretty much twist and 
abuse paravirt-ops in a number of ways, and as long as it continues to 
run on real hardware with the above four requirements, it still runs 
even on VMI.  Violate the above four requirements, and it costs a lot of 
performance, but we still continue to run.


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ALIGN (Re: [PATCH] Fix get_order())

2007-03-09 Thread Oleg Verych

On Wed, Mar 07, 2007 at 08:38:27AM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 7 Mar 2007, Oleg Verych wrote:
> >
> > Probably it can be used to get rid of gccisms and "type fluff" due to
> > bitwise arithmetics in ALIGN?
> 
> Hell no.
> 
> The typeof is there to make sure we have the right type, and it's simple.
> 
> The current ALIGN() macro is efficient as hell (generating just a simple 
> mask+add). Turning it into some kind of horrible thing that uses ilog2() 
> would be a total mistake.

OTOH, if i would write it this way

#define BALIGN(x,bits)  x) >> (bits)) + 1) << (bits))

that would give more convenient way of expessing alignment (values of
what are most widely used, i.e powers of two) without log2(:)
requirement, no?

arch/powerpc/mm/hugetlbpage.c:  addr = 
ALIGN(addr+1,1UL

Re: [PATCH] Use more gcc extensions in the Linux headers

2007-03-09 Thread Randy Dunlap

On Sat, 10 Mar 2007 09:57:32 +1100 Rusty Russell wrote:

> On Fri, 2007-03-09 at 16:56 +1100, Rusty Russell wrote:
> > __builtin_types_compatible_p() has been around since gcc 2.95, and we
> > don't use it anywhere.  This patch quietly fixes that.

Bah.  Just because gcc has a "feature" doesn't mean we should use it
(in this form).


> OK, many people complained that it needed a comment.  Good point!
> ==
> Add comment to ARRAY_SIZE macro.
> 
> Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
> 
> diff -r 933e410f204f include/linux/kernel.h
> --- a/include/linux/kernel.h  Sat Mar 10 09:55:31 2007 +1100
> +++ b/include/linux/kernel.h  Sat Mar 10 09:55:53 2007 +1100
> @@ -35,6 +35,7 @@ extern const char linux_proc_banner[];
>  #define ALIGN(x,a)   __ALIGN_MASK(x,(typeof(x))(a)-1)
>  #define __ALIGN_MASK(x,mask) (((x)+(mask))&~(mask))
>  
> +/* GCC is awesome. */
>  #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])  
>   \
>   + sizeof(typeof(int[1 - 2*!!__builtin_types_compatible_p(typeof(arr), \
>typeof(&arr[0]))]))*0)


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc3-mm1 RSDL results

2007-03-09 Thread Con Kolivas

On Saturday 10 March 2007 09:29, Matt Mackall wrote:
> On Sat, Mar 10, 2007 at 09:18:05AM +1100, Con Kolivas wrote:
> > On Saturday 10 March 2007 08:57, Con Kolivas wrote:
> > > On Saturday 10 March 2007 08:39, Matt Mackall wrote:
> > > > So what's different between makes in parallel and make -j 5? Make's
> > > > job server uses pipe I/O to control how many jobs are running.
> > >
> > > Hmm it must be those deep pipes again then. I removed any quirks
> > > testing for those from mainline as I suspected it would be ok. Guess
> > > I"m wrong.
> >
> > I shouldn't blame this straight up though if NO_HZ makes it better.
> > Something else is going wrong... wtf though?
>
> Just so we're clear, dynticks has only 'fixed' the single non-parallel
> make load so far.

Ok, back to the pipe idea. Without needing a kernel recompile, can you try 
running the make -j5 as a SCHED_BATCH task?

This wrapper will make it possible:
http://freequaos.host.sk/schedtool/schedtool-1.2.9.tar.bz2

then
schedtool -B -e make -j5

If that helps it gives me something to work with.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix building kernel under Solaris 11_snv

2007-03-09 Thread Jan Engelhardt


On Mar 9 2007 23:23, Sam Ravnborg wrote:

>Date: Fri, 9 Mar 2007 23:23:32 +0100
>From: Sam Ravnborg <[EMAIL PROTECTED]>
>To: Jan Engelhardt <[EMAIL PROTECTED]>,
>Paulo Marques <[EMAIL PROTECTED]>
>Cc: [EMAIL PROTECTED], Christoph Hellwig <[EMAIL PROTECTED]>,
>Deepak Saxena <[EMAIL PROTECTED]>,
>Andrew Morton <[EMAIL PROTECTED]>, linux-kernel@vger.kernel.org
>Subject: Re: [PATCH] Fix building kernel under Solaris 11_snv
>

Reference: http://lkml.org/lkml/2007/3/8/368
Signed-off-by: Jan Engelhardt <[EMAIL PROTECTED]>

Happy cherrypicking.
More comments below.

>> >> Index: linux-2.6.21-rc3/scripts/kallsyms.c
>> >> ===
>> >> --- linux-2.6.21-rc3.orig/scripts/kallsyms.c  2007-03-07 
>> >> 05:41:20.0 +0100
>> >> +++ linux-2.6.21-rc3/scripts/kallsyms.c   2007-03-07 23:46:46.249005000 
>> >> +0100
>> >> @@ -378,6 +378,40 @@
>> >>   table_cnt = pos;
>> >>  }
>> >>  
>> >> +#ifdef __sun__
>> >> +/* Return the first occurrence of NEEDLE in HAYSTACK.  */
>> >> +void *
>> >> +memmem (haystack, haystack_len, needle, needle_len)
>> >> + const void *haystack;
>> >> +  return (void *) begin;
>> >> +
>> >> +  return NULL;
>> >> +}
>> >> +#endif
>> >> +
>> >>  /* replace a given token in all the valid symbols. Use the sampled 
>> >> symbols
>> >>   * to update the counts */
>> >>  static void compress_symbols(unsigned char *str, int idx)
>> 
>> This one, I am just waiting for someone to object to the extra #if-#endif.
>
>I was planning to ask Paulo if strstr could not be used - Paulo?

I am not sure, but I would tend to say "no".

kallsyms.c:
p1 = table[i].sym;  

/* find the token on the symbol */  
p2 = memmem(p1, len, str, 2);   
if (!p2) continue;  

My first impression would be that 'p1' is a multi-nul-terminated string
array ("foo\0bar\0next\0symbol\0..."), much like the entire .strtab
section (I have not actually bothered to check, so it is a raw guess), and
I do not think strstr would replace memmem here if that was the case.

>> >> Index: linux-2.6.21-rc3/scripts/kconfig/Makefile
>> >> ===
>> >> --- linux-2.6.21-rc3.orig/scripts/kconfig/Makefile2007-03-07 
>> >> 05:41:20.0 +0100
>> >> +++ linux-2.6.21-rc3/scripts/kconfig/Makefile 2007-03-07 
>> >> 23:21:19.730679000 +0100
>> >> @@ -88,7 +88,7 @@
>> >>  HOST_EXTRACFLAGS = $(shell $(CONFIG_SHELL) $(check-lxdialog) -ccflags)
>> >>  HOST_LOADLIBES   = $(shell $(CONFIG_SHELL) $(check-lxdialog) -ldflags 
>> >> $(HOSTCC))
>> >>  
>> >> -HOST_EXTRACFLAGS += -DLOCALE
>> >> +HOST_EXTRACFLAGS += -DLOCALE -std=c99 -D__EXTENSIONS__
>> >>  
>> >>  PHONY += $(obj)/dochecklxdialog
>> >>  $(obj)/dochecklxdialog:
>> 
>> The error message for this one was:  only valid in C99 mode.
>> Linux GCC 4.1.2 does not print that, Solaris GCC 3.4.3 does. I do not
>> know offhand who is right.
>
>The -std= looks safe. The __EXTENSIONS__ part seems safe and needed after a
>bit googling.

Humm humm. Linux's /usr/include does not have any (glibc) file with the
string __EXTENSIONS__ in it. Though, the linux manpage for fileno() does
not mention fileno() conforming to anything. Under Linux, fileno() is
wrapped inside __USE_POSIX, and the comments in Solaris's headers also
indicate it is something posixy.

Perhaps have something like

#ifdef __sun__
#   define __EXTENSIONS__ 1
#endif

>> >> Index: linux-2.6.21-rc3/scripts/mod/file2alias.c
>> >> ===
>> >> --- linux-2.6.21-rc3.orig/scripts/mod/file2alias.c2007-03-07 
>> >> 05:41:20.0 +0100
>> >> +++ linux-2.6.21-rc3/scripts/mod/file2alias.c 2007-03-07 
>> >> 23:41:23.772026000 +0100
>> >> @@ -32,6 +32,8 @@
>> >>  typedef uint32_t __u32;
>> >>  typedef uint16_t __u16;
>> >>  typedef unsigned char__u8;
>> >> +typedef int32_t  __s32;
>> >> +typedef int16_t  __s16;
>> >>  
>> >>  /* Big exception to the "don't include kernel headers into userspace, 
>> >> which
>> >>   * even potentially has different endianness and word sizes, since
>> 
>> HAX.
>
>This goes away when we do not include input.h I think.

Exactly.

>> >> Index: linux-2.6.21-rc3/scripts/mod/modpost.h
>> >> ===
>> >> --- linux-2.6.21-rc3.orig/scripts/mod/modpost.h   2007-03-07 
>> >> 05:41:20.0 +0100
>> >> +++ linux-2.6.21-rc3/scripts/mod/modpost.h2007-03-07 
>> >> 23:37:01.31529 +0100
>> >> @@ -41,6 +41,11 @@
>> >>  #define ELF_R_TYPE  ELF64_R_TYPE
>> >>  #endif
>> >>  
>> >> +#ifdef __sun__
>> >> +typedef uint16_t Elf32_Section;
>> >> +typedef uint16_t Elf64_Section;
>> >

Re: [PATCH 1/2] NET: Multiple queue network device support

2007-03-09 Thread Thomas Graf

* Waskiewicz Jr, Peter P <[EMAIL PROTECTED]> 2007-03-09 11:25
> > > + }
> > > + } else {
> > > + /* We're not a multi-queue device. */
> > > + spin_lock(&dev->queue_lock);
> > > + q = dev->qdisc;
> > > + if (q->enqueue) {
> > > + rc = q->enqueue(skb, q);
> > > + qdisc_run(dev);
> > > + spin_unlock(&dev->queue_lock);
> > > + rc = rc == NET_XMIT_BYPASS
> > > +? NET_XMIT_SUCCESS : rc;
> > > + goto out;
> > > + }
> > > + spin_unlock(&dev->queue_lock);
> > 
> > Please don't duplicate already existing code.
> 
> I don't want to mix multiqueue and non-multiqueue code in the transmit
> path.  This was an attempt to allow MQ and non-MQ devices to coexist in
> a machine having separate code paths.  Are you suggesting to combine
> them?  That would get very messy trying to determine what type of lock
> to grab (subqueue lock or dev->queue_lock), not to mention grabbing the
> dev->queue_lock would block multiqueue devices in that same codepath.

The piece of code I quoted above is the branch executed if multi queue
is not enabled. The code you added is 100% identical to the already
existing enqueue logic. Just execute the existing branch if multi queue
is not enabled.

> This is another attempt to keep the two codepaths separate.  The only
> way I see of combining them is to check netif_is_multiqueue() everytime
> I need to grab a lock, which I think would be excessive.

The code added is 100% identical to the existing code, just be a little
smarter on how you do the ifdefs.

> > >   }
> > >  
> > >   return NULL;
> > > @@ -141,18 +174,53 @@ prio_dequeue(struct Qdisc* sch)
> > >   struct sk_buff *skb;
> > >   struct prio_sched_data *q = qdisc_priv(sch);
> > >   int prio;
> > > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> > > + int queue;
> > > +#endif
> > >   struct Qdisc *qdisc;
> > >  
> > > + /*
> > > +  * If we're multiqueue, the basic approach is try the 
> > lock on each
> > > +  * queue.  If it's locked, either we're already 
> > dequeuing, or enqueue
> > > +  * is doing something.  Go to the next band if we're 
> > locked.  Once we
> > > +  * have a packet, unlock the queue.  NOTE: the 
> > underlying qdisc CANNOT
> > > +  * be a PRIO qdisc, otherwise we will deadlock.  FIXME
> > > +  */
> > >   for (prio = 0; prio < q->bands; prio++) {
> > > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> > > + if (netif_is_multiqueue(sch->dev)) {
> > > + queue = q->band2queue[prio];
> > > + if 
> > (spin_trylock(&sch->dev->egress_subqueue[queue].queue_lock)) {
> > > + qdisc = q->queues[prio];
> > > + skb = qdisc->dequeue(qdisc);
> > > + if (skb) {
> > > + sch->q.qlen--;
> > > + skb->priority = prio;
> > > + 
> > spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
> > > + return skb;
> > > + }
> > > + 
> > spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
> > > + }
> > 
> > Your modified qdisc_restart() expects the queue_lock to be 
> > locked, how can this work?
> 
> No, it doesn't expect the lock to be held.  Because of the multiple
> queues, enqueueing and dequeueing are now asynchronous, since I can
> enqueue to queue 0 while dequeuing from queue 1.  dev->queue_lock isn't
> held, so this can happen.  Therefore the spin_trylock() is used in this
> dequeue because I don't want to wait for someone to finish with that
> queue in question (e.g. enqueue working), since that will block all
> other bands/queues after the band in question.  So if the lock isn't
> available to grab, we move to the next band.  If I were to wait for the
> lock, I'd serialize the enqueue/dequeue completely, and block other
> traffic flows in other queues waiting for the lock.

The first thing you do in qdisc_restart() after dequeue()ing is unlock
the sub queue lock. You explicitely unlock it before calling qdisc_run()
so I assume dequeue() is expected to keep it locked. Something doesn't
add up.

BTW, which lock serializes your write access to qdisc->q.qlen? It used
to be dev->queue_lock but that is apparently not true for multi queue.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Use more gcc extensions in the Linux headers

2007-03-09 Thread Rusty Russell

On Fri, 2007-03-09 at 16:56 +1100, Rusty Russell wrote:
> __builtin_types_compatible_p() has been around since gcc 2.95, and we
> don't use it anywhere.  This patch quietly fixes that.

OK, many people complained that it needed a comment.  Good point!
==
Add comment to ARRAY_SIZE macro.

Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>

diff -r 933e410f204f include/linux/kernel.h
--- a/include/linux/kernel.hSat Mar 10 09:55:31 2007 +1100
+++ b/include/linux/kernel.hSat Mar 10 09:55:53 2007 +1100
@@ -35,6 +35,7 @@ extern const char linux_proc_banner[];
 #define ALIGN(x,a) __ALIGN_MASK(x,(typeof(x))(a)-1)
 #define __ALIGN_MASK(x,mask)   (((x)+(mask))&~(mask))
 
+/* GCC is awesome. */
 #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])
  \
+ sizeof(typeof(int[1 - 2*!!__builtin_types_compatible_p(typeof(arr), \
 typeof(&arr[0]))]))*0)



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 29/30] sata_nv: don't read shadow registers when in ADMA mode

2007-03-09 Thread Robert Hancock


Alistair John Strachan wrote:

I lean towards "yes", since it is a needed-by-hardware fix, but I also
am interested in testing feedback since it is so late in the 2.6.21-rc
game.

I would lean toward that as well, but it would be good to get some
testing from some sata_nv ADMA users to make sure it doesn't do anything
funny for them..


Since I've been a bit of a problem case this time, I'd be happy to test it.

Can I assume that I can apply the patch you sent to Jeff "[PATCH] sata_nv: 
revert use of notifiers for now", and apply this one, to -rc3, and then be 
able to usefully test?


Yes, you should be able to.

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ABI coupling to hypervisors via CONFIG_PARAVIRT

2007-03-09 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> > but ... maybe because VMI is so lowlevel and covers /all/ of x86 
> > today, it will always be able to emulate whatever different concept 
> > we can come up with? Do we really know this absolutely sure?
> 
> "For sure"? Absolutely not. But since any new interfaces we come up 
> with for doing timers etc had better work perfectly fine on an old 
> hardware platform too, we can't exactly require any interfaces that do 
> things that a bog-standard old dual-PPro didn't do 10 years ago, can 
> we?

i'm not really thinking in terms of extending the native kernel - 
whatever we do on the native kernel we'll indeed have to be able to do 
on 'real' hardware too - including really old boxes.

i'm thinking more in terms of having some new and more intelligent 
virtualization-only abstractions.

For example i think people are way too obsessed with building virtual 
'machines' that look like PCs, while i've got some truly pie-in-the-sky 
ideas floating to simplify virtual machines, like the one i just 
outlined to Chris:

|| long-term i'd like to have a paravirt model where the guest does not 
|| store /any/ page tables - all paging is managed by the hypervisor. 
|| The guest has a vma tree, but otherwise it does not process 
|| pagefaults, has no concept of a pte (if in paravirt mode), has no 
|| concept of kernel page tables either: there are hypercalls to 
|| allocate/free guest-kernel memory, etc. This needs some (serious) MM 
|| surgery but it's doable and it's interesting as well. How would you 
|| map this to the VMI backend?

Clearly the ugliest and most complex part of hypervisors is MMU support. 
So the above model would avoid all the shadow page table complexities 
and other MMU nasties, and keep that stuff in the hypervisor. [ in 
exchange for a whole set of other complexities and problems ;-) ] This 
would be even more highlevel than UML (UML emulates pagetables in 
user-space), in that regard.

even in such a drastic model we could share like 90% of the x86 binary 
code with the rest of the kernel (all the filesystem support, networking 
stack, etc. would still be reusable by the guest kernel), so a paravirt 
approach, where native and guest is the same image (as opposed to the 
UML model, where they are separate) still makes sense and is preferred 
by distributions.

Mapping this into VMI calls looks ... near impossible to me. VMI really 
assumes that there is no hypervisor state for kernel objects - while the 
above model _shares_ a very substantial kernel object with the 
hypervisor - and in fact 100% delegates its handling to the hypervisor 
altogether. Hm?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ABI coupling to hypervisors via CONFIG_PARAVIRT

2007-03-09 Thread Lee Revell


On 3/9/07, Jan Engelhardt <[EMAIL PROTECTED]> wrote:

I think the sound example to the right really shows it. /dev/dsp has a
consistent ABI on a ton of systems. The API below it, varies. Linux got
file_operations and ALSA. Solaris/BSD may have its
vnode-and-so-on-functions and some sort of OSS.


I think this is a poor example as applications lose a lot of
functionality (multiple stream mixing, software volume control, etc)
by going through the legacy /dev/dsp interface vs. using native ALSA.

Lee
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ABI coupling to hypervisors via CONFIG_PARAVIRT

2007-03-09 Thread Chris Wright

* Ingo Molnar ([EMAIL PROTECTED]) wrote:
> * Chris Wright <[EMAIL PROTECTED]> wrote:
> > i'm not really one to argue on behalf of VMI, but i don't think it's 
> > as dire make it out. [...]
> 
> hey, that's what i thought when i helped do the vDSO, until i got 
> slapped with cold reality called "CONFIG_COMPAT_VDSO". I'm a bit more 
> careful about ABIs since then =B-)

heh, once bitten twice shy, or however that goes ;-)

> > [...] the VMI is client code of pv_ops, and as the kernel changes that 
> > client code will simply have to adapt.  of course there are 
> > theoretical limitations, but let's keep it grounded to practical 
> > reality. the whole premise is evolution.  so throw out specific 
> > issues, and let's adapt rather than fall deep into theoretical 
> > rhetoric.
> 
> ok, sure, how about the one i mentioned: long-term i'd like to have a 
> paravirt model where the guest does not store /any/ page tables - all 
> paging is managed by the hypervisor. The guest has a vma tree, but 
> otherwise it does not process pagefaults, has no concept of a pte (if in 
> paravirt mode), has no concept of kernel page tables either: there are 
> hypercalls to allocate/free guest-kernel memory, etc. This needs some 
> (serious) MM surgery but it's doable and it's interesting as well. How 
> would you map this to the VMI backend?

Sounds a lot like a userspace process.  My immediate thought is, why
not use containers, a more natural fit.  But if you have _any_ hope
of booting this kernel on native hardware when it's not running under
a hypervisor then I'd expect the same pv_ops interfaces that allow it
to run on native would allow VMI to build and handle the shadow (since
you'd have taken it out of the kernel).  Heh, so in order to run this on
native we had to add fork/mmap pv ops?  I agree it might be interesting,
but it's still not clear that it's useful w/out some code to back it up,
and see the value.

thanks,
-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] swsusp: Disable nonboot CPUs before entering platform suspend

2007-03-09 Thread Pavel Machek

On Fri 2007-03-09 23:34:00, Rafael J. Wysocki wrote:
> On Friday, 9 March 2007 23:13, Pavel Machek wrote:
> > Hi!
> > 
> > > > > Index: linux-2.6.21-rc3/kernel/power/user.c
> > > > > ===
> > > > > --- linux-2.6.21-rc3.orig/kernel/power/user.c
> > > > > +++ linux-2.6.21-rc3/kernel/power/user.c
> > > > > @@ -402,9 +402,10 @@ static int snapshot_ioctl(struct inode *
> > > > >  
> > > > >   case PMOPS_ENTER:
> > > > >   if (data->platform_suspend) {
> > > > > + disable_nonboot_cpus();
> > > > >   
> > > > > kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK);
> > > > >   error = pm_ops->enter(PM_SUSPEND_DISK);
> > > > > - error = 0;
> > > > > + enable_nonboot_cpus();
> > > > 
> > > > Why did we discard return code in previous versions? Do we still want
> > > > to do that?
> > > 
> > > I think it was a mistake.
> > 
> > I took a look at git-annotate, and it is yours code, so I assume you
> > are right. ACK, then.
> 
> Thanks!
> 
> BTW, what about the patch at http://lkml.org/lkml/2007/3/8/363?

Seems obviously correct to me (ACK), but I did not have time to test it.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Trouble using some (fast) compact flash as ide device on an embedded system

2007-03-09 Thread Bartlomiej Zolnierkiewicz


[ cc:ing Alan who may have a better idea what is wrong with this CF than me ]

On Friday 09 March 2007, Marco Lazzarotto wrote:
> Hallo! :-)
> 
> Bartlomiej Zolnierkiewicz ha scritto:
> > Czesc!
> > 
> > On Tuesday 06 March 2007, Marco Lazzarotto wrote:
> > 
> >>Ciao!
> >>
> >>Bartlomiej Zolnierkiewicz ha scritto:
> >>
> >>>On Friday 02 March 2007, Pavel Machek wrote:
> >>>
> >>>
> Hi!
> 
> 
> 
> >As I reported in bug 8036 in bugzilla.kernel.org,
> >
> >Hardware Environment:
> >
> >- Use a compact flash SanDisk SDCFB-128 Firmware revision HDX 2.15
> >  (we used other compact flashes with the same hw ad sw for years
> >   with  no trouble)
> >
> >It happens on both etx boards:
> >- VIA SOM-ETX (4475)
> >- Gene-4312
> >>
> >>ERRATA CORRIGE: Gene-4312 is not a etx board ;-) but a pc/104
> > 
> > 
> > What IDE hardware / host driver is used by this system?
> 
> NB: I'm usign the VIA SOM-ETX (4475) for debugging
> 
> Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
> ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
> VP_IDE: IDE controller at PCI slot :00:07.1
> PCI: Calling quirk c01dc1e8 for :00:07.1
> VP_IDE: chipset revision 6
> VP_IDE: not 100% native mode: will probe irqs later
> VP_IDE: VIA vt82c686b (rev 40) IDE UDMA100 controller on pci:00:07.1
> ide1: BM-DMA at 0xe408-0xe40f, BIOS settings: hdc:DMA, hdd:pio
> 
> (I disabled DMA in the bios, why is saying it is enabled?)

BIOS "forgot" to disable DMA bits...

> >Doing the command
> >sfdisk -R /dev/hdc
> >
> >gives:
> >
> >* * *
> >ide1: start_request: current=0xc6ebe754 (rq->sect=0,block 0)
> >hdc: status error: status=0x58 { DriveReady SeekComplete DataRequest }
> >ide: failed opcode was: unknown
> >hdc: drive not ready for command
> >ide1: start_request: current=0xc6ebe754 (rq->sect=0,block 0)
> >hdc: do_special: 0x02
> >hdc: do_special: recalibrate
> >ide1: start_request: current=0xc6ebe754 (rq->sect=0,block 0)
> >hdc: reading: block=0 sectors=8, buffer = 0xc6cd4
> >ide1: end_request: current=0xc6ebe754
> >* * *
> >
> >the 'bad bit' in status error is DataRequest
> > 
> > 
> > Seems like the device wants data from/to host and I have no idea why this
> > is happening.  It might be that this particular CF has problems with one
> > of the commands that IDE driver issues during device initialization.
> > 
> > I assume that device is recognized properly by the driver during probe, 
> > right?
> > If so probably adding some debugging printks (i.e. dumping status register)
> > to ide-disk.c:idedisk_setup() would shed some more light at the problem...
> 
> The device seems to be recognized properly.
> Here's (part of) the dmesg output:
> 
>  * * *
> 
> ide1: BM-DMA at 0xe408-0xe40f, BIOS settings: hdc:DMA, hdd:pio
> Probing IDE interface ide1...
> probing for hdc: present=0, media=32, probetype=ATA
> hdc: SanDisk SDCFB-128, CFA DISK drive ()
> Before ide_disk_init_chs() and ide_disk_init_mult_count()
> IDE_STATUS_REG=0x50
> After ide_disk_init_chs() and ide_disk_init_mult_count() IDE_STATUS_REG=0x50
> probing for hdd: present=0, media=32, probetype=ATA
> probing for hdd: present=0, media=32, probetype=ATAPI
> ide_init_queue()
> ide1 at 0x170-0x177,0x376 on irq 15
> Probing IDE interface ide0...
> probing for hda: present=0, media=32, probetype=ATA
> probing for hda: present=0, media=32, probetype=ATAPI
> probing for hdb: present=0, media=32, probetype=ATA
> probing for hdb: present=0, media=32, probetype=ATAPI
> Probing IDE interface ide2...
> probing for hde: present=0, media=32, probetype=ATA
> probing for hde: present=0, media=32, probetype=ATAPI
> probing for hdf: present=0, media=32, probetype=ATA
> probing for hdf: present=0, media=32, probetype=ATAPI
> Probing IDE interface ide3...
> probing for hdg: present=0, media=32, probetype=ATA
> probing for hdg: present=0, media=32, probetype=ATAPI
> probing for hdh: present=0, media=32, probetype=ATA
> probing for hdh: present=0, media=32, probetype=ATAPI
> hdc: max request size: 128KiB
> After init_idedisk_capacity() IDE_STATUS_REG=0x50
> After idedisk_capacity() IDE_STATUS_REG=0x50
> hdc: 250880 sectors (128 MB) w/1KiB Cache (buf_size=2), CHS=980/8/32
> After write_cache(drive,1) IDE_STATUS_REG=0x50
>  hdc:
> ide1: start_request: current=0xc1190804 (rq->sect=0,block 0, SECTOR_SIZE=512
> hdc: do_special: 0x03
> hdc: do_special: set_geometry
> ide1: start_request: current=0xc1190804 (rq->sect=0,block 0, SECTOR_SIZE=512
> hdc: do_special: 0x02
> hdc: do_special: recalibrate
> hdc : recal_intr() IDE_STATUS_REG=50
> ide1: start_request: current=0xc1190804 (rq->sect=0,block 0, SECTOR_SIZE=512
> hdc: reading: block=0, sectors=8, buffer=0xc6c2d000
> ide1: end_request:   current=0xc1190804
>  hdc1
> 
>  * * *
> 
> I dump IDE_STATUS_REG with e.g. 'printk("%s : recal_intr()
> IDE_STATUS_REG=%02x\n",drive->name,stat)'
> where

Re: [git patches] libata (and devres) fixes

2007-03-09 Thread Alan Cox

> scsi1 : ata_piix
> ata2: port disabled. ignoring.
> ata2: reset failed, giving up<--- THIS IS NEW.
> 
> However, I think it's just bogus as there is ata2 is disabled on this laptop.

This is expected behaviour and it is what every controller except the
PIIX has done for some time. I'm not sure its perfect but we could return
0 from the -ENOENT case in ata_eh_reset() if that is preferred.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc3-mm1 RSDL results

2007-03-09 Thread Matt Mackall

On Sat, Mar 10, 2007 at 09:18:05AM +1100, Con Kolivas wrote:
> On Saturday 10 March 2007 08:57, Con Kolivas wrote:
> > On Saturday 10 March 2007 08:39, Matt Mackall wrote:
> > > On Sat, Mar 10, 2007 at 08:19:18AM +1100, Con Kolivas wrote:
> > > > On Saturday 10 March 2007 08:07, Con Kolivas wrote:
> > > > > On Saturday 10 March 2007 07:46, Matt Mackall wrote:
> > > > > > My suspicion is the problem lies in giving too much quanta to
> > > > > > newly-started processes.
> > > > >
> > > > > Ah that's some nice detective work there. Mainline does some rather
> > > > > complex accounting on sched_fork including (possibly) a whole timer
> > > > > tick which rsdl does not do. make forks off continuously so what you
> > > > > say may well be correct. I'll see if I can try to revert to the
> > > > > mainline behaviour in sched_fork (which was obviously there for a
> > > > > reason).
> > > >
> > > > Wow! Thanks Matt. You've found a real bug too. This seems to fix the
> > > > qemu misbehaviour and bitmap errors so far too! Now can you please try
> > > > this to see if it fixes your problem?
> > >
> > > Sorry, it's about the same. I now suspect an accounting glitch involving
> > > pipe wake-ups.
> > >
> > > 5x memload: good
> > > 5x execload: good
> > > 5x forkload: good
> > > 5 parallel makes: mostly good
> > > make -j 5: bad
> > >
> > > So what's different between makes in parallel and make -j 5? Make's
> > > job server uses pipe I/O to control how many jobs are running.
> >
> > Hmm it must be those deep pipes again then. I removed any quirks testing
> > for those from mainline as I suspected it would be ok. Guess I"m wrong.
> 
> I shouldn't blame this straight up though if NO_HZ makes it better. Something 
> else is going wrong... wtf though?

Just so we're clear, dynticks has only 'fixed' the single non-parallel
make load so far.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ABI coupling to hypervisors via CONFIG_PARAVIRT

2007-03-09 Thread Jeremy Fitzhardinge

Ingo Molnar wrote:
> ok, sure, how about the one i mentioned: long-term i'd like to have a 
> paravirt model where the guest does not store /any/ page tables - all 
> paging is managed by the hypervisor. The guest has a vma tree, but 
> otherwise it does not process pagefaults, has no concept of a pte (if in 
> paravirt mode), has no concept of kernel page tables either: there are 
> hypercalls to allocate/free guest-kernel memory, etc. This needs some 
> (serious) MM surgery but it's doable and it's interesting as well. How 
> would you map this to the VMI backend?

You wouldn't.  Why would you?  It might be a useful interface - and its
the perfect kind of high-level interface for pv_ops.  It might be worth
adapting a hypervisor to suit it, but you still need to support the
i386's pagetables.  So, you present the pv_ops interface with your
vma-based mappings, and it runs it through the vma->pagetable
translation layer to feed into either the i386 pagetables directly, or
to a hypervisor's page-based interface.

The important part is that there's more to the story than just pv_ops. 
If you wanted to make such a change, then you'd need to refactor the
i386 support code to add a vma->paging helper layer.  That layer would
be available for any pv_ops interface to use if it wishes.

(Remember, in the pv_ops model, bare hardware is a "hypervisor" too.)

Next problem?

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc3-mm1 RSDL results

2007-03-09 Thread Matt Mackall

On Sat, Mar 10, 2007 at 09:12:07AM +1100, Con Kolivas wrote:
> On Saturday 10 March 2007 08:57, Willy Tarreau wrote:
> > On Fri, Mar 09, 2007 at 03:39:59PM -0600, Matt Mackall wrote:
> > > On Sat, Mar 10, 2007 at 08:19:18AM +1100, Con Kolivas wrote:
> > > > On Saturday 10 March 2007 08:07, Con Kolivas wrote:
> > > > > On Saturday 10 March 2007 07:46, Matt Mackall wrote:
> > > > > > My suspicion is the problem lies in giving too much quanta to
> > > > > > newly-started processes.
> > > > >
> > > > > Ah that's some nice detective work there. Mainline does some rather
> > > > > complex accounting on sched_fork including (possibly) a whole timer
> > > > > tick which rsdl does not do. make forks off continuously so what you
> > > > > say may well be correct. I'll see if I can try to revert to the
> > > > > mainline behaviour in sched_fork (which was obviously there for a
> > > > > reason).
> > > >
> > > > Wow! Thanks Matt. You've found a real bug too. This seems to fix the
> > > > qemu misbehaviour and bitmap errors so far too! Now can you please try
> > > > this to see if it fixes your problem?
> > >
> > > Sorry, it's about the same. I now suspect an accounting glitch involving
> > > pipe wake-ups.
> > >
> > > 5x memload: good
> > > 5x execload: good
> > > 5x forkload: good
> > > 5 parallel makes: mostly good
> > > make -j 5: bad
> > >
> > > So what's different between makes in parallel and make -j 5? Make's
> > > job server uses pipe I/O to control how many jobs are running.
> >
> > Matt, could you check with plain 2.6.20 + Con's patch ? It is possible
> > that he added bugs when porting to -mm, or that someting in -mm causes
> > the trouble. Your experience with -mm seems so much different from mine
> > with mainline, there must be a difference somewhere !
> 
> Good idea.

2.6.20+RSDL+tickfix+noyield behaves more or less the same under make
-j5 as 2.6.21-rc3-mm1. A bit worse, perhaps. There's no tickless on
2.6.20, so that could explain that.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] dio: invalidate clean pages before dio write

2007-03-09 Thread Zach Brown

dio: invalidate clean pages before dio write

This patch fixes a user-triggerable oops that was reported by Leonid Ananiev as
archived at http://lkml.org/lkml/2007/2/8/337.

dio writes invalidate clean pages that intersect the written region so that
subsequent buffered reads go to disk to read the new data.  If this fails the
interface tries to tell the caller that the cache is inconsistent by returning
EIO.

Before this patch we had the problem where this invalidation failure would
clobber -EIOCBQUEUED as it made its way from fs/direct-io.c to fs/aio.c.  Both
fs/aio.c and bio completion call aio_complete() and we reference freed memory,
usually oopsing.

This patch addresses this problem by invalidating before the write so that we
can cleanly return -EIO before ->direct_IO() has had a chance to return
-EIOCBQUEUED.

There is a compromise here.  During the dio write we can fault in mmap()ed
pages which intersect the written range with get_user_pages() if the user
provided them for the source buffer.  This is a crazy thing to do, but we can
make it mostly work in most cases by trying the invalidation again.   The
compromise is that we won't return an error if this second invalidation fails
if it's an AIO write and we have -EIOCBQUEUED.

This was tested by having two processes race performing large O_DIRECT and
buffered ordered writes.  Within minutes ext3 would see a race between
ext3_releasepage() and jbd holding a reference on ordered data buffers and
would cause invalidation to fail, panicing the box.  The test can be found in
the 'aio_dio_bugs' test group in test.kernel.org/autotest.  After this patch
the test passes.

Signed-off-by: Zach Brown <[EMAIL PROTECTED]>
---

 mm/filemap.c |   48 
 1 file changed, 36 insertions(+), 12 deletions(-)

--- a/mm/filemap.c  Wed Feb 28 06:00:13 2007 +
+++ b/mm/filemap.c  Thu Mar 08 17:48:58 2007 -0800
@@ -2379,7 +2379,8 @@ generic_file_direct_IO(int rw, struct ki
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
ssize_t retval;
-   size_t write_len = 0;
+   size_t write_len;
+   pgoff_t end = 0; /* silence gcc */
 
/*
 * If it's a write, unmap all mmappings of the file up-front.  This
@@ -2388,23 +2389,46 @@ generic_file_direct_IO(int rw, struct ki
 */
if (rw == WRITE) {
write_len = iov_length(iov, nr_segs);
+   end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT;
if (mapping_mapped(mapping))
unmap_mapping_range(mapping, offset, write_len, 0);
}
 
retval = filemap_write_and_wait(mapping);
-   if (retval == 0) {
-   retval = mapping->a_ops->direct_IO(rw, iocb, iov,
-   offset, nr_segs);
-   if (rw == WRITE && mapping->nrpages) {
-   pgoff_t end = (offset + write_len - 1)
-   >> PAGE_CACHE_SHIFT;
-   int err = invalidate_inode_pages2_range(mapping,
+   if (retval)
+   goto out;
+
+   /*
+* After a write we want buffered reads to be sure to go to disk to get
+* the new data.  We invalidate clean cached page from the region we're
+* about to write.  We do this *before* the write so that we can return
+* -EIO without clobbering -EIOCBQUEUED from ->direct_IO().
+*/
+   if (rw == WRITE && mapping->nrpages) {
+   retval = invalidate_inode_pages2_range(mapping,
offset >> PAGE_CACHE_SHIFT, end);
-   if (err)
-   retval = err;
-   }
-   }
+   if (retval)
+   goto out;
+   }
+
+   retval = mapping->a_ops->direct_IO(rw, iocb, iov, offset, nr_segs);
+   if (retval)
+   goto out;
+
+   /* 
+* Finally, try again to invalidate clean pages which might have been
+* faulted in by get_user_pages() if the source of the write was an
+* mmap()ed region of the file we're writing.  That's a pretty crazy
+* thing to do, so we don't support it 100%.  If this invalidation
+* fails and we have -EIOCBQUEUED we ignore the failure.
+*/
+   if (rw == WRITE && mapping->nrpages) {
+   int err = invalidate_inode_pages2_range(mapping,
+ offset >> PAGE_CACHE_SHIFT, end);
+   if (err && retval >= 0)
+   retval = err;
+   }
+out:
return retval;
 }
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] swsusp: Disable nonboot CPUs before entering platform suspend

2007-03-09 Thread Rafael J. Wysocki

On Friday, 9 March 2007 23:13, Pavel Machek wrote:
> Hi!
> 
> > > > Index: linux-2.6.21-rc3/kernel/power/user.c
> > > > ===
> > > > --- linux-2.6.21-rc3.orig/kernel/power/user.c
> > > > +++ linux-2.6.21-rc3/kernel/power/user.c
> > > > @@ -402,9 +402,10 @@ static int snapshot_ioctl(struct inode *
> > > >  
> > > > case PMOPS_ENTER:
> > > > if (data->platform_suspend) {
> > > > +   disable_nonboot_cpus();
> > > > 
> > > > kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK);
> > > > error = pm_ops->enter(PM_SUSPEND_DISK);
> > > > -   error = 0;
> > > > +   enable_nonboot_cpus();
> > > 
> > > Why did we discard return code in previous versions? Do we still want
> > > to do that?
> > 
> > I think it was a mistake.
> 
> I took a look at git-annotate, and it is yours code, so I assume you
> are right. ACK, then.

Thanks!

BTW, what about the patch at http://lkml.org/lkml/2007/3/8/363?

Rafael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

do_generic_mapping_read performance issue

2007-03-09 Thread Ashif Harji



Hi, I am encountering a performance problem, which I have tracked into the 
Linux kernel. The problem occurs with my experimental web server that uses 
sendfile to repeatedly transmit files.  The files are based on the static 
portion of the SPECweb99 fileset and range in size to model a reasonable 
workload.  With this workload, a significant number of the requests are 
for files of size 4 KB or less.


I have determined that the performance problems occurs in the function
do_generic_mapping_read in file mm/filemap.c for kernel version 2.6.20.1.
Here is the specific code fragment:

/*
 * When (part of) the same page is read multiple times
 * in succession, only mark it as accessed the first time.
 */
if (prev_index != index)
mark_page_accessed(page);


The implication of this code is that for files of size less than or equal 
to a single page, the page associated with such a file is likely to get 
evicted from the cache regardless of how frequently it is accessed.  The 
reason is that after the first access, prev_index is always zero and index 
can only be zero. Hence, mark_page_accessed is never called after the 
first time the file is requested.  As a result, the page is evicted from 
the cache no matter how frequently it is used.  By changing the kernel to 
always call mark_page_accessed for these files, the server throughput is 
increased by as much as 20%.


I was wondering if anyone could explain why the call to mark_page_accessed 
is conditional? That is, what problem it is trying to solve. It would seem 
that in many scenarios, if the same page is accessed repeatedly, then it 
would be appropriate to keep that page cached.


Please personally CC me on any responses.

thanks,
ashif.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc3-mm1 RSDL results

2007-03-09 Thread Willy Tarreau

On Sat, Mar 10, 2007 at 09:12:07AM +1100, Con Kolivas wrote:
(...)
> > Matt, could you check with plain 2.6.20 + Con's patch ? It is possible
> > that he added bugs when porting to -mm, or that someting in -mm causes
> > the trouble. Your experience with -mm seems so much different from mine
> > with mainline, there must be a difference somewhere !
> 
> Good idea.

OK, so let me summarize :

  plain 2.6.20
  + http://www.kernel.org/pub/linux/kernel/v2.6/patch-2.6.20.2.bz2
  + http://ck.kolivas.org/patches/staircase-deadline/sched-rsdl-0.26.patch
  + http://marc.theaimsgroup.com/?l=linux-kernel&m=117347544926731&q=raw

should be a good starting point.

> > Con, is your patch necessary for mainline patch too ? I see that it
> > should apply, but sometimes -mm may justify changes.
> 
> Yes it will be necessary for the mainline patch too.

OK Thanks Con.

Best regards,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ABI coupling to hypervisors via CONFIG_PARAVIRT

2007-03-09 Thread Jeremy Fitzhardinge

Ingo Molnar wrote:
> yep. That's precisely my worry. And it doesnt have to be a 'great' thing 
> - just any random small change in the kernel that makes sense: what is 
> the likelyhood that it cannot be implemented, no matter what amount of 
> insight, paravirt_ops + hyper-ABI emulation hackery, for FoobieVisor, 
> because FoobieVisor messed up its ABI.
>
> that likelyhood is a pure function of how FoobieVisor's hypercall ABI is 
> shaped. Wow! So can you guess where my fixation about not having too 
> many ABIs could possibly originate from? ;-)
>   

OK, so its a problem that's happened before.  "It's a great idea, it's
so nice, but it breaks X."  Your options are:

   1. Well, nobody is really using X.  We can stop supporting it.
   2. X makes up 50% of the users, we'll just have to do without your
  great idea.
   3. Maybe we can get X updated so this idea works.

If X is a piece of hardware, then you're probably stuck with options 1
and 2.  If its something like firmware or a hypervisor, you might have a
chance with option 3.

The hypervisor interface is not at all special in this regard; you may
as well be arguing "We can't allow a port of Linux to the FoobieTron2000
CPU, because it might constrain some future development"; that's true,
it might.  But I don't think I've ever seen anyone make that argument
for not accepting a new architecture port.

I don't really understand what your overall argument is though.  Sure, I
guess its that if there's one ABI for all hypervisors, then you've only
got one hypervisor-related constraint to consider when evaluating a new
kernel change.  But that ABI is going to be as constraining as the its
most constraining hypervisor, so you're not really in a better position
than if you have N hypervisor ABIs.  In fact you're worse off, because
you have no flexibility to drop/adapt/whatever the real blocker.

> _Now_ at least i've got this minimal 
> admission that FoobieVisor _might_ break. Quite a breakthrough =B-)

If you went to all that typing to get that much of a concession, then
you have way too much time ;)

J

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ABI coupling to hypervisors via CONFIG_PARAVIRT

2007-03-09 Thread Ingo Molnar

* Chris Wright <[EMAIL PROTECTED]> wrote:

> * Ingo Molnar ([EMAIL PROTECTED]) wrote:
> > i am worried whether /any/ future change to the upstream kernel's design 
> > can be adopted via paravirt_ops, via the current VMI ABI. And by /any/ i 
> > mean truly any. And whether that can be done is not a function of the 
> > flexibility of paravirt_ops, it's a function of the flexibility of the 
> > VMI ABI.
> 
> i'm not really one to argue on behalf of VMI, but i don't think it's 
> as dire make it out. [...]

hey, that's what i thought when i helped do the vDSO, until i got 
slapped with cold reality called "CONFIG_COMPAT_VDSO". I'm a bit more 
careful about ABIs since then =B-)

> [...] the VMI is client code of pv_ops, and as the kernel changes that 
> client code will simply have to adapt.  of course there are 
> theoretical limitations, but let's keep it grounded to practical 
> reality. the whole premise is evolution.  so throw out specific 
> issues, and let's adapt rather than fall deep into theoretical 
> rhetoric.

ok, sure, how about the one i mentioned: long-term i'd like to have a 
paravirt model where the guest does not store /any/ page tables - all 
paging is managed by the hypervisor. The guest has a vma tree, but 
otherwise it does not process pagefaults, has no concept of a pte (if in 
paravirt mode), has no concept of kernel page tables either: there are 
hypercalls to allocate/free guest-kernel memory, etc. This needs some 
(serious) MM surgery but it's doable and it's interesting as well. How 
would you map this to the VMI backend?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ABI coupling to hypervisors via CONFIG_PARAVIRT

2007-03-09 Thread Linus Torvalds

On Fri, 9 Mar 2007, Ingo Molnar wrote:
> 
> hm. So your point is that VMI is in essence a Turing machine (a 
> near-complete one)? No matter what redesign we do on the Linux side, the 
> VMI paravirt_ops will always be able to adopt to it?

No, I don't think it's turing-complete ;)

But since it tries to basically come fairly close to emulating the 
hardware we already use, any higher-level abstraction we do (which 
obviously has to work on real hardware too!) is likely to be translatable 
into the "pseudo-hardware" thing that is the VMI interfaces.

> but ... maybe because VMI is so lowlevel and covers /all/ of x86 today, 
> it will always be able to emulate whatever different concept we can come 
> up with? Do we really know this absolutely sure?

"For sure"? Absolutely not. But since any new interfaces we come up with 
for doing timers etc had better work perfectly fine on an old hardware 
platform too, we can't exactly require any interfaces that do things that 
a bog-standard old dual-PPro didn't do 10 years ago, can we?

So assumign that the VMI interface is roughly as powerful (by virtue of 
basically emulating it) as the old single-ioapic/single-lapic systems we 
used to use, I don't think it should ever be a real problem. Hmm?

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix building kernel under Solaris 11_snv

2007-03-09 Thread Sam Ravnborg

On Fri, Mar 09, 2007 at 09:16:35PM +0100, Jan Engelhardt wrote:
> 
> On Mar 9 2007 20:00, Sam Ravnborg wrote:
> >On Thu, Mar 08, 2007 at 11:01:57PM +0100, Jan Engelhardt wrote:
> >> 
> >> Since Solaris seems to be on the run, I did myself try compile it. 
> >> However, unlike the original poster who said he did so on SunOS 4.8, I 
> >> did it on 5.11_snv39, yielding a bigger changeset. I thought I just 
> >> share the diff that piled up so far. It needs a lot of hacks on the 
> >> Solaris side - prioritizing GNU names, then, second, gnu ld has a 
> >> glitch, then, gcc has a missing file... it's fun fun fun!
> >
> >Can I please have a signed-off version of this patch.
> 
> _Are you sure_ you want all these hacks without further
> review from other people? Also note the patch is incomplete,
> for example I could not compile the acpi pieces because
> acsolaris.h -- which is referenced in the acpi includes --
> does not exist. (Yet another piece of software that has
> crossplatform compatibilty stuff, like XFS.)

The Signed-off-by: document the origin of the path.
I'm planning only to take the sensible bits of the patch anyway.

> >> --- linux-2.6.21-rc3.orig/include/linux/input.h2007-03-07 
> >> 05:41:20.0 +0100
> >> +++ linux-2.6.21-rc3/include/linux/input.h 2007-03-07 23:40:39.417339000 
> >> +0100
> >> @@ -16,7 +16,9 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> -#include 
> >> +#ifndef __sun__
> >> +# include 
> >> +#endif
> >>  #endif
> 
> This is not a proper fix for sure. The problem lies in
> file2alias.c, see (your own) http://lkml.org/lkml/2007/3/8/339
I already committed my own fix - so this chunk can be ignored.

> 
> >> Index: linux-2.6.21-rc3/scripts/genksyms/genksyms.c
> >> ===
> >> --- linux-2.6.21-rc3.orig/scripts/genksyms/genksyms.c  2007-03-07 
> >> 05:41:20.0 +0100
> >> +++ linux-2.6.21-rc3/scripts/genksyms/genksyms.c   2007-03-07 
> >> 23:28:35.659555000 +0100
> >> @@ -21,6 +21,7 @@
> >> along with this program; if not, write to the Free Software Foundation,
> >> Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.  */
> >>  
> >> +#include 
> >>  #include 
> >>  #include 
> >>  #include 
> 
> This is however, is valid. Can I gave sign-offs for single hunks?
Agree and yes.

> 
> >> Index: linux-2.6.21-rc3/scripts/kallsyms.c
> >> ===
> >> --- linux-2.6.21-rc3.orig/scripts/kallsyms.c   2007-03-07 
> >> 05:41:20.0 +0100
> >> +++ linux-2.6.21-rc3/scripts/kallsyms.c2007-03-07 23:46:46.249005000 
> >> +0100
> >> @@ -378,6 +378,40 @@
> >>table_cnt = pos;
> >>  }
> >>  
> >> +#ifdef __sun__
> >> +/* Return the first occurrence of NEEDLE in HAYSTACK.  */
> >> +void *
> >> +memmem (haystack, haystack_len, needle, needle_len)
> >> + const void *haystack;
> >> +  return (void *) begin;
> >> +
> >> +  return NULL;
> >> +}
> >> +#endif
> >> +
> >>  /* replace a given token in all the valid symbols. Use the sampled symbols
> >>   * to update the counts */
> >>  static void compress_symbols(unsigned char *str, int idx)
> 
> This one, I am just waiting for someone to object to the extra #if-#endif.
I was planning to ask Paulo if strstr could not be used - Paulo?

> 
> >> Index: linux-2.6.21-rc3/scripts/kconfig/Makefile
> >> ===
> >> --- linux-2.6.21-rc3.orig/scripts/kconfig/Makefile 2007-03-07 
> >> 05:41:20.0 +0100
> >> +++ linux-2.6.21-rc3/scripts/kconfig/Makefile  2007-03-07 
> >> 23:21:19.730679000 +0100
> >> @@ -88,7 +88,7 @@
> >>  HOST_EXTRACFLAGS = $(shell $(CONFIG_SHELL) $(check-lxdialog) -ccflags)
> >>  HOST_LOADLIBES   = $(shell $(CONFIG_SHELL) $(check-lxdialog) -ldflags 
> >> $(HOSTCC))
> >>  
> >> -HOST_EXTRACFLAGS += -DLOCALE
> >> +HOST_EXTRACFLAGS += -DLOCALE -std=c99 -D__EXTENSIONS__
> >>  
> >>  PHONY += $(obj)/dochecklxdialog
> >>  $(obj)/dochecklxdialog:
> 
> The error message for this one was:  only valid in C99 mode.
> Linux GCC 4.1.2 does not print that, Solaris GCC 3.4.3 does. I do not
> know offhand who is right.
The -std= looks safe. The __EXTENSIONS__ part seems safe and needed after a bit 
googling.

> 
> >> Index: linux-2.6.21-rc3/scripts/kconfig/lxdialog/dialog.h
> >> ===
> >> --- linux-2.6.21-rc3.orig/scripts/kconfig/lxdialog/dialog.h
> >> 2007-03-07 05:41:20.0 +0100
> >> +++ linux-2.6.21-rc3/scripts/kconfig/lxdialog/dialog.h 2007-03-07 
> >> 23:14:48.462956000 +0100
> >> @@ -222,3 +222,7 @@
> >>   *   -- uppercase chars are used to invoke the button (M_EVENT + 'O')
> >>   */
> >>  #define M_EVENT (KEY_MAX+1)
> >> +
> >> +#ifndef KEY_RESIZE
> >> +# define KEY_RESIZE 0632
> >> +#endif
> 
> Solaris only has curses, not ncurses. Consider this a supreme hack.
> In fact, menuconfig has some weird display errors still.
T

question regarding the Linux block device cache

2007-03-09 Thread Xin Zhao


Hi,

I am working on a file system that allow multiple files to share data
blocks. That is, a data block can be shared by two or more files. Now
my question is: suppose file A and B share the same data block D. Now
a process open file A and read block D, then this process closes file
A. If another process open file B and read block D right after the
first process closes A, is the data of block D read from some cache or
has to be loaded from disk again? I think this has to do with the
Linux block device buffer cache. But I am not quite familiar with this
part.

Can someone help me or direct me to the right place to find the answer?

Thanks in advance!

-x
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] drivers: PMC MSP71xx LED driver

2007-03-09 Thread Marc St-Jean


Andrew Morton wrote:
>  > On Mon, 26 Feb 2007 17:48:55 -0600 Marc St-Jean 
> <[EMAIL PROTECTED]> wrote:
>  > [PATCH] drivers: PMC MSP71xx LED driver
>  >
>  > Patch to add LED driver for the PMC-Sierra MSP71xx devices.
>  >
>  > This patch references some platform support files previously
>  > submitted to the [EMAIL PROTECTED] list.

Thanks for the feedback Andrew, I've implemented your recommendations.
A few comments/answers below.

[...]

>  > + /* determine the progress into the current cycle, relative to 
> the POLL_PERIOD */
>  > + initialPeriod = (u8)(*ledRegPtr >> MSP_LED_INITIALPERIOD_SHIFT);
>  > + finalPeriod = (u8)(*ledRegPtr >> MSP_LED_FINALPERIOD_SHIFT);
>  > + ledTimeOut = (u8)(*ledRegPtr >> MSP_LED_WATCHDOG_SHIFT);
>  > + timer = (u8)(private_msp_led_register[ledId] >> 
> MSP_LED_WATCHDOG_SHIFT);
> 
> I assume all these (u8) casts are unneeded.
> 
>  > + totalPeriod = (u16)initialPeriod + (u16)finalPeriod;
> 
> And here.

I assume the author didn't expect the integer promotion to occur until
after the addition.

[...]

> 
>  > +{
>  > + int pin;
>  > + u8 currDirectionBits, currDataBits, prevDataBits, 
> prevDirectionBits;
>  > + currDirectionBits = currDataBits = prevDataBits = 
> prevDirectionBits = 0;
> 
> The unneeded initialisations here are just to suppress the incorrect gcc
> warning, yes?

No, initialization is needed as they are passed by reference to functions
setting/clearing bits.

> If so, that should at least be comented.  And try to avoid declarations o
> this form as well as multiple assignments.  So you want:
> 
> u8 curr_direction_bits = 0; /* Suppress gcc warning */
> u8 curr_data_bits = 0;  /* Suppress gcc warning */
> u8 prev_data_bits = 0;  /* Suppress gcc warning */
> u8 prev_direction_bits = 0; /* Suppress gcc warning */
> 
> the initialisation does cause extra ode to be generated and we usually just
> let te warning come out.  I think later gcc's fixed it.

OK, I've split them on to separate line but without the comment.

[...]

> 
>  > +void __init pmctwiled_setup(void)
>  > +{
>  > + static int called;
>  > + int dev;
>  > +
>  > + /* check if already initialized */
>  > + if( called )
>  > + return;
> 
> This cannot happen (can it?)

Yes it can happen. Platform code can call pmctwiled_setup (that's why
the function was written) before the pmctwiled_init function runs.
This is so various sub-system init functions can ensure initialization
has occurred before setting start-up values.

If you have an idea on a better way to accomplish this I'm all ears.


>  > + /* initialize LEDs to default state */
>  > + for( dev = 0; dev < MSP_LED_NUM_DEVICES; dev++ ) {
>  > + int pin;
>  > + pmctwiled_device[dev] = NULL;
>  > +
>  > + for( pin = 0; pin < 8; pin++ ) {
>  > + int led = MSP_LED_DEVPIN(dev,pin);
>  > + if (mspLedInitialInputState[dev] & (1 << pin)) 
> {   
>  > + msp_led_disable(led);
>  > + } else {
>  > + msp_led_enable(led);
>  > + if (mspLedInitialPinState[dev] & (1 << 
> pin))   
> 
>  > + msp_led_turn_on(led);   
>
>  > + else   
>  > + msp_led_turn_off(led);
>  > + }
>  > +
>  > + /* Initialize the private led register memory */
>  > + private_msp_led_register[led] = 0;
>  > + }
>  > + }
>  > +
>  > + /* indicate initialised */
>  > + called++;
>  > +}

[...]

>  > +typedef enum {
>  > + MSP_LED_INPUT = 0,
>  > + MSP_LED_OUTPUT,
>  > +} msp_led_direction_t;
> 
> No typedefs, please.   Convert this to
> 
> enum msp_led_direction {
> ...
> };

Alright I'll change it but it wasn't mentioned in the review of
the previous drivers and they've been resubmitted with some.
A quick search shows several drivers typedef'ing enums with and
without *_t suffixes.

Is there a new style rule or are only core kernel types allowed to
use _t?


>  > +/* Output modes */
>  > +typedef enum {
>  > + MSP_LED_OFF = 0,/* Off steady */
>  > + MSP_LED_ON, /* On steady */
>  > + MSP_LED_BLINK,  /* On for initialPeriod, off 
> for finalPeriod */
>  > + MSP_LED_BLINK_INVERT,   /* Off for initialPeriod, on for 
> finalPeriod */
>  > +} msp_led_mode_t;
> 
> Ditto.
> 
>  > +/* For non-LED pins, these macros set HI and LO accordingly */
>  > +#define msp_led_pin_hi   msp_led_turn_off
>  > +#define msp_led_pin_lo   msp_led_turn_on
> 
> eww.
> 
> s

Re: 2.6.21-rc3-mm1 RSDL results

2007-03-09 Thread Con Kolivas

On Saturday 10 March 2007 08:57, Con Kolivas wrote:
> On Saturday 10 March 2007 08:39, Matt Mackall wrote:
> > On Sat, Mar 10, 2007 at 08:19:18AM +1100, Con Kolivas wrote:
> > > On Saturday 10 March 2007 08:07, Con Kolivas wrote:
> > > > On Saturday 10 March 2007 07:46, Matt Mackall wrote:
> > > > > My suspicion is the problem lies in giving too much quanta to
> > > > > newly-started processes.
> > > >
> > > > Ah that's some nice detective work there. Mainline does some rather
> > > > complex accounting on sched_fork including (possibly) a whole timer
> > > > tick which rsdl does not do. make forks off continuously so what you
> > > > say may well be correct. I'll see if I can try to revert to the
> > > > mainline behaviour in sched_fork (which was obviously there for a
> > > > reason).
> > >
> > > Wow! Thanks Matt. You've found a real bug too. This seems to fix the
> > > qemu misbehaviour and bitmap errors so far too! Now can you please try
> > > this to see if it fixes your problem?
> >
> > Sorry, it's about the same. I now suspect an accounting glitch involving
> > pipe wake-ups.
> >
> > 5x memload: good
> > 5x execload: good
> > 5x forkload: good
> > 5 parallel makes: mostly good
> > make -j 5: bad
> >
> > So what's different between makes in parallel and make -j 5? Make's
> > job server uses pipe I/O to control how many jobs are running.
>
> Hmm it must be those deep pipes again then. I removed any quirks testing
> for those from mainline as I suspected it would be ok. Guess I"m wrong.

I shouldn't blame this straight up though if NO_HZ makes it better. Something 
else is going wrong... wtf though?

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Pluggable Schedulers (was: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler)

2007-03-09 Thread Ryan Hope


from what I understood, there is a performance loss in plugsched
schedulers because they have to share code

even if pluggable schedulers is not a viable option, being able to
choose which one was built into the kernel would be easy (only takes a
few ifdefs), i too think competition would be good

On 3/9/07, Al Boldi <[EMAIL PROTECTED]> wrote:

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
> >> I consider policy issues to be hopeless political quagmires and
> >> therefore stick to mechanism. So even though I may have started the
> >> code in question, I have little or nothing to say about that sort of
> >> use for it.
> >> There's my longwinded excuse for having originated that tidbit of code.
>
> On Fri, Mar 09, 2007 at 04:25:55PM +0300, Al Boldi wrote:
> > I've no idea what both of you are talking about.
>
> The short translation of my message for you is "Linus, please don't
> LART me too hard."

Right.

> On Fri, Mar 09, 2007 at 04:25:55PM +0300, Al Boldi wrote:
> > How can giving people the freedom of choice be in any way
> > counter-productive?
>
> This sort of concern is too subjective for me to have an opinion on it.

How diplomatic.

> My preferred sphere of operation is the Manichean domain of faster vs.
> slower, functionality vs. non-functionality, and the like. For me, such
> design concerns are like the need for a kernel to format pagetables so
> the x86 MMU decodes what was intended, or for a compiler to emit valid
> assembly instructions, or for a programmer to write C the compiler
> won't reject with parse errors.

Sure, but I think, even from a technical point of view, competition is a good
thing to have.  Pluggable schedulers give us this kind of competition, that
forces each scheduler to refine or become obsolete.  Think evolution.

> If Linus, akpm, et al object to the
> design, then invalid output was produced. Please refer to Linus, akpm,
> et al for these sorts of design concerns.

Point taken.

Linus Torvalds wrote:
> And hey, you can try to prove me wrong. Code talks. So far, nobody has
> really ever come close.
>
> So go and code it up, and show the end result. So far, nobody who actually
> *does* CPU schedulers have really wanted to do it, because they all want
> to muck around with their own private versions of the data structures.

What about PlugSched?


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc3-mm1 RSDL results

2007-03-09 Thread Con Kolivas

On Saturday 10 March 2007 08:57, Willy Tarreau wrote:
> On Fri, Mar 09, 2007 at 03:39:59PM -0600, Matt Mackall wrote:
> > On Sat, Mar 10, 2007 at 08:19:18AM +1100, Con Kolivas wrote:
> > > On Saturday 10 March 2007 08:07, Con Kolivas wrote:
> > > > On Saturday 10 March 2007 07:46, Matt Mackall wrote:
> > > > > My suspicion is the problem lies in giving too much quanta to
> > > > > newly-started processes.
> > > >
> > > > Ah that's some nice detective work there. Mainline does some rather
> > > > complex accounting on sched_fork including (possibly) a whole timer
> > > > tick which rsdl does not do. make forks off continuously so what you
> > > > say may well be correct. I'll see if I can try to revert to the
> > > > mainline behaviour in sched_fork (which was obviously there for a
> > > > reason).
> > >
> > > Wow! Thanks Matt. You've found a real bug too. This seems to fix the
> > > qemu misbehaviour and bitmap errors so far too! Now can you please try
> > > this to see if it fixes your problem?
> >
> > Sorry, it's about the same. I now suspect an accounting glitch involving
> > pipe wake-ups.
> >
> > 5x memload: good
> > 5x execload: good
> > 5x forkload: good
> > 5 parallel makes: mostly good
> > make -j 5: bad
> >
> > So what's different between makes in parallel and make -j 5? Make's
> > job server uses pipe I/O to control how many jobs are running.
>
> Matt, could you check with plain 2.6.20 + Con's patch ? It is possible
> that he added bugs when porting to -mm, or that someting in -mm causes
> the trouble. Your experience with -mm seems so much different from mine
> with mainline, there must be a difference somewhere !

Good idea.

> Con, is your patch necessary for mainline patch too ? I see that it
> should apply, but sometimes -mm may justify changes.

Yes it will be necessary for the mainline patch too.

> Best regards,
> Willy

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] swsusp: Disable nonboot CPUs before entering platform suspend

2007-03-09 Thread Pavel Machek

Hi!

> > > Index: linux-2.6.21-rc3/kernel/power/user.c
> > > ===
> > > --- linux-2.6.21-rc3.orig/kernel/power/user.c
> > > +++ linux-2.6.21-rc3/kernel/power/user.c
> > > @@ -402,9 +402,10 @@ static int snapshot_ioctl(struct inode *
> > >  
> > >   case PMOPS_ENTER:
> > >   if (data->platform_suspend) {
> > > + disable_nonboot_cpus();
> > >   kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK);
> > >   error = pm_ops->enter(PM_SUSPEND_DISK);
> > > - error = 0;
> > > + enable_nonboot_cpus();
> > 
> > Why did we discard return code in previous versions? Do we still want
> > to do that?
> 
> I think it was a mistake.

I took a look at git-annotate, and it is yours code, so I assume you
are right. ACK, then.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ABI coupling to hypervisors via CONFIG_PARAVIRT

2007-03-09 Thread Ingo Molnar

* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:

> Now it may be that you've got a change that's absolutely great for 
> everyone, and the only blocker is that the FoobieVisor can't deal with 
> it.  OK, great, then you'd have a point.

yep. That's precisely my worry. And it doesnt have to be a 'great' thing 
- just any random small change in the kernel that makes sense: what is 
the likelyhood that it cannot be implemented, no matter what amount of 
insight, paravirt_ops + hyper-ABI emulation hackery, for FoobieVisor, 
because FoobieVisor messed up its ABI.

that likelyhood is a pure function of how FoobieVisor's hypercall ABI is 
shaped. Wow! So can you guess where my fixation about not having too 
many ABIs could possibly originate from? ;-)

Until today everyone on the hypervisor side of the argument pretended 
that paravirt_ops solves all problems and acted stupid when i said an 
ABI is an ABI is an ABI, and that "backwards compatibility" does have 
some technological consequences. _Now_ at least i've got this minimal 
admission that FoobieVisor _might_ break. Quite a breakthrough =B-)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ABI coupling to hypervisors via CONFIG_PARAVIRT

2007-03-09 Thread Chris Wright

* Ingo Molnar ([EMAIL PROTECTED]) wrote:
> i am worried whether /any/ future change to the upstream kernel's design 
> can be adopted via paravirt_ops, via the current VMI ABI. And by /any/ i 
> mean truly any. And whether that can be done is not a function of the 
> flexibility of paravirt_ops, it's a function of the flexibility of the 
> VMI ABI.

i'm not really one to argue on behalf of VMI, but i don't think it's as
dire make it out.  the VMI is client code of pv_ops, and as the kernel
changes that client code will simply have to adapt.  of course there are
theoretical limitations, but let's keep it grounded to practical reality.
the whole premise is evolution.  so throw out specific issues, and let's
adapt rather than fall deep into theoretical rhetoric.

thanks,
-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Lockdep report against pktcdvd

2007-03-09 Thread Blaisorblade

When booting my laptop with a 2.6.20.1 laptop with lockdep enabled to test my 
code, I got a lockdep warning on pktcdvd on setup. It seems that do_open on a 
pktcdvd device causes another do_open on the underlying device, and that 
mutex_lock_nested is called with the same subclass (the for_part argument to 
do_open). So this may be a false positive, after all, but I'll let you 
decide.

I've installed and configured Ubuntu udftools so that pktcdvd0 is linked 
to /dev/cdrw, i.e. /dev/sr0, on my system.

This is an extract from my /proc/config.gz - it shows that both LOCKDEP and 
FRAME_POINTER are enabled, so the stack trace below out to be correct.

#
# Kernel hacking
#
CONFIG_DEBUG_KERNEL=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCKDEP=y
# CONFIG_DEBUG_LOCKDEP is not set
CONFIG_TRACE_IRQFLAGS=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
CONFIG_STACKTRACE=y
CONFIG_FRAME_POINTER=y
CONFIG_DEBUG_RODATA=y
CONFIG_DEBUG_STACKOVERFLOW=y
~

[   56.517353] pktcdvd: writer pktcdvd0 mapped to sr0
[   56.525469] 
[   56.525471] =
[   56.525476] [ INFO: possible recursive locking detected ]
[   56.525479] 2.6.20.1-rfp+skas-v9-pre9+skas-dbg #3
[   56.525498] -
[   56.525501] vol_id/4536 is trying to acquire lock:
[   56.525503]  (&bdev->bd_mutex){--..}, at: [] 
do_open+0x7b/0x2c4
[   56.525515] 
[   56.525521] but task is already holding lock:
[   56.525536]  (&bdev->bd_mutex){--..}, at: [] 
do_open+0x7b/0x2c4
[   56.525560] 
[   56.525561] other info that might help us debug this:
[   56.525579] 2 locks held by vol_id/4536:
[   56.525593]  #0:  (&bdev->bd_mutex){--..}, at: [] 
do_open+0x7b/0x2c4
[   56.525610]  #1:  (&ctl_mutex#2){--..}, at: [] 
mutex_lock+0x22/0x26
[   56.525634] 
[   56.525634] stack backtrace:
[   56.525646] 
[   56.525652] Call Trace:
[   56.525666]  [] __lock_acquire+0x137/0xa62
[   56.525687]  [] __mutex_unlock_slowpath+0x129/0x14f
[   56.525712]  [] lock_acquire+0x4d/0x69
[   56.525732]  [] do_open+0x7b/0x2c4
[   56.525750]  [] mutex_lock_nested+0x106/0x2cd
[   56.525774]  [] do_open+0x7b/0x2c4
[   56.525795]  [] __blkdev_get+0x7b/0x8d
[   56.525830]  [] blkdev_get+0xb/0xd
[   56.525853]  [] :pktcdvd:pkt_open+0xb5/0xd52
[   56.525876]  [] __d_lookup+0x116/0x142
[   56.525897]  [] debug_check_no_locks_freed+0x12b/0x13a
[   56.525922]  [] trace_hardirqs_on+0x11a/0x13e
[   56.525944]  [] lockdep_init_map+0xa6/0x326
[   56.525968]  [] __mutex_lock_slowpath+0x281/0x2b4
[   56.525990]  [] mark_held_locks+0x53/0x71
[   56.526010]  [] __mutex_lock_slowpath+0x281/0x2b4
[   56.526034]  [] __mutex_unlock_slowpath+0x129/0x14f
[   56.526054]  [] mutex_lock_nested+0x298/0x2cd
[   56.526075]  [] mark_held_locks+0x53/0x71
[   56.526095]  [] mutex_lock_nested+0x298/0x2cd
[   56.526117]  [] debug_mutex_free_waiter+0x58/0x5c
[   56.526141]  [] mutex_lock_nested+0x2be/0x2cd
[   56.526165]  [] do_open+0xae/0x2c4
[   56.526184]  [] _spin_unlock+0x2d/0x4b
[   56.526205]  [] blkdev_open+0x0/0x6b
[   56.526225]  [] blkdev_open+0x34/0x6b
[   56.526247]  [] __dentry_open+0x128/0x201
[   56.526270]  [] nameidata_to_filp+0x2a/0x3c
[   56.526291]  [] do_filp_open+0x3d/0x4f
[   56.526315]  [] _spin_unlock+0x2d/0x4b
[   56.526335]  [] get_unused_fd+0xfa/0x10b
[   56.526356]  [] do_sys_open+0x4d/0xd5
[   56.526377]  [] sys_open+0x1b/0x1d
[   56.526396]  [] system_call+0x7e/0x83
[   56.526417] 
-- 
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] swsusp: Disable nonboot CPUs before entering platform suspend

2007-03-09 Thread Pavel Machek

Hi!

> > ...so, if pm_ops is non-null, power_down does nonboot cpu disabling,
> > otherwise we proceed with cpus enabled?
> > 
> > That looks ugly.
> > 
> > Is the warning bogus? Or maybe we should *always* disable nonboot cpus
> > in powerdown path?
> 
> Is disable_nonboot_cpus() assuming that first_cpu(cpu_present_map) is
> the boot cpu? Just wondering why disable_nonboot_cpus() isn't using just

I'd  say so. It is nicer (and required on some APM systems?) to do
shutdown from the boot cpu.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] resource control file system - aka containers on top of nsproxy!

2007-03-09 Thread Paul Menage


On 3/9/07, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote:


1. What is the fundamental unit over which resource-management is
applied? Individual tasks or individual containers?

/me thinks latter.


Yes


In which case, it makes sense to stick
resource control information in the container somewhere.


Yes, that's what all my patches have been doing.


2. Regarding space savings, if 100 tasks are in a container (I dont know
   what is a typical number) -and- lets say that all tasks are to share
   the same resource allocation (which seems to be natural), then having
   a 'struct container_group *' pointer in each task_struct seems to be not
   very efficient (simply because we dont need that task-level granularity of
   managing resource allocation).


I think you should re-read my patches.

Previously, each task had N pointers, one for its container in each
potential hierarchy. The container_group concept means that each task
has 1 pointer, to a set of container pointers (one per hierarchy)
shared by all tasks that have exactly the same set of containers (in
the various different hierarchies).

It doesn't give task-level granularity of resource management (unless
you create a separate container for each task), it just gives a space
saving.



3. This next leads me to think that 'tasks' file in each directory doesnt make
   sense for containers. In fact it can lend itself to error situations (by
   administrator/script mistake) when some tasks of a container are in one
   resource class while others are in a different class.

Instead, from a containers pov, it may be usefull to write
a 'container id' (if such a thing exists) into the tasks file
which will move all the tasks of the container into
the new resource class. This is the same requirement we
discussed long back of moving all threads of a process into new
resource class.


I think you need to give a more concrete example and use case of what
you're trying to propose here. I don't really see what advantage
you're getting.

Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ckrm-tech] [PATCH 0/2] resource control file system - aka containers on top of nsproxy!

2007-03-09 Thread Paul Jackson

> the emphasis here is on 'from inside' which basically
> boils down to the following:
> 
>  if you create a 'resource container' to limit the
>  usage of a set of resources for the processes
>  belonging to this container, it would be kind of
>  defeating the purpose, if you'd allow the processes
>  to manipulate their limits, no?

Wrong - this is not the only way.

For instance in cpusets, -any- task in the system, regardless of what
cpuset it is currently assigned to, might be able to manipulate -any-
cpuset in the system.

Yes -- some sufficient mechanism is required to keep tasks from
escalating their resources or capabilities beyond an allowed point.

But that mechanism might not be strictly based on position in some
hierarchy.

In the case of cpusets, it is based on the permissions on files in
the cpuset file system (normally mounted at /dev/cpuset), versus
the current priviledges and capabilities of the task.

A root priviledged task in the smallest leaf node cpuset can manipulate
every cpuset in the system.  This is an ordinary and common occurrence.

I say again, as you seem to be skipping over this detail, one
advantage of basing an API on a file system is the usefulness of
the file system permission model (the -rwxrwxrwx permissions and
the uid/gid owners on each file and directory node).

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/3] fs: introduce perform_write aop

2007-03-09 Thread Anton Altaparmakov

On 9 Mar 2007, at 12:52, Nick Piggin wrote:

Hi Christoph,

On Fri, Mar 09, 2007 at 10:39:13AM +, Christoph Hellwig wrote:

Hi Nick,

sorry for my later reply, this has been on my to answer list for  
the last

month and I only managed to get back to it now.

No worries, I haven't had much time to work on it since then anyway.
Thanks for taking a look.

On Thu, Feb 08, 2007 at 02:07:36PM +0100, Nick Piggin wrote:
as a single call to copy a given amount of userdata at the given  
offset. This
is more flexible, because the implementation can determine how to  
best handle
errors, or multi-page ranges (eg. it may use a gang lookup), and  
only requires

one call into the fs.

I really like this idea, especially for avoiding to call into the  
allocator

for every block.

Indeed.  FWIW my NTFS driver does not use the generic file write  
helper function and instead has its own code that does the allocation  
first and then does the writing a page at a time much the same as  
generic file write does so depending on what this new interface ends  
up looking like exactly I may well be able to switch the NTFS driver  
to use it instead of doing everything by myself and duplicating a ton  
of code from the VFS...

Best regards,

Anton

Have you contacted the reiser4 folks whether this would
superceed their batch_write op completely?

I haven't yet, although that's been on my todo list when I get the API
into a more final state.

batch_write seems quite similar, however theirs is still page  
based, and
a bit crufty, IMO. I found it to be really clean to just pass down  
offsets,

but that may be a matter for debate.

What they _do_ have is a write actor function that will do the data  
copy.

This could be one possible way to get rid of ->prepare_write and
->commit_write, but I haven't tried that yet, because I don't like  
adding

more redirection and complexity if possible...

One problem with this interface is that it cannot be used to  
write into the
filesystem by any means other than already-initialised buffers  
via iovecs. So

prepare/commit have to stay around for non-user data...

Actually I think that's a a good thing to a certain extent.  It  
reminds
us that all other users are horrible abuse of the interface.  I'd  
even
go so far as to make batch_write a callback that the filesystem  
passes

to generic_file_aio_write to make clear it's not a generic thing but
a helper.  (It's not a generic thing because it's the upper layer  
writing

into the pagecache, not a pagecache to fs below operation).

OK, if you think that's reasonable, then that is one hurdle out of  
the way ;)

The still leaves open on how to get rid of ->prepare_write and - 
>commit_write
compltely, and for that we'll probably need ->kernel_read and - 
>kernel_write
file operations.  But that's a step you shouldn't consider yet  
when doing

this work.

I had a couple of possibilities for that. First is passing in a  
write actor
(eg. defaulting to the normal iovec usercopy), but as I said I  
consider this
more like fixing the problem with brute force (ie. just making the  
interface

more complex). Maybe as a last resort, though.

Another thing that would be much nicer from _my_ point of view  
would be to
just make all kernel users set up their data in an iovec, and use  
the normal
call with KERNEL_DS. Unfortunately, this is not the expected way  
for a lot

of code to work, and it might require extra copying of the data.

Another thing is that it seems to be less able to be implemented  
in generic,
reusable code. It should be possible to introduce a new 2-op  
interface (or
maybe just a new error handler op) which can be used correctly in  
generic code.

We should be able to find a nice abstraction for this, see my next  
mails.

+   /*
+* perform_write replaces prepare and commit_write callbacks.
+*/

This is a rather useless comment :)  Better remove it and add a  
proper

descriptions to Documentation/filesystems/vfs.txt and
Documentation/filesystems/Locking

Will do. Thanks!

--
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

< 1 2 3 4 5 6 >

101 - 200 of 523 matches

Mail list logo